<!---
Latex Macros
-->
$$
\newcommand{\bar}{\,|\,}
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# Assignment 3

## Introduction

In the last assignment, you will apply deep learning methods to solve a particular story understanding problem. Automatic understanding of stories is an important task in natural language understanding [[1]](http://anthology.aclweb.org/D/D13/D13-1020.pdf). Specifically, you will develop a model that given a sequence of sentences learns to sort these sentence in order to yield a coherent story [[2]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/short-commonsense-stories.pdf). This sounds (and to an extent is) trivial for humans, however it is a quite difficult task for machines as it involves commonsense knowledge and temporal understanding.

## Goal

You are given a dataset of 45502 instances, each consisting of 5 sentences. Your system needs to ouput a sequence of numbers which represent the predicted order of these sentences. For example, given a story:

    He went to the store.
    He found a lamp he liked.
    He bought the lamp.
    Jan decided to get a new lamp.
    Jan's lamp broke.

your system needs to provide an answer in the following form:

    2	3	4	1	0

where the numbers correspond to the zero-based index of each sentence in the correctly ordered story. So "`2`" for "`He went to the store.`" means that this sentence should come 3rd in the correctly ordered target story. In This particular example, this order of indices corresponds to the following target story:

    Jan's lamp broke.
    Jan decided to get a new lamp.
    He went to the store.
    He found a lamp he liked.
    He bought the lamp.

## Resources

To develop your model(s), we provide a training and a development datasets. The test dataset will be held out, and we will use it to evaluate your models. The test set is coming from the same task distribution, and you don't need to expect drastic changes in it.

You will use [TensorFlow](https://www.tensorflow.org/) to build a deep learning model for the task. We provide a very crude system which solves the task with a low accuracy, and a set of additional functions you will have to use to save and load the model you create so that we can run it.

As we have to run the notebooks of each submission, and as deep learning models take long time to train, your notebook **NEEDS** to conform to the following requirements:
* You **NEED** to run your parameter optimisation offline, and provide your final model saved by using the provided function
* The maximum size of a zip file you can upload to moodle is 160MB. We will **NOT** allow submissions larger than that.
* We do not have time to train your models from scratch! You **NEED** to provide the full code you used for the training of your model, but by all means you **CANNOT** call the training method in the notebook you will send to us.
* We will run these notebooks automatically. If your notebook runs the training procedure, in addition to loading the model, and we need to edit your code to stop the training, you will be penalised with **-20 points**.
* If you do not provide a pretrained model, and rely on training your model on our machines, you will get **0 points**.
* It needs to be tested on the stat-nlp-book Docker setup to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get **0 points**.

Running time and memory issues:
* We have tested a possible solution on a mid-2014 MacBook Pro, and a few epochs of the model run in less than 3min. Thus it is possible to train a model on the data in reasonable time. However, be aware that you will need to run these models many times over, for a larger number of epochs (more elaborate models, trained on much larger datasets can train for weeks! However, this shouldn't be the case here.). If you find training times too long for your development cycle you can reduce the training set size. Once you have found a good solution you can increase the size again. Caveat: model parameters tuned on a smaller dataset may not be optimal for a larger training set.
* In addition to this, as your submission is capped by size, feel free to experiment with different model sizes, numeric values of different precisions, filtering the vocabulary size, downscaling some vectors, etc.

## Hints

A non-exhaustive list of things you might want to give a try:
- better tokenization
- experiment with pre-trained word representations such as [word2vec](https://code.google.com/archive/p/word2vec/), or [GloVe](http://nlp.stanford.edu/projects/glove/). Be aware that these representations might take a lot of parameters in your model. Be sure you use only the words you expect in the training/dev set and account for OOV words. When saving the model parameters, pre-rained word embeddings can simply be used in the word embedding matrix of your model. As said, make sure that this word embedding matrix does not contain all of word2vec or GloVe. Your submission is limited, and we will not allow uploading nor using the whole representations set (up to 3GB!)
- reduced sizes of word representations
- bucketing and batching (our implementation is deliberately not a good one!)
  - make sure to draw random batches from the data! (we do not provide this in our code!)
- better models:
  - stacked RNNs (see tf.nn.rnn_cell.MultiRNNCel
  - bi-directional RNNs
  - attention
  - word-by-word attention
  - conditional encoding
  - get model inspirations from papers on nlp.stanford.edu/projects/snli/
  - sequence-to-sequence encoder-decode architecture for producing the right ordering
- better training procedure:
  - different training algorithms
  - dropout on the input and output embeddings (see tf.nn.dropout)
  - L2 regularization (see tf.nn.l2_loss)
  - gradient clipping (see tf.clip_by_value or tf.clip_by_norm)
- model selection:
  - early stopping
- hyper-parameter optimization (e.g. random search or grid search (expensive!))
    - initial learning rate
    - dropout probability
    - input and output size
    - L2 regularization
    - gradient clipping value
    - batch size
    - ...
- post-processing
  - for incorporating consistency constraints

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2016/assignment3/problem/group_X/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to, and in `X` in `group_X` contains the number of your group.

After you placed it there, **rename the notebook file** to `group_X`.

The notebook is pre-set to save models in

    DIRECTORY_OF_YOUR_BOOK/assignments/2016/assignment3/problem/group_X/model/

Be sure not to tinker with that - we expect your submission to contain a `model` subdirectory with a single saved model! 
The saving procedure might overwrite the latest save, or not. Make sure you understand what it does, and upload only a single model! (for more details check tf.train.Saver)

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit, move nor copy these cells**.
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit, move, nor copy these cells**.
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

**If you edit, move or copy any of the setup, assessments and mark cells, you will be penalised with -20 points**.

Note that you are free to **create additional notebook cells** within a task section. 

Please **do not share** this assignment nor the dataset publicly, by uploading it online, emailing it to friends etc.

## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook. Make sure you do not use any additional files other than your saved model.
* Make sure that your solution runs linearly from start to end (no execution hops). We will run your notebook in that order.
* **Before you submit, make sure your submission is tested on the stat-nlp-book Docker setup to ensure that it does not have any dependencies outside of those that we provide. If your submission fails to adhere to this requirement, you will get 0 points**.
* **If running your notebook produces a trivially fixable error that we spot, we will correct it and penalise you with -20 points. Otherwise you will get 0 points for that solution.**
* **Rename this notebook to your `group_X`** (where `X` is the number of your group), and adhere to the directory structure requirements, if you have not already done so. ** Failure to do so will result in -1 point.**
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Your submission should be a zip file containing the `group_X` directory, containing `group_X.ipynb` notebook, and the `model` directory with _____
* Upload that file to the Moodle submission site.

## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change, move or copy it.**

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
#! SETUP 1 - DO NOT CHANGE, MOVE NOR COPY
import sys, os
_snlp_book_dir = "../../../../../"
sys.path.append(_snlp_book_dir)
# docker image contains tensorflow 0.10.0rc0. We will support execution of only that version!
import statnlpbook.nn as nn

import tensorflow as tf
import numpy as np
from random import sample

## <font color='green'>Setup 2</font>: Load Training Data

This cell loads the training data. **Do not edit the next cell, nor copy/duplicate it**. Instead refer to the variables in your own code, and slice and dice them as you see fit (but do not change their values). 
For example, no one stops you from introducing, in the corresponding task section, `my_train` and `my_dev` variables that split the data into different folds.   

In [2]:
#! SETUP 2 - DO NOT CHANGE, MOVE NOR COPY
data_path = _snlp_book_dir + "data/nn/"
data_train = nn.load_corpus(data_path + "train.tsv")
data_dev = nn.load_corpus(data_path + "dev.tsv")
assert(len(data_train) == 45502)

### Data Structures

Notice that the data is loaded from tab-separated files. The files are easy to read, and we provide the loading functions that load it into a simple data structure. Feel free to check details of the loading.

The data structure at hand is an array of dictionaries, each containing a `story` and the `order` entry. `story` is a list of strings, and `order` is a list of integer indices:

In [3]:
data_train[0]

{'order': [3, 2, 1, 0, 4],
 'story': ['His parents understood and decided to make a change.',
  'The doctors told his parents it was unhealthy.',
  'Dan was overweight as well.',
  "Dan's parents were overweight.",
  'They got themselves and Dan on a diet.']}

## <font color='blue'>Task 1</font>: Model implementation

Your primary task in this assignment is to implement a model that produces the right order of the sentences in the dataset.

### Preprocessing pipeline

First, we construct a preprocessing pipeline, in our case `pipeline` function which takes care of:
- out-of-vocabulary words
- building a vocabulary (on the train set), and applying the same unaltered vocabulary on other sets (dev and test)
- making sure that the length of input is the same for the train and dev/test sets (for fixed-sized models)

You are free (and encouraged!) to do your own input processing function. Should you experiment with recurrent neural networks, you will find that you will need to do so.

In [4]:
# convert train set to integer IDs
train_stories, train_orders, vocab = nn.pipeline(data_train)

You need to make sure that the `pipeline` function returns the necessary data for your computational graph feed - the required inputs in this case, as we will call this function to process your dev and test data. If you do not make sure that the same pipeline applied to the train set is applied to other datasets, your model may not work with that data!

In [5]:
# get the length of the longest sentence
max_sent_len = train_stories.shape[2]

# convert dev set to integer IDs, based on the train vocabulary and max_sent_len
dev_stories, dev_orders, _ = nn.pipeline(data_dev, vocab=vocab, max_sent_len_=max_sent_len)

You can take a look at the result of the `pipeline` with the `show_data_instance` function to make sure that your data loaded correctly:

In [6]:
nn.show_data_instance(dev_stories, dev_orders, vocab, 155)

Input:
 Story:
  The manager decided to offer John the job.
  During the interview he was very <OOV> and <OOV>
  He went to the interview very prepared and nicely dressed.
  John was excited to have a job interview.
  The manager of the company was really impressed by John's comments.
 Order:
  [4 2 1 0 3]

Desired story:
  John was excited to have a job interview.
  He went to the interview very prepared and nicely dressed.
  During the interview he was very <OOV> and <OOV>
  The manager of the company was really impressed by John's comments.
  The manager decided to offer John the job.


### Helper Functions and Custom Pipeline

In [7]:
def split_dataset(x, y, ratio = [0.7, 0.15, 0.15] ):
    # number of examples
    data_len = len(x)
    lens = [ int(data_len*item) for item in ratio ]

    trainX, trainY = x[:lens[0]], y[:lens[0]]
    testX, testY = x[lens[0]:lens[0]+lens[1]], y[lens[0]:lens[0]+lens[1]]
    validX, validY = x[-lens[-1]:], y[-lens[-1]:]

    return (trainX,trainY), (testX,testY), (validX,validY)

def split_dataset_mlp(x, y, z, ratio = [0.7, 0.15, 0.15] ):
    # number of examples
    data_len = len(x)
    lens = [ int(data_len*item) for item in ratio ]

    trainX, trainY, trainZ = x[:lens[0]], y[:lens[0]], z[:lens[0]]
    testX, testY, testZ = x[lens[0]:lens[0]+lens[1]], y[lens[0]:lens[0]+lens[1]], z[lens[0]:lens[0]+lens[1]]
    validX, validY, validZ = x[-lens[-1]:], y[-lens[-1]:], z[-lens[-1]:]

    return (trainX,trainY,trainZ), (testX,testY,testZ), (validX,validY,validZ)

def batch_gen(x, y, batch_size):
    # infinite while
    while True:
        for i in range(0, len(x), batch_size):
            if (i+1)*batch_size < len(x):
                yield x[i : (i+1)*batch_size ].T, y[i : (i+1)*batch_size ].T
                
def rand_batch_gen(x, y, batch_size):
    while True:
        sample_idx = sample(list(np.arange(len(x))), batch_size)
        yield x[sample_idx].T, y[sample_idx].T
        
def rand_batch_mlp(x, y, z, batch_size):
    while True:
        sample_idx = sample(list(np.arange(len(x))), batch_size)
        yield np.array(x)[sample_idx], np.array(y)[sample_idx], np.array(z)[sample_idx]
        
        
def decode(sequence, lookup, separator=''): # 0 used for padding, is ignored
    return separator.join([ lookup[element] for element in sequence if element ])


def getRevVocab(vocab): 
    return {v: k for k, v in vocab.items()}


def flattenStory(stories, lengths): 
    out_sentences_dev1 = [item for sent in stories for item in sent]
    out_seq_len_dev1 = [item for sent in lengths for item in sent]
    return out_sentences_dev1, out_seq_len_dev1

def getBatchGen(trainX, trainY, batch_size):
    counter = 0
    while True:
        if counter >= shape(trainX)[0] // batch_size:
            counter = 0
            yield trainX[counter:counter+batch_size].T, trainY[counter:counter+batch_size].T
            counter += 1
        else: 
            yield trainX[counter:counter+batch_size].T, trainY[counter:counter+batch_size].T
            counter += 1
            

def getBatchGenMLP(trainX, trainY, trainZ, batch_size):
    counter = 0
    while True:
        if counter >= shape(trainX)[0] // batch_size:
            counter = 0
            yield trainX[counter:counter+batch_size].T, trainY[counter:counter+batch_size].T, trainZ[counter:counter+batch_size]
            counter += 1
        else: 
            yield trainX[counter:counter+batch_size].T, trainY[counter:counter+batch_size].T, trainZ[counter:counter+batch_size]
            counter += 1
               
def orderStories(data, order):
    out_sentences_orderd = []
    for i, story in enumerate(data): 
        out_sentences_orderd.append([story[item] for item in order[i]])
    return out_sentences_orderd

def makeDataSeq2SeqReady(data):
    out_sentences_enc = []
    out_sentences_dec = []
    for i, item in enumerate(data): 
        out_sentences_enc.append(item[:-1])
        out_sentences_dec.append(item[1:])
    
    return out_sentences_enc, out_sentences_dec

def w2vToNumpy():
    word2vec = {} #skip information on first line
    fin= open('glove.6B.50d.txt')    
    for line in fin:
        items = line.replace('\r','').replace('\n','').split(' ')
        if len(items) < 10: continue
        word = items[0]
        vect = np.array([float(i) for i in items[1:] if len(i) > 1])
        word2vec[word] = vect


    return word2vec

In [8]:
#my_glove = w2vToNumpy()

In [9]:
#glove_vocab = {'<PAD>': 0, '<OOV>':1}
#for word in my_glove.keys():
#    glove_vocab[word] = len(glove_vocab)
#glove_embedding = np.zeros((len(glove_vocab), 50))

### Word2Vec Implementation

In [10]:
class Word2Vec():

    def __init__(self, raw_data, voc_size, window_size = 2, batch_size = 50, embedding_size = 40, num_sampled = 50):
        
        self.raw_data = raw_data
        self.window_size = window_size 
        self.batch_size = batch_size
        self.embedding_size = embedding_size
        self.number_neg_samples = num_sampled
        self.skip_gram_pairs = self.makeSkipGram(raw_data)
        self.trained_embeddings = []
        self.voc_size = voc_size
        
    def makeSkipGram(self, data):    
        data = data.flatten()
        cbow_pairs = [];
        for i in range(1, len(data)-1) :
            cbow_pairs.append([[data[i-1], data[i+1]], data[i]]);

        skip_gram_pairs = [];
        for c in cbow_pairs:
            skip_gram_pairs.append([c[1], c[0][0]])
            skip_gram_pairs.append([c[1], c[0][1]])
            
        return skip_gram_pairs
        
    def generate_batch(self, size):
        assert size < len(self.skip_gram_pairs)
        x_data=[]
        y_data = []
        r = np.random.choice(range(len(self.skip_gram_pairs)), size, replace=False)
        for i in r:
            x_data.append(self.skip_gram_pairs[i][0])  # n dim
            y_data.append([self.skip_gram_pairs[i][1]])  # n, 1 dim
        return x_data, y_data
    
    def train(self):
        
        train_inputs = tf.placeholder(tf.int32, shape=[self.batch_size])
        # need to shape [batch_size, 1] for nn.nce_loss
        train_labels = tf.placeholder(tf.int32, shape=[self.batch_size, 1])
        # Ops and variables pinned to the CPU because of missing GPU implementation
        with tf.device('/cpu:0'):
            # Look up embeddings for inputs.
            embeddings = tf.Variable(
                tf.random_uniform([self.voc_size, self.embedding_size], -1.0, 1.0), name = "emb")
            embed = tf.nn.embedding_lookup(embeddings, train_inputs) # lookup table
        
        # initialising a saver object that contains the learning embeddings. 
        
        
        # Construct the variables for the NCE loss
        nce_weights = tf.Variable(
            tf.random_uniform([self.voc_size, self.embedding_size],-1.0, 1.0))
        nce_biases = tf.Variable(tf.zeros([self.voc_size]))
        
        loss = tf.reduce_mean(
          tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
                         self.number_neg_samples, self.voc_size))
        # Use the adam optimizer
        train_op = tf.train.AdamOptimizer(1e-1).minimize(loss)
        # Initialise saver object
        
        saver = tf.train.Saver({"my_embeddings": embeddings})
        
        init = tf.initialize_all_variables()
        
        # Launch the graph in a session
        with tf.Session() as sess:
            # Initializing all variables
            sess.run(init)
    
            for step in range(30):
                batch_inputs, batch_labels = self.generate_batch(self.batch_size)
                _, loss_val = sess.run([train_op, loss],
                        feed_dict={train_inputs: batch_inputs, train_labels: batch_labels})
                if step % 10 == 0:
                    print("Loss at ", step, loss_val) # Report the loss
                # Every 100 steps create checkpoint of current model. 
                if step %100 == 0: 
                    print("Creating Checkpoint..")
                    #saver.save(sess, SAVER_PATH, global_step = step)
            # Final embeddings are ready for you to use. Need to normalize for practical use
            #saver.save(sess, SAVER_PATH, global_step = step)
            #saver.save(sess, os.path.join(LOG_DIR, "model.ckpt"), step)
            self.trained_embeddings = embeddings.eval()
            #saver.save(sess, "embedding_checkpoint.ckpt")

## Custom Pipeline

In [11]:
OOV = '<OOV>'
PAD = '<PAD>'

def ttokenize(scentence):
    import re
    #word=scentence.split(' ')
    word = scentence.lower()
    token = re.compile("[\w]+(?=n't)|n't|\'m|\'ll|[\w]+|[.?!;,\-\(\)—\:']")
    t=token.findall(word)
    #t=list(reversed(t))
    return t

def tokenize(input):
    print(input.split(' '))
    return input.split(' ')

def my_pipeline(data, vocab=None, max_sent_len_=None):
    is_ext_vocab = True
    if vocab is None:
        is_ext_vocab = False
        vocab = {PAD: 0, OOV: 1}

    max_sent_len = -1
    data_sentences = []
    data_orders = []

    out_seq_len = []
    
    
    for instance in data:
        sents = []
        data_seq_len = []
        for sentence in instance['story']:
            sent = []
            tokenized = ttokenize(sentence)
            
            data_seq_len.append(len(tokenized))
     
            for token in tokenized:
            
                if not is_ext_vocab and token not in vocab:
                    vocab[token] = len(vocab)
                if token not in vocab:
                    token_id = vocab[OOV]
                else:
                    token_id = vocab[token]
                sent.append(token_id)
            if len(sent) > max_sent_len:
                max_sent_len = len(sent)
            sents.append(sent)
        
        out_seq_len.append(data_seq_len)
        
        data_sentences.append(sents)
        data_orders.append(instance['order'])

    if max_sent_len_ is not None:
        max_sent_len = max_sent_len_
    out_sentences = np.full([len(data_sentences), 5, max_sent_len], vocab[PAD], dtype=np.int32)

    for i, elem in enumerate(data_sentences):
        for j, sent in enumerate(elem):
            out_sentences[i, j, 0:len(sent)] = sent

    out_orders = np.array(data_orders, dtype=np.int32)

    return out_sentences, out_orders, out_seq_len, vocab, max_sent_len

out_sentences, out_orders, out_seq_len, vocab, max_sent_len = my_pipeline(data_train)

In [12]:
out_sentences_flat, out_len_flat = flattenStory(out_sentences, out_seq_len)
test_stories1, test_orders, test_seq_len1, _ , _= my_pipeline(data_dev, vocab = vocab, max_sent_len_= max_sent_len)
test_stories1, test_seq_len1 = flattenStory(test_stories1, test_seq_len1)

batch_gen = rand_batch_mlp(out_sentences, out_seq_len, out_orders, 30)

### Model

The model we provide is a rudimentary, non-optimised model that essentially represents every word in a sentence with a fixed vector, sums these vectors up (per sentence) and puts a softmax at the end which aims to guess the order of sentences independently.

First we define the model parameters:

In [13]:
# Embeddings
#GloVe_embedings50 = np.load('GloVe_devvocab_emb.npy').item()


## Seq2seq + Multi-layer Perceptron

In [14]:
class Seq2SeqOrdering(object):

    def __init__(self, xseq_len, yseq_len,
            xvocab_size, yvocab_size,
            emb_dim, num_layers, ckpt_path,
            lr=0.01,
            epochs=10, model_name='seq2seq_model'):

        # attach these arguments to self
        self.xseq_len = xseq_len
        self.yseq_len = yseq_len
        self.ckpt_path = ckpt_path
        self.epochs = epochs
        self.model_name = model_name
        self.emb_dim = emb_dim
        self.epochs = 10000
        
        
        self.mlp_hidden = 64
        self.n_classes = 2
        self.mlp_input = emb_dim
        self.mlp_epochs = 10000


        # build thy graph
        #  attach any part of the graph that needs to be exposed, to the self
        def __graph__():
            
            
            ############### Placeholders for seq2seq ###############

            # placeholders
            tf.reset_default_graph()
            #  encoder inputs : list of indices of length xseq_len
            self.enc_ip = [ tf.placeholder(shape=[None,],
                            dtype=tf.int64,
                            name='ei_{}'.format(t)) for t in range(xseq_len) ]

            #  labels that represent the real outputs
            self.labels = [ tf.placeholder(shape=[None,],
                            dtype=tf.int64,
                            name='ei_{}'.format(t)) for t in range(yseq_len) ]

            #  decoder inputs : 'GO' + [ y1, y2, ... y_t-1 ]
            self.dec_ip = [ tf.zeros_like(self.enc_ip[0], dtype=tf.int64, name='GO') ] + self.labels[:-1]


            # Basic LSTM cell wrapped in Dropout Wrapper
            self.keep_prob = tf.placeholder(tf.float32)
            # define the basic cell
            
            ############### Set Up LSTM Net ###############

            basic_cell = tf.nn.rnn_cell.DropoutWrapper(
                    tf.nn.rnn_cell.BasicLSTMCell(self.emb_dim, state_is_tuple=True),
                    output_keep_prob=self.keep_prob)
            # stack cells together : n layered model
            stacked_lstm = tf.nn.rnn_cell.MultiRNNCell([basic_cell]*num_layers, state_is_tuple=True)
            

            # for parameter sharing between training model
            #  and testing model
            with tf.variable_scope('decoder') as scope:
                # build the seq2seq model
                #  inputs : encoder, decoder inputs, LSTM cell type, vocabulary sizes, embedding dimensions
                self.decode_outputs, self.decode_states = tf.nn.seq2seq.embedding_rnn_seq2seq(self.enc_ip,self.dec_ip, stacked_lstm,
                                                    xvocab_size, yvocab_size, emb_dim)
                # share parameters
                scope.reuse_variables()
                # testing model, where output of previous timestep is fed as input
                #  to the next timestep
                self.decode_outputs_test, self.decode_states_test = tf.nn.seq2seq.embedding_rnn_seq2seq(
                    self.enc_ip, self.dec_ip, stacked_lstm, xvocab_size, yvocab_size,emb_dim,
                    feed_previous=True)

            
            ############### Seq2seq Loss ###############
            
            with tf.variable_scope('loss') as scope: 
                # weighted loss
                #  TODO : add parameter hint
                loss_weights = [ tf.ones_like(label, dtype=tf.float32) for label in self.labels ]
                self.loss = tf.nn.seq2seq.sequence_loss(self.decode_outputs, self.labels, loss_weights, yvocab_size)
                
                scope.reuse_variables()
                
                self.loss_permutation = tf.nn.seq2seq.sequence_loss(self.decode_outputs_test, self.labels, loss_weights, yvocab_size)

            
            ############### Seq2seq Optimisation ###############
            
            self.train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(self.loss)
            

            self.n_hidden_1 = 512 # 1st layer number of features
            self.n_hidden_2 = 256 # 2nd layer number of features
            self.n_input = self.emb_dim * 4 
            self.n_classes_mlp = 5 
            self.learning_rate = 0.01
            self.output_size = 25

            # tf Graph input
            self.x = tf.placeholder("float", [None, self.n_input])
            self.y = tf.placeholder(tf.int64, [None, self.n_classes_mlp])


            # Store layers weight & bias
            self.weights = {
                'h1': tf.Variable(tf.random_normal([self.n_input, self.n_hidden_1])),
                'h2': tf.Variable(tf.random_normal([self.n_hidden_1, self.n_hidden_2])),
                'out': tf.Variable(tf.random_normal([self.n_hidden_2, self.output_size]))
            }
            self.biases = {
                'b1': tf.Variable(tf.random_normal([self.n_hidden_1])),
                'b2': tf.Variable(tf.random_normal([self.n_hidden_2])),
                'out': tf.Variable(tf.random_normal([self.output_size]))
            }

            # Construct model
            self.logits = self.multilayer_perceptron(self.x, self.weights, self.biases)
            #self.y = tf.reshape(self.y, [35, 5])
            self.logits_reshaped = tf.reshape(self.logits, [-1, 5, 5])
            
            self.unpacked_logits = [tensor for tensor in tf.unpack(self.logits_reshaped, axis=1)]
            self.softmaxes = [tf.nn.softmax(tensor) for tensor in self.unpacked_logits ]
            self.softmaxed_logits = tf.pack(self.softmaxes, axis=1)
            self.mlp_predict = tf.arg_max(self.softmaxed_logits , 2)

            # Define loss and optimizer
            self.mlp_loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y))
            self.mlp_optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(self.mlp_loss)

        sys.stdout.write('>> Graph Ready <<')
        # build comput graph
        __graph__()
    
    # get the feed dictionary
    def get_feed(self, X, Y, keep_prob):
        feed_dict = {self.enc_ip[t]: X[t] for t in range(self.xseq_len)}
        feed_dict.update({self.labels[t]: Y[t] for t in range(self.yseq_len)})
        feed_dict[self.keep_prob] = keep_prob # dropout prob
        #print(">> Made feed dict.")
        return feed_dict
    
        # Create model
    def multilayer_perceptron(self, x, weights, biases):
        # Hidden layer with RELU activation
        layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
        layer_1 = tf.nn.relu(layer_1)
        # Hidden layer with RELU activation
        layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
        layer_2 = tf.nn.relu(layer_2)
        # Output layer with linear activation
        out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
        return out_layer

    # run one batch for training
    def train_batch(self, sess, train_batch_gen):
        # get batches
        batchX, batchY = train_batch_gen.__next__()
        # build feed
        feed_dict = self.get_feed(batchX, batchY, keep_prob=0.5)
        _, loss_v = sess.run([self.train_op, self.loss], feed_dict)
        return loss_v

    def eval_step(self, sess, eval_batch_gen):
        # get batches
        batchX, batchY = eval_batch_gen.__next__()
        # build feed
        feed_dict = self.get_feed(batchX, batchY, keep_prob=1.)
        loss_v, dec_op_v = sess.run([self.loss, self.decode_outputs_test], feed_dict)
        # dec_op_v is a list; also need to transpose 0,1 indices
        #  (interchange batch_size and timesteps dimensions
        dec_op_v = np.array(dec_op_v).transpose([1,0,2])
        return loss_v, dec_op_v, batchX, batchY

    # evaluate 'num_batches' batches
    def eval_batches(self, sess, eval_batch_gen, num_batches):
        losses = []
        for i in range(num_batches):
            loss_v, dec_op_v, batchX, batchY = self.eval_step(sess, eval_batch_gen)
            losses.append(loss_v)
        return np.mean(losses)
        
    # finally the train function that
    #  runs the train_op in a session
    #   evaluates on valid set periodically
    #    prints statistics
    def train(self, train_set, valid_set, sess=None ):
        # we need to save the model periodically
        #saver = tf.train.Saver()
        # if no session is given
        if not sess:
            # create a session
            sess = tf.Session()
            # init all variables
            sess.run(tf.global_variables_initializer())

        sys.stdout.write('>> Training started <<')
        # run M epochs
        for i in range(self.epochs):
            try:
                self.train_batch(sess, train_set)
                if i % 100 == 0: #and i% (self.epochs//1) == 0: # TODO : make this tunable by the user
                    # save model to disk
                    #saver.save(sess, self.ckpt_path + self.model_name + '.ckpt', global_step=i)
                    # evaluate to get validation loss
                    val_loss = self.eval_batches(sess, valid_set, 16) # TODO : and this
                    # print stats
                    print('\nModel saved to disk at iteration #{}'.format(i))
                    print('val   loss : {0:.6f}'.format(val_loss))
                    sys.stdout.flush()
            except KeyboardInterrupt: # this will most definitely happen, so handle it
                print('Interrupted by user at iteration {}'.format(i))
                self.session = sess
                return sess
            
        return sess
    
    def trainMLP(self, train_batch, eval_batch, sess): 
        
        for i in range(self.mlp_epochs):
            try: 
                # Get training batch
                train_batchX, train_batchY, train_batchZ  = next(train_batch)
                
                # Determine batch size
                batch_size = np.shape(train_batchZ)[0]
                # Flatten story for embedding
                flatten_enc, flatten_dec = self.flattenBatch(train_batchX.T, train_batchY.T)
                # Embedd training batchX
                _, embedded_x = self.getSeq2SeqEmbedding(sess, flatten_enc, flatten_dec)
                
        
                final_h = embedded_x[0].h
                flatten_embeddings = final_h.reshape(batch_size, 4*self.emb_dim)
                feed ={self.x:flatten_embeddings,self.y:array(train_batchZ)}
                print(shape(sess.run(self.logits_reshaped, feed_dict=feed)))
                if i % 100 == 0:
                    train_batchX, train_batchY, train_batchZ  = next(eval_batch)
                    flatten_enc, flatten_dec = self.flattenBatch(train_batchX.T, train_batchY.T)
                    _, embedded_x = self.getSeq2SeqEmbedding(sess, flatten_enc, flatten_dec)
                    final_h = embedded_x[0].h
                    flatten_embeddings = final_h.reshape(batch_size, 4*self.emb_dim)
                    feed ={self.x:flatten_embeddings,self.y:array(train_batchZ)}
                    
                    loss, pred = sess.run([self.mlp_loss, self.mlp_predict], feed_dict = feed)
                    acc = calculate_accuracy(train_batchZ, pred)
                    print("Iteration: {} Loss: {} Acc: {}".format(i, loss, acc))
                
            except KeyboardInterrupt: 
                print("Training Stopped")
                break
        
        
    
    def restore_last_session(self):
        saver = tf.train.Saver()
        # create a session
        sess = tf.Session()
        # get checkpoint state
        ckpt = tf.train.get_checkpoint_state(self.ckpt_path)
        # restore session
        if ckpt and ckpt.model_checkpoint_path:
            print("Restoring last session at: ", ckpt.model_checkpoint_path)
            saver.restore(sess, ckpt.model_checkpoint_path)
            return sess
        # return to user
        else: 
            sess.close()
            print("No session saved.")
    
    def getSeq2SeqEmbedding(self, sess, x, y):
        feed = self.get_feed(x, y, keep_prob = 1)
        dec_out, dec_states = sess.run([self.decode_outputs, self.decode_states], feed_dict = feed)
        return dec_out, dec_states
    
    def flattenBatch(self, x, y): 
        enc = array([item for something in x for item in something]).T
        dec = array([item for something in y for item in something]).T
        return enc, dec
    
    def predictLogits(self, sess, x, y, t): 
        feed = self.get_feed(x, y, keep_prob = 1)
        feed.update({x: t})
        return sess.run(self.predict, feed_dict = feed)

    # prediction
    def predict(self, sess, X):
        feed_dict = {self.enc_ip[t]: X[t] for t in range(self.xseq_len)}
        feed_dict[self.keep_prob] = 1.
        dec_op_v = sess.run(self.decode_outputs_test, feed_dict)
        # dec_op_v is a list; also need to transpose 0,1 indices
        #  (interchange batch_size and timesteps dimensions
        dec_op_v = np.array(dec_op_v).transpose([1,0,2])
        # return the index of item with highest probability
        return np.argmax(dec_op_v, axis=2)

## Baseline Model Simple LSTM 

In [15]:
### MODEL PARAMETERS ###

## hidden 132, batch 50 - 55.5% 
target_size = 5
vocab_size = len(vocab)
input_size = 30
n = 5460240
hidden_size = 134
BATCH_SIZE= 45
n_stacks = 1
embedding_dim = 60

and then we define the model

In [16]:
### Base Line MODEL ###
tf.reset_default_graph()
## PLACEHOLDERS
story = tf.placeholder(tf.int64, [None, max_sent_len], "story")        # [batch_size x 5 x max_length]
order = tf.placeholder(tf.int64, [None, 5], "order")             # [batch_size x 5]
sen_len = tf.placeholder(tf.int64, [None], "sen_len")
batch_size = tf.shape(story)[0]//5

W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")

keep_prob = tf.placeholder(tf.float32)

learning_rate = tf.placeholder(tf.float32)

# Word embeddings
initializer = tf.random_uniform_initializer(-0.1, 0.1)

embeddings = tf.get_variable("W", [vocab_size, input_size], initializer=initializer)

sentences_embedded = tf.nn.embedding_lookup(embeddings, story)

with tf.variable_scope("encoder") as varscope:

    basic_cell = tf.nn.rnn_cell.DropoutWrapper(
                        tf.nn.rnn_cell.BasicLSTMCell(hidden_size, state_is_tuple=True),
                        output_keep_prob=keep_prob)

    _, final_first = tf.nn.dynamic_rnn(basic_cell, sentences_embedded, sequence_length=sen_len, dtype=tf.float32)

    final_firs_h = final_first.h
        
reshape_final = tf.reshape(final_firs_h, [-1, hidden_size*5])

logits_ = tf.contrib.layers.linear(reshape_final, 25)

logits = tf.reshape(logits_, [-1, 5, 5])


loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=order))

## Optimizer
optim = tf.train.AdamOptimizer(learning_rate)
optim_op = optim.minimize(loss)
init =tf.initialize_all_variables()

unpacked_logits = [tensor for tensor in tf.unpack(logits, axis=1)]
softmaxes = [tf.nn.softmax(tensor) for tensor in unpacked_logits]
softmaxed_logits = tf.pack(softmaxes, axis=1)#
predict = tf.arg_max(softmaxed_logits, 2)

saver = tf.train.Saver()

sess= tf.Session()
sess.run(init)

We built our model, together with the loss and the prediction function, all we are left with now is to build an optimiser on the loss:

### Model training 

We defined the preprocessing pipeline, set the model up, so we can finally train the model

In [17]:

def trainModel(sess = None):
    
        if not sess: 
            sess = tf.Session()
            
        sess.run(tf.initialize_all_variables())
        
        for j in range(1):
            counter = 0
            slow_down = False
            slow_slow_down = False
            for i in range(n // BATCH_SIZE):
                x, y, z = next(batch_gen)
                x_flat, y_flat = flattenStory(x, y)
                if counter >= len(out_sentences)//BATCH_SIZE - BATCH_SIZE: 
                    counter =0 
                try:
                    if slow_down == True and slow_slow_down == False: 
                        l_r = 0.001
                    elif slow_slow_down == True:
                        l_r = 0.0001
                    else:
                        l_r = 0.01
                        
                    inst_story = x_flat 
                    inst_order = z
                    inst_seq_len = y_flat
                    
                    feed_dict = {story: inst_story, order: inst_order, sen_len: inst_seq_len, keep_prob:0.5, learning_rate: l_r}
                    test = sess.run(optim_op, feed_dict = feed_dict)
                   
                    if i%10 == 0:
                        test_feed_dict = {story:test_stories1 , order: test_orders, sen_len:test_seq_len1,  keep_prob:1.0, learning_rate:l_r}
                        test_predicted = sess.run(predict, feed_dict=test_feed_dict)
                        test_accuracy = nn.calculate_accuracy(test_orders, test_predicted)
                        print('test_accuracy =', test_accuracy)
                        if test_accuracy > 0.538 and test_accuracy < 0.55: 
                            slow_down = True
                            slow_slow_down = False
                        elif test_accuracy > 0.55: 
                            slow_down = False
                            slow_slow_down = True 
                        else: 
                            slow_down = False
                            
                        if test_accuracy > 0.555:
                            nn.save_model(sess)
                            print(test_accuracy)
                            break

                    counter += 1
                except KeyboardInterrupt:
                    print("Training Stopped")
                    nn.save_model(sess)
                    break
    


In [18]:
"""
Error analysis functions from statNLPbook.bio 
"""
import pandas as pd
import collections as col
import statnlpbook.bio as bio
import statnlpbook.util as util

def confusion_matrix(dev_predicted, dev_orders):
    confusion = col.defaultdict(int)
    batch_sz = dev_predicted.shape[0]
    for i in range(batch_sz):
        for j in range(5):
            confusion[(dev_predicted[i][j],dev_orders[i][j])] += 1
    return confusion

#conf_matrix = confusion_matrix(dev_orders,dev_predicted)
#bio.full_evaluation_table(conf_matrix)
#util.plot_confusion_matrix_dict(cm_dev,90, outside_label="None")

## <font color='red'>Assessment 1</font>: Assess Accuracy (50 pts) 

We assess how well your model performs on an unseen test set. We will look at the accuracy of the predicted sentence order, on sentence level, and will score them as followis:

* 0 - 20 pts: 45% <= accuracy < 50%, linear
* 20 - 40 pts: 50% <= accuracy < 55
* 40 - 70 pts 55 <= accuracy < Best Result, linear

The **linear** mapping maps any accuracy value between the lower and upper bound linearly to a score. For example, if your model's accuracy score is $acc=54.5\%$, then your score is $20 + 20\frac{acc-50}{55-50}$.

The *Best-Result* accuracy is the maximum of the best accuracy the course organiser achieved, and the submitted accuracies scores.  

Change the following lines so that they construct the test set in the same way you constructed the dev set in the code above. We will insert the test set instead of the dev set here. **`test_feed_dict` variable must stay named the same**.

In [19]:
# LOAD THE DATA
data_test = nn.load_corpus(data_path + "dev.tsv")
# make sure you process this with the same pipeline as you processed your dev set
test_stories, test_orders, test_seq_len, _, _ = my_pipeline(data_test, vocab=vocab, max_sent_len_=max_sent_len)
test_stories, test_seq_len = flattenStory(test_stories, test_seq_len)
# THIS VARIABLE MUST BE NAMED `test_feed_dict`
test_feed_dict = {story: test_stories, order: test_orders, sen_len: test_seq_len}

The following code loads your model, computes accuracy, and exports the result. **DO NOT** change this code.

In [20]:
#! ASSESSMENT 1 - DO NOT CHANGE, MOVE NOR COPY
with tf.Session() as sess:
    # LOAD THE MODEL
    saver = tf.train.Saver()
    saver.restore(sess, './model/model.checkpoint')
    
    # RUN TEST SET EVALUATION
    dev_predicted = sess.run(predict, feed_dict=test_feed_dict)
    dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)

dev_accuracy

0.55521111704970605

## <font color='orange'>Mark</font>:  Your solution to Task 1 is marked with ** __ points**. 
---

## <font color='blue'>Task 2</font>: Describe your Approach

Enter a 750 words max description of your approach **in this cell**.
Make sure to provide:
- an **error analysis** of the types of errors your system makes
- compare your system with the model we provide, focus on differences and draw useful comparations between them

Should you need to include figures in your report, make sure they are Python-generated. For that, feel free to create new cells after this cell (before Assessment 2 cell). Link online images at your risk.

### Approach summary
The presented model (section 1.3.2) is a dynamic recurrent neural network, constructed in TensorFlow, which takes vectorised, concatenated sentances and passes each story through a LSTM cell with a drop-out wrapper, applies a linear layer to the output, to produce logits which are softmaxed and argmaxed to predict a class for each sentence. The model trains in under 15 minutes and reaches a test accuracy of 55.5%.

We also built a state-of-the-art 'sequence to sequence' model combined with a Multi-layered Perceptron (section 1.3.1) following Logeswaran et al. (2017) but this failed to beat the test accuracy of the LSTM model (reaching only 11%) for this dataset. We have included this model as demonstration of our efforts. 

### 2.3. Pre-processing
#### 2.3.1 Pipeline enhancement
We  adjusted the provided pipeline function to improve the pre-processing and provide additional information required for our more complex model. Such as, improving tokenisation, adding sentence length and other helper functions. 

#### 2.3.1 Word embedding
We built and tested word embedding’s with Google's 'word2vec' and Stanford's 'Glove' techniques (TensorFlow (2017) and Pennington et al. (2014) respectively). We trained the models and tested each model at 100, 200 and 300 dimensions and we implemented the Wiki 2014 pre-trained word vectors from GloVe (http://www-nlp.stanford.edu/projects/glove/#discuss). 

However, word embeddings from neither pre-trained or corpus train improved either models accuracy compared to a baseline of random uniform variables as shown below.

<table style="width:80%">
  <tr>
    <th>Algorithm</th>
    <th>Dimension</th>
    <th>Perfomance compared to baseline</th>  
  </tr>
  <tr>
    <td>GloVe corpus trained</b></td>
    <td>300D</td> 
    <td>-7.5%</td> 
  </tr>
    <tr>
    <td>GloVe pre-trained</td>
    <td>300D</td> 
    <td>-5.0%</td> 
  </tr>
    <tr>
    <td>GloVe corpus trained</td>
    <td>100D</td> 
    <td>-3.5%</td> 
  </tr>
    <tr>
    <td>word2vec corpus trained</td>
    <td>100D</td> 
    <td>-5.0%</td> 
  </tr>
</table>

As a result of this findings, we used the same random uniform embeddings as the stub model. We optimised our model for the number of dimensions and found 40 dimensions to provide the best performance. 

### 2.4. Models
We built two different models as shown in section 1.3.1. 'Sequence-to-sequence + Multi-layered Perceptron' and 1.3.2 'LSTM model'. The structure of each model is described below. 

#### 2.2.1 Sequence-to-sequence + Multi-layered Perceptron
The most promising in theory and by far the most complex model we have built is the **Seq2SeqOrdering** class. This model is based upon the idea of coherence modelling through sequence to sequence prediction. Hence, if a machine can predict the most probable subsequent sentence given the current sentence maximizing the log likelihood of the form

$$ L(s_{i}, s_{i+1}) = \frac{1}{N_{i}}\log p (s_{i+1} | s_{i})$$

would allow to discriminate between coherent and arbitrary sentence structures. 

This model is trained in two stages: 

> • training a seq2seq model using tensorflow's embedding_rnn_seq2seq wrapper by feeding it the sentences in the right order. 

> • training a 2-layer perceptron that scores every permutation (120) of the story.

The multi-layer perceptron is trained on the embedding producing by the seq2seq model when an unordered sentence pair is passed into the encoder/decoder. Although the model is efficient to train, producing good sentence predictions after only a few epoch, the inefficiency of the model comes from the fact that it tests each story for each permutation. Despite its good predictive performance, with a mean prediction time of ~10s this model becomes, at least for this assignment, computational intractable. 

#### 2.2.2 LSTM model 
Our baseline model consists of a dynamic RNN with a single LSTM cell. Every sentence is fed into the RNN individually creating a sentence embedding. The five-sentence embedding’s of a story are concatenated and passed into a single linear neural network layer that scores the embedding’s using a softmax. 

To prevent overfitting and dead neurons the LSTM cell is wrapped into a dropout container that randomly turns neurons on and off. By doing this we could take advantage of an increased number of hidden layers. 

While increasing the number of hidden layers only marginally improved the final result the model convergered very quickly to its maximum accuracy rate. 

An adaptive learning rate was incorporated into the model training to reduce the learning rate as test accuracy reached above 53%. This substantially slowed the optimisation rate down at latter stages of training to reduce voliatity and find a better optimum. 

The LSTM model is far more computationally advance than the stub model that doesn’t use a NN. NNs are able to learn very complex patterns by having very high degrees of freedom compared to linear or simpler models. The LSTM cell benefits further by being able to store and pass memory sequentially which gives it good prediction power for sequential tasks like this. 

### 2.3 Parameter optimisation
Parameters optimisation was a substantial activity as both models contained a large number of parameters to be tuned. We performed mesh-grid searches on key parameters such as learning rate, batch size and hidden layer size and we're about to add around 15% accuracy by finding local minimas.

The key parameters for the optimisation were: learning rate, batch size, hidden layer size and embedding size.

This process had not been done for the stub model and with quick optimisation of the learning rate we were able to improve the stubs performance to 40%. The stub doesn’t have many other parameters which reduce to amount of optimiation required. 

### 2.4. Model performance
<table style="width:80%">
  <tr>
    <th>Model</th>
    <th>Description</th>
    <th>Accuracy</th>  
  </tr>
  <tr>
    <td><b>LSTM model</b></td>
    <td>Our model using a single LSTM cell to encode each sentence of a story individually.</td> 
    <td>55.5% Baseline</td> 
  </tr>
    <tr>
    <td><b>Stacked LSTM</b></td>
    <td>Added a second LSTM cell to the dynamic RNN. Introducing another cell had similar results as adding more 
    hidden layers to the cell itself. The results did only marginally imporved at a cost of substantially elevated training times.</td> 
    <td>54.2%</td> 
  </tr>
    </tr>
     <tr>
    <td><b>Stub</b></td>
    <td>Simple linear model with embeddings</td> 
    <td>36%</td>
   <tr>
    <td><b>Sequence-to-sequence + Multi-layered Perceptron</b></td>
    <td>Contains the 5-word history preceding the trigger
    word. If the trigger word has a history that is 
    shorter than 5 words, a [Start] token is inserted.</td> 
    <td>11.33%</td>

  </tr>
</table>


### 2.6. Error analysis

#### 2.6.1 Confusion matrix
It is shown from the confusion matrix that the first sentence is the most accurately predicted (85%), followed by the second and the last sentences (50-53%). Sentence 3 and 4 are the least accurately predicted (30% each).

The first sentences high accuracy is likely to be caused by it having a the most distinct sentence semantics. The first sentence sets the context for the rest of the story and thus may be more distinguishable. For example, the first sentence is restricted in identifying a person by name whereas other sentences can use 'he','she','him', ect.

Sentence 3 and 4 are in the middle of the story and as such are likely to be the most interchangeable as the wide-ranging story topics are determined here. This will result in a weaker pattern for the NN to identify and lower accuracy.   

#### 2.6.2 Precision, Recall and F1
The model precision and recall agree with the above trend with showing a high recall accuracy (88%) for sentence 1 and lowest accuracy for sentences 2 and 3 (30%).


### 2.7. Further improvements
- Improved sequence-to-sequence modelling
- Mulit-layered LSTMs - Additional layers of LSMT cells enable more patterns to be learnt form the data. We stacked LSTM cells using the TensorFlow (using rnn_cell.MultiRNNCell) however it worsened performance. Further work could look at optimising the implementation.
- Use third party CPUs to perform grid search on all parameters

### Appendix A. References
Lajanugen Logeswaran, Honglak Lee & Dragomir Radev. 2017. Sentance ordering using Recurrent Neural Networks 

Jeffrey Pennington, Richard Socher, Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation (http://www-nlp.stanford.edu/pubs/glove.pdf)

TensorFlow. 2017. Vector Representations of Words (https://www.tensorflow.org/tutorials/word2vec/)

Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimisation  (https://arxiv.org/pdf/1412.6980.pdf)


## <font color='red'>Assessment 2</font>: Assess Description (30 pts) 

We will mark the description along the following dimensions: 

* Clarity (10pts: very clear, 0pts: we can't figure out what you did, or you did nothing)
* Creativity (10pts: we could not have come up with this, 0pts: Use only the provided model)
* Substance (10pts: implemented complex state-of-the-art classifier, compared it to a simpler model, 0pts: Only use what is already there)

## <font color='orange'>Mark</font>:  Your solution to Task 2 is marked with ** __ points**.
---

## <font color='orange'>Final mark</font>: Your solution to Assignment 3 is marked with ** __points**. 