## Sentiment Classification: classifying IMDB reviews

In this task, you will learn how to process text data and how to train neural networks with limited input text data using pre-trained embeddings for sentiment classification (classifying a review document as "positive" or "negative" based solely on the text content of the review).

We will use the `Embedding` layer in Keras to represent text input. The `Embedding` layer is best understood as a dictionary mapping integer indices (which stand for specific words) to dense vectors. It takes as input integers, then looks up these integers into an internal dictionary, and finally returns the associated vectors. It's effectively a dictionary lookup.

The `Embedding` layer takes as input a 2D tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers. It can embed sequences of variable lengths, so for instance we could feed into our embedding layer above batches that could have  shapes `(32, 10)` (batch of 32 sequences of length 10) or `(64, 15)` (batch of 64 sequences of length 15). All sequences in a batch must have the same length, though (since we need to pack them into a single tensor), so sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

This layer returns a 3D floating point tensor, of shape `(samples, sequence_length, embedding_dimensionality)`. Such a 3D tensor can then be processed by a RNN layer or a 1D convolution layer.

You can instantiate the `Embedding` layer by randomly initialising its weights (its internal dictionary of token vectors). During training, these word vectors will be gradually adjusted via backpropagation, structuring the space into something that the downstream model can exploit. Once fully trained, your embedding space will show a lot of structure -- a kind of structure specialized for the specific problem you were training your model for. You can also instantiate the `Embedding` layer by intialising its weights using the pre-trained word embeddings, such as GloVe word embeddings pretrained from Wikipedia articles.

#### a) Download the IMDB data as raw text

First, create a "data" directory, then head to `http://ai.stanford.edu/~amaas/data/sentiment/` and download the raw IMDB dataset (if the URL isn't working anymore, just Google "IMDB dataset"). Save it into the "data" directory. Uncompress it. Store the individual reviews into a list of strings, one string per review, and also collect the review labels (positive / negative) into a separate `labels` list.

In [147]:
import os
import random
# write your code here
train_l=[]
test_l=[]

train_path='data/aclImdb/train/'
train_pos_filenames = [train_path+'pos/'+i for i in os.listdir(train_path+'pos')][:500]
train_neg_filenames= [train_path+'neg/'+i for i in os.listdir(train_path+'neg')][:500]
for i in range(2*len(train_pos_filenames)):
    if i%2==0:
        t=train_pos_filenames
        label='positive'
    else:
        t=train_neg_filenames
        label='negative'
    with open(t[i//2], 'r') as f:
        train_l.append((f.read(),label))


test_path='data/aclImdb/test/'
test_pos_filenames = [test_path+'pos/'+i for i in os.listdir(test_path+'pos')][:5]
test_neg_filenames= [test_path+'neg/'+i for i in os.listdir(test_path+'neg')][:5]
for i in range(2*len(test_pos_filenames)):
    if i%2==0:
        t=test_pos_filenames
        label='positive'
    else:
        t=test_neg_filenames
        label='negative'
    with open(t[i//2], 'r') as f:
        test_l.append((f.read(),label))

random.shuffle(train_l)
random.shuffle(test_l)

train_x, train_y = zip(*train_l)
test_x, test_y = zip(*test_l)


#### b) Pre-process the review documents 

Pre-process review documents by tokenisation and split the data into the training and testing sets. You can restrict the training data to the first 1000 reviews and only consider the top 5,000 words in the dataset. You can also cut reviews after 100 words (that is, each review contains a maximum of 100 words).

In [183]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import scipy

# write your code here

#the list was compiled by first looking into the words with the top tfidf rates
#removed words indicating negation(e.g. not) and exagration(e.g. very) from list 
stopwords=['the', 'and', 'of', 'to', 'is', 'br', 'it', 'in', 'this', 'that', 'was','with', 'as', 'for', 'you', 
'on', 'are', 'he', 'have', 'his', 'be', 'one', 'at', 'they', 'all', 'who', 'by', 'from', 'so', 'an', 'off',
'there', 'her', 'if','out', 'or', 'about', 'just', 'has', 'what', 'can', 'some', 'when', 'she', 'up', 'my',
'their', 'which', 'me', 'were', 'had', 'we', 'well', 'get', 'than', 'because', 'will', 'did', 'your','over'
'been', 'its', 'other', 'do', 'also', 'into''him', 'how', 'too', 'them', 'after', 'any', 'then', 'before','those']

tk=train_x
#tk.extend(test_x)
tfidf = TfidfVectorizer(max_features=6000, stop_words=stopwords)
idf=tfidf.fit_transform(tk)
#transform idf matrix into a 2d matrix and list the mean tdidf for each word
words_tdidf=idf.todense().mean(axis=0).tolist()[0]
#match each tdidf with the word and put the tuples in a list
top_tokens=sorted(tfidf.vocabulary_)
d=list(zip(words_tdidf, sorv))
#d=sorted(d, key=lambda x: 1-x[0])


train_x=pad_sequences(train_x, maxlen = 100, padding = "post", truncating = "post", value = 0)
test_x=pad_sequences(test_x, maxlen = 100, padding = "post", truncating = "post", value = 0)


[0.0013274016220092822, 0.011563074059944608, 0.0022475183265244793, 0.0006549498314360392, 0.0015748986836798093, 0.001390678502200713, 0.0012037294325137036, 0.0006598303490456458, 0.0010182482892582502, 0.00197482882213165, 0.0005435264187085446, 0.0009105158665120399, 0.0008294583750504146, 0.0005545721897230263, 0.0004885323611590072, 0.0006139278721142368, 0.0004584371654068973, 0.0008342777815689926, 0.0003270973865474415, 0.00040132977709438777, 0.0006046385821501377, 0.0007852046932063918, 0.0003948763007146793, 0.0005013858887031281, 0.0005299701640717812, 0.0002912291390727943, 0.0007681967909892862, 0.0005503522629081639, 0.0005773322047958801, 0.0003818019992654211, 0.0004018381524000026, 0.0005901576140857198, 0.0004049084738237033, 0.0005094666788569213, 0.0005222252534788734, 0.0003153992437712715, 0.0005762812935950253, 0.00028707963635524425, 0.0006578439313780816, 0.000616713089713886, 0.0006493212685375853, 0.0020860386087182977, 0.0004466762178811563, 0.00065051798

# c) Download the GloVe word embeddings and map each word in the dataset into its pre-trained GloVe word embedding.


First go to `https://nlp.stanford.edu/projects/glove/` and download the pre-trained 
embeddings from 2014 English Wikipedia into the "data" directory. It's a 822MB zip file named `glove.6B.zip`, containing 100-dimensional embedding vectors for 
400,000 words (or non-word tokens). Un-zip it.

Parse the un-zipped file (it's a `txt` file) to build an index mapping words (as strings) to their vector representation (as number vectors).

Build an embedding matrix that will be loaded into an `Embedding` layer later. It must be a matrix of shape `(max_words, embedding_dim)`, where each entry `i` contains the `embedding_dim`-dimensional vector for the word of index `i` in our reference word index 
(built during tokenization). Note that the index `0` is not supposed to stand for any word or token -- it's a placeholder.

In [None]:
# write your code here
glove_path='./glove.6B.100d.txt'
wn=len(top_tokens)+1
embed_matrix=np.zeros(wn,100)

with open(glove_path, encoding="utf8" ) as f:
    content = f.readlines()
for line in content:
    splitLine = line.split()
    word = splitLine[0]
    if word in top_tokens:
        index=top_token.index(word)
        embedding = np.array([float(val) for val in splitLine[1:]])
        embed_matrix[index+1,:] = embedding

    



#### d) Build and train a simple Sequential model

The model contains an Embedding Layer with maximum number of tokens to be 10,000 and embedding dimensionality as 100. Initialise the Embedding Layer with the pre-trained GloVe word vectors. Set the maximum length of each review to 100. Flatten the 3D embedding output to 2D and add a Dense Layer which is the classifier. Train the model with a 'rmsprop' optimiser. You need to freeze the embedding layer by setting its `trainable` attribute to `False` so that its weights will not be updated during training.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

# write your code here


embed_layer = Embedding(len(top_tokens) + 1, 100, weights=[embed_matrix],input_length=100,trainable=False)

#### e) Plot the training and validation loss and accuracies and evaluate the trained model on the test set.

What do you observe from the results?

In [None]:
import matplotlib.pyplot as plt

# write your code here

#### f) Add an LSTM layer into the simple neural network architecture and re-train the model on the training set, plot the training and validation loss/accuracies, also evaluate the trained model on the test set and report the result.

In [None]:
from keras.layers import LSTM

# write your code here