# HW 2: Data Preparation for Sentiment Classification

In this homework we will prepare the IMDB movie review sentiment dataset. We will prepare it to fit a model that will predict whether a new review has a positive or negative sentiment. 

**Start by downloading the IMDB_Dataset from the .csv file into a pandas DataFrame**

In [1]:
import pandas as pd
import numpy as np

#Download the dataset into a Pandas DataFrame and display the first 5 rows
imdb = pd.read_csv("IMDB_Dataset.csv")
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


We have to process the data first, so that we have fixed length sequences.

<b>We want to split the dataset into reviews and labels. </b>

In [2]:
imdb_x = imdb['review']
imdb_y = imdb['sentiment']

In [3]:
# Paste the toBinary function created in HW 1 from this hw set (week 2)
def toBinary(data, positive):
    data.where(data == positive, 0, inplace=True)
    data.where(data != positive, 1, inplace=True)

**Use the toBinary method to tranform the sentiment column into binary, 1 for positive and 0 for negative.**

In [4]:
toBinary(imdb_y, 'positive')
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


"Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors."
    https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

<b>Lemmatize the sentences using any library from the article. Make sure to filter out non-alphabetical characters. </b>

In [5]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

def Lemmatize(data):
    # Init the Wordnet Lemmatizer
    data = data.str.split(" |<br /><br />|[^a-zA-Z]") # splits the data from sentences to words, feel free to change
    lemmatizer = WordNetLemmatizer()
    data = [' '.join([lemmatizer.lemmatize(word) for word in sentence]) for sentence in data ]
    
    return data

imdb_x = Lemmatize(imdb_x)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/siddarth/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The data has to be put into integer form, so each integer represents a unique word, 0 represents a PAD character, 1 represents a START character and 2 represents a character that is unknown because it is not in the top `num_words`. 
Thus 3 represents the first real word. 

Also the words should be in decreasing order of frequency, so the word that 3 represents is the most common word in the dataset. 

Complete CreateDict which will take in a data column <br>
    1) Create a Dict that maps {Word: Apperances in dataset} 
       <i> Do not implement dictionary keys for PAD, START, and unknown characters (except "" see hint), this will be done later. </i>
<br>
    2) Choose the top N most recurring words and give ascending indexes starting at 3 <br>

In [6]:
numWords = 2000

def CreateDict(data, topN):
    occurrences = {}
    for sentence in data:
        words = sentence.split(' ')
        for word in words:
            occurrences[word] = occurrences.get(word, 0) + 1
    
    occurrences = sorted(occurrences.items(), key=lambda x: x[1], reverse=True)[:topN-2]
    
    index = {}
    for i in range(0, topN-2):
        index[occurrences[i][0]] = i + 2
    
    return index
    
wordCounter = CreateDict(imdb_x, numWords)

Complete  replaceByIndex which will replace known words with their index and unknown words with a 2.

In [7]:
def replaceByIndex(data, wordCounter):
    newData = []
    for sentence in data:
        words = sentence.split(' ')
        translation = []
        for word in words:
            if word in wordCounter:
                translation.append(wordCounter[word])
            else:
                translation.append(2)
        newData.append(translation)
    return newData

imdb_x = replaceByIndex(imdb_x, wordCounter)

#### We want to process the data into NumPy arrays of sequences that are all length 200. We will use these criteria: 
* We want to add a 1 at the beginning of every review to signal the beginning of the text.
* If a given sequence is shorter than 200 tokens we want to pad the beginning of the sequence out with zeros so that the sequence is 200 long. 
* Else if the sequence is longer than 200 (including the starting 1) we want to cut it down to length 200. 


In [8]:
def process_data(data):
    processed = []
    for sentence in data:
        sentence.insert(0, 1)
        if len(sentence) < 200:
            for i in range(200 - len(sentence)):
                sentence.insert(0, 0)
        sentence = sentence[:200]
        processed.append(np.asarray(sentence))
    processed = np.asarray(processed)
    return pd.DataFrame(processed)

imdb_x = process_data(imdb_x)

<b> Separate the dataset into train and test sets, test set should be 1/3 of the set.</b> <p>
This sklearn method will make your life much easier: 
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [9]:
from sklearn.model_selection import train_test_split

x_train_proc, x_test_proc, y_train, y_test = train_test_split(imdb_x, imdb_y, test_size=0.33)

At this point **your job is done!!!** Congratulations, if done correctly, the sentences are processed and ready to be used as features and labels to train a Recurrent Neural Network (LSTM). You will learn how to do this yourself in the next couple weeks. For now, you can just sit back and "follow along" as we build this model using Keras and then train it. 

The first thing we will do is initialize the model using Sequential.

In [10]:
import keras
from keras import Sequential

imdb_model = Sequential()

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Now we want to add an embedding layer. The purpose of an embedding layer is to take a sequence of integers representing words in our case and turn each integer into a dense vector in some embedding space. (This is essentially the idea of Word2Vec https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). We want to create an embedding layer with vocab size equal to the max num words we allowed when we loaded the data (in this case 1000), and a fixed dense vector of size 32. Then we have to specify the max length of our sequences and we want to mask out zeros in our sequence since we used zero to pad.
Use the docs for embedding layer to fill out the missing entries: https://keras.io/layers/embeddings/

In [11]:
from keras.layers.embeddings import Embedding
imdb_model.add(Embedding(2000, 32, input_length=200, mask_zero=True))

#### **(a)** We add an LSTM layer with 32 outputs, then a Dense layer with 16 neurons, then a relu activation, then a dense layer with 1 neuron, then a sigmoid activation. Then we print out the model summary. The Keras documentation is here: https://keras.io/

In [12]:
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Activation
imdb_model.add(LSTM(32))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [13]:
imdb_model.add(Dense(units=16, activation='relu'))
imdb_model.add(Dense(units=1, activation='sigmoid'))

In [14]:
imdb_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 32)           64000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 72,865
Trainable params: 72,865
Non-trainable params: 0
_________________________________________________________________


#### **(b)** Now we compile the model with binary cross entropy, and the adam optimizer. We include accuracy as a metric in the compile. Then train the model on the processed data.

In [15]:
imdb_model.compile(loss=keras.losses.binary_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['acc'])

In [16]:
imdb_model.fit(x_train_proc, y_train)


Epoch 1/1


<keras.callbacks.callbacks.History at 0x12fd43a50>

In [17]:
print("Accuracy: ", imdb_model.evaluate(x_test_proc, y_test)[1])

Accuracy:  0.8215757608413696


## If you did the data pre-processing correctly you should be getting around an 80% accuracy. congratulations, that is much better than random! 
<i>If you are getting a test accuracy that is significantly lower, you probably did something wrong, slack your NMEP team or go to office hours to get help sorting it out :) </i>

#### Now we can look at our predictions and the sentences they correspond to.

In [18]:
y_pred = imdb_model.predict(x_test_proc)

In [19]:
word_to_id = wordCounter
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items() if value < 2000}
def get_words(token_sequence):
    return ' '.join(id_to_word[token] for token in token_sequence)

def get_sentiment(y_pred, index):
    return 'Positive' if y_pred[index] else 'Negative'

In [20]:
y_test = [i for i in y_test]
y_pred = np.vectorize(lambda x: int(x >= 0.5))(y_pred)
correct = []
incorrect = []
for i, pred in enumerate(y_pred):
    if y_test[i] == pred:
        correct.append(i)
    else:
        incorrect.append(i)

#### Now we print out one of the sequences we got correct.

In [21]:
print(get_sentiment(y_pred, correct[10]))
print(get_words(x_test_proc[correct[10]]))

Negative
off I <PAD> <PAD> t <PAD> <PAD> chase matter <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> video movie <UNK> <PAD> <PAD> <UNK> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <UNK> <PAD> <UNK> have <PAD> <PAD> but <PAD> <PAD> <PAD> <PAD> which watch very <PAD> year <PAD> town <PAD> never <UNK> is m film <PAD> five <PAD> on <UNK> <PAD> <PAD> DVD <PAD> movie <PAD> <PAD> Michael <PAD> like <PAD> <PAD> <PAD> <PAD> <PAD> my <PAD> to <UNK> <PAD> working <PAD> for <PAD> wa <PAD> <UNK> <UNK> <PAD> a <PAD> re had a <PAD> <PAD> t <PAD> <START> of <UNK> <UNK> <PAD> <UNK> <PAD> t they <PAD> <PAD> <PAD> <PAD> spoiler <PAD> <PAD> <PAD> show <PAD> <PAD> <PAD> it An out so concept I is <PAD> <UNK> a <PAD> <PAD> a <PAD> <PAD> <PAD> <PAD> of have film about <UNK> or if <PAD> previous <UNK> <PAD> <PAD> <UNK> <PAD> <PAD> with that <PAD> sit <PAD> This John <UNK> room <PAD> <PAD> make of knew <PAD> <PAD> in <UNK> had <PAD> <PAD> wa <UNK> <PAD> <PAD> <PAD> sight little some <PAD> <PAD> <UNK> <PAD> <PAD> <PAD> a th

#### And one we got wrong.

In [22]:
print(get_sentiment(y_pred, incorrect[10]))
print(get_words(x_test_proc[incorrect[10]]))

Positive
wa the <PAD> it time <UNK> not are he s this <PAD> that <UNK> wa I <UNK> than you <PAD> <UNK> <PAD> one <PAD> <PAD> movie a if <UNK> cut It and <PAD> <UNK> we effect <UNK> <PAD> <PAD> <UNK> my director <PAD> track <PAD> all it they it to your list the the <PAD> the in <UNK> expected IMDb <PAD> <UNK> <UNK> <UNK> basically violent in <UNK> is <PAD> <PAD> <UNK> <UNK> <PAD> <UNK> awesome <UNK> young wa like <UNK> <UNK> first <UNK> <UNK> <PAD> in watching <UNK> so the <PAD> didn and <UNK> appear but <UNK> insult <PAD> young B do there of <PAD> more <PAD> playing a this <PAD> <UNK> <UNK> <UNK> than filmmaker into <UNK> <UNK> <UNK> money <UNK> <PAD> money <UNK> <PAD> <PAD> <UNK> smile doing old <PAD> think the t <UNK> <UNK> time his <PAD> to been <UNK> <PAD> to an s A is <UNK> <UNK> <UNK> to i the most get movie this half only the <PAD> and to <UNK> by that a it <UNK> <PAD> <PAD> <UNK> gave brief <PAD> and the <PAD> is <PAD> friend However <UNK> expect show <UNK> win le But <UNK> of 

#### As you can see the amount of UNKNOWN characters in the sequence cause by having only 1000 vocab words is hurting our performance. If you want, go back and increase the number of vocab words to 2000 and compare your accuracy 
(If you do so, remember to change your embedding parameter from 1000 to 2000 as well; you should get ~85% accuracy). 

## And that's it! Now you should feel like a data engineering/preprocessing expert :) 