# HW 2: Data Preparation for Sentiment Classification

In this homework we will prepare the IMDB movie review sentiment dataset. We will prepare it to fit a model that will predict whether a new review has a positive or negative sentiment. 

**Start by downloading the IMDB_Dataset from the .csv file into a pandas DataFrame**

In [430]:
import pandas as pd
import numpy as np

#Download the dataset into a Pandas DataFrame and display the first 5 rows
## YOUR CODE HERE
imdb = pd.read_csv("IMDB_Dataset.csv")

We have to process the data first, so that we have fixed length sequences.

<b>We want to split the dataset into reviews and labels. </b>

In [431]:
imdb_x = imdb['review']
imdb_y = imdb['sentiment']

In [432]:
# Paste the toBinary function created in HW 1 from this hw set (week 2)
def toBinary(data, positive):
    ## YOUR CODE HERE
    n = data.apply(lambda x: 1 if (x == positive) else 0)
    imdb[data.name] = n
    return n

**Use the toBinary method to tranform the sentiment column into binary, 1 for positive and 0 for negative.**

In [433]:
imdb_y = toBinary(imdb_y, 'positive')
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


"Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors."
    https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

<b>Lemmatize the sentences using any library from the article. Make sure to filter out non-alphabetical characters. </b>

In [434]:
## YOUR IMPORT STATEMENTS
import nltk
from nltk.stem import WordNetLemmatizer 
import re

In [435]:
def Lemmatize(data):
    # Init the Wordnet Lemmatizer
    data = data.str.split(" |,|<br /><br />|\"|\'|!") # splits the data from sentences to words, feel free to change
    lemmatizer = WordNetLemmatizer()
    regex = re.compile('[^a-zA-Z]')

    # YOUR CODE HERE
    #data.apply(lambda lst: [lemmatizer.lemmatize(w) for w in lst])
    
    all_reviews = []
    for s in data:
        sentence = [(lemmatizer.lemmatize(regex.sub('', w))) for w in s]
        all_reviews.append(sentence)
    return pd.Series(all_reviews)

imdb_x = Lemmatize(imdb_x)

In [436]:
imdb_x = imdb_x.rename('review')
print(imdb_x)

0        [One, of, the, other, reviewer, ha, mentioned,...
1        [A, wonderful, little, production, , The, film...
2        [I, thought, this, wa, a, wonderful, way, to, ...
3        [Basically, there, s, a, family, where, a, lit...
4        [Petter, Mattei, s, , Love, in, the, Time, of,...
                               ...                        
49995    [I, thought, this, movie, did, a, down, right,...
49996    [Bad, plot, , bad, dialogue, , bad, acting, , ...
49997    [I, am, a, Catholic, taught, in, parochial, el...
49998    [I, m, going, to, have, to, disagree, with, th...
49999    [No, one, expects, the, Star, Trek, movie, to,...
Name: review, Length: 50000, dtype: object


The data has to be put into integer form, so each integer represents a unique word, 0 represents a PAD character, 1 represents a START character and 2 represents a character that is unknown because it is not in the top `num_words`. 
Thus 3 represents the first real word. 

Also the words should be in decreasing order of frequency, so the word that 3 represents is the most common word in the dataset. 

Complete CreateDict which will take in a data column <br>
    1) Create a Dict that maps {Word: Apperances in dataset} 
       <i> Do not implement dictionary keys for PAD, START, and unknown characters (except "" see hint), this will be done later. </i>
<br>
    2) Choose the top N most recurring words and give ascending indexes starting at 3 <br>

In [437]:
from itertools import islice

numWords = 1000

def CreateDict(data, topN):
    # YOUR CODE HERE
     
    ##Hint: "" (or the empty string) is actually 
    ### the most common word but we want to cast it as an unknown word, what index should we index it as then?
    max_num = 0
    d = {}
    
    for review_lst in data.items():
        for w in review_lst[1]:
            #print(w)
            if w == "":
                continue
            elif w in d:
                d[w] += 1
            else:
                d[w] = 1
            max_num = max(d[w], max_num)
    
    d = {key: val for key, val in sorted(d.items(), key = lambda ele: ele[1], reverse = True)} 
    d = list(islice(d.items(), topN - 3))
    d_new = {}
    
    i = 3
    for tup in d:
        d_new[tup[0]] = i
        i += 1
    d_new[''] = 2
    return d_new
    
wordCounter = CreateDict(imdb_x, numWords)

In [438]:
print(len(wordCounter))
maximum_val = max(wordCounter, key=wordCounter.get)
print(wordCounter[maximum_val])

998
999


Complete  replaceByIndex which will replace known words with their index and unknown words with a 2.

In [439]:
def replaceByIndex(data, wordCounter):
    
    # YOUR CODE HERE
    
    def replace(lst):
        new_lst = []
        for w in lst:
            index = wordCounter.get(w, -1)
            if index == -1:
                new_lst.append(2)
            else:
                new_lst.append(index)
        return new_lst
        
    d = data.apply(lambda x : replace(x))
    return d

imdb_x = replaceByIndex(imdb_x, wordCounter)

In [440]:
imdb_x.head()

0    [295, 6, 3, 82, 2, 45, 2, 12, 131, 166, 46, 2,...
1    [124, 430, 132, 353, 2, 18, 2, 2, 8, 55, 2, 55...
2    [11, 192, 14, 16, 4, 430, 91, 7, 2, 48, 23, 4,...
3    [2, 57, 13, 4, 238, 130, 4, 132, 356, 2, 102, ...
4    [2, 2, 13, 2, 2, 9, 3, 2, 6, 2, 2, 8, 4, 2, 2,...
Name: review, dtype: object

#### We want to process the data into NumPy arrays of sequences that are all length 200. We will use these criteria: 
* We want to add a 1 at the beginning of every review to signal the beginning of the text.
* If a given sequence is shorter than 200 tokens we want to pad the beginning of the sequence out with zeros so that the sequence is 200 long. 
* Else if the sequence is longer than 200 (including the starting 1) we want to cut it down to length 200. 


In [441]:
def process_data(data):
    
    #YOUR CODE HERE 
    def proc(lst):
        lst.insert(0, 1)
        if len(lst) > 200:
            lst = lst[0:200]
        elif len(lst) < 200:
            for i in range(0, 200 - len(lst)):
                lst.insert(0,0)
        return lst
    
    processed = data.apply(lambda x: proc(x))
    return processed

imdb_x = process_data(imdb_x)

In [442]:
imdb_x.head()

0    [1, 295, 6, 3, 82, 2, 45, 2, 12, 131, 166, 46,...
1    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4    [1, 2, 2, 13, 2, 2, 9, 3, 2, 6, 2, 2, 8, 4, 2,...
Name: review, dtype: object

<b> Separate the dataset into train and test sets, test set should be 1/3 of the set.</b> <p>
This sklearn method will make your life much easier: 
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [443]:
## YOUR CODE HERE
from sklearn.model_selection import train_test_split

#imdb_x = imdb_x.apply(lambda x: np.array(x).astype(np.float32))
#imdb_y = imdb_y.apply(lambda y: np.array(y))
x = []
for i in imdb_x:
    sentence = []
    for j in i:
        sentence.append(j)
    x.append(sentence)


size = imdb_x.size
x_train_proc, x_test_proc, y_train, y_test = train_test_split(np.array(x), np.asarray(imdb_y).astype(np.float32), test_size=round(size/3))


In [444]:
print(x_train_proc.shape)
print(x_train_proc)

(33333, 200)
[[  0   0   0 ...   2 341   2]
 [  0   0   0 ...   2   2   2]
 [  0   0   0 ...  71   2  58]
 ...
 [  1  49   8 ... 616  30   2]
 [  0   0   0 ... 403   9 467]
 [  0   1  49 ...  58   3   2]]


At this point **your job is done!!!** Congratulations, if done correctly, the sentences are processed and ready to be used as features and labels to train a Recurrent Neural Network (LSTM). You will learn how to do this yourself in the next couple weeks. For now, you can just sit back and "follow along" as we build this model using Keras and then train it. 

The first thing we will do is initialize the model using Sequential.

In [445]:
import keras
from keras import Sequential

imdb_model = Sequential()

Now we want to add an embedding layer. The purpose of an embedding layer is to take a sequence of integers representing words in our case and turn each integer into a dense vector in some embedding space. (This is essentially the idea of Word2Vec https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). We want to create an embedding layer with vocab size equal to the max num words we allowed when we loaded the data (in this case 1000), and a fixed dense vector of size 32. Then we have to specify the max length of our sequences and we want to mask out zeros in our sequence since we used zero to pad.
Use the docs for embedding layer to fill out the missing entries: https://keras.io/layers/embeddings/

In [446]:
from keras.layers.embeddings import Embedding
imdb_model.add(Embedding(1000, 32, input_length=200, mask_zero=True))

#### **(a)** We add an LSTM layer with 32 outputs, then a Dense layer with 16 neurons, then a relu activation, then a dense layer with 1 neuron, then a sigmoid activation. Then we print out the model summary. The Keras documentation is here: https://keras.io/

In [447]:
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Activation
imdb_model.add(LSTM(32))

In [448]:
imdb_model.add(Dense(units=16, activation='relu'))
imdb_model.add(Dense(units=1, activation='sigmoid'))

In [449]:
imdb_model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 200, 32)           32000     
_________________________________________________________________
lstm_10 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_18 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_19 (Dense)             (None, 1)                 17        
Total params: 40,865
Trainable params: 40,865
Non-trainable params: 0
_________________________________________________________________


#### **(b)** Now we compile the model with binary cross entropy, and the adam optimizer. We include accuracy as a metric in the compile. Then train the model on the processed data.

In [450]:
imdb_model.compile(loss=keras.losses.binary_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['acc'])

In [451]:
imdb_model.fit(x_train_proc, y_train)



<tensorflow.python.keras.callbacks.History at 0x157bdc7f0>

In [452]:
print("Accuracy: ", imdb_model.evaluate(x_test_proc, y_test)[1])

Accuracy:  0.7946240901947021


## If you did the data pre-processing correctly you should be getting around an 80% accuracy. congratulations, that is much better than random! 
<i>If you are getting a test accuracy that is significantly lower, you probably did something wrong, slack your NMEP team or go to office hours to get help sorting it out :) </i>

#### Now we can look at our predictions and the sentences they correspond to.

In [453]:
y_pred = imdb_model.predict(x_test_proc)

In [454]:
word_to_id = wordCounter
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items() if value < 2000}
def get_words(token_sequence):
    return ' '.join(id_to_word[token] for token in token_sequence)

def get_sentiment(y_pred, index):
    return 'Positive' if y_pred[index] else 'Negative'

In [455]:
y_test = [i for i in y_test]
y_pred = np.vectorize(lambda x: int(x >= 0.5))(y_pred)
correct = []
incorrect = []
for i, pred in enumerate(y_pred):
    if y_test[i] == pred:
        correct.append(i)
    else:
        incorrect.append(i)

#### Now we print out one of the sequences we got correct.

In [456]:
print(get_sentiment(y_pred, correct[10]))
print(get_words(x_test_proc[correct[10]]))

Positive
<START> This show really is the <UNK> American <UNK> It ha <UNK> <UNK> the British <UNK> <UNK> A guy who s sometimes nice <UNK> and a <UNK> woman Of course it is different because there is a <UNK> <UNK> and there s <UNK> and some acting we just don t see some of the acting <UNK> I gave this show a <UNK> because there are a couple <UNK> that I know a lot of people including <UNK> make if they were working for the show The first thing that really need to be <UNK> is the <UNK> <UNK> who go home I know they want to find the right <UNK> and <UNK> <UNK> but America should have the power to <UNK> who doe home There s really no point to the <UNK> The person with the <UNK> number of <UNK> usually go home anyway Another thing I d change is to see them actually act on the show What s <UNK> without the acting The last thing that need to be <UNK> is the song the <UNK> people <UNK> at the end The <UNK> <UNK> always <UNK> the same song and the <UNK> <UNK> always <UNK> the same song a


#### And one we got wrong.

In [457]:
print(get_sentiment(y_pred, incorrect[10]))
print(get_words(x_test_proc[incorrect[10]]))

Positive
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> <UNK> <UNK> <UNK> <UNK> <UNK> is the movie <UNK> of a <UNK> <UNK> All the character <UNK> from <UNK> s Paul <UNK> <UNK> of course <UNK> <UNK> to be killed <UNK> either a <UNK> <UNK> <UNK> the good guy or a <UNK> <UNK> <UNK> the villain The director simply <UNK> on the <UNK> violence people even get <UNK> <UNK> and <UNK> up <UNK> <UNK> this into an <UNK> version of <UNK> <UNK> <UNK> and <UNK> <UNK> like <UNK> <UNK> <UNK> <UNK> t

#### As you can see the amount of UNKNOWN characters in the sequence cause by having only 1000 vocab words is hurting our performance. If you want, go back and increase the number of vocab words to 2000 and compare your accuracy 
(If you do so, remember to change your embedding parameter from 1000 to 2000 as well; you should get ~85% accuracy). 

## And that's it! Now you should feel like a data engineering/preprocessing expert :) 