# HW 2: Data Preparation for Sentiment Classification

In this homework we will prepare the IMDB movie review sentiment dataset. We will prepare it to fit a model that will predict whether a new review has a positive or negative sentiment. 

**Start by downloading the IMDB_Dataset from the .csv file into a pandas DataFrame**

In [111]:
import pandas as pd
import numpy as np

#Download the dataset into a Pandas DataFrame and display the first 5 rows
## YOUR CODE HERE
imdb = pd.read_csv('IMDB_Dataset.csv')

We have to process the data first, so that we have fixed length sequences.

<b>We want to split the dataset into reviews and labels. </b>

In [112]:
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [113]:
imdb_x = imdb['review']
imdb_y = imdb['sentiment']

In [114]:
# Paste the toBinary function created in HW 1 from this hw set (week 2)
def toBinary(data, positive):
    return data.apply(lambda x: 1 if x == positive else 0)

**Use the toBinary method to tranform the sentiment column into binary, 1 for positive and 0 for negative.**

In [115]:
imdb_y = toBinary(imdb_y, 'positive')
imdb['sentiment'] = imdb_y

"Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors."
    https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

<b>Lemmatize the sentences using any library from the article. Make sure to filter out non-alphabetical characters. </b>

In [116]:
import nltk
from nltk.stem import WordNetLemmatizer 
import re

def Lemmatize(data):
    # Init the Wordnet Lemmatizer
    data = data.str.split(" |,|<br /><br />|\"|\'|!") # splits the data from sentences to words, feel free to change
    lemmatizer =  WordNetLemmatizer()
    
    
    for review in data:
        for word in review:
            word = lemmatizer.lemmatize(re.sub(r'\W+', '', word))
    
    return data

imdb_x = Lemmatize(imdb_x)


In [117]:
imdb_x

0        [One, of, the, other, reviewers, has, mentione...
1        [A, wonderful, little, production., , The, fil...
2        [I, thought, this, was, a, wonderful, way, to,...
3        [Basically, there, s, a, family, where, a, lit...
4        [Petter, Mattei, s, , Love, in, the, Time, of,...
                               ...                        
49995    [I, thought, this, movie, did, a, down, right,...
49996    [Bad, plot, , bad, dialogue, , bad, acting, , ...
49997    [I, am, a, Catholic, taught, in, parochial, el...
49998    [I, m, going, to, have, to, disagree, with, th...
49999    [No, one, expects, the, Star, Trek, movies, to...
Name: review, Length: 50000, dtype: object

The data has to be put into integer form, so each integer represents a unique word, 0 represents a PAD character, 1 represents a START character and 2 represents a character that is unknown because it is not in the top `num_words`. 
Thus 3 represents the first real word. 

Also the words should be in decreasing order of frequency, so the word that 3 represents is the most common word in the dataset. 

Complete CreateDict which will take in a data column <br>
    1) Create a Dict that maps {Word: Apperances in dataset} 
       <i> Do not implement dictionary keys for PAD, START, and unknown characters (except "" see hint), this will be done later. </i>
<br>
    2) Choose the top N most recurring words and give ascending indexes starting at 3 <br>

In [135]:
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))
numWords = 1000



def CreateDict(data, topN):
    
    wordfreqs = {}
    
    for review in data:

        for word in review:
            if word not in wordfreqs:
                wordfreqs[word] = 0
            wordfreqs[word] += 1
    
  
    ##Hint: "" (or the empty string) is actually 
    ### the most common word but we want to cast it as an unknown word, what index should we index it as then?
    final_dict = {}
    
    
    for idx, word in enumerate(sorted(wordfreqs, key=wordfreqs.get, reverse=True)[1:]):
        
        if idx == numWords + 3:
            break
        
        final_dict[word] = idx + 3
        print(word, final_dict[word])
        
    return final_dict
    
wordCounter = CreateDict(imdb_x, numWords)
wordCounter

the 3
a 4
and 5
of 6
to 7
is 8
in 9
I 10
it 11
that 12
s 13
this 14
was 15
The 16
as 17
with 18
for 19
movie 20
but 21
film 22
t 23
on 24
you 25
are 26
his 27
have 28
not 29
be 30
he 31
one 32
at 33
by 34
an 35
who 36
all 37
they 38
from 39
like 40
It 41
so 42
has 43
about 44
just 45
or 46
her 47
out 48
This 49
some 50
can 51
very 52
more 53
good 54
what 55
would 56
there 57
when 58
up 59
if 60
their 61
really 62
only 63
had 64
she 65
which 66
even 67
see 68
were 69
my 70
no 71
than 72
story 73
- 74
time 75
been 76
me 77
into 78
get 79
much 80
will 81
we 82
because 83
other 84
people 85
most 86
great 87
do 88
could 89
make 90
first 91
how 92
bad 93
also 94
its 95
don 96
any 97
him 98
made 99
But 100
think 101
And 102
well 103
it. 104
way 105
then 106
too 107
being 108
them 109
many 110
characters 111
character 112
movie. 113
movies 114
never 115
There 116
two 117
A 118
In 119
know 120
where 121
little 122
after 123
films 124
watch 125
seen 126
acting 127
plot 128
your 129
did 130
love 

{'the': 3,
 'a': 4,
 'and': 5,
 'of': 6,
 'to': 7,
 'is': 8,
 'in': 9,
 'I': 10,
 'it': 11,
 'that': 12,
 's': 13,
 'this': 14,
 'was': 15,
 'The': 16,
 'as': 17,
 'with': 18,
 'for': 19,
 'movie': 20,
 'but': 21,
 'film': 22,
 't': 23,
 'on': 24,
 'you': 25,
 'are': 26,
 'his': 27,
 'have': 28,
 'not': 29,
 'be': 30,
 'he': 31,
 'one': 32,
 'at': 33,
 'by': 34,
 'an': 35,
 'who': 36,
 'all': 37,
 'they': 38,
 'from': 39,
 'like': 40,
 'It': 41,
 'so': 42,
 'has': 43,
 'about': 44,
 'just': 45,
 'or': 46,
 'her': 47,
 'out': 48,
 'This': 49,
 'some': 50,
 'can': 51,
 'very': 52,
 'more': 53,
 'good': 54,
 'what': 55,
 'would': 56,
 'there': 57,
 'when': 58,
 'up': 59,
 'if': 60,
 'their': 61,
 'really': 62,
 'only': 63,
 'had': 64,
 'she': 65,
 'which': 66,
 'even': 67,
 'see': 68,
 'were': 69,
 'my': 70,
 'no': 71,
 'than': 72,
 'story': 73,
 '-': 74,
 'time': 75,
 'been': 76,
 'me': 77,
 'into': 78,
 'get': 79,
 'much': 80,
 'will': 81,
 'we': 82,
 'because': 83,
 'other': 84,
 'peop

Complete  replaceByIndex which will replace known words with their index and unknown words with a 2.

In [136]:
wordCounter['the']

3

In [None]:
def replaceByIndex(data, wordCounter):
    
#     for i, review in enumerate(data):
#         for j, word in enumerate(review):
#             try:
#                 data[i][j] = wordCounter[word]
#             except KeyError:
#                 data[i][j] = 2
#     return data

    [[print(word, review) for word in review] for review in data]
    return [[wordCounter[word] if word in wordCounter else 2 for word in review] for review in data]
#     return [[print(wordCounter[word]) if word in wordCounter else 2] for review in data]
            
    

imdb_x_copy = replaceByIndex(imdb_x, wordCounter)

In [None]:
imdb_x_copy

In [95]:
imdb_x_copy.head()

AttributeError: 'NoneType' object has no attribute 'head'

#### We want to process the data into NumPy arrays of sequences that are all length 200. We will use these criteria: 
* We want to add a 1 at the beginning of every review to signal the beginning of the text.
* If a given sequence is shorter than 200 tokens we want to pad the beginning of the sequence out with zeros so that the sequence is 200 long. 
* Else if the sequence is longer than 200 (including the starting 1) we want to cut it down to length 200. 


In [None]:
def process_data(data):
    
    #YOUR CODE HERE 
    
    return processed

imdb_x = process_data(imdb_x)

<b> Separate the dataset into train and test sets, test set should be 1/3 of the set.</b> <p>
This sklearn method will make your life much easier: 
[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
## YOUR CODE HERE

x_train_proc, x_test_proc, y_train, y_test = #YOUR CODE HERE

At this point **your job is done!!!** Congratulations, if done correctly, the sentences are processed and ready to be used as features and labels to train a Recurrent Neural Network (LSTM). You will learn how to do this yourself in the next couple weeks. For now, you can just sit back and "follow along" as we build this model using Keras and then train it. 

The first thing we will do is initialize the model using Sequential.

In [None]:
import keras
from keras import Sequential

imdb_model = Sequential()

Now we want to add an embedding layer. The purpose of an embedding layer is to take a sequence of integers representing words in our case and turn each integer into a dense vector in some embedding space. (This is essentially the idea of Word2Vec https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). We want to create an embedding layer with vocab size equal to the max num words we allowed when we loaded the data (in this case 1000), and a fixed dense vector of size 32. Then we have to specify the max length of our sequences and we want to mask out zeros in our sequence since we used zero to pad.
Use the docs for embedding layer to fill out the missing entries: https://keras.io/layers/embeddings/

In [None]:
from keras.layers.embeddings import Embedding
imdb_model.add(Embedding(1000, 32, input_length=200, mask_zero=True))

#### **(a)** We add an LSTM layer with 32 outputs, then a Dense layer with 16 neurons, then a relu activation, then a dense layer with 1 neuron, then a sigmoid activation. Then we print out the model summary. The Keras documentation is here: https://keras.io/

In [None]:
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Activation
imdb_model.add(LSTM(32))

In [None]:
imdb_model.add(Dense(units=16, activation='relu'))
imdb_model.add(Dense(units=1, activation='sigmoid'))

In [None]:
imdb_model.summary()

#### **(b)** Now we compile the model with binary cross entropy, and the adam optimizer. We include accuracy as a metric in the compile. Then train the model on the processed data.

In [None]:
imdb_model.compile(loss=keras.losses.binary_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['acc'])

In [None]:
imdb_model.fit(x_train_proc, y_train)

In [None]:
print("Accuracy: ", imdb_model.evaluate(x_test_proc, y_test)[1])

## If you did the data pre-processing correctly you should be getting around an 80% accuracy. congratulations, that is much better than random! 
<i>If you are getting a test accuracy that is significantly lower, you probably did something wrong, slack your NMEP team or go to office hours to get help sorting it out :) </i>

#### Now we can look at our predictions and the sentences they correspond to.

In [None]:
y_pred = imdb_model.predict(x_test_proc)

In [None]:
word_to_id = wordCounter
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items() if value < 2000}
def get_words(token_sequence):
    return ' '.join(id_to_word[token] for token in token_sequence)

def get_sentiment(y_pred, index):
    return 'Positive' if y_pred[index] else 'Negative'

In [None]:
y_test = [i for i in y_test]
y_pred = np.vectorize(lambda x: int(x >= 0.5))(y_pred)
correct = []
incorrect = []
for i, pred in enumerate(y_pred):
    if y_test[i] == pred:
        correct.append(i)
    else:
        incorrect.append(i)

#### Now we print out one of the sequences we got correct.

In [None]:
print(get_sentiment(y_pred, correct[10]))
print(get_words(x_test_proc[correct[10]]))

#### And one we got wrong.

In [None]:
print(get_sentiment(y_pred, incorrect[10]))
print(get_words(x_test_proc[incorrect[10]]))

#### As you can see the amount of UNKNOWN characters in the sequence cause by having only 1000 vocab words is hurting our performance. If you want, go back and increase the number of vocab words to 2000 and compare your accuracy 
(If you do so, remember to change your embedding parameter from 1000 to 2000 as well; you should get ~85% accuracy). 

## And that's it! Now you should feel like a data engineering/preprocessing expert :) 