# Data Preprocessing for Sentiment Analysis

The following notebook describes in detail the steps involved in preprocessing the dataset for Sentiment Classification problem. These methods can be generalized for use in most of NLP tasks. The methods used are quite naive and can be considerably improved, this is just a beginner's introduction.

Let's dive into the process staright away...

## Load Dependencies

**pandas** : to read the dataset


**numpy** : to help us with numerical computation 


**string** : to help with cleaning 


**re** : to find regular expression 


**tqdm** : to make iterations less boring


**collections** : to do some counting


**gensim** : to help with making word embedding vectors


**multiprocessing** : to get the count of number of available CPU cores

In [14]:
import pandas as pd
import numpy as np
import string
import re
from tqdm import tqdm, trange
from collections import Counter
from gensim.models import Word2Vec
import multiprocessing

Now we are all set to look into our dataset..

In [19]:
data = pd.read_csv('resources/labeledTrainData.tsv', sep='\t')
print("shape: ",data.shape)
data.head()

shape:  (25000, 3)


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


The dataset contains 3 columns and 25000 samples

**id** : it is just the question id and is irrelevant as of now

**sentiment** : contains the positive and negative sentiment labels for each review

**review** : movie reviews

## Preprocessing

Before I go into preprocessing and all the details let's first have a look at fragment of the reviews
```
\"The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book from 1968.
```
Inorder to classify this review into positive or negative we need to train a model which can learn differences between the inputs but first of all we need to understand what these inputs will be.

If we look at the above sentence we see that the sentence has a lot of noise like \ " ' . Then there are cases where 'Classic' and 'classic' are present which may be treated as different words. We also some have less common words like names and years(which are numeric).

These all are irrelevant and may have adverse effect to our model. So we solve these issues in following manner by first aggregating all the reviews together and then applying our preprocessing function.

In [3]:
# convert reviews from dataframe to list
sentences = list(data['review'])

In [4]:
# preprocess sentences
def preprocess(sent):
    # convert string to lower case, remove punctutaions and replace digits with '#'
    sent = ''.join('#' if s.isdigit() else s for s in sent if s not in string.punctuation).lower()
    # substitute '####' with '#+' and tokenize
    sent = re.sub('#+', '#', sent).split(' ')
    return sent

After the above preprocessing function is applied we get output which is similar to this
```
['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'timothy', 'hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'from', '#']
```

This is done for all the sentences and hence we get a list like this
```
[
[...],
['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'timothy', 'hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'from', '#'],
[...],
]
```

Now we need to create a vocabulary of all the words present in all the sentences. But we need to keep in mind that we do not include those words in our vocabulary which infrequent like person names 'timothy', 'hines'.

In [5]:
# create vocabulary for the given dataset
def create_vocab(sent):
    # concats all the words in all the sentences of the dataset into one single list
    words = []
    for s in sent:
        words.extend(s)
    # count the occurence of each word in the dataset(words list)
    word_count = Counter(words)
    # create vocabulary with words occuring more than 2 times
    global vocab
    vocab = set([k for k,v in word_count.items() if v>2])

Let us just suppose for our above sentence that we create a vocabulary which maybe like this
```
['the', 'classic', 'war', 'of', 'worlds', 'by', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'faithfully', 'recreate', 'book', 'from', '#']
```
Note that the names which may not be frequent in all reviews are removed for explanation.

In [6]:
# remove words not present in the vocabulary
def remove_unknown(sent):
    sent = [w if w in vocab else 'unk' for w in sent]
    return sent

Next step is to replace the words in a sentence which is not present in vocabulary by 'unk' tag. The output for example sequence and vocabulary will be:
```
['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'unk', 'unk', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'unk', 'unk', 'unk', 'classic', 'book', 'from', '#']
```

In [7]:
# preprocess sentences - tokenization, removing punctuations and digits
sentences = [preprocess(sent) for sent in tqdm(sentences)]
# create vocab
create_vocab(sentences)
# replace unkown words with 'unk' tag
sentences = [remove_unknown(sent) for sent in tqdm(sentences)]

100%|██████████| 25000/25000 [00:09<00:00, 2665.88it/s]
100%|██████████| 25000/25000 [00:01<00:00, 22155.97it/s]


In [8]:
vocab_size = len(vocab)

## Word Embeddings

All the sentences are now in the following manner
```
[
[...],
['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'unk', 'unk', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'unk', 'unk', 'unk', 'classic', 'book', 'from', '#'],
[...],
]
```

We feed-in these sentences into a word2vec model to generate word embedding vectors for each of the words present. These word vectors are able to successfully capture the semantic relations between each words in their neighborhood. 

In [9]:
# set train_word2vec = True to train the word embeddings using skip-gram model
train_word2vec = False

if train_word2vec:
    cores = multiprocessing.cpu_count()
    model = Word2Vec(sentences, min_count=1, size=200, sg=1, iter=2, negative=10, workers=cores)
    model.save('./resources/word2vec/word2vec.model')

Once the model is trained we have a dictionary of word and their vectors like this:
```
{
    'classic' : [0.06, ................. 0.04],
    'the'     : [0.1, 0.23 ....... 0.01, 0.06],
}
```

Now we create two dictionaries word-to-index and index-to-word which help us in the course

In [10]:
# index to word dictionary
id_to_word = dict(enumerate(list(vocab) + ['unk']))

# word to index dictionary
word_to_id = {v: k for k, v in id_to_word.items()}

These dictionaries look somewhat like this
```
id_to_word = {  'classic' : 1, 'the' : 2, ....., 'from' : 20, '#' : 21  }

word_to_id = {  1 : 'classic', 2 : 'the', ....., 20 : 'from', 21 : '#' }
```
Now we need to convert all the words to their indices in all the reviews

In [11]:
# change list of tokenized words to word indices for every sentences
sentences = [[word_to_id[w] for w in sent] for sent in sentences]

The words are replced by their corresponding indices as mentioned in the word_to_index dictionary. After this step the list of all reviews look like this.
```
[
[2, 435, 23, 34, 234, 324],
[1, 2, 43, 67, 23 , 20, 21]
[3, 4, 2, 4, 6, 234 , 45, 324],
]
```

since the sentences are of variable size like in the above case sentences are of sizes 6, 7 and 8 we need to make them of equal length by padding extra indices are the end. These indices can be any number but conventionally we use 0's 

In [12]:
# find max length sequence
max_length = max([len(sent) for sent in sentences])

In [15]:
# pad 0's to the end of each sequence
for i in trange(len(sentences)):
    sentences[i] = sentences[i] + [0]*(max_length - len(sentences[i]))

100%|██████████| 25000/25000 [00:00<00:00, 43499.76it/s]


After the padding the sequences becomes
```
[
[2, 435, 23, 34, 234, 324, 0,  0],
[1, 2,   43, 67, 23 , 20,  21, 0]
[3, 4,   2,  4,  6,   234, 45, 324],
]
```

For further usage it is better to convert the dictionary of word2vec into an array as it makes it faster to access when performing operations in tensorflow and uses less memory. 

[TODO: explain why]

We need to convert this dictionary
```
{
    'classic' : [0.06, ................. 0.04],
    'the'     : [0.1, 0.23 ....... 0.01, 0.06],
}
```
into an array like
```
[
[0.06, ................. 0.04],        // this row index is 1 which contains word vector for 'classic' : 1
[0.1, 0.23 ....... 0.01, 0.06],        // this row index is 2 which contains word vector for 'the' : 2
]
```
and to access the vectors from the array we can use the indices of each word from this
```
id_to_word = {  'classic' : 1, 'the' : 2, ....., 'from' : 20, '#' : 21  }
```

In [16]:
# initalize an embeddings array which contains word vectors for all the words in vocabulary. These word vectors can 
# be accesed by the indices from this vector
embed = np.zeros((vocab_size+1, 200))
for k,v in word_to_id.items():
    embed[v] = model[k]
# assigning 0's vector of size 200 for padding
# note: 0 index in vocabulary was assigned to empty word ''
embed[0] = np.zeros((200))

In [17]:
# sample sentence now assigned with indices value and padding
sentences[0]

[13443,
 34047,
 8236,
 8362,
 27533,
 33631,
 12734,
 17840,
 24362,
 13443,
 5624,
 36099,
 3685,
 29223,
 14051,
 25130,
 14368,
 37118,
 17840,
 13107,
 13239,
 35653,
 37951,
 33962,
 12047,
 17840,
 13133,
 37951,
 12047,
 9630,
 19710,
 38278,
 4168,
 21888,
 27124,
 14051,
 36470,
 13676,
 16445,
 20452,
 4604,
 8236,
 15731,
 17894,
 4168,
 882,
 3169,
 16916,
 725,
 37668,
 17840,
 12040,
 21888,
 14051,
 38278,
 24054,
 36285,
 23112,
 9833,
 22100,
 10560,
 13122,
 25810,
 3876,
 19377,
 9630,
 13122,
 16503,
 23771,
 16503,
 8203,
 37780,
 22887,
 4168,
 16116,
 27533,
 14051,
 11159,
 12734,
 17840,
 1400,
 10974,
 41729,
 3169,
 16584,
 35255,
 22043,
 12638,
 41729,
 29717,
 17101,
 33896,
 8177,
 501,
 16029,
 31208,
 17840,
 13377,
 37951,
 6917,
 17840,
 35965,
 42968,
 12638,
 18584,
 40933,
 38315,
 43232,
 5368,
 5920,
 31017,
 9330,
 12638,
 35309,
 8236,
 13122,
 34047,
 8177,
 16918,
 24797,
 38192,
 17065,
 39203,
 12314,
 2218,
 5624,
 37668,
 29332,
 10083,


In [19]:
train_X = np.array(sentences)
train_y = np.array(data['sentiment'])
! mkdir data
np.savez('data/train_set.npz', train_X=train_X, train_y=train_y)

In [20]:
np.savez('data/embedding.npz', embed=embed)