# Sentiment Analysis: Data preprocessing

In this notebook we are going to compare different approaches of data preprocessing applied to a real world task, analysing movie reviews given by IMBD users. Each review can be classified in two different classes, positive, if the user likes the movie and negative otherwise. This tutorial is inspired by: https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

## Data preprocessing

Our first step will be to load and prepare all the our data to perform the experiments. In this case we are going to employ IMBD reviews data available at The Training Dataset used is stored in the zipped folder: aclImbdb.tar file. This can also be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
The dataset consists in 50.000 sentences splitted in two sets, train and test of 25.000 sentences each.

For our experiments the corpus will be splitted in the following way:
* Train_split: 17500 sentences extracted from the original training data.
* Validation split: 7500 sentences extracted from the original training data.
* Test split: 25000 sentences that form the original test data.

In [None]:
import torch
from torchtext import data
from torchtext import datasets
import random
import itertools
import matplotlib.pyplot as plt

TEXT = data.Field()
LABEL = data.LabelField(dtype=torch.float)
SEED = 0

train_data, test_data = datasets.IMDB.splits(TEXT,LABEL)

train_data, valid_data = train_data.split(random_state=random.seed(SEED))

train_sents = [vars(train_data.examples[i])['text'] for i in range(len(train_data))]    
valid_sents = [vars(valid_data.examples[i])['text'] for i in range(len(valid_data))]
test_sents =  [ vars(test_data.examples[i])['text'] for i in range(len(test_data))]
    
train_targets = [1 if vars(train_data.examples[i])['label'] == 'pos' else 0 for i in range(len(train_data))]
test_targets = [1 if vars(test_data.examples[i])['label'] == 'pos' else 0 for i in range(len(test_data))]
valid_targets = [1 if vars(valid_data.examples[i])['label'] == 'pos' else 0 for i in range(len(valid_data))]


In order to feed our data into a model that is able to predict the sentiment of our movie reviews we should first create a vocabulary of all the words that appear in our training data. 

A measure of quality of our vocabulary can be its coverage over the data. We can define the vocabulary coverage as the percentage of tokens from our data that are found in our vocabulary. 

As we will see later, vocabulary size is a paremeter that can be tuned and this coverage can serve as guidande.

**Compute the vocabulary as a dictionary wirh words as keys and the number of times the word appears as value.**

**Compute the coverage over the data given our vocabulary.**

In [None]:
def compute_vocabulary(train_sents):
  vocabulary = {'<unk>':99999999}
  for sent in train_sents:
    for token in sent:
      if token in vocabulary:
        vocabulary[token] += 1
      else:
        vocabulary[token] = 1
      
  return vocabulary

def coverage(split,voc):
  total = 0.0
  unk = 0.0
  for sent in split:
    for token in sent:
      if not  token in voc:
          unk += 1.0
      total += 1.0
  return 1.0 - (unk/total)


def voc_stats(split,voc):
  print('**** VOCABULARY ***')
  print('* Unique words', len(voc))
  print('* Coverage', coverage(split,voc))

voc = compute_vocabulary(train_sents)
voc_stats(valid_sents, voc)


**** VOCABULARY ***
* Unique words 224384
* Coverage 0.9638441717742088


Our first proposed preprocess is word level tokenization. In this cases all word will be splitted and stop-words will be separated. In this case we are going to employ 'nltk' library for this task.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer    
from nltk.tokenize import TweetTokenizer
import numpy as np


stop_words = list(stopwords.words('english')) #About 150 stopwords
lemmatizer = WordNetLemmatizer()

def tokenize_sentence(sentence):
    return  nltk.word_tokenize(sentence)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Compute the tokenization and vocabulary given the function avobe. How the vocabulary size and the coverage have changed?**

In [None]:
train_tok_sents = [tokenize_sentence(' '.join(s)) for s in train_sents]
valid_tok_sents = [tokenize_sentence(' '.join(s)) for s in valid_sents]
test_tok_sents = [tokenize_sentence(' '.join(s)) for s in test_sents]

tok_voc = compute_vocabulary(train_tok_sents)
voc_stats(valid_tok_sents, tok_voc)

**** VOCABULARY ***
* Unique words 111358
* Coverage 0.9867840235348441



In addition to the previous steps we are going further preprocess our data.

For each review in the corpus we apply a preprocess consisting in:

- Each word is replaced by its lemma form. Lemmatization reduces vocabulary size as all different forms of a word are grouped in a single lemma.
- Remove casing from words, this helps reduce ambiguity as upper and lower cased appearences of a word are selected as different in the vocabulary.
- Remove stop words. In order to reduce sentence length and represent the words that carry the meaning of the sentence, we remove all stop words.

**Apply the preprocess and compute its new vocabulary. How did stats change?**


In [None]:
def preprocess_sentence(sentence):
    return  [lemmatizer.lemmatize(word.lower()) for word in sentence if not word in stop_words]


In [None]:
train_tok_sents = [preprocess_sentence(s) for s in train_tok_sents]
valid_tok_sents = [preprocess_sentence(s) for s in valid_tok_sents]
test_tok_sents = [preprocess_sentence(s) for s in test_tok_sents]

tok_voc = compute_vocabulary(train_tok_sents)
voc_stats(valid_tok_sents, tok_voc)

**** VOCABULARY ***
* Unique words 88081
* Coverage 0.9838491774231946


In both cases vocabularies are too big to be handled in this task as a lot of words only appear a few times in the dataset of some ambiguity is still present in the data.

**Create a 5000 token vocabulary that only include the most frequent tokens in the dataset. How is the new coverage?**

In [None]:
def reduce_vocabulary(voc,size=5000):
  items = list(voc.items())
  items.sort(key=lambda x: x[1], reverse=True)
  items = items[:size]
  return {k:v for k,v in items}

SIZE = 5000

tok_voc = reduce_vocabulary(tok_voc,size=SIZE)
voc = reduce_vocabulary(tok_voc,size=SIZE)
voc_stats(valid_tok_sents,tok_voc)


**** VOCABULARY ***
* Unique words 5000
* Coverage 0.8758908085582482


The previous preprocess helps to reduce the ambiguity produced by the different
form of the same word or stop-words included in other tokens, but the original tokens are mostly unchanged. In the following cases we will explore other levels of tokenization, characters and subwords. 

Let's start with characters, where all words in the dataset are stripped into its individual characters, which has several new characterisitics:
* Vocabularies are orders of magnitude smaller that its word counterparts, as all words share the same script that uses a limited set of them.
* Each token does not provide a lot of information about the sentence. Individual words include semantic and morpghological information.
* Character's use is more ambiguious, words are used generally on the same context consistently during the dataset while characters can belong to a great nunmber of words.

**Based on the previous tokenization compute a character vocabulary over the training, and measure its size and coverage of the validation data**

In [None]:
train_char_sents = [' '.join(s).strip() for s in train_tok_sents]
valid_char_sents = [' '.join(s).strip() for s in valid_tok_sents]
test_char_sents = [' '.join(s).strip() for s in test_tok_sents]

char_voc = compute_vocabulary(train_char_sents)
voc_stats(valid_char_sents,char_voc)

**** VOCABULARY ***
* Unique words 130
* Coverage 0.9999934164930838


Character level may be too extreme for some tasks, but it provides a great coverage of the dataset. In those cases a great alternative e to apply is subword tokenization. 

In this case words are splitted in pieces which lenght depends on how common they are in the dataset. This way long pieces that are really common will be maintained, for example lemmas of common words in the data, while for not common words or affixes the vocabulary includes smaller pieces until arriving to  individual characters.

This tokenizations allows:
* A great coverage of the data, as only words including characters not present in the training data will not be recognized.
* A parametrized vocabulary size. The number of tokens in the vocabulary is set when the tokenization is computed and can be tuned to improve the models performance.

The example shows a call to Byte-Pair-Encoding tokenization using the standard subword-nmt library.

In [None]:
! pip install subword-nmt

# Write the data splits into files to compute the Byte-Pair Encoding (BPE)
with open('train.txt','w+') as tr:
  for s in train_tok_sents:
    print(' '.join(s),file=tr)

with open('valid.txt','w+') as vl:
  for s in valid_tok_sents:
    print(' '.join(s),file=vl)

with open('test.txt','w+') as ts:
  for s in test_tok_sents:
    print(' '.join(s),file=ts)


# Compute BPE codes and apply to the data splits
! subword-nmt learn-bpe -s 5000 < train.txt > bpe5000.codes
! subword-nmt apply-bpe -c bpe5000.codes < train.txt > train.bpe.txt
! subword-nmt apply-bpe -c bpe5000.codes < valid.txt > valid.bpe.txt
! subword-nmt apply-bpe -c bpe5000.codes < test.txt > test.bpe.txt

train_bpe_sents = []
with open('train.bpe.txt') as tr:
  for line in tr.readlines():
    train_bpe_sents.append(line.replace('\n','').split())

  valid_bpe_sents = []
with open('valid.bpe.txt') as tr:
  for line in tr.readlines():
    valid_bpe_sents.append(line.replace('\n','').split())

test_bpe_sents = []
with open('test.bpe.txt') as tr:
  for line in tr.readlines():
    test_bpe_sents.append(line.replace('\n','').split())


Collecting subword-nmt
  Downloading https://files.pythonhosted.org/packages/74/60/6600a7bc09e7ab38bc53a48a20d8cae49b837f93f5842a41fe513a694912/subword_nmt-0.3.7-py2.py3-none-any.whl
Installing collected packages: subword-nmt
Successfully installed subword-nmt-0.3.7


**Compute the coverage and vocabulary size of the subword tokenization**

In [None]:
bpe_voc = compute_vocabulary(train_bpe_sents)
voc_stats(valid_bpe_sents,bpe_voc)

**** VOCABULARY ***
* Unique words 5196
* Coverage 0.9999679559433573


# Classification Task

In this final section we are going to measure the performance of a model when using as features the different tokenization levels described above. 

The first step in order to train a model is decide how to feed our data into the model. For these experiments, we are going to employ bag of words vectors as they do not require any further preprocess.

Bag of Words (BOW) consists in representing each of the sentences of the dataset as fixed size vectors of the size of the vocabulary. For example, for our 5000 tokens vocabularies, each sentence will be represented as a 5000 dimension vectors, following these steps:

- Initially all vector dimensions are initialized as 0.
- For each word in the sentence, 1 is added to the position of such word in the vocabulary.

The final vector is a sparce vector, most of its values are 0, that contains the number of times each word appears in the sentence and the sum of all dimensions of the vector is the number of tokens in the sentence.

Note, that while this representation provides a fixed size representation of the sentences it lacks information of the order in which words appear.

**Create a general model to compute the vocabulary indexes and the BOW representations of our data**

In [None]:
def idx_voc(voc):
  items = list(voc.items())
  items.sort(key=lambda x: x[1], reverse=True)
  return {item[0]:n for n,item in enumerate(items)}

def bag_of_words(splits,voc):
  voc = idx_voc(voc)
  bow_splits = []
  for split in splits:
    bow = []
    for sent in split:
      vector = [0] * len(voc)
      for s in sent:
        try:
          vector[voc[s]] += 1
        except:
          vector[voc['<unk>']] += 1
      bow.append(vector)
    bow_splits.append(bow)
  return bow_splits



As a classifier we are going to employ Random forest (https://towardsdatascience.com/understanding-random-forest-58381e0602d2), a model that consists in an ensemble of Decision Trees that see different data by means of bagging and boosting. 


In [None]:
from sklearn.ensemble import RandomForestClassifier


def get_accuracy(preds,labels):
    return sum([p == l for p,l in zip(preds,labels)])/len(preds)*100

def trainRandomForestClassifier(train_idx,test_idx):
  train_idx = np.array(train_idx,dtype='float')
  test_idx = np.array(test_idx,dtype='float')

  clf = RandomForestClassifier(n_estimators=100, max_depth=3,random_state=SEED)
  clf.fit(train_idx,train_targets)

  preds = clf.predict(test_idx)
  print('Accuracy', get_accuracy(preds,test_targets),'%')

**Train the model for the different preprocesses. How they affect the results?**

In [None]:
# Tokenized at word level and preprocessed data
train_idx,valid_idx,test_idx = bag_of_words([train_tok_sents,valid_tok_sents,test_tok_sents],tok_voc)
trainRandomForestClassifier(train_idx,test_idx)


Accuracy 78.488 %


In [None]:
#Tokenized and character data
train_idx,valid_idx,test_idx = bag_of_words([train_char_sents,valid_char_sents,test_char_sents],char_voc)
trainRandomForestClassifier(train_idx,test_idx)

Accuracy 60.844 %


In [None]:
#BPE tokenized data
train_idx,valid_idx,test_idx = bag_of_words([train_bpe_sents,valid_bpe_sents,test_bpe_sents],bpe_voc)
trainRandomForestClassifier(train_idx,test_idx)

Accuracy 79.392 %


# Conclusions
* Preprocessing plays an important role in the performance of NLP systems.
* Even for the same model performance may vary a lot depending on how expressive are the features selected for the task.
* All methods  presented (with the exception of no preprocess) are employed in state of the art work for several tasks.
 