# Sentiment Analysis: Data preprocessing

In this notebook we are going to compare different approaches of data preprocessing applied to a real world task, analysing movie reviews given by IMBD users. Each review can be classified in two different classes, positive, if the user likes the movie and negative otherwise. This tutorial is inspired by: https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

## Data preprocessing

Our first step will be to load and prepare all the our data to perform the experiments. In this case we are going to employ IMBD reviews data available at The Training Dataset used is stored in the zipped folder: aclImbdb.tar file. This can also be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
The dataset consists in 50.000 sentences splitted in two sets, train and test of 25.000 sentences each.

For our experiments the corpus will be splitted in the following way:
* Train_split: 17500 sentences extracted from the original training data.
* Validation split: 7500 sentences extracted from the original training data.
* Test split: 25000 sentences that form the original test data.

In [None]:
!pip install torchdata
!pip install 'portalocker>=2.0.0'
!pip install datasets
import nltk
nltk.download('omw-1.4')
nltk.download('punkt_tab')

In [None]:
import torch
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")
import random
import itertools
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split


SEED = 0

train_data = dataset["train"]
test_data = dataset["test"]
train_split = list(train_data)
test_split = list(test_data)
train_sents = [t["text"].split() for t in train_split]
test_sents = [t["text"].split() for t in test_split]

train_targets = [t["label"] for t in train_split]
test_targets = [t["label"] for t in test_split]


train_sents, valid_sents, train_targets, valid_targets = train_test_split(train_sents,
                                                                          train_targets,
                                                                          test_size=0.25,
                                                                          random_state=SEED)


In order to feed our data into a model that is able to predict the sentiment of our movie reviews we should first create a vocabulary of all the words that appear in our training data.

A measure of quality of our vocabulary can be its coverage over the data. We can define the vocabulary coverage ans the percentage of tokens from our data that are found in our vocabulary.

As we will see later vocabulary size is a paremeter that can be tuned and this coverage can serve as guidande.

**Compute the vocabulary as a dictionary with words as keys and the number of times the word appears as value.**

**Compute the coverage over the data given our vocabulary.**

In [None]:

def compute_vocabulary(train_sents):
  vocabulary = {'<unk>':99999999}
  for sent in train_sents:
    for token in sent:
      if token in vocabulary:
        vocabulary[token] += 1
      else:
        vocabulary[token] = 1

  return vocabulary

def coverage(split,voc):
  total = 0.0
  unk = 0.0
  for sent in split:
    for token in sent:
      if not  token in voc:
          unk += 1.0
      total += 1.0
  return 1.0 - (unk/total)


def voc_stats(split,voc):
  print('**** VOCABULARY ***')
  print('* Unique words', len(voc))
  print('* Coverage', coverage(split,voc))


In [None]:
voc = compute_vocabulary(train_sents)
voc_stats(valid_sents, voc)

Our first proposed preprocess is word level tokenization. Tokenization consists in separating all words or particles that are attached to the words in the text. In this case, all word will be splitted and stop-words will be separated. In this case we are going to employ 'nltk' library for this task.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
import numpy as np


stop_words = list(stopwords.words('english')) #About 150 stopwords
lemmatizer = WordNetLemmatizer()

def tokenize_sentence(sentence):
    return  nltk.word_tokenize(sentence)

**Compute the tokenization and vocabulary given the function avobe. How the vocabulary size and the coverage have changed?**

In [None]:
train_tok_sents = [tokenize_sentence(' '.join(s)) for s in train_sents]
valid_tok_sents = [tokenize_sentence(' '.join(s)) for s in valid_sents]
test_tok_sents = [tokenize_sentence(' '.join(s)) for s in test_sents]

tok_voc = compute_vocabulary(train_tok_sents)
voc_stats(valid_tok_sents, tok_voc)


In addition to the previous steps we are going further preprocess our data.

For each review in the corpus we apply a preprocess consisting in:

- Each word is replaced by its lemma form. Lemmatization reduces vocabulary size as all different forms of a word are grouped in a single lemma.
- Remove casing from words, this helps reduce ambiguity as upper and lower cased appearences of a word are selected as different in the vocabulary.
- Remove stop words. In order to reduce sentence length and represent the words that carry the meaning of the sentence, we remove all stop words.

**Apply the preprocess and compute its new vocabulary. How did stats change?***




In [None]:
def preprocess_sentence(sentence):
    return  [lemmatizer.lemmatize(word.lower()) for word in sentence if not word in stop_words]


In [None]:
train_tok_sents = [preprocess_sentence(s) for s in train_tok_sents]
# TODO Complete sentence tokenizer for validation set and testset

# TODO: Make the tokenized vocabulary and calculate the stats


The previous preprocess helps to reduce the ambiguity produced by the different
form of the same word or stop-words included in other tokens, but the original tokens are mostly unchanged. In the following cases we will explore other levels of tokenization, characters and subwords.

In both cases vocabularies are too big to be handled in this task as a lot of words only appear a few times in the dataset of some ambiguity is still present in the data. A common resourse is removing the less frequent tokens in our vocabulary until ensuring that models can be trained with the resources available.

**Create a 5000 token vocabulary that only include the most frequent tokens in the dataset. How is the new coverage?**

In [None]:
def reduce_vocabulary(voc,size=5000):
  # TODO: Create a 5000 token vocabulary that only include the most frequent tokens in the dataset

SIZE_5000 = 2000
# TODO: Apply the reduced vocabulary for both tokenized and not tokenized vocabulary (e.g. tok_voc = reduce_vocabulary(...)) and calculate the new vocabulary




Let's start with characters, where all words in the dataset are stripped into its individual characters, which has several new characterisitics:
* Vocabularies are orders of magnitude smaller that its word counterparts, as all words share the same script that uses a limited set of them.
* Each token does not provide a lot of information about the sentence. Individual words include semantic and morpghological information.
* Character's use is more ambiguious, words are used generally on the same context consistently during the dataset while characters can belong to a great number of words. Also, individual characters have much less meaning information than words or

**Based on the previous tokenization compute a character vocabulary over the training, and measure its size and coverage of the validation data**


In [None]:
train_char_sents = [' '.join(s).strip() for s in train_tok_sents]
# TODO: Complete sentence char for validation set and test set

# TODO: Make the character level vocabulary and vocabulary


Character level may be too extreme for some tasks, but it provides a great coverage of the dataset. In those cases a great alternative e to apply is subword tokenization.

In this case words are splitted in pieces which lenght depends on how common they are in the dataset. This way long pieces that are really common will be maintained, for example lemmas of common words in the data, while for not common words or affixes the vocabulary includes smaller pieces until arriving to  individual characters.

This tokenizations allows:
* A great coverage of the data, as only words including characters not present in the training data will not be recognized.
* A parametrized vocabulary size. The number of tokens in the vocabulary is set when the tokenization is computed and can be tuned to improve the models performance.

The example shows a call to Byte-Pair-Encoding tokenization using the standard subword-nmt library.

In [None]:
! pip install subword-nmt

# Write the data splits into files to compute the Byte-Pair Encoding (BPE)
with open('train.txt','w+') as tr:
  for s in train_tok_sents:
    print(' '.join(s),file=tr)

with open('valid.txt','w+') as vl:
  for s in valid_tok_sents:
    print(' '.join(s),file=vl)

with open('test.txt','w+') as ts:
  for s in test_tok_sents:
    print(' '.join(s),file=ts)


# Compute BPE codes and apply to the data splits
! subword-nmt learn-bpe -s 1000 < train.txt > bpe1000.codes
! subword-nmt apply-bpe -c bpe1000.codes < train.txt > train.bpe.txt
! subword-nmt apply-bpe -c bpe1000.codes < valid.txt > valid.bpe.txt
! subword-nmt apply-bpe -c bpe1000.codes < test.txt > test.bpe.txt

train_bpe_sents = []
with open('train.bpe.txt') as tr:
  for line in tr.readlines():
    train_bpe_sents.append(line.replace('\n','').split())

valid_bpe_sents = []
with open('valid.bpe.txt') as tr:
  for line in tr.readlines():
    valid_bpe_sents.append(line.replace('\n','').split())

test_bpe_sents = []
with open('test.bpe.txt') as tr:
  for line in tr.readlines():
    test_bpe_sents.append(line.replace('\n','').split())


**Compute the coverage and vocabulary size of the subword tokenization**

In [None]:
# TODO: Make the bpe level vocabulary and caculate the coverage and stats
bpe_voc = ...


**Pretrained Tokenizers from Language Models:**

Pretrained tokenizers are components of large, pretrained language models (like BERT, GPT, or RoBERTa). These tokenizers come with a pre-built vocabulary and a set of rules for splitting text into tokens (e.g., words or subwords). They have been trained on massive amounts of text data and are designed to work seamlessly with their corresponding language models.

Using a pretrained tokenizer has two main advantages:


*   Consistency: The tokenizer’s vocabulary and rules match those used during the

*   model's training, ensuring that your text is processed in the same way.
Convenience: Instead of building your own tokenizer and managing a vocabulary manually, you can simply load an existing one using libraries such as Hugging Face Transformers.



In [None]:
from transformers import AutoTokenizer

def tokenize_with_pretrained(model_name, sentences):
    # Load the tokenizer corresponding to the given model name.
    # This retrieves a pretrained tokenizer with its associated vocabulary and tokenization rules.
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Check if the sentences are provided as lists of tokens.
    # If so, join them into raw text strings (because the tokenizer expects plain text).
    if isinstance(sentences[0], list):
        sentences = [" ".join(s) for s in sentences]

    # Tokenize the sentences:
    # - 'padding=True' ensures all sequences are padded to the same length.
    # - 'truncation=True' limits sequences that are too long.
    # - 'max_length=512' sets the maximum allowed length of each tokenized sentence.
    encoded = tokenizer(sentences,
                        padding=True,
                        truncation=True,
                        max_length=512)

    # Convert the token IDs back into token strings.
    # This returns a list where each element corresponds to the tokenized representation of a sentence.
    return [tokenizer.convert_ids_to_tokens(ids) for ids in encoded["input_ids"]]


**Tokenize the sentences with the BERT tokenizer**




In [None]:
# Tokenize with BERT-base-uncased and XLM-RoBERTa-base
bert_tokenized_train = tokenize_with_pretrained("bert-base-uncased", train_sents)
# xlmr_tokenized_train = tokenize_with_pretrained("xlm-roberta-base", train_tok_sents)

bert_tokenized_valid = tokenize_with_pretrained("bert-base-uncased", valid_sents)
# xlmr_tokenized_valid = tokenize_with_pretrained("xlm-roberta-base", valid_tok_sents)

bert_tokenized_test = tokenize_with_pretrained("bert-base-uncased", test_sents)
# xlmr_tokenized_test = tokenize_with_pretrained("xlm-roberta-base", test_tok_sents)


In [None]:
bert_voc = compute_vocabulary(bert_tokenized_train)
voc_stats(bert_tokenized_valid,bert_voc)

In [None]:
# xlmr_voc = compute_vocabulary(xlmr_tokenized_train)
# voc_stats(xlmr_tokenized_valid,xlmr_voc)

# Classification Task

In this final section we are going to measure the performance of a model when using as features the different tokenization levels described above.

The first step in order to train a model is decide how to feed our data into the model. For these experiments, we are going to employ bag of words vectors as they do not require any further preprocess.

Bag of Words (BOW) consists in representing each of the sentences of the dataset as fixed size vectors of the size of the vocabulary. For example, for our 5000 tokens vocabularies, each sentence will be represented as a 5000 dimension vectors, following these steps:

- Initially all vector dimensions are initialized as 0.
- For each word in the sentence, 1 is added to the position of such word in the vocabulary.

The final vector is a sparce vector, most of its values are 0, that contains the number of times each word appears in the sentence and the sum of all dimensions of the vector is the number of tokens in the sentence.

Note, that while this representation provides a fixed size representation of the sentences it lacks information of the order in which words appear.

In [None]:
def idx_voc(voc):
  items = list(voc.items())
  items.sort(key=lambda x: x[1], reverse=True)
  return {item[0]:n for n,item in enumerate(items)}

def bag_of_words(splits,voc):
  voc = idx_voc(voc)
  bow_splits = []
  for split in splits:
    bow = []
    for sent in split:
      vector = [0] * len(voc)
      for s in sent:
        try:
          vector[voc[s]] += 1
        except:
          vector[voc['<unk>']] += 1
      bow.append(vector)
    bow_splits.append(bow)
  return bow_splits





When you use a pretrained BERT tokenizer, it splits text into subword units rather than whole words. Applying a simple bag-of-words model on these subwords can lead to an extremely large and fragmented vocabulary, resulting in very high-dimensional and sparse representations. This dense representation, if not handled correctly, can be both computationally and memory inefficient. In contrast, TF-IDF creates a sparse matrix that only stores non-zero entries and also weights tokens by their importance across the corpus. This sparse format is more memory-friendly and better suited for handling the many subword tokens produced by BERT tokenization.

**TF-IDF Vectorization:**
The TfidfVectorizer is used to convert the text into a numerical representation.
When fitting on the training texts with vectorizer.fit_transform(train_texts), the vectorizer learns the vocabulary and calculates the inverse document frequency (IDF) for each token.
The test texts are then transformed using vectorizer.transform(test_texts) based on the vocabulary learned from the training data.
The parameter max_features=5000 limits the number of features (tokens) to 5000 to control memory usage and potentially reduce noise.


*   For character-level processing:
Each sentence (which is a list of characters) is joined using "".join(sent) to create a continuous string without spaces. This ensures that each character is preserved in sequence.

*   For word-level processing:
Each sentence (a list of word tokens) is joined with spaces using " ".join(sent). This creates a typical sentence string where words are separated by spaces, which is what the vectorizer expects.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def compute_tfidf(splits, char_level=False):
    """
    Computes TF-IDF features for training and testing datasets.

    Parameters:
        splits (list): A list of three elements where:
                        - splits[0] is the training data,
                        - splits[1] is the validation data (unused here),
                        - splits[2] is the test data.
                        Each element should be a list of sentences, and each sentence is a list of tokens.
        char_level (bool): If True, use character-level tokenization; otherwise, use word-level tokenization.

    Returns:
        train_tfidf, test_tfidf: The TF-IDF transformed training and test data.
    """
    if char_level:
        # If char_level is True, we configure the vectorizer to break text into characters.
        # analyzer='char' means that individual characters (or character n-grams) are the tokens.
        # ngram_range=(1, 1) specifies that we only consider individual characters.
        # max_features limits the vocabulary size to 5000 to avoid memory issues.
        vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(1, 1), max_features=5000)
        # For character-level analysis, join each sentence's characters without spaces.
        # This produces a continuous string where each character is kept in order.
        train_texts = ["".join(sent) for sent in splits[0]]
        test_texts = ["".join(sent) for sent in splits[2]]
    else:
        # Default behavior uses word-level tokenization.
        # The vectorizer will automatically tokenize the text into words.
        vectorizer = TfidfVectorizer(max_features=5000)
        # For word-level analysis, join tokens with a space to form a proper sentence.
        train_texts = [" ".join(sent) for sent in splits[0]]
        test_texts = [" ".join(sent) for sent in splits[2]]

    # Fit the vectorizer on the training data to learn the vocabulary and IDF values,
    # then transform the training texts into TF-IDF feature matrices.
    train_tfidf = vectorizer.fit_transform(train_texts)
    # Use the learned vocabulary to transform the test texts into TF-IDF feature matrices.
    test_tfidf = vectorizer.transform(test_texts)

    return train_tfidf, test_tfidf


As a classifier we are going to employ Random forest (https://towardsdatascience.com/understanding-random-forest-58381e0602d2), a model that consists in an ensemble of Decision Trees that see different data by means of bagging and boosting.


In [None]:
from sklearn.ensemble import RandomForestClassifier
def get_accuracy(preds,labels):
    return sum([p == l for p,l in zip(preds,labels)])/len(preds)*100

def preprocess_data(data):
    # Check if the data is sparse (i.e., has a toarray() method)
    if hasattr(data, "toarray"):
        return data.toarray()
    else:
        return np.array(data, dtype='float')

def trainRandomForestClassifier(train_idx, test_idx):
    # Preprocess both train and test data
    train_idx = preprocess_data(train_idx)
    test_idx = preprocess_data(test_idx)

    clf = RandomForestClassifier(n_estimators=10, max_depth=3, random_state=SEED)
    clf.fit(train_idx, train_targets)

    preds = clf.predict(test_idx)
    print('Accuracy', get_accuracy(preds, test_targets), '%')


**Train the model for the different preprocesses. How they affect the results?**

In [None]:
# NOT tokenized data
train_idx,valid_idx,test_idx = bag_of_words([train_sents,valid_sents,test_sents],voc)
trainRandomForestClassifier(train_idx,test_idx)

In [None]:
# Tokenized at word level and preprocessed data
# TODO: continue the training and evaluation for tokenized data


In [None]:
#Tokenized and character data
# TODO: continue the training and evaluation for Char data


In [None]:
#BPE tokenized data
# TODO: continue the training and evaluation for Bpe data


**Running the same experiment with TF-IDF and comparing the results**

---



In [None]:
train_idx, test_idx = compute_tfidf([train_sents,valid_sents,test_sents])
trainRandomForestClassifier(train_idx,test_idx)

In [None]:
# TODO: continue the training and evaluation for tokenized data


In [None]:
# TODO: continue the training and evaluation for Char data


In [None]:
#BPE bpe data
# TODO: continue the training and evaluation for Bpe data


**Now we are going to try the BERT tokenizer with TF-IDF**

---



In [None]:
train_idx, test_idx = compute_tfidf([bert_tokenized_train, bert_tokenized_valid, bert_tokenized_test])
trainRandomForestClassifier(train_idx,test_idx)

# Conclusions
* Preprocessing plays an important role in the performance of NLP systems.
* Even for the same model performance may vary a lot depending on how expressive are the features selected for the task.
* All methods  presented (with the exception of no preprocess) are employed in state of the art work for several tasks.
* We can use pre-trained tokenizers from language models instead of creating them
