## What is sentiment analysis?

Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) task that involves determining the emotional tone expressed in a piece of text. The primary goal of sentiment analysis is to classify the text into different categories based on the sentiment it conveys, such as positive, negative, or neutral. 

### Importing the modules

In [49]:
# Importing collections module for handling collections of data
import collections

# Importing datasets module for loading and processing datasets
import datasets

# Importing matplotlib.pyplot for plotting graphs and visualizations
import matplotlib.pyplot as plt

# Importing numpy for numerical operations and array manipulations
import numpy as np

# Importing torch for building and training neural networks
import torch

# Importing torch.nn for defining neural network layers and functions
import torch.nn as nn

# Importing torch.optim for optimization algorithms
import torch.optim as optim

import torchtext
from torchtext.vocab import build_vocab_from_iterator

# For tokenization
from torchtext.data import get_tokenizer

# Importing stopwords from nltk.corpus for removing common words that do not contribute to the sentiment
from nltk.corpus import stopwords

# Storing the stopwords in a set for faster lookup
stop_words = set(stopwords.words('english'))

# Importing tqdm for progress bar visualization
import tqdm

import string

print("Everything imported succesfully👍🏻")

Everything imported succesfully👍🏻


### Defining the parameters

In [40]:
test_size = 0.25
max_length = 256
min_freq = 5

### Loading the datasets

In [8]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 21.0M/21.0M [00:03<00:00, 6.41MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:03<00:00, 6.60MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:05<00:00, 8.06MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 89288.92 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 110008.76 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 105918.43 examples/s]


In [9]:
# Let's check the features in the data
print(train_data.features)

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


In [11]:
train_data[10]

{'text': 'It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn\'t go on to star in more and better films. Sadly, I didn\'t think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat\'s Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is a

### Data cleaning

#### Tokenization
tokenizer: the name of tokenizer function. If None, it returns split()
            function, which splits the string sentence by space.
            If basic_english, it returns _basic_english_normalize() function,
            which normalize the string first and split by space. If a callable
            function, it will return the function. If a tokenizer library
            (e.g. spacy, moses, toktok, revtok, subword), it returns the
            corresponding library.

In [19]:
# For tokenization
tokenizer = get_tokenizer("basic_english")

print(tokenizer("My name is Yuvraj Singh"))


['my', 'name', 'is', 'yuvraj', 'singh']


In [28]:
def tokenize_sentence(raw_text, tokenizer, max_length):
    """
    Tokenizes the input text using the specified tokenizer, removes stop words and punctuation, and truncates the tokens to the maximum length.

    Args:
        raw_text (dict): A dictionary containing the text to be tokenized with the key "text".
        tokenizer (callable): A tokenizer function that takes a string and returns a list of tokens.
        max_length (int): The maximum number of tokens to return.

    Returns:
        dict: A dictionary containing the truncated list of tokens with the key "tokens".
    """
    tokens = [token for token in tokenizer(raw_text["text"]) if token not in stop_words and token not in string.punctuation][:max_length]
    return {"tokens": tokens}

Each dataset provided by the datasets library is an instance of a Dataset class. We can see all the methods in a Dataset here, but the main one we are interested in is map. By using map we can apply a function to every example in the dataset and either update the example or create a new feature.

In [31]:
dummy_text = {
    "text": "Hello!!! This is a test sentence, with a lot of punctuation... and some stop words like 'the', 'is', 'at', 'which', and 'on'. Let's see how it works!"
}

tokenize_sentence(dummy_text, tokenizer,15)

{'tokens': ['hello',
  'test',
  'sentence',
  'lot',
  'punctuation',
  'stop',
  'words',
  'like',
  'let',
  'see',
  'works']}

In [32]:
train_data = train_data.map(
    tokenize_sentence, fn_kwargs={"tokenizer": tokenizer, "max_length": max_length}
)
test_data = test_data.map(
    tokenize_sentence, fn_kwargs={"tokenizer": tokenizer, "max_length": max_length}
)

Map: 100%|██████████| 25000/25000 [00:10<00:00, 2448.57 examples/s]
Map: 100%|██████████| 25000/25000 [00:09<00:00, 2538.05 examples/s]


In [33]:
train_data

Dataset({
    features: ['text', 'label', 'tokens'],
    num_rows: 25000
})

### Creation of validation set

Talk about the differnece between train,test and validation split

We can split a Dataset using the train_test_split method which splits a dataset into two, creating a DatasetDict for each split, one called train and another called test 

In [36]:
train_valid_data = train_data.train_test_split(test_size=test_size)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

25,000 training examples have now been split into 18,750 training examples and 6,250 validation examples, with the original 25,000 test examples remaining untouched.

In [37]:
len(train_data), len(valid_data), len(test_data)

(14062, 4688, 25000)

### Creating a Vocabulary

Next, we have to build a vocabulary. This is look-up table where every unique token in your dataset has a corresponding index (an integer).

We do this as machine learning models cannot operate on strings, only numerical vaslues. Each index is used to construct a one-hot vector for each token. A one-hot vector is a vector where all the elements are 0, except one, which is 1, and the dimensionality is the total number of unique tokens in your vocabulary, commonly denoted by 
.

For example:



One issue with creating a vocabulary using every single word in the dataset is that there are usually a considerable amount of unique tokens. One way to combat this is to either only construct the vocabulary only using the most commonly appearing tokens, or to only use tokens which appear a minimum amount of times in the dataset. In this notebook, we do the latter, keeping on the tokens which appear 5 times.

What happens to tokens which appear less than 5 times? We replace them with a special unknown token, denoted by <unk>. For example, if the sentence "This film is great and I love it", but the word "love" was not in the vocabulary, it would become: "This film is great and I <unk> it".

We use the build_vocab_from_iterator function from torchtext.vocab to create our vocabulary, specifying the min_freq (the minimum amount of times a token should appear to be added to the vocabulary) and special_tokens (tokens which should be appended to the start of the vocabulary, even if they don't appear min_freq times in the dataset).

The first special token is our unknown token, the other, <pad> is a special token we'll use for padding sentences.

When we feed sentences into our model, we pass a batch of sentences, i.e. more than one, at the same time. Passing a batch of sentences is preferred to passing sentences one at a time as it allows our model to perform computation on all sentences within a batch in paralle, thus speeding up the time taken to train and evaluate our model. All sentences within a batch need to be the same length (in terms of the number of tokens). Thus, to ensure each sentence is the same length, any shorter than the longest sentence need to have padding tokens appended to the end of them.

For an example batch of two sentences of length four and three tokens:

As we can see, the second sentence has been padded with a single <pad> token.

In [48]:
special_tokens = ["<unk>", "<pad>"]

vocab = build_vocab_from_iterator(
    train_data["tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

In [51]:
print("Length of vocabulary : ",len(vocab))
print("First 10 tokens in vocabulary : ",vocab.get_itos()[:10])

Length of vocabulary :  21734
First 10 tokens in vocabulary :  ['<unk>', '<pad>', 'movie', 'film', 'one', 'like', 'good', 'even', 'time', 'would']
