<a href="https://colab.research.google.com/github/sachaRfd/Sentiment-Analysis-NLP/blob/main/Sentiment_Analysis_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Sentiment Analysis using IMDB PyTorch Dataset and simple LSTM:

All Imports:

In [None]:
!pip install torchdata  # Install Torch Datasets
!pip install nltk  # Import the Natural Language Toolkit --> Most Common

import nltk  # Download key files
nltk.download('punkt')  # Sequence Tokeniser
nltk.download('stopwords')  # List of Most Common StopWords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string 



import numpy as np
import pandas as pd

import gc
from tqdm import tqdm

import torch
from torch import nn
from torch.nn.functional import pad
import torch.nn.functional as F
from torchtext.data import to_map_style_dataset
from torch.utils.data import DataLoader
from torch.optim import RMSprop


# Set Device to GPU is available - otherwise set to CPU: 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Your Current Device is {device}')  # Check the Colab Device we are using

from torchtext import data, datasets  # Import the datasets
from sklearn.model_selection import train_test_split  # Import splitting function
from sklearn.metrics import accuracy_score
import torchdata

from torchtext.vocab import GloVe  # Import the Glove Embedding

Let's get the Train, Validation and Training Sets ready: 

In [None]:
# Get the train and test splits form the IMDB Dataset
train_dataset, test_dataset  = datasets.IMDB(root = '.data', split = ('train', 'test'))

# Let's now split the test set into a test and validation set: 
test_dataset, valid_dataset = train_test_split(list(test_dataset), train_size=.8)


## Understanding the Dataset: 
IMDB Reviews


In [None]:
print(f'The shape of the training set is {train_dataset.shape}')

Shape of the training Dataset is XXX. 

Let's Check if our data is balanced in the training set: 

In [None]:
# Code to check for balanced dataset

### Let's now visualise some of the reviews: 

In [None]:
# Plotting the First 2 Reviews
train_dataset[:2]

## To summarise the dataset: 
- The dataset consists of Movie reviews taken from IMDB
- The train set is formed of XXX reviews
- The validation set of XXX reviews
- and the Test set is of XXX reviews. 
- In the Y variable, a 1 consists of a Negative Review and 2 a Positive Review
- We can also see that our dataset is BALANCED OR UNBALANCED

# Data Preprocessing: 

For simlpe NLP applications, the data has to be processed in a certain manner: 
- Things to Check For:
  - All lower-case text
  - No Numbers in text
  - No Punctutation - Good for generalisation - eventhough some people use punctuation to show sentiment
- Transformation of the sentences into list of tokens - Therefore the sentence becomes a list of words
- We have to tokenise the words - We will be using words from the GloVe library.
  - We want to get the Index of our words in the GloVe library.
- Padding of the sentences is also required as some of the reviews are very long or relatively short. Let's use a maximum padding of 150 here - For no reason.

# List of Functions: 
- To Remove Numbers in the sentences
- To Remove Punctuation
- To Tokenise the sentences
- To Remove Unwanted Stopwords
- To Get the Index of the words in the GloVe library
- To Pad the Sentences
- A final Function which transforms the inputted test by using all the above functions and converting the sentences to lowercase.

In [None]:
def remove_numbers(text):
  '''Function to Remove Numbers from inputted text'''

    text = ''.join(word for word in text if not word.isdigit())
    return text

def remove_punctuation(text):
  '''Function to Remove all Punctuation from inputted text'''

    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')  # Replace the Punctuation with empty space
    return text

def tokenize(text):
  '''Function to Tokenise any inputted text using NLTK tokenise'''

    word_tokens = word_tokenize(text)  # Tokenise Using the NLTK Tokenise Function
    return word_tokens

def remove_stopwords(word_tokens, language='english'):
  '''Function to remove all stopwords in given language from the inputted words tokens'''

    stop_words = set(stopwords.words(language))  # Most common English Stopwords
    word_tokens = [w for w in word_tokens if not w in stop_words]  # Get list of words if they are not stopwords
    return word_tokens

def get_index(text, vocab=glove):
  '''Function that gets the index of each token in a text from the GloVe Library'''

    embedded_text = []
    for word in text:
        try:
            embedded_text.append(glove.stoi[word])  # Get String to Integer
        except:
            pass
    return embedded_text  # return list of the indices of the tokenised words in the GloVe library


def pad_sentence(text, MAX_LENGTH = 150):
  ''' Function that Pads a sentence to a given length'''

    if text.shape[0]>=MAX_LENGTH:
        return text[:MAX_LENGTH]
    else:
        return pad(text, (0, MAX_LENGTH-text.shape[0]), 'constant',0).long()


# Final Transform Function: 

def transform_text(text):
  '''Function that applies all the Data-Preprocessing Functions'''
  
    text = text.lower()
    text = remove_numbers(text)
    text = remove_punctuation(text)
    text = tokenize(text)
    text = remove_stopwords(text)
    text = torch.tensor(get_index(text)).long()
    return pad_sentence(text)

Now that we have setup our data-preprocessing, let's test it out on an example from our training dataset: 

In [None]:
example_train = train_dataset[0][5]  # Get a random train data
transform_text(example_train)

Now that our pre-processing function seem to work as we want - Let's finalise the dataset with a dataloader. 

To not apply our transform function to all the data at once, let'd just apply it batch by batch from the dataloader using transform_batch function: 

In [None]:
def transform_batch(batch):
    Y, X = list(zip(*batch))
    
    X_embedded = torch.stack([transform_text(txt) for txt in X])  # Get the transformed - embedded text
    
    return X_embedded, torch.tensor(Y).long()-1  # Return the Embedded text and the Y variable as .long() as it is a categorical label

train_dataset=  to_map_style_dataset(train_dataset)  # We will be using the to_map_style_dataset as it CHECK WHAT IT DOES

train_loader = DataLoader(train_dataset, batch_size=256, collate_fn=transform_batch, shuffle=True)  # Make sure to have shuffle on true for best training