# Sentiment Analysis
***
## Table of Contents

***

In [93]:
import pandas as pd
import numpy as np
import string
import re
from collections import Counter

## 1. Introduction

## 2. Loading Data

The dataset used in this project (retrieved from [Kaggle - IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)) includes:

- **review**: Review comments in text.
- **sentiment**: Whether the review is positive or negative.

In [None]:
df = pd.read_csv("_datasets/IMDB_Dataset.csv")

In [95]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [96]:
print("=" * 50)
print(f"Shape of the dataset: {df.shape}")
print("=" * 50)
print(f"Count of null values: {df.isnull().sum().sum()}")

Shape of the dataset: (50000, 2)
Count of null values: 0


In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


## 3. Data Preprocessing
1. Text Cleaning
    - Lower all letters
    - Removing HTML Tags
    - Removing URLs
    - Removing Emojis and Non-ASCII Characters
    - Remove Punctuations
    - Remove extra whitespace
2. Tokenisation
3. Building Vocabulary and Mapping Tokens to Indices

### Text Cleaning

In [98]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [99]:
def clean_text(col: pd.Series) -> pd.Series:
    col = col.str.lower()
    col = col.str.replace(r"<.*?>", "", regex=True)
    col = col.str.replace(r"http\S+|www\.\S+", "", regex=True)
    col = col.str.replace(r"[^\x00-\x7F]+", "", regex=True)
    col = col.str.replace("[{}]".format(re.escape(string.punctuation)), "", regex=True)
    col = col.str.replace(
        r"\s+", " ", regex=True
    ).str.strip()  # Leave a space between words
    return col

In [100]:
df["clean_text"] = clean_text(df["review"])
df.head()

Unnamed: 0,review,sentiment,clean_text
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...


### Tokenisation
Split all reviews into tokens (words). 

In [101]:
# # ! With for loop and .extend()
# all_words = []
# for text in df["clean_text"]:
#     all_words.extend(text.split())

# ! With list comprehension
all_words = [token for text in df["clean_text"] for token in text.split()]

In [102]:
all_words[:5]

['one', 'of', 'the', 'other', 'reviewers']

### Building Vocabulary and Mapping Tokens to Indices
Using `Counter()`, get the frequency of each word, sort in descending order (we can specify `n` parameter to extract the top N most frequent words).
Then assign a unique index to each word, create mapping (word2index), and reserve indices for padding (`<PAD>`) and unknown (`<UNK>`) tokens.

**Padding**:
- Padding is the process of adding special tokens (usually represented as `<PAD>`) to sequences so that all sequences in a batch have the same length.
- This is necessary because neural networks, especially in libraries like PyTorch, require inputs to be in tensors of consistent shape.

*Example*:
- Original:
    - ["i", "loved", "this", "movie"]
- After padding to length 6:
    - ["i", "loved", "this", "movie", "`<PAD>`", "`<PAD>`"]

**Unknown**:
- `<UNK>` stands for '*unknown token*', serving as a placeholder for any token (word) in the input text that does not exist in the vocabulary.

*Example*:
- Vocabulary:
    - { "the":2, "movie":3, "`<PAD>`":0, "`<UNK>`":1 }
- Input:
    - "the plot was amazing" -> ["the", "`<UNK>`", "`<UNK>`", "`<UNK>`"]

In [None]:
word_counts = Counter(all_words)

In [111]:
all_words_sorted = word_counts.most_common()

In [118]:
word2index = {word: i for i, (word, counts) in enumerate(all_words_sorted, start=2)}
word2index["<PAD>"] = 0
word2index["<UNK>"] = 1

In [None]:
word2index['dramatic']

(963, 79)