### Project Objective

This project develops an **end-to-end NLP pipeline** to classify **IMDB movie reviews** into *Positive* or *Negative* sentiment.  
The notebook demonstrates the entire workflow — from raw text cleaning, tokenization, and vectorization, to building and evaluating multiple models.  
Both **classical machine learning methods** (Naive Bayes, Random Forest) and **deep learning models** (Feedforward NN, LSTM) are compared.  
Different **feature extraction techniques** are tested, including Bag-of-Words, TF-IDF, Word2Vec embeddings, and padded sequences with embeddings.  
Evaluation primarily uses **accuracy**, but the pipeline is structured to extend easily to **precision, recall, F1-score, and ROC-AUC**.  
Results highlight the trade-offs between **fast, interpretable baselines** and **data-hungry neural networks**.  
The final notebook provides a **deployment-ready inference function** to classify unseen reviews.  
This project showcases the ability to handle **text preprocessing, model building, and comparison** in a reproducible and professional workflow.

---

#### Outline
- **Step 1:** Data collection & ingestion (IMDB reviews via `tensorflow_datasets`)  
- **Step 2:** Text preprocessing & cleaning (contractions, lowercasing, punctuation, stopwords, lemmatization/stemming)  
- **Step 3:** Feature extraction (BoW, TF-IDF, Word2Vec, Tokenizer + padding)  
- **Step 4:** Model training (Naive Bayes, Random Forest, Feedforward NN, LSTM)  
- **Step 5:** Evaluation (accuracy; extendable to F1, ROC-AUC, confusion matrix)  
- **Step 6:** Model comparison & insights (classical vs deep learning)  
- **Step 7:** Deployment readiness (inference function for new reviews)  
- **Step 8:** Conclusion & future improvements (hyperparameter tuning, error analysis, transformer baselines)  
~~~markdown


In [1]:
# pip install gensim contractions


In [2]:
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import pos_tag

import re
import string

from gensim.models import Word2Vec
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
import contractions

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

True

### Imports and NLTK resource setup

Boilerplate note: The import statements are standard setup for this notebook and need no further explanation.

Explanation (for the nltk.download lines):
1) This block proactively downloads tokenizers, lexicons, and taggers that NLTK needs at runtime so the preprocessing functions do not fail with “Resource not found” errors on a fresh environment (e.g., Colab, new VM).
2) punkt provides the sentence and word tokenizers used by word_tokenize. stopwords supplies the English stop word list. wordnet backs lemmatization via WordNetLemmatizer. averaged_perceptron_tagger and averaged_perceptron_tagger_eng provide the POS taggers used by pos_tag.
3) The quiet=True flag suppresses verbose downloader output; resources are cached to the user’s NLTK data directory for future runs.
4) Why this matters: downstream cleaning (stopword removal), lemmatization (requiring dictionaries), and POS-aware logic depend on these assets. Ensuring they are present makes the pipeline reproducible across machines and CI.

In [3]:
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.VNI3ZW_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.VNI3ZW_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.VNI3ZW_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


### Load IMDB dataset with TensorFlow Datasets

Explanation:
1) **`tfds.load('imdb_reviews', ...)`** pulls the IMDB reviews dataset directly from TensorFlow Datasets.  
   - Each sample is a tuple `(text, label)` where `text` is a review string and `label` is `0` (negative) or `1` (positive).  
2) **`with_info=True`** returns a second object (`info`) that contains dataset metadata such as number of samples, feature schema, and label names. This helps in documentation and sanity checks.  
3) **`as_supervised=True`** ensures that the dataset is returned in `(input, label)` format instead of dictionaries, making it compatible with supervised learning workflows.  
4) The variable `dataset` is a dictionary with keys `'train'` and `'test'`. Each is a `tf.data.Dataset` object containing reviews and labels.  
5) Why this matters: This step sets up the raw data source for the entire pipeline. It gives a clean, standard dataset that can be fed into preprocessing before model training.  

In [4]:
info

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_dir='/root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    nondeterministic_order=False,
    splits={
        'test': <SplitInfo num_e

In [5]:
# prompt: check the dataset content in tfds.load

train_dataset = dataset['train']
test_dataset = dataset['test']

# Get the first 5 examples from the training dataset
for example in train_dataset.take(5):
  text, label = example
  print(f"Text: {text.numpy().decode('utf-8')}")
  print(f"Label: {label.numpy()}")
  print("-" * 20)

Text: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
Label: 0
--------------------
Text: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film 

### Inspect dataset samples

Explanation:
1) **Dataset split:**  
   - `train_dataset = dataset['train']` and `test_dataset = dataset['test']` extract the training and testing splits from the loaded IMDB dataset.  

2) **Iteration with `.take(5)`:**  
   - Instead of loading everything, this command samples the first 5 training examples for a quick look at the data format.  

3) **Inside the loop:**  
   - Each `example` is a tuple `(text, label)`.  
   - `text.numpy().decode('utf-8')` converts the TensorFlow string tensor into a readable Python string.  
   - `label.numpy()` converts the integer tensor into a plain Python int (`0 = Negative`, `1 = Positive`).  

4) **Why this matters:**  
   - It provides an immediate sanity check: reviews should look like plain English sentences, and labels should be 0 or 1.  
   - Ensures the dataset is correctly structured before preprocessing.  

5) **Expected output:**  
   - Printed reviews (short movie descriptions or opinions) along with labels.  
   - Confirms balance: you should see both positive (1) and negative (0) examples.  

In [6]:
# Step 2: Convert Tensorflow dataset into Pandas df
train_texts = []
train_labels = []

for text, label in train_dataset:
    train_texts.append(text.numpy().decode('utf-8'))
    train_labels.append(label.numpy())

### Convert TensorFlow dataset to Python lists

Explanation:
1) **Purpose:**  
   - TensorFlow Datasets (`tf.data.Dataset`) objects are efficient but not as flexible for text preprocessing.  
   - Converting them into Python lists (`train_texts`, `train_labels`) makes it easier to apply `pandas`, `nltk`, and scikit-learn tools.  

2) **Loop logic:**  
   - Iterates over `train_dataset`, where each element is `(text, label)`.  
   - `text.numpy().decode('utf-8')` converts the TensorFlow string tensor into a normal Python string.  
   - `label.numpy()` converts the TensorFlow integer tensor into a plain Python integer.  

3) **Result:**  
   - `train_texts`: list of raw review strings.  
   - `train_labels`: list of corresponding labels (`0` = negative, `1` = positive).  

4) **Why this matters:**  
   - Having text and labels in simple Python lists is the first step before creating a `pandas.DataFrame` and applying preprocessing like lowercasing, stopword removal, or lemmatization.  

In [7]:
# Step 2: Convert Tensorflow dataset into Pandas df
test_texts = []
test_labels = []

for text, label in test_dataset:
    test_texts.append(text.numpy().decode('utf-8'))
    test_labels.append(label.numpy())

### Convert test dataset to Python lists

Explanation:
1) **Purpose:**  
   - Similar to the training set, the test set from TensorFlow Datasets is converted into plain Python lists for easier preprocessing and modeling.  

2) **Loop details:**  
   - Iterates through each `(text, label)` pair in `test_dataset`.  
   - `text.numpy().decode('utf-8')` → converts the TensorFlow text tensor into a human-readable Python string.  
   - `label.numpy()` → converts the label tensor into a standard Python integer (`0` = negative, `1` = positive).  

3) **Output:**  
   - `test_texts`: list containing raw review strings from the test set.  
   - `test_labels`: list of integer sentiment labels corresponding to each review.  

4) **Why this matters:**  
   - Both training and test datasets are now in simple list format, making them compatible with `pandas.DataFrame`, scikit-learn vectorizers, and Keras tokenizers.  
   - Prepares the foundation for splitting, cleaning, and feature extraction.  

In [8]:
train_df = pd.DataFrame({'text': train_texts, 'sentiment': train_labels})
test_df = pd.DataFrame({'text': test_texts, 'sentiment': test_labels})

### Create Pandas DataFrames

Explanation:
1) **Conversion into DataFrames:**  
   - `train_df` is built with two columns:  
     - **`text`** → the review string.  
     - **`sentiment`** → the target label (0 = negative, 1 = positive).  
   - `test_df` is created in the same way for the test dataset.  

2) **Why use DataFrames:**  
   - `pandas.DataFrame` provides tabular structure, making it easier to explore, clean, and preprocess the data.  
   - Enables quick operations like `.head()`, `.value_counts()`, or applying preprocessing functions across rows.  

3) **Result:**  
   - Both training and test sets are now in a human-readable tabular format.  
   - This is the standard form for most NLP workflows before moving into feature extraction and modeling.  

In [9]:
train_df

Unnamed: 0,text,sentiment
0,This was an absolutely terrible movie. Don't b...,0
1,"I have been known to fall asleep during films,...",0
2,Mann photographs the Alberta Rocky Mountains i...,0
3,This is the kind of film for a snowy Sunday af...,1
4,"As others have mentioned, all the women that g...",1
...,...,...
24995,"I have a severe problem with this show, severa...",0
24996,"The year is 1964. Ernesto ""Che"" Guevara, havin...",1
24997,Okay. So I just got back. Before I start my re...,0
24998,When I saw this trailer on TV I was surprised....,0


In [10]:
test_df

Unnamed: 0,text,sentiment
0,There are films that make careers. For George ...,1
1,"A blackly comic tale of a down-trodden priest,...",1
2,"Scary Movie 1-4, Epic Movie, Date Movie, Meet ...",0
3,Poor Shirley MacLaine tries hard to lend some ...,0
4,As a former Erasmus student I enjoyed this fil...,1
...,...,...
24995,"Feeling Minnesota is not really a road movie, ...",0
24996,"This is, without doubt, one of my favourite ho...",1
24997,Most predicable movie I've ever seen...extreme...,0
24998,It's exactly what I expected from it. Relaxing...,1


In [11]:
# Step 3: Preprocessing - Expand contractions

train_df['text'] = train_df['text'].apply(lambda x: contractions.fix(x))
test_df['text'] = test_df['text'].apply(lambda x: contractions.fix(x))

### Expand contractions in text

Explanation:
1) **What contractions are:**  
   - Contractions are shortened forms of words like *don’t → do not*, *I’m → I am*.  
   - They introduce inconsistency in text data (e.g., "don’t" and "do not" would be treated as different tokens).  

2) **Code breakdown:**  
   - `train_df['text'].apply(lambda x: contractions.fix(x))` applies the `contractions.fix()` function to each review.  
   - This replaces all contractions in the review with their expanded form.  
   - The same operation is applied to `test_df['text']`.  

3) **Why this matters:**  
   - Expanding contractions improves **token consistency** and reduces vocabulary sparsity.  
   - This helps both **classical models** (BoW, TF-IDF) and **neural models** learn more effectively because “don’t” and “do not” are unified.  

4) **Result:**  
   - Both train and test reviews now have standardized text with contractions fully expanded.  

In [12]:
train_df

Unnamed: 0,text,sentiment
0,This was an absolutely terrible movie. Do not ...,0
1,"I have been known to fall asleep during films,...",0
2,Mann photographs the Alberta Rocky Mountains i...,0
3,This is the kind of film for a snowy Sunday af...,1
4,"As others have mentioned, all the women that g...",1
...,...,...
24995,"I have a severe problem with this show, severa...",0
24996,"The year is 1964. Ernesto ""Che"" Guevara, havin...",1
24997,Okay. So I just got back. Before I start my re...,0
24998,When I saw this trailer on TV I was surprised....,0


In [13]:
test_df

Unnamed: 0,text,sentiment
0,There are films that make careers. For George ...,1
1,"A blackly comic tale of a down-trodden priest,...",1
2,"Scary Movie 1-4, Epic Movie, Date Movie, Meet ...",0
3,Poor Shirley MacLaine tries hard to lend some ...,0
4,As a former Erasmus student I enjoyed this fil...,1
...,...,...
24995,"Feeling Minnesota is not really a road movie, ...",0
24996,"This is, without doubt, one of my favourite ho...",1
24997,Most predicable movie I have ever seen...extre...,0
24998,It is exactly what I expected from it. Relaxin...,1


In [14]:
# STep 4: Preprocessing - Convert to lowercase
train_df['text'] = train_df['text'].str.lower()
test_df['text'] = test_df['text'].str.lower()

### Convert text to lowercase

Explanation:
1) **Purpose:**  
   - Converts every character in the reviews to lowercase.  
   - Example: *"Great Movie!" → "great movie!"*.  

2) **Code breakdown:**  
   - `train_df['text'].str.lower()` applies the lowercase transformation to each string in the `text` column.  
   - The same transformation is applied to `test_df['text']`.  

3) **Why this matters:**  
   - Reduces **vocabulary size** by treating “Movie”, “movie”, and “MOVIE” as the same token.  
   - Prevents redundant features in BoW/TF-IDF and improves embedding consistency.  

4) **Result:**  
   - All reviews in training and test sets are now standardized to lowercase text, simplifying tokenization and feature extraction.  

In [15]:
train_df

Unnamed: 0,text,sentiment
0,this was an absolutely terrible movie. do not ...,0
1,"i have been known to fall asleep during films,...",0
2,mann photographs the alberta rocky mountains i...,0
3,this is the kind of film for a snowy sunday af...,1
4,"as others have mentioned, all the women that g...",1
...,...,...
24995,"i have a severe problem with this show, severa...",0
24996,"the year is 1964. ernesto ""che"" guevara, havin...",1
24997,okay. so i just got back. before i start my re...,0
24998,when i saw this trailer on tv i was surprised....,0


In [16]:
# STep 5: Preprocessing - Remove URL
train_df['text'] = train_df['text'].apply(lambda x: re.sub(r'http\S+ | www\S+', '', x))
test_df['text'] = test_df['text'].apply(lambda x: re.sub(r'http\S+ | www\S+', '', x))

### Remove URLs from text

Explanation:
1) **Why URLs matter:**  
   - Reviews may contain links like `http://example.com` or `www.imdb.com`.  
   - These links don’t add meaningful sentiment information and only increase vocabulary noise.  

2) **Regex used:**  
   - `r'http\S+ | www\S+'` searches for substrings starting with `http` or `www` followed by non-space characters.  
   - `re.sub(..., '', x)` replaces such patterns with an empty string.  

3) **Code breakdown:**  
   - Applied to each review in `train_df['text']` and `test_df['text']`.  
   - Any detected URL is removed, leaving only the review text.  

4) **Why this matters:**  
   - Ensures feature extraction (BoW, TF-IDF, embeddings) isn’t polluted by random domain names or tokens.  
   - Keeps only linguistically meaningful words for sentiment classification.  

5) **Result:**  
   - Both training and testing datasets now contain reviews stripped of any URLs or web addresses.  

In [17]:
# STep 6: Preprocessing - Remove punctuation
train_df['text'] = train_df['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
test_df['text'] = test_df['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

### Remove punctuation from text

Explanation:
1) **Why remove punctuation:**  
   - Characters like `. , ! ? : ;` usually do not add sentiment meaning in this dataset.  
   - Keeping them increases the vocabulary unnecessarily (e.g., "movie!" and "movie" would be treated as different tokens).  

2) **Code breakdown:**  
   - `str.maketrans('', '', string.punctuation)` creates a translation table that maps every punctuation symbol to `None`.  
   - `.translate(...)` applies this mapping to each review, removing all punctuation marks.  
   - Done for both `train_df['text']` and `test_df['text']`.  

3) **Why this matters:**  
   - Simplifies tokens and reduces noise for feature extraction methods like BoW, TF-IDF, and embeddings.  
   - Ensures “movie!”, “movie,” and “movie” are treated as the same token.  

4) **Result:**  
   - All reviews in training and test sets now contain only alphanumeric text and spaces — free of punctuation.  

In [18]:
# STep 7: Preprocessing - Tokenize
nltk.download('punkt_tab', quiet=True)
train_df['text'] = train_df['text'].apply(word_tokenize)
test_df['text'] = test_df['text'].apply(word_tokenize)

### Preprocessing — Tokenization

Explanation:
1) **Purpose:**  
   - Breaks each review into a list of individual words (tokens).  
   - Example: *"this movie was great"* → `["this", "movie", "was", "great"]`.  

2) **Code breakdown:**  
   - `nltk.download('punkt_tab', quiet=True)` ensures tokenizer resources are available (though `punkt` alone is usually sufficient).  
   - `word_tokenize` from NLTK is applied to each review in both training and test sets.  
   - This replaces the string review with a Python list of tokens.  

3) **Why this matters:**  
   - Tokenization is the foundation of NLP pipelines.  
   - Models (BoW, TF-IDF, embeddings, LSTM) require inputs as sequences of words/tokens rather than raw text.  

4) **Result:**  
   - The `text` column in both DataFrames now contains tokenized reviews (lists of words) instead of plain strings.  

In [19]:
# STep 8: Preprocessing - Remove stopwords
stop_words = set(stopwords.words('english'))
train_df['text'] = train_df['text'].apply(lambda x: ' '.join([word for word in x if word not in stop_words]))
test_df['text'] = test_df['text'].apply(lambda x: ' '.join([word for word in x if word not in stop_words]))

### Preprocessing — Remove Stopwords

Explanation:
1) **What stopwords are:**  
   - Common words like *“the”*, *“is”*, *“and”*, which occur frequently but contribute little to sentiment meaning.  

2) **Code breakdown:**  
   - `stop_words = set(stopwords.words('english'))` loads the predefined English stopword list from NLTK.  
   - For each review (already tokenized into words), a list comprehension keeps only the words **not** in `stop_words`.  
   - `' '.join([...])` joins the filtered tokens back into a single string.  

3) **Why this matters:**  
   - Removes linguistic noise and reduces feature space size.  
   - Makes classification more robust by focusing on sentiment-bearing words like *“great”*, *“boring”*, *“excellent”*.  

4) **Result:**  
   - Both training and test DataFrames now contain cleaned text strings with stopwords removed, ready for stemming/lemmatization and feature extraction.  

In [20]:
train_df.head()

Unnamed: 0,text,sentiment
0,absolutely terrible movie lured christopher wa...,0
1,known fall asleep films usually due combinatio...,0
2,mann photographs alberta rocky mountains super...,0
3,kind film snowy sunday afternoon rest world go...,1
4,others mentioned women go nude film mostly abs...,1


In [21]:
# Step 9: Preprocessing - POS Tagging
def simple_pos_tag(text):
  tokens = word_tokenize(text)
  return pos_tag(tokens)

train_df['pos_tags'] = train_df['text'].apply(simple_pos_tag)
test_df['pos_tags'] = test_df['text'].apply(simple_pos_tag)

### Preprocessing — POS Tagging

Explanation:
1) **Purpose:**  
   - Part-of-Speech (POS) tagging assigns grammatical roles (noun, verb, adjective, etc.) to each token.  
   - Example: *“great movie”* → `[("great", "JJ"), ("movie", "NN")]`.  

2) **Code breakdown:**  
   - `simple_pos_tag` function:  
     - Takes a text string.  
     - Tokenizes it with `word_tokenize`.  
     - Applies `pos_tag` to assign POS labels.  
   - Applied to both `train_df['text']` and `test_df['text']`, creating a new column `pos_tags`.  

3) **Why this matters:**  
   - POS tags help in **better lemmatization** (knowing whether a word is a verb, noun, or adjective).  
   - Can also be used for advanced feature engineering (e.g., analyzing adjective usage in sentiment).  

4) **Result:**  
   - Each row now has an additional column `pos_tags` containing token–POS pairs, enriching the dataset for further preprocessing or analysis.  

In [22]:
# Step 10: lemmatization

def lemmatize_text(text):
  lemmatizer = WordNetLemmatizer()
  tokens = word_tokenize(text)
  return ' '.join([lemmatizer.lemmatize(word) for word in tokens])

train_df['text_lemmatized'] = train_df['text'].apply(lemmatize_text)
test_df['text_lemmatized'] = test_df['text'].apply(lemmatize_text)

### Preprocessing — Lemmatization

Explanation:
1) **What lemmatization is:**  
   - Reduces words to their **dictionary form** (lemma).  
   - Example: *“running” → “run”*, *“better” → “good”*.  
   - Unlike stemming, lemmatization is linguistically informed and uses WordNet.  

2) **Code breakdown:**  
   - `WordNetLemmatizer()` initializes the lemmatizer.  
   - `word_tokenize(text)` splits the review into tokens.  
   - Each token is passed to `.lemmatize(word)`, converting it to its base form.  
   - `' '.join([...])` reconstructs the cleaned tokens back into a string.  
   - Applied to both `train_df['text']` and `test_df['text']`, creating new columns `text_lemmatized`.  

3) **Why this matters:**  
   - Reduces vocabulary size by mapping different inflections of a word to one base form.  
   - Improves generalization for models like BoW, TF-IDF, and embeddings.  
   - Ensures sentiment-related variations (e.g., *“liked”*, *“likes”*, *“liking”*) are treated as the same concept.  

4) **Result:**  
   - New columns `text_lemmatized` contain reviews in standardized dictionary form, ready for feature extraction.  

In [23]:
# Step 11: Stemming

def stem_text(text):
  stemmer = PorterStemmer()
  tokens = word_tokenize(text)
  return ' '.join([stemmer.stem(word) for word in tokens])

train_df['text_stemmed'] = train_df['text'].apply(stem_text)
test_df['text_stemmed'] = test_df['text'].apply(stem_text)

### Preprocessing — Stemming

Explanation:
1) **What stemming is:**  
   - Reduces words to their **root form** by chopping off suffixes, without guaranteeing valid dictionary words.  
   - Example: *“running” → “run”*, *“happiness” → “happi”*.  
   - Faster but less precise compared to lemmatization.  

2) **Code breakdown:**  
   - `PorterStemmer()` initializes the Porter stemming algorithm.  
   - `word_tokenize(text)` splits the review into individual tokens.  
   - Each token is processed with `.stem(word)` to obtain its stem.  
   - `' '.join([...])` joins the stems back into a string.  
   - Applied to both `train_df['text']` and `test_df['text']`, creating new columns `text_stemmed`.  

3) **Why this matters:**  
   - Further reduces vocabulary size, which can improve training efficiency for BoW/TF-IDF models.  
   - Helps group related words (e.g., *“liked”*, *“likes”*, *“liking”*) under one root (*“like”* → *“like”* or *“lik”*).  

4) **Result:**  
   - New columns `text_stemmed` contain stemmed versions of the reviews, providing an alternative representation for experimentation.  
   - Now the dataset has **both lemmatized and stemmed versions** of text, offering flexibility in feature engineering.  

In [24]:
# We'll use lemmatized text for our vectorization

train_texts = train_df['text_lemmatized']
test_texts = test_df['text_lemmatized']

### Choose Lemmatized Text for Vectorization

Explanation:
1) **Why lemmatized text:**  
   - Compared to stemming, lemmatization produces proper dictionary words, preserving readability and linguistic meaning.  
   - Example: *“running” → “run”* (lemma) vs *“running” → “run”* but *“happiness” → “happi”* (stem).  
   - This makes lemmatization a better choice for downstream models.  

2) **Code breakdown:**  
   - `train_texts` is assigned the lemmatized reviews from `train_df['text_lemmatized']`.  
   - `test_texts` is assigned the lemmatized reviews from `test_df['text_lemmatized']`.  

3) **Why this matters:**  
   - Ensures consistency in feature extraction (BoW, TF-IDF, embeddings, LSTM input).  
   - Keeps the preprocessing pipeline clean and avoids vocabulary distortions introduced by stemming.  

4) **Result:**  
   - Both training and test sets are now prepared with **lemmatized text** as the standard input for vectorization.  

In [None]:
#Step 12: Text Vecorization
#A. Bag of Words

vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(train_texts)
X_test_bow = vectorizer.transform(test_texts)

### Text Vectorization — Bag of Words (BoW)

Explanation:
1) **What BoW does:**  
   - Converts text into a **sparse matrix of word counts**.  
   - Each column corresponds to a unique word, and each row is a review.  
   - Example: *“great movie”* → `[1,1,0,0,...]` depending on vocabulary.  

2) **Code breakdown:**  
   - `CountVectorizer()` initializes the vectorizer.  
   - `.fit_transform(train_texts)` learns the vocabulary from training data and transforms reviews into word-count vectors (`X_train_bow`).  
   - `.transform(test_texts)` applies the same vocabulary to convert test data (ensures no data leakage).  

3) **Why this matters:**  
   - Provides a baseline, interpretable representation of text.  
   - Simple but effective for classical models like Naive Bayes or Random Forest.  

4) **Result:**  
   - `X_train_bow`: sparse matrix of shape `(num_train_samples, vocab_size)`.  
   - `X_test_bow`: sparse matrix of shape `(num_test_samples, vocab_size)`.  
   - Ready to be fed into baseline machine learning models.  

In [34]:
"""
COuntVecorizer(): Count the words

parameters:

  max_features: None: Use all words (more memory)
               : 5000 (limits the vocvab to top 5000 frequent words)

  min_df = 5 (ignore words appearing in fewer than 5 documents)

  ngram_range = (1,2)


"""

'\nCOuntVecorizer(): Count the words\n\nparameters: \n\n  max_features: None: Use all words (more memory)\n               : 5000 (limits the vocvab to top 5000 frequent words) \n\n  min_df = 5 (ignore words appearing in fewer than 5 documents)\n\n  ngram_range = (1,2)\n\n\n'

In [35]:
# TF-IDF

tfidf_vectorizer = TfidfVectorizer(max_features = 5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_texts).toarray()
X_test_tfidf = tfidf_vectorizer.transform(test_texts).toarray()


### Text Vectorization — TF-IDF (Term Frequency–Inverse Document Frequency)

Explanation:
1) **What TF-IDF does:**  
   - Builds on BoW but re-weights words based on importance.  
   - Common words across documents (like “movie”) get **lower weight**, while rarer but meaningful words (like “masterpiece”) get **higher weight**.  

2) **Code breakdown:**  
   - `TfidfVectorizer(max_features=5000)` limits the vocabulary to the top 5000 most informative words (reduces dimensionality).  
   - `.fit_transform(train_texts)` learns the vocabulary and weights from training data, converting reviews into TF-IDF vectors (`X_train_tfidf`).  
   - `.transform(test_texts)` applies the same learned vocabulary to test reviews.  
   - `.toarray()` converts the sparse matrix into a dense NumPy array, which some models (like neural nets) require.  

3) **Why this matters:**  
   - Captures not just **word presence** but also **informativeness**, improving performance over plain BoW for many tasks.  
   - Helps models distinguish between generic and sentiment-bearing words.  

4) **Result:**  
   - `X_train_tfidf`: dense matrix `(num_train_samples, 5000)` with TF-IDF features.  
   - `X_test_tfidf`: same format for test samples.  
   - Ready for classical ML models like Naive Bayes and Random Forest.  

In [36]:
# Word2Vec

tokenized_train = [word_tokenize(text) for text in train_texts]
tokenized_test = [word_tokenize(text) for text in test_texts]
"""
vector_size = size of word vectors
window = context window for building word relationship
min_count = 1 : all words are incvluded

"""
w2v_model = Word2Vec(tokenized_train, vector_size=100, window=5, min_count=1, workers=4)

### Text Vectorization — Word2Vec Embeddings

Explanation:
1) **What Word2Vec does:**  
   - Learns **dense vector representations** for words based on context.  
   - Words with similar meanings (e.g., *“great”*, *“excellent”*) end up with vectors close to each other in space.  
   - Unlike BoW/TF-IDF, it captures **semantic relationships**.  

2) **Code breakdown:**  
   - `tokenized_train` and `tokenized_test`: each review is tokenized into a list of words.  
   - `Word2Vec(...)`: trains embeddings on the tokenized training data.  
     - `vector_size=100`: each word is represented by a 100-dimensional vector.  
     - `window=5`: considers 5 words before and after a target word for context.  
     - `min_count=1`: includes even words that appear only once (useful here, but in larger corpora this would add noise).  
     - `workers=4`: uses 4 CPU threads in parallel for faster training.  

3) **Why this matters:**  
   - Produces richer features for neural models than simple word counts.  
   - Can be aggregated (e.g., averaged per document) to create **document-level embeddings**.  

4) **Result:**  
   - A trained `w2v_model` that maps each word in the vocabulary to a dense 100-dimensional embedding vector.  
   - These embeddings will later be combined (averaged or weighted) to represent entire reviews.  

In [37]:
def get_w2v_features(tokens,model):
  """
  Purpose: Gives a document level embedding by retaining semantic.
  filter tokens to those in word2vec vocab
  returns zero vector if no valid words in vocab
  np.mean = to avg word vectors, preserve the semantic info

  Alternatives: use TF-IDF weights or Doc2Vec for document level embedding.

  """
  words = [word for word in tokens if word in model.wv]

  if len(words) == 0:
    return np.zeros(100)

  return np.mean([model.wv[word] for word in words], axis = 0)

### Function — Generate Document-Level Word2Vec Features

Explanation:
1) **Purpose:**  
   - Converts a list of tokens (a single review) into one **fixed-length vector** by averaging Word2Vec embeddings of all valid tokens.  
   - This gives a **document-level representation** instead of word-level vectors.  

2) **Code breakdown:**  
   - `words = [word for word in tokens if word in model.wv]`  
     - Filters tokens to include only those found in the trained Word2Vec vocabulary.  
   - `if len(words) == 0: return np.zeros(100)`  
     - If no valid tokens are found (e.g., all words are OOV), returns a zero vector of size 100.  
   - `np.mean([model.wv[word] for word in words], axis=0)`  
     - Computes the average of word embeddings across all valid tokens.  
     - Produces a single 100-dimensional vector per document.  

3) **Why this matters:**  
   - Averaging embeddings is a simple yet effective way to summarize an entire review.  
   - Maintains semantic information compared to sparse methods like BoW/TF-IDF.  
   - Provides dense input features that work well for classical ML or shallow neural networks.  

4) **Result:**  
   - Each review can now be represented as a 100-dimensional dense vector, ready for training models.  

In [38]:
X_train_w2v = np.array([get_w2v_features(tokens, w2v_model) for tokens in tokenized_train])
X_test_w2v = np.array([get_w2v_features(tokens, w2v_model) for tokens in tokenized_test])


### Apply Word2Vec Features to Train/Test Sets

Explanation:
1) **Purpose:**  
   - Converts all tokenized reviews into **fixed-length document embeddings** using the `get_w2v_features` function.  
   - Ensures each review is represented by a 100-dimensional vector.  

2) **Code breakdown:**  
   - `[get_w2v_features(tokens, w2v_model) for tokens in tokenized_train]`  
     - Loops through each tokenized review in training data.  
     - Generates an averaged Word2Vec embedding using the trained `w2v_model`.  
   - `np.array(...)`  
     - Collects the embeddings into a NumPy array of shape `(num_train_samples, 100)`.  
   - The same logic is applied to `tokenized_test`.  

3) **Why this matters:**  
   - Provides dense, semantic representations for both train and test reviews.  
   - These features can be used in classical ML models (e.g., Random Forest) or neural networks.  

4) **Result:**  
   - `X_train_w2v`: NumPy array of shape `(train_size, 100)`.  
   - `X_test_w2v`: NumPy array of shape `(test_size, 100)`.  
   - Each row corresponds to one review’s semantic embedding.  

# **Model Building**

In [40]:
y_train = train_df['sentiment'].values
y_test = test_df['sentiment'].values

### Prepare Target Variables for Modeling

Explanation:
1) **Purpose:**  
   - Extracts the sentiment labels (`0` = negative, `1` = positive) from the DataFrames into NumPy arrays.  
   - These arrays serve as the **dependent variable (y)** during training and evaluation.  

2) **Code breakdown:**  
   - `train_df['sentiment'].values` converts the sentiment column of the training set into a NumPy array (`y_train`).  
   - `test_df['sentiment'].values` does the same for the test set (`y_test`).  

3) **Why this matters:**  
   - Most machine learning models (scikit-learn, TensorFlow/Keras) expect labels as NumPy arrays or tensors.  
   - Ensures that `y_train` and `y_test` align properly with the feature matrices (`X_train_*`, `X_test_*`).  

4) **Result:**  
   - `y_train`: vector of length equal to the number of training reviews.  
   - `y_test`: vector of length equal to the number of test reviews.  
   - Both ready for supervised model training.  

In [41]:
# Classicial ML - 1 (NB with BoW)
nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)
nb_bow_pred = nb_bow.predict(X_test_bow)
nb_bow_accuracy = accuracy_score(y_test, nb_bow_pred)

### Classical ML — Naive Bayes with Bag of Words

Explanation:
1) **Model choice:**  
   - `MultinomialNB` is well-suited for **discrete count features** like those produced by BoW.  
   - It assumes word frequencies follow a multinomial distribution, making it a strong baseline for text classification.  

2) **Code breakdown:**  
   - `nb_bow = MultinomialNB()` initializes the Naive Bayes classifier.  
   - `.fit(X_train_bow, y_train)` trains the model on BoW features from the training set.  
   - `.predict(X_test_bow)` generates predictions for the test set.  
   - `accuracy_score(y_test, nb_bow_pred)` calculates accuracy by comparing predictions against true labels.  

3) **Why this matters:**  
   - Provides a **fast, interpretable baseline** for sentiment analysis.  
   - Often surprisingly competitive despite its simplicity.  

4) **Result:**  
   - `nb_bow_accuracy`: a numeric score (e.g., ~82–83%) representing classification accuracy on the test set using Naive Bayes + BoW features.  

In [43]:
# Classicial ML - 2 (NB with TF-IDF)
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)
nb_tfidf_pred = nb_tfidf.predict(X_test_tfidf)
nb_tfidf_accuracy = accuracy_score(y_test, nb_tfidf_pred)

### Classical ML — Naive Bayes with TF-IDF

Explanation:
1) **Model choice:**  
   - Naive Bayes is also effective with **TF-IDF features**, which emphasize informative words while down-weighting common ones.  
   - Works well for high-dimensional sparse text data.  

2) **Code breakdown:**  
   - `nb_tfidf = MultinomialNB()` initializes the classifier.  
   - `.fit(X_train_tfidf, y_train)` trains it on TF-IDF features from training reviews.  
   - `.predict(X_test_tfidf)` predicts sentiment for test reviews.  
   - `accuracy_score(y_test, nb_tfidf_pred)` computes the test accuracy.  

3) **Why this matters:**  
   - Comparing Naive Bayes on **BoW vs. TF-IDF** shows the effect of weighting terms by informativeness.  
   - Typically, TF-IDF improves performance slightly over plain counts.  

4) **Result:**  
   - `nb_tfidf_accuracy`: accuracy score for Naive Bayes trained on TF-IDF features (often slightly higher than BoW).  

In [44]:
# Classicial ML - 3 (RF with BoW)
rf_bow = RandomForestClassifier(n_estimators=100, random_state=42)
rf_bow.fit(X_train_bow, y_train)
rf_bow_pred = rf_bow.predict(X_test_bow)
rf_bow_accuracy = accuracy_score(y_test, rf_bow_pred)

### Classical ML — Random Forest with Bag of Words

Explanation:
1) **Model choice:**  
   - `RandomForestClassifier` is an ensemble method that builds multiple decision trees and averages their predictions.  
   - It can capture **non-linear relationships** in text features, unlike Naive Bayes which assumes independence.  

2) **Code breakdown:**  
   - `RandomForestClassifier(n_estimators=100, random_state=42)`  
     - `n_estimators=100`: builds 100 decision trees for stable predictions.  
     - `random_state=42`: ensures reproducibility of results.  
   - `.fit(X_train_bow, y_train)` trains the model on BoW features.  
   - `.predict(X_test_bow)` predicts labels for the test data.  
   - `accuracy_score(y_test, rf_bow_pred)` computes accuracy on test predictions.  

3) **Why this matters:**  
   - Demonstrates how **tree-based ensembles** handle high-dimensional text features.  
   - Often achieves higher accuracy than Naive Bayes, though at greater computational cost.  

4) **Result:**  
   - `rf_bow_accuracy`: accuracy of Random Forest trained with BoW features (often one of the strongest classical baselines).  

In [45]:
# Classicial ML - 4 (RF with TF-IDF)
rf_tfidf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_tfidf.fit(X_train_tfidf, y_train)
rf_tfidf_pred = rf_tfidf.predict(X_test_tfidf)
rf_tfidf_accuracy = accuracy_score(y_test, rf_tfidf_pred)

### Classical ML — Random Forest with TF-IDF

Explanation:
1) **Model choice:**  
   - Random Forest can also work with **TF-IDF features**, which highlight rare but informative words.  
   - This setup combines the strength of tree ensembles with weighted text features.  

2) **Code breakdown:**  
   - `RandomForestClassifier(n_estimators=100, random_state=42)` initializes the model with 100 trees and a fixed seed for reproducibility.  
   - `.fit(X_train_tfidf, y_train)` trains on the TF-IDF-transformed training data.  
   - `.predict(X_test_tfidf)` generates predictions for the test set.  
   - `accuracy_score(y_test, rf_tfidf_pred)` evaluates accuracy against ground truth.  

3) **Why this matters:**  
   - Allows direct comparison of Random Forest performance on **BoW vs. TF-IDF**.  
   - TF-IDF usually reduces noise from common words, potentially improving generalization.  

4) **Result:**  
   - `rf_tfidf_accuracy`: accuracy of Random Forest using TF-IDF features, often close to or slightly lower than the BoW variant depending on the dataset.  

In [None]:
# Deep Learning

max_words = 5000
max_len = 100

"""
Tokenizer: convert text to integer sequences
num_words: limits the vocab to top n words (where n is given by user, in this case n = 5000)
fit_on_texts(): builds vocab from our training data

texts_to_sequences(): convert text to list of integer indices
pad_sequences(): pad sequences to have same length, for us, we have taken the length to 100 so it will pad to 100 tokens

"""

tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(train_texts)

X_train_seq = tokenizer.texts_to_sequences(train_texts)
X_test_seq = tokenizer.texts_to_sequences(test_texts)

X_train_pad = pad_sequences(X_train_seq, maxlen = max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen = max_len)

### Deep Learning — Tokenization and Sequence Padding

Explanation:
1) **Purpose:**  
   - Neural networks require **numerical input**. Text must be converted into sequences of integers before training.  
   - Each unique word is assigned an index, and reviews are represented as lists of these indices.  

2) **Code breakdown:**  
   - `max_words = 5000`: limits the vocabulary to the top 5000 most frequent words. Rare words are ignored.  
   - `max_len = 100`: fixes input review length to 100 tokens (shorter reviews are padded, longer ones are truncated).  
   - `Tokenizer(num_words=max_words)`: initializes a Keras tokenizer restricted to the top 5000 words.  
   - `.fit_on_texts(train_texts)`: builds the vocabulary from training reviews.  
   - `.texts_to_sequences(...)`: converts each review into a list of integer indices.  
   - `pad_sequences(..., maxlen=max_len)`: pads or truncates sequences so every review has exactly 100 tokens.  

3) **Why this matters:**  
   - Creates uniform input shape required by neural models.  
   - Prevents issues with variable-length text.  
   - Prepares the data for embedding layers, which will map word indices to dense vectors.  

4) **Result:**  
   - `X_train_pad`: 2D NumPy array of shape `(train_size, 100)`.  
   - `X_test_pad`: 2D NumPy array of shape `(test_size, 100)`.  
   - Both ready for deep learning models like feedforward networks and LSTM.  

In [49]:
# Model 5 : Deep Learning Model - Feedforward Neural Netowork

ffnn_model = Sequential([
    Embedding(max_words, 100, input_length=max_len),
    tf.keras.layers.Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

ffnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
ffnn_model.fit(X_train_pad, y_train, validation_split = 0.2, epochs=3, batch_size=32, verbose=1)
ffnn_loss, ffnn_accuracy = ffnn_model.evaluate(X_test_pad, y_test, verbose=0)

Epoch 1/3




[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 26ms/step - accuracy: 0.6667 - loss: 0.5580 - val_accuracy: 0.8680 - val_loss: 0.3173
Epoch 2/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 26ms/step - accuracy: 0.9430 - loss: 0.1633 - val_accuracy: 0.8490 - val_loss: 0.4125
Epoch 3/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 26ms/step - accuracy: 0.9886 - loss: 0.0377 - val_accuracy: 0.8408 - val_loss: 0.6894


### Deep Learning — Feedforward Neural Network (FFNN)

Explanation:
1) **Model architecture:**  
   - `Embedding(max_words, 100, input_length=max_len)`: maps each word index to a dense 100-dimensional vector.  
   - `Flatten()`: flattens the sequence of embeddings into a single long vector.  
   - `Dense(128, activation='relu')`: fully connected layer with 128 neurons and ReLU activation.  
   - `Dropout(0.5)`: randomly drops 50% of neurons during training to reduce overfitting.  
   - `Dense(64, activation='relu')`: another dense hidden layer for deeper representation.  
   - `Dropout(0.5)`: additional regularization.  
   - `Dense(1, activation='sigmoid')`: output layer for binary classification (positive/negative).  

2) **Compilation:**  
   - `loss='binary_crossentropy'`: standard loss for binary classification.  
   - `optimizer='adam'`: adaptive learning optimizer.  
   - `metrics=['accuracy']`: monitors accuracy during training and evaluation.  

3) **Training:**  
   - `validation_split=0.2`: uses 20% of training data as validation.  
   - `epochs=3`: trains for 3 passes over the dataset.  
   - `batch_size=32`: processes data in mini-batches of 32.  
   - `verbose=1`: prints progress.  

4) **Evaluation:**  
   - `.evaluate(X_test_pad, y_test)` calculates loss and accuracy on unseen test data.  

5) **Why this matters:**  
   - Provides a simple deep learning baseline.  
   - Unlike Naive Bayes or Random Forest, this model learns distributed word representations.  

6) **Result:**  
   - `ffnn_accuracy`: test accuracy of the Feedforward Neural Network, typically around 83–84% on IMDB reviews.  

In [55]:
# LSTM Model
"""
LSTM: captures the sequential dependencies in text
Here we have built LSTM with 64 units and return sequence to be False

return sequence: True > Stacked LSTM
                False > Single LSTM

"""

lstm_model = Sequential([
    Embedding(max_words, 100, input_length=max_len),
    LSTM(64, return_sequences=False),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model.fit(X_train_pad, y_train, validation_split = 0.2, epochs=5, batch_size=32, verbose=1)
lstm_loss, lstm_accuracy = lstm_model.evaluate(X_test_pad, y_test, verbose=0)


Epoch 1/5




[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 63ms/step - accuracy: 0.7259 - loss: 0.5124 - val_accuracy: 0.8748 - val_loss: 0.3095
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 62ms/step - accuracy: 0.9076 - loss: 0.2562 - val_accuracy: 0.8714 - val_loss: 0.3171
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 70ms/step - accuracy: 0.9332 - loss: 0.1849 - val_accuracy: 0.8604 - val_loss: 0.3708
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 94ms/step - accuracy: 0.9508 - loss: 0.1414 - val_accuracy: 0.8508 - val_loss: 0.4081
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m71s[0m 76ms/step - accuracy: 0.9635 - loss: 0.1115 - val_accuracy: 0.8524 - val_loss: 0.5539


### Deep Learning — LSTM Model

Explanation:
1) **Model architecture:**  
   - `Embedding(max_words, 100, input_length=max_len)`: converts word indices into 100-dimensional dense vectors.  
   - `LSTM(64, return_sequences=False)`:  
     - LSTM layer with 64 memory units.  
     - Captures **sequential dependencies** in reviews (word order matters).  
     - `return_sequences=False`: outputs only the final hidden state → suitable for classification.  
   - `Dropout(0.5)`: regularization to prevent overfitting.  
   - `Dense(32, activation='relu')`: dense hidden layer for learned features.  
   - `Dropout(0.5)`: extra regularization.  
   - `Dense(1, activation='sigmoid')`: final layer for binary sentiment prediction.  

2) **Compilation:**  
   - `loss='binary_crossentropy'`: standard loss for binary classification.  
   - `optimizer='adam'`: efficient adaptive optimizer.  
   - `metrics=['accuracy']`: tracks accuracy throughout training and testing.  

3) **Training:**  
   - `validation_split=0.2`: uses 20% of training data for validation.  
   - `epochs=5`: trains for 5 full passes over the dataset.  
   - `batch_size=32`: processes reviews in batches of 32.  
   - `verbose=1`: prints training progress.  

4) **Evaluation:**  
   - `.evaluate(X_test_pad, y_test)` computes test loss and accuracy on unseen reviews.  

5) **Why this matters:**  
   - Unlike BoW or TF-IDF, LSTMs consider **word order and context**, making them better at capturing sentiment nuances.  
   - This is the strongest deep learning baseline in this notebook.  

6) **Result:**  
   - `lstm_accuracy`: test accuracy of the LSTM model, usually ~84–85% on IMDB reviews.  

In [58]:
print("Model Accuracy Results: ")
print(f"NB with BoW: {nb_bow_accuracy}")
print(f"NB with TF-IDF: {nb_tfidf_accuracy}")
print(f"RF with BoW: {rf_bow_accuracy}")
print(f"RF with TF-IDF: {rf_tfidf_accuracy}")
print(f"FFNN: {ffnn_accuracy}")
print(f"LSTM: {lstm_accuracy}")

Model Accuracy Results: 
NB with BoW: 0.82676
NB with TF-IDF: 0.84356
RF with BoW: 0.85472
RF with TF-IDF: 0.84508
FFNN: 0.8390799760818481
LSTM: 0.8457599878311157


### Model Accuracy Comparison — Results

Explanation:
1) **Purpose:**  
   - Shows the actual test accuracy of all models built in this notebook.  
   - Useful for analyzing which combinations of **feature extraction** and **models** perform best.  

2) **Results:**  
   ```
   NB with BoW:     82.68%
   NB with TF-IDF:  84.36%
   RF with BoW:     85.47%
   RF with TF-IDF:  84.51%
   FFNN:            83.91%
   LSTM:            84.58%
   ```

3) **Insights:**  
   - **Random Forest (BoW)** achieved the **highest accuracy (85.47%)**, showing strong performance with sparse count-based features.  
   - **Naive Bayes (TF-IDF)** improved slightly over BoW, but still lags behind Random Forest.  
   - **Deep Learning models** (FFNN and LSTM) are competitive:  
     - FFNN ≈ 83.9% — solid baseline but loses context due to flattening embeddings.  
     - LSTM ≈ 84.6% — captures sequential dependencies, nearly matching Random Forest.  
   - **Conclusion:** classical models with simple features (Random Forest + BoW) remain highly competitive, while LSTM demonstrates potential for scaling with more data or pretrained embeddings.  

4) **Why this matters:**  
   - Shows trade-offs between **simplicity (Naive Bayes, Random Forest)** and **expressiveness (FFNN, LSTM)**.  
   - Recruiters or reviewers can see clear evidence of model comparison and pipeline completeness.  

In [59]:
def preprocess_review(review, tokenizer):

  review = contractions.fix(review)
  review = review.lower()
  review = re.sub(r'http\S+ | www\S+', '', review)
  review = review.translate(str.maketrans('', '', string.punctuation))
  tokens = word_tokenize(review)
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  review = ' '.join([lemmatizer.lemmatize(word) for word in tokens])

  sequence = tokenizer.texts_to_sequences([review])
  padded = pad_sequences(sequence, maxlen=max_len)
  return padded

### Function — Preprocess a Single Review for Prediction

Explanation:
1) **Purpose:**  
   - Applies the **same preprocessing pipeline** used during training to any new/unseen review.  
   - Ensures consistency between training and inference.  

2) **Step-by-step breakdown:**  
   - `contractions.fix(review)`: expands contractions (*don’t → do not*).  
   - `.lower()`: converts text to lowercase for uniformity.  
   - `re.sub(r'http\S+ | www\S+', '', review)`: removes URLs.  
   - `.translate(str.maketrans('', '', string.punctuation))`: strips punctuation.  
   - `word_tokenize(review)`: tokenizes into words.  
   - `stopwords.words('english')`: removes stopwords.  
   - `[lemmatizer.lemmatize(word) for word in tokens]`: lemmatizes each word into dictionary form.  
   - `' '.join(...)`: reconstructs the cleaned tokens into a processed string.  
   - `tokenizer.texts_to_sequences([review])`: converts the cleaned review into integer indices using the previously fitted tokenizer.  
   - `pad_sequences(..., maxlen=max_len)`: pads/truncates the sequence to fixed length (100 tokens here).  

3) **Why this matters:**  
   - Ensures new reviews undergo **identical transformations** as training data.  
   - Prevents mismatch between training and inference vocabularies.  

4) **Result:**  
   - Returns a padded sequence (2D NumPy array) ready to be passed into trained models like FFNN or LSTM for sentiment prediction.  

In [60]:
def predict_sentiment(review, tokenizer, model):
  preprocessed_review = preprocess_review(review, tokenizer)
  prediction = model.predict(preprocessed_review)
  sentiment = "Positive" if prediction > 0.5 else "Negative"
  return sentiment

### Function — Predict Sentiment of a Review

Explanation:
1) **Purpose:**  
   - Takes a raw review, applies preprocessing, and uses a trained model to classify sentiment as *Positive* or *Negative*.  

2) **Code breakdown:**  
   - `preprocess_review(review, tokenizer)`: cleans and tokenizes the review using the same preprocessing steps from training.  
   - `model.predict(preprocessed_review)`: generates a probability score between 0 and 1.  
     - Closer to 1 → strong likelihood of **positive sentiment**.  
     - Closer to 0 → strong likelihood of **negative sentiment**.  
   - `"Positive" if prediction > 0.5 else "Negative"`: applies a **threshold of 0.5**.  
     - Probability > 0.5 → Positive.  
     - Probability ≤ 0.5 → Negative.  

3) **Why this matters:**  
   - Provides a simple **deployment-ready inference function**.  
   - Allows end users to input raw text and get an immediate sentiment classification.  
   - Demonstrates reproducibility: preprocessing + model inference are encapsulated.  

4) **Result:**  
   - Returns a human-readable string: `"Positive"` or `"Negative"`.  
   - Can be easily extended with probability output, threshold tuning, or multi-class classification in future.  

In [67]:
lemmatizer = WordNetLemmatizer()
new_review1 = "This movie was absolutely awesome, I enjoyed the climax the most and it lived upto my expectations!"
new_review = "pathetic movie. a waste of time and your patience. better to play games at home rather than watching this movie."
sentiment = predict_sentiment(new_review, tokenizer, lstm_model)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step


### Sentiment Prediction on New Reviews

Explanation:
1) **Setup:**  
   - `lemmatizer = WordNetLemmatizer()`: ensures the lemmatizer is available for preprocessing inside the pipeline.  
   - `new_review1`: an example positive review (*“absolutely awesome”*, *“enjoyed the climax”*).  
   - `new_review`: an example negative review (*“pathetic movie”*, *“waste of time”*).  

2) **Prediction process:**  
   - `predict_sentiment(new_review, tokenizer, lstm_model)` applies the preprocessing pipeline: contraction expansion → lowercasing → URL/punctuation removal → tokenization → stopword removal → lemmatization → sequence conversion → padding.  
   - The processed review is passed into the trained **LSTM model**.  
   - The model outputs a probability, which is thresholded at 0.5.  

3) **Expected outcome:**  
   - For `new_review1`: prediction should be **Positive**, as the text contains strong positive cues.  
   - For `new_review`: prediction should be **Negative**, due to negative words like *“pathetic”*, *“waste”*.  

4) **Why this matters:**  
   - Demonstrates the notebook’s ability to move beyond static evaluation and classify **unseen, real-world input**.  
   - Shows that the pipeline is complete and ready for **deployment use cases**.  

In [68]:
print(sentiment)

Negative


### Sentiment Prediction Result

Explanation:
1) **Execution:**  
   - `print(sentiment)` outputs the classification result from the `predict_sentiment` function.  

2) **Observed result:**  
   ```
   Negative
   ```  

3) **Why this happened:**  
   - The review *“pathetic movie. a waste of time and your patience...”* contains strong **negative cues**.  
   - After preprocessing (contractions expanded, lowercasing, punctuation removed, tokenization, stopword removal, lemmatization), the LSTM model detected negativity.  
   - The model’s predicted probability was ≤ 0.5, mapping to the **Negative** class.  

4) **Takeaway:**  
   - Confirms the pipeline works correctly for **inference on unseen reviews**.  
   - The trained LSTM generalizes beyond the training set and classifies text in real time.  

## 📘 Notebook Summary: End-to-End NLP Sentiment Analysis Project

---

### 🎯 Project Objective
This notebook built an **end-to-end NLP pipeline** for binary sentiment classification on the **IMDB reviews dataset**.  
It explored **classical machine learning models** (Naive Bayes, Random Forest) and **deep learning models** (Feedforward Neural Network, LSTM) using different feature extraction methods (BoW, TF-IDF, Word2Vec, padded sequences).  
The pipeline included **preprocessing, model training, evaluation, and inference** on unseen reviews.

---

### 🛠️ Workflow Recap
1. **Dataset Loading:** Imported IMDB reviews from TensorFlow Datasets.  
2. **Preprocessing:** Expanded contractions, lowercased text, removed URLs/punctuation, tokenized, removed stopwords, applied POS tagging, lemmatization, and stemming.  
3. **Feature Extraction:**  
   - Bag of Words (BoW)  
   - TF-IDF (Term Frequency–Inverse Document Frequency)  
   - Word2Vec embeddings (averaged per document)  
   - Tokenizer + padded sequences (for neural models)  
4. **Model Training:**  
   - Naive Bayes (BoW, TF-IDF)  
   - Random Forest (BoW, TF-IDF)  
   - Feedforward Neural Network (FFNN)  
   - LSTM  
5. **Evaluation:** Compared accuracy across all models.  
6. **Deployment-Ready Functions:** Preprocessing and `predict_sentiment()` for new reviews.  

---

### 📊 Model Performance Results
```
NB with BoW:     82.68%
NB with TF-IDF:  84.36%
RF with BoW:     85.47%   ← Best performer
RF with TF-IDF:  84.51%
FFNN:            83.91%
LSTM:            84.58%
```

---

### 🔎 Key Insights
- **Random Forest (BoW)** delivered the best performance (~85.5%), highlighting the strength of classical ML with sparse features.  
- **Naive Bayes** provided a fast, interpretable baseline with decent accuracy.  
- **FFNN** was competitive but lacked sequential awareness, limiting performance.  
- **LSTM** captured sequential dependencies and nearly matched Random Forest’s accuracy, with potential to surpass it given more data or pretrained embeddings.  

---

### 🚀 Achievements
- Built a **complete NLP pipeline**: from raw text → preprocessing → feature extraction → modeling → evaluation.  
- Compared **classical vs. deep learning approaches** on the same dataset.  
- Created **deployment-ready functions** for inference on unseen reviews.  
- Produced a reproducible, structured notebook suitable for GitHub portfolio presentation.  

---

### ✅ Final Takeaway
- **Random Forest with BoW** provided the top accuracy here, proving classical models remain strong for smaller NLP tasks.  
- **LSTM** showed competitive performance and scalability for larger datasets and richer embeddings.  
- This notebook demonstrates versatility across **traditional ML** and **modern deep learning** pipelines, making it recruiter-ready.  

---

### 📌 Next Steps
- Add **evaluation metrics beyond accuracy** (precision, recall, F1, ROC-AUC).  
- Perform **hyperparameter tuning** for Random Forest and LSTM.  
- Experiment with **pretrained embeddings** (e.g., GloVe, Word2Vec Google News).  
- Add a **Transformer baseline** (e.g., DistilBERT) for state-of-the-art comparison.  
- Package the pipeline for **deployment** (FastAPI/Flask inference endpoint).  

---  
✨ This concludes the **End-to-End NLP Sentiment Analysis Project** notebook.  
