# Applying Machine Learning to Sentiment Analysis
## Project: IMDB Movie Review Data
In the mordern internet and social media age, people's opinion, reviews and recommendations have become a valuable resource for businesses.
Thanks to modern technology, we are now able to collect and analyse such data more efficiently.
In this project, I will delve into the subfield of **Natural Language Processing** called **Sentiment Analysis**, and learn how to use machine learning algorithms to classify documents based on their popularity.
 **Sentiment Analysis** is also called as **Opinion Mining**.

# Applying Machine Learning to Sentiment Analysis
## Obtaining the Movie Review Dataset

In this section, I prepare the **IMDB movie review dataset** that will be used
throughout the sentiment analysis project.

The goal is to make sure that:

1. The notebook is running inside the correct project folder.
2. A `data/aclImdb` directory exists and contains the extracted IMDB dataset.
3. If the dataset is missing, the notebook prints clear instructions on what the
   expected folder structure should look like.

### Step A – Confirm the notebook location

Before touching any data, I verify that the notebook is running inside the
expected project folder.

This small check helps avoid subtle bugs later, for example:
- running the notebook from a different directory,
- saving files to the wrong place,
- or accidentally creating duplicate `data/` folders.

The next cell prints the **current working directory** so I can visually confirm
that it matches:

`/Users/shivesh/Desktop/PythonProject/Sentiment Analysis`

In [1]:
# Double Check where the notebook is running
import os
from cProfile import label

os.getcwd()

'/Users/shivesh/Desktop/PythonProject/Sentiment Analysis'

### Step B – Make sure the IMDB dataset is available in `data/aclImdb`

The IMDB movie review dataset is distributed as the archive
`aclImdb_v1.tar` (originally `aclImdb_v1.tar.gz`).

On this machine I downloaded and extracted it **manually**:

1. Downloaded the archive from the Stanford URL in Safari.
2. Moved the file into the `Sentiment Analysis` project folder.
3. Extracted it (by double-clicking in Finder), which created a folder:

   `aclImdb/`  containing `train/` and `test/` subfolders.

After that, I want the project to follow this structure:

```text
Sentiment Analysis/
  sample.ipynb
  data/
    aclImdb/
      train/
      test/
      ...

In [2]:
import os

# Folder inside the project where we keep raw data
DATA_DIR = "data"
IMDB_DIR = os.path.join(DATA_DIR, "aclImdb")

# Make sure the data folder exists
os.makedirs(DATA_DIR, exist_ok=True)

# Check that the extracted dataset is in the right place
if os.path.isdir(IMDB_DIR):
    print("IMDB dataset is ready at:", IMDB_DIR)
else:
    print("IMDB dataset NOT found at:", IMDB_DIR)
    print()
    print("Expected structure:")
    print("  Sentiment Analysis/")
    print("    sample.ipynb")
    print("    data/aclImdb/{train,test,...}")
    raise FileNotFoundError(f"Expected folder not found: {IMDB_DIR}")

IMDB dataset is ready at: data/aclImdb


## Preprocessing the movie dataset into a more convinient format
To visualize the progress and estimated time until completion, we will use the **Python Progress Indicator**.
PyPind can be installed by executing the _"pip install pyprind"_ command in the terminal.

In [3]:
import pyprind
import pandas as pd
import os

# Base path of the unzipped movie dataset inside data/
basepath = os.path.join("data", "aclImdb")

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)

rows = []   # temporary list to store [review_text, sentiment] pairs

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            rows.append([txt, labels[l]])  # store row instead of df.append
            pbar.update()

# Build the DataFrame once from the collected rows
df = pd.DataFrame(rows, columns=['review', 'sentiment'])



Using Nested For Loops, we itreated over the _train_ and the _test_ subdirectories in the main aclImdb directory and read the individual text files from _pos_ and _neg_ subdirectories that we eventually appended to the _df_ pandas _DataFrame_, together with the integer class label (1 = positive, 0 = negative )

Since the class labels in the assembled dataset are sorted, we will now shuffle _DataFrame_ using the **Permutation** function from _np.random_ submodule.
This will be useful to split the dataset into training and testing datasets in the later parts of the projects, when we will stream the data from our local drive directly.

In [4]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

_df = df.read_csv('movie_data.csv', encoding='utf-8')_

No need to read the CSV again in the same session.
Right now, at the end of the previous cell you already have:
_df.to_csv('movie_data.csv', index=False, encoding='utf-8')_

df is still in memory, so you can just do:
_df.head()_

In [5]:
df.head(5)

Unnamed: 0,review,sentiment
11841,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
19602,OK... so... I really like Kris Kristofferson a...,0
45519,"***SPOILER*** Do not read this, if you think a...",0
25747,hi for all the people who have seen this wonde...,1
42642,"I recently bought the DVD, forgetting just how...",0


## Introducing the Bag-Of-Words Model
The idea behind the **Bag-Of-Words** is quiet simple
1. We create a vocabulearly of unique tokens- E.g. words from the entire set of documnets
2. We construct a feature vector from each document that contains the counts of how often each word occurs in a particular document.

## Transforming Words into Feature Vectors

In this step, we use CountVectorizer to build a Bag-of-Words representation of a small corpus.
The vectorizer first learns a vocabulary of all unique tokens in the four example sentences (count.vocabulary_).
Then, fit_transform converts each sentence into a row of the document–term matrix, where each column corresponds to a word in the vocabulary and each entry stores how often that word appears in the sentence.
The result (bag.toarray()) is a numerical matrix that we can feed into machine-learning models.


In [6]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(
    ['The sun is shining',
     'The weather is sweet',
     'The sun is shining, the weather is sweet',
     'and one and one is two'
     ]
)
bag = count.fit_transform(docs)
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [7]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [0 2 0 1 1 1 2 0 1]
 [2 1 2 0 0 0 0 1 0]]


### Raw term frequencies

In the bag-of-words model, *raw term frequency* for a word is simply the **number of times that word appears in a document**, without any extra scaling or weighting.

- For each document, we build a vocabulary of all unique tokens (words).
- For every word in that vocabulary, we count how many times it occurs in the document.
- These counts form the feature vector for the document (one dimension per word).

Example:
If the document is:
> "the sun is shining, the weather is sweet"

and our vocabulary includes `["the", "sun", "is", "shining", "weather", "sweet"]`, then the raw term frequencies are:

- `the` → 2
- `sun` → 1
- `is` → 1
- `shining` → 1
- `weather` → 1
- `sweet` → 1

No normalization (like dividing by document length) and no weighting (like TF–IDF) is applied here — we are just using **plain counts**.

### N-gram Models (Unigrams, Bigrams, Trigrams)

In the previous section, we built a **bag-of-words** model using **raw term frequencies**.
That model was based on **unigrams**, i.e. single words.

An **n-gram** is defined as a sequence of *n* consecutive tokens (usually words) from a text:

- **Unigram (1-gram)**: sequences of length 1
  - Example: `"the weather is sweet"`
    → `["the", "weather", "is", "sweet"]`
- **Bigram (2-gram)**: sequences of length 2
  - Example: `"the weather is sweet"`
    → `["the weather", "weather is", "is sweet"]`
- **Trigram (3-gram)**: sequences of length 3
  - Example: `"the weather is sweet"`
    → `["the weather is", "weather is sweet"]`

A unigram bag-of-words model ignores word order and only uses **individual word counts** as features.
By contrast, **n-gram models** (with n ≥ 2) can capture short-range context and common phrases, such as:

- sentiment phrases like `"not good"`, `"very bad"`, `"really great"`
- named entities like `"New York"`, `"Los Angeles"`

This makes n-gram features particularly useful in **sentiment analysis** and other NLP tasks where local word order carries important meaning.

### Assessing Word Relevancy with TF–IDF

Bag-of-words and n-gram models give us **raw term frequencies** (how many times each word or phrase appears in a document). However, frequent words are not always informative. For example, words like “the”, “is”, or “movie” may appear in almost every review, regardless of sentiment.

To capture **how relevant a word is for a specific document**, we use **TF–IDF (Term Frequency–Inverse Document Frequency)**.

- **Term Frequency (TF)**
  Measures how often a term appears in a single document.
  Higher TF → the term is more important *within that document*.

- **Inverse Document Frequency (IDF)**
  Measures how common or rare a term is across the whole collection of documents.
  - Terms that appear in **many** documents (e.g., “the”) get a **low** IDF.
  - Terms that appear in **fewer** documents (e.g., “excellent”, “terrible”) get a **higher** IDF.

- **TF–IDF score**
  TF–IDF combines these two ideas:

  > **TF–IDF(term, document) = TF(term in this document) × IDF(term over all documents)**

  Intuition:
  - A term gets a **high TF–IDF** score if
    - it appears frequently in this document (**high TF**), and
    - it does **not** appear in most other documents (**high IDF**).
  - Very common words across the corpus get **low TF–IDF** scores, even if they appear often in a document.

In practice, TF–IDF helps the model focus on **discriminative words**, such as “excellent”, “boring”, “waste”, or “masterpiece”, which are more useful for tasks like **sentiment analysis** than very common function words.

Scikit-Learn library implememts yet another transformer, the _TfidfTransformer_ class, which takes in raw term frequencies from the _CountVectorize_ class as input and transforms them into tf-idf

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)

# counts from CountVectorizer
counts = count.fit_transform(docs)

# transform counts → tf–idf
tfidf_matrix = tfidf.fit_transform(counts)

print(tfidf_matrix.toarray())

[[0.   0.38 0.   0.57 0.57 0.   0.46 0.   0.  ]
 [0.   0.38 0.   0.   0.   0.57 0.46 0.   0.57]
 [0.   0.46 0.   0.35 0.35 0.35 0.56 0.   0.35]
 [0.66 0.17 0.66 0.   0.   0.   0.   0.33 0.  ]]


## Cleaning Text Data

Before we build our Bag-of-Words and TF–IDF models, we first need to clean the raw text.
Real-world reviews contain a lot of “noise” such as HTML tags, punctuation, numbers,
and inconsistent casing. If we feed this directly into the model, the vocabulary
becomes messy and the model wastes capacity on useless tokens.

In this project, our basic text-cleaning pipeline will:

1. **Normalize the text**
   - Convert everything to lowercase
   - Remove HTML tags and line breaks

2. **Remove unwanted characters**
   - Strip punctuation, numbers, and other non-word symbols
   - Collapse multiple spaces into a single space

3. **Tokenize and filter**
   - Split the cleaned text into individual tokens (words)
   - Optionally remove stop words (very common words like *the*, *is*, *and*)

After this preprocessing step, each review is reduced to a cleaner sequence of words,
which we then feed into the Bag-of-Words / TF–IDF pipeline.

We will now remove all punctuations marks except for emoticon characters, such as :), since those are useful for sentiment analysis. To accomplish this task, we will use the Python's **Regular Expression (regex)** library, _re_

In [9]:
import re

def preprocessor(text):
    # remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # extract emoticons like :) ;-) :D etc.
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)

    # remove non-word chars, lowercase, then append emoticons (without '-')
    text = re.sub(r'[\W]+', ' ', text.lower())
    text = text + ' ' + ' '.join(emoticons).replace('-', '')

    return text

### Note: Regex backslashes and `SyntaxWarning: invalid escape sequence`

**What’s happening?**

- Our regex patterns contain things like `\)` and `\W`.
  - In **regex**, these are valid:
    - `\)` means “literal `)`”
    - `\W` means “non-word character”
- But in **normal Python strings**, a backslash starts an *escape sequence* (like `\n`, `\t`).
  - `\)` and `\W` are **not** valid Python string escapes.
  - Python still runs the code, but it shows warnings like:
    `SyntaxWarning: invalid escape sequence '\)'` and `'\W'`.

**Why this matters**

- The regex logic is correct, but the way it’s written as a Python string is noisy.
- These warnings can hide real problems later, so it’s good practice to fix them.

**Fix**

- Make regex patterns **raw strings** so backslashes are passed directly to the regex engine.
- Use the `r""` prefix:

```python
# bad (will raise SyntaxWarning)
re.findall('(?::|;|=)(?:-)?(?:\)|$begin:math:text$\|D\|P\)\'\, text\)
re\.sub\(\'\[\\W\]\+\'\, \' \'\, text\.lower\(\)\)

\# good \(raw strings\, no warnings\)
re\.findall\(r\'\(\?\:\:\|\;\|\=\)\(\?\:\-\)\?\(\?\:$end:math:text$|\(|D|P)', text)
re.sub(r'[\W]+', ' ', text.lower())

In [10]:
# lets confirm that our preprocessor works correctly
preprocessor(df.loc[0, 'review'] [-50:])

'and i suggest that you go see it before you judge  '

In [11]:
# Another one
preprocessor("</a> This :) is :( a test :-)!")

' this is a test  :) :( :)'

In [12]:
# Now lets apply our preprocessor function to all the movie reviews in our DataFrame
df['review'] = df['review'].apply(preprocessor)

## Cleaning Text Data

Before building our Bag-of-Words and TF–IDF models, we first normalize the raw text.

Our `preprocessor` function does three main things:

1. **Remove HTML tags and punctuation**
   We strip out HTML markup (e.g. `<br />`) and most punctuation characters so that they do not become separate “words” in our vocabulary.

2. **Preserve emoticons**
   Simple emoticons such as `:)`, `:(`, `:-)` are extracted using a regular expression and re-attached to the cleaned text.
   These patterns often carry strong sentiment and are therefore useful features.

3. **Lowercase everything**
   We convert the text to lowercase so that words like `Good`, `GOOD`, and `good` are treated as the same token.

After preprocessing, each review is a cleaner, more uniform string that is easier to tokenize and vectorize. This reduces noise in the feature space and helps the classifier focus on actual sentiment patterns instead of formatting differences.

In [13]:
def tokenizer(text):
    return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

## Using NLTK (Natural Language Toolkit)

In this section we introduce **NLTK – Natural Language Toolkit**, a popular Python library for basic NLP tasks.

At a high level, NLTK gives us:

- **Tokenization**
  Functions to split raw text into sentences or words.
  Example:
  `"This is a test."` → `["This", "is", "a", "test", "."]`

- **Stop word lists**
  Built-in lists of very common words such as *the, and, is* that usually do not carry much meaning for tasks like sentiment analysis.
  We can remove these tokens to reduce noise.

- **Stemming and lemmatization**
  Tools to reduce different word forms to a common base:
  - *Stemming* (e.g. Porter stemmer) cuts words down to a root:
    `"running", "runs", "ran"` → `"run"`
  - *Lemmatization* uses vocabulary and grammar rules to map words to a canonical form:
    `"mice"` → `"mouse"`, `"better"` → `"good"`

In our IMDB sentiment project we mainly use NLTK to:

1. Build a **smarter tokenizer** than just `text.split()`.
2. Optionally **remove stop words** that do not help the classifier.
3. Optionally **stem** words so that different forms (e.g. *run, running, runs*) are treated as the same feature.

This improves our Bag-of-Words and TF–IDF representations, because the model focuses on the core meaning of the text instead of superficial differences in capitalization, punctuation, or verb forms.

### Porter Stemmer Algorithm

The **Porter stemmer** is a classic rule-based algorithm for **stemming** English words.
Stemming means reducing different word forms to a simpler, common **stem** by stripping off
frequent suffixes.

The Porter stemmer works in several steps, each applying rules such as:

- `caresses → caress` (remove *es*)
- `ponies → poni` (replace *ies* with *i*)
- `caressed → caress` (remove *ed*)
- `hopping → hop`, `hoped → hope`
- `relational → relat`, `conditional → condit`

The resulting stems are not always valid English words (e.g. *relat*, *studi*), but they are
**consistent**, which is what we need for Bag-of-Words / TF–IDF features.

In the IMDB sentiment project we use the Porter stemmer to:

- collapse different inflected forms of a word (e.g. *run, runs, running, ran*) into one stem,
- reduce the size of the vocabulary,
- make the model focus on the underlying concepts rather than small spelling variations.

In [14]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

### Stop-Word Removal

**Stop words** are very common words such as *the, a, an, is, was, and, or, to, of, in* that usually
do not carry much content information. In a Bag-of-Words model with **raw** or **normalized term
frequencies (tf)** these words:

- appear in almost every document,
- produce very high counts,
- increase the dimensionality of the feature space,
- and add little useful signal for classification.

Therefore, stop-word removal is especially helpful when we work with tf-based features.

When we use **TF–IDF**, extremely frequent words automatically receive a low weight because their
inverse document frequency (IDF) is small. In that case, stop-word removal is less critical but
still useful to reduce noise and vocabulary size.

For sentiment analysis we usually *do not* remove negation words such as *not, no, never*, since
they can completely change the polarity of a sentence (e.g. “good” vs “not good”). We typically
start from a standard stop-word list and then customize it to keep important tokens like negations.

In [15]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print(len(ENGLISH_STOP_WORDS))
list(ENGLISH_STOP_WORDS)[:20]

318


['own',
 'we',
 'else',
 'then',
 'whatever',
 'yourself',
 'mostly',
 'who',
 'often',
 'namely',
 'those',
 'also',
 'front',
 'which',
 'sincere',
 'few',
 'fill',
 'an',
 'forty',
 'enough']

## Stop-Word Removal

**Stop words** are very common words such as *the, a, an, is, was, and, or, to, of* that usually do not carry much content information.

In a Bag-of-Words model with **raw or normalized term frequencies (tf)** these words:

- appear in almost every document,
- produce very high counts,
- increase the dimensionality of the feature space, and
- add little useful signal for classification.

Because of that, **stop-word removal** is especially helpful when we work with tf-based features.

When we use **TF–IDF**, extremely frequent words automatically receive a low weight because their inverse document frequency (IDF) is small. In that case, stop-word removal is less critical, but it still helps reduce noise and vocabulary size.

For **sentiment analysis** we usually **do not** remove negation words such as *not, no, never*, since they can completely change the polarity of a sentence (e.g. “good” vs “not good”). We typically start from a standard stop-word list and then customize it to keep important tokens like negations.

---

### Using Stop Words in This Project

Instead of using NLTK’s stop words, we use the list that comes **built in with scikit-learn**:


from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# start from scikit-learn's default English stop words
print(len(ENGLISH_STOP_WORDS))
list(ENGLISH_STOP_WORDS)[:20]

We can also customize this list, for example to keep negation words:
custom_stopwords = ENGLISH_STOP_WORDS.difference({'not', 'no', 'never'})

Then we plug this into our tokenizer + stemmer:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer_porter_no_stop(text):
    return [
        porter.stem(word)
        for word in text.split()
        if word.lower() not in custom_stopwords
    ]
This gives us stemmed tokens with most uninformative words removed, but keeps crucial negations.

---

### Why scikit-learn stop words worked but NLTK stop words did not

When we tried to use NLTK:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

we saw errors like:
	•	SSL: CERTIFICATE_VERIFY_FAILED
	•	LookupError: Resource 'stopwords' not found

What happened:
- nltk.download('stopwords') tries to download the stop-word corpus over HTTPS.
- On this Mac, the HTTPS request failed because of a certificate verification issue (CERTIFICATE_VERIFY_FAILED).
- As a result, the stopwords data never got saved to the local nltk_data folder.
- Later, stopwords.words('english') looked for that file, couldn’t find it, and raised a LookupError.

In contrast, scikit-learn’s ENGLISH_STOP_WORDS:
	•	is bundled directly inside the scikit-learn package,
	•	does not require any download or internet access,
	•	so it works immediately with no SSL or lookup errors.

Functionally, both NLTK and scikit-learn provide lists of common English stop words.
For this IMDB sentiment project, using scikit-learn’s built-in list is perfectly fine and avoids the NLTK download issues on this machine.

## Training a Logistic Regression Model for Document Classification

In this section we train a **logistic regression** classifier on the IMDB
movie reviews using a Bag-of-Words / TF–IDF representation.

### Why logistic regression?

For binary text classification (positive vs negative review), logistic
regression is a strong baseline:

- It works very well with **high-dimensional sparse features** (like
  bag-of-words).
- The model is **linear**, easy to regularize, and relatively fast to train.
- The output is a **probability** for each class, which is easy to interpret.

### Train / test split

We already built `movie_data.csv` with 50,000 reviews and labels:

- 25,000 reviews for training
- 25,000 reviews for testing

We extract:

- `X` = reviews (raw text)
- `y` = sentiment labels (0 = negative, 1 = positive)

Then we slice the first 25k as train and the rest as test, following the
textbook.

### Pipeline + GridSearchCV

We build a scikit-learn `Pipeline` with two steps:

1. **Vectorizer** (`TfidfVectorizer`):
   - converts raw text → token counts → TF–IDF weights
   - we will try different options for:
     - `ngram_range` (unigrams vs bigrams)
     - `stop_words` (use our stopwords or None)
     - `tokenizer` (simple split vs Porter stemmer)

2. **Classifier** (`LogisticRegression`):
   - with L2 regularization
   - hyperparameter `C` controls regularization strength
     (small `C` = stronger regularization).

We use **GridSearchCV** with 5-fold stratified cross-validation to search
over combinations of:

- BoW/TF–IDF parameters (ngrams, stop words, tokenizer, etc.)
- Logistic regression parameters (`C`, penalty)

The grid has two dictionaries:

1. Standard TF–IDF settings (`use_idf=True`, `norm='l2'`).
2. "Raw tf" style settings with `use_idf=False`, `smooth_idf=False`,
   `norm=None`.

This mirrors the idea from the chapter: **compare models based on pure term
frequency vs TF–IDF**.

We run `GridSearchCV` with:

- `scoring='accuracy'`
- `cv=5`
- `n_jobs=-1` (use all cores)
- `verbose=2` (show progress)

After the search:

- `gs_lr_tfidf.best_params_` gives the best combination of settings.
- `gs_lr_tfidf.best_score_` gives mean CV accuracy.
- `gs_lr_tfidf.best_estimator_` is the final trained pipeline.
- We then evaluate on the 25k held-out test reviews to get **test accuracy**.

The key takeaway: with a well-tuned logistic regression + TF–IDF, we can reach
~90% accuracy on IMDB sentiment classification.

In [16]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer


# 1. Load data (if you don't already have df in memory)

df = pd.read_csv("movie_data.csv", encoding="utf-8")  # review, sentiment

# 25k train / 25k test split as in the book
X_train = df.loc[:24999, 'review'].values
y_train = df.loc[:24999, 'sentiment'].values

X_test  = df.loc[25000:, 'review'].values
y_test  = df.loc[25000:, 'sentiment'].values


In [17]:
# 2. Vectorizer + Logistic Regression pipeline


# if you have your own preprocessor/tokenizers, import/define them here
# from your earlier cells:
# - preprocessor (clean HTML, emoticons, lowercasing, etc.)
# - tokenizer (simple split with optional cleaning)
# - tokenizer_porter (uses PorterStemmer)

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
stop = ENGLISH_STOP_WORDS  # our base stop-word list

tfidf = TfidfVectorizer(
    strip_accents=None,
    lowercase=False,          # we already handle casing in preprocessor
    preprocessor=preprocessor # your custom preprocessor function
)

lr = LogisticRegression(
    random_state=0,
    solver="liblinear"        # works well for small/medium text problems
)

lr_tfidf = Pipeline([
    ("vect", tfidf),
    ("clf", lr)
])




In [18]:
# 3. Parameter grid

param_grid = [
    {
        "vect__ngram_range": [(1, 1)],   # unigrams
        "vect__stop_words":   [stop, None],
        "vect__tokenizer":    [tokenizer, tokenizer_porter],
        "clf__penalty":       ["l2"],
        "clf__C":             [1.0, 10.0, 100.0]
    },
    {
        "vect__ngram_range": [(1, 1)],
        "vect__stop_words":   [stop, None],
        "vect__tokenizer":    [tokenizer, tokenizer_porter],
        "vect__use_idf":      [False],
        "vect__norm":         [None],
        "clf__penalty":       ["l2"],
        "clf__C":             [1.0, 10.0, 100.0]
    }
]




### Regularization (short recap)

In high-dimensional text data, a logistic regression model can easily overfit:
it can give very large weights to some words or n-grams that only appear in the
training set. **Regularization** is a way to control this complexity.

We add a penalty on the size of the weight vector \( w \) to the loss
function:

- **L2 regularization (ridge)** uses \( \lambda \sum_j w_j^2 \).
  This keeps many weights **small but non-zero**, which works very well with
  Bag-of-Words / TF–IDF features.
- **L1 regularization (lasso)** uses \( \lambda \sum_j |w_j| \).
  This encourages **sparse** solutions where many weights are exactly zero,
  which acts like an automatic feature selection.

In scikit-learn’s `LogisticRegression` the strength of regularization is
controlled by the parameter **C**, which is the **inverse** of \(\lambda\):

- small `C` → **strong** regularization (simpler model),
- large `C` → **weak** regularization (more flexible model).

In this chapter we tune `C` using cross-validation to find a good balance
between **fitting the training data** and **generalizing to unseen reviews**.

In [19]:
# 4. GridSearchCV (5-fold stratified CV)

gs_lr_tfidf = GridSearchCV(
    estimator=lr_tfidf,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    verbose=2,
    n_jobs=-1
)

gs_lr_tfidf.fit(X_train, y_train)

print("CV Accuracy: %.3f" % gs_lr_tfidf.best_score_)
print("Best params:", gs_lr_tfidf.best_params_)


Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=frozenset({'a', 'describe', 'less', 'be', 'sixty', 'only', 'side', 'its', 'cant', 'otherwise', 'hereby', 'anyone', 'somehow', 'elsewhere', 'several', 'whenever', 'thick', 'he', 'because', 'back', 'due', 'none', 'two', 'thin', 'nevertheless', 'himself', 'neither', 'thereupon', 'so', 'made', 'becoming', 'else', 'name', 'another', 'after', 'some', 'last', 'noone', 'will', 'without', 'out', 'her', 'find', 'who', 'go', 'yourselves', 'done', 'have', 'and', 'is', 'even', 'could', 'too', 'somewhere', 'alone', 'are', 'on', 'afterwards', 'below', 'top', 'beyond', 'latter', 'anywhere', 'anyhow', 'cry', 'now', 'former', 'every', 'fifteen', 'bottom', 'nothing', 'also', 'mill', 'detail', 'into', 'part', 'seems', 'least', 'hers', 'toward', 'interest', 'me', 'amongst', 'should', 'indeed', 'five', 'ten', 'whom', 'except', 'eleven', 'meanwhile', 'fire', 'therefor



[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=frozenset({'nothing', 'describe', 'latter', 'beyond', 'please', 'however', 'will', 'a', 'all', 'whoever', 'least', 'therein', 'along', 'thereby', 'further', 'get', 'whereafter', 'herein', 'yours', 'eleven', 'some', 'else', 'through', 'hasnt', 'within', 'whither', 'eight', 'would', 'fill', 'last', 'anywhere', 'go', 'very', 'much', 'before', 'still', 'seeming', 'next', 'where', 'other', 'we', 'whereas', 'has', 'anyone', 'sixty', 'must', 'who', 'about', 'at', 'enough', 'nobody', 'itself', 'ours', 'them', 'whereby', 'serious', 'whose', 'such', 'full', 'keep', 'myself', 'even', 'thin', 'via', 'ourselves', 'but', 'anyway', 'too', 'cannot', 'twelve', 'which', 'yourselves', 'your', 'something', 'everywhere', 'their', 'above', 'less', 'might', 'alone', 'two', 'can', 'side', 'be', 'nor', 'former', 'of', 'several', 'forty', 'than', 'ten', 'her', 'me', 'whenever', 'was', 'con', 'done', 'while', 'himself', 'find', 'for



[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x1129b7c40>; total time=   2.5s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x110a0bc40>; total time=   2.5s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x11085bb00>; total time=   2.6s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10ebbbb00>; total time=   2.6s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10f117c40>; total time=   2.6s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=frozenset({'also', 'becoming', 'hereafter', 'further', 'such', 'as', 'two', 'sometimes', 'nothing', 'your', 'every', 'out', 'whenever', 'perhaps', 'ourselves', 'upon', 'call', 'anywhere', 'hereby', 'again', 'therein', 'bottom', 'on', 'have', 'hasnt', 'latterly', 'behind', 'get', 'this', 'several', 'empty', 'well', 'anyhow', 'latter', 'become', 'fire', 'are', 'so', 'none', 'always', 'off', 'below', 'us', 'system', 'may', 'already', 'be', 'once', 'detail', 'third', 'thereafter', 'fifteen', 'per', 'wherever', 'her', 'next', 'whereafter', 'since', 'myself', 'if', 'around', 'last', 'how', 'via', 'give', 'very', 'beforehand', 'why', 'of', 'fifty', 'move', 'might', 'must', 'whether', 'at', 'first', 'that', 'here', 'over', 'most', 'twenty', 'been', 'found', 'an', 'mill', 'still', 'amongst', 'the', 'being', 'everything', 'de', 'co', 'itself', 'hundred', 'describe', 'had', 'beside', 'ever', 'and', 'could', 'above'



[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x112ae4540>; total time=   2.9s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x113c07e20>; total time=   3.0s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x110d6b740>; total time=   2.9s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10f6c54e0>; total time=   3.0s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10ffefe20>; total time=   3.1s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x1133c7740>; total time=   3.1s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10a91dd00>; total time=   3.1s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=frozenset({'down', 'about', 'or', 'seem', 'whose', 'anyhow', 'and', 'out', 'across', 'side', 'some', 'very', 'their', 'eg', 'get', 'thus', 'we', 'himself', 'back', 'after', 'amoungst', 'this', 'enough', 'please', 'no', 'as', 'top', 'up', 'anything', 'each', 'less', 'towards', 'empty', 'six', 'ourselves', 'i', 'perhaps', 'another', 'third', 'whence', 'upon', 'such', 'meanwhile', 'above', 'too', 'afterwards', 'whither', 'which', 'of', 'would', 'something', 'first', 'while', 'hasnt', 'may', 'she', 'thence', 'any', 'nine', 'though', 'until', 'must', 'on', 'what', 'not', 'amount



[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=frozenset({'down', 'about', 'or', 'seem', 'whose', 'anyhow', 'and', 'out', 'across', 'side', 'some', 'very', 'their', 'eg', 'get', 'thus', 'we', 'himself', 'back', 'after', 'amoungst', 'this', 'enough', 'please', 'no', 'as', 'top', 'up', 'anything', 'each', 'less', 'towards', 'empty', 'six', 'ourselves', 'i', 'perhaps', 'another', 'third', 'whence', 'upon', 'such', 'meanwhile', 'above', 'too', 'afterwards', 'whither', 'which', 'of', 'would', 'something', 'first', 'while', 'hasnt', 'may', 'she', 'thence', 'any', 'nine', 'though', 'until', 'must', 'on', 'what', 'not', 'amount', 'under', 'bottom', 'keep', 'several', 'more', 'nowhere', 'found', 'whereas', 'noone', 'describe', 'yet', 'these', 'fifty', 'nevertheless', 'well', 'at', 'whatever', 'give', 'from', 'even', 'system', 'although', 'everywhere', 'ten', 'ours', 'wherever', 'throughout', 'due', 'he', 'has', 'since', 'thereby', 'than', 'towa



[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x112acbc40>, vect__use_idf=False; total time=   4.3s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x113ea6e80>, vect__use_idf=False; total time=   4.5s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x112b7b9c0>, vect__use_idf=False; total time=   4.6s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10a9223e0>, vect__use_idf=False; total time=   4.5s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f1149a0>; total time=  27.2s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f5d09a0>; total time=  27.5s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x110e28860>; total time=  27.4s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f6009a0>; total time=  27.6s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10fab09a0>; total time=  27.8s




[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x107469d00>; total time=  28.1s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=frozenset({'along', 'because', 'found', 'fire', 'who', 'four', 'she', 'their', 'each', 'same', 'always', 'my', 'everywhere', 'all', 'herself', 'one', 'thereupon', 'inc', 'whose', 'there', 'whence', 'below', 'do', 'onto', 'nevertheless', 'forty', 'that', 'least', 'anyhow', 'ours', 'amoungst', 'etc', 'but', 'between', 'towards', 'any', 'to', 'upon', 'ourselves', 'should', 'whereby', 'alone', 'whom', 'noone', 'much', 'eleven', 'anywhere', 'detail', 'now', 'though', 'seemed', 'name', 'cry', 'never', 'your', 'until', 'perhaps', 'except', 'being', 'whatever', 'again', 'others', 'since', 'in', 'thin', 'per', 'next', 'another', 'became', 'system', 'you', 'former', 'sometime', 'bottom', 'with', 'fifteen', 'it', 'although', 'as', 'acr



[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x112ae4540>, vect__use_idf=False; total time=   4.5s




[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x111028ae0>; total time=  27.8s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x112fb0ae0>; total time=  27.9s




[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f1a89a0>; total time=  28.1s




[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x110e789a0>; total time=  28.5s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x112e70720>; total time=  29.3s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x11394c9a0>; total time=  28.3s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x1109abe20>, vect__use_idf=False; total time=   5.8s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f1d0680>; total time=  28.6s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x1113549a0>; total time=  29.2s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=frozenset({'at', 'go', 'could', 'own', 'somewhere', 'nobody', 'top', 'made', 'who', 'why', 'further', 'over', 'in', 'eg', 'hereby', 'more', 'will', 'back', 'ten', 'any', 'onto', 'whereupon', 'always', 'us', 'together', 'yourself', 'wherever', 'without', 'beforehand', 'on', 'give', 'must', 'him', 'its', 'almost', 'besides', 'been', 'around', 'seem', 'eight', 'twelve', 'thereafter', 'un', 'are', 'elsewhere', 'their', 'bill', 'same', 'whither', 'name', 'wherein', 'while', 'beside', 'becoming', 'how', 'because', 'see', 'during', 'interest', 'sometimes', 'whether', 'couldnt', 'every', 'somehow', 'can', 'thence', 'enough', 'whereafter', 'full', 'otherwise', 'moreover', 'themselves', 'least', 'three', 'five', 'ours', 'still', 'et



[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x1129eb880>, vect__use_idf=False; total time=   5.6s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x110a3f880>, vect__use_idf=False; total time=   5.8s




[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10ec53740>, vect__use_idf=False; total time=   5.6s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x10ed679c0>, vect__use_idf=False; total time=   4.8s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x110e879c0>, vect__use_idf=False; total time=   4.9s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x113024720>, vect__use_idf=False; total time=   5.6s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x113dfb100>, vect__use_idf=False; total time=   5.3s




[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x11178b100>, vect__use_idf=False; total time=   5.5s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f1149a0>, vect__use_idf=False; total time=  30.8s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10fab09a0>, vect__use_idf=False; total time=  30.4s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f5d09a0>, vect__use_idf=False; total time=  30.6s
[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x1086b5d00>, vect__use_idf=False; total time=  30.5s




[CV] END clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f6009a0>, vect__use_idf=False; total time=  31.1s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x110e2d3a0>, vect__use_idf=False; total time=  30.4s




[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x112e70720>, vect__use_idf=False; total time=  30.1s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x106fc63e0>, vect__use_idf=False; total time=  29.4s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x11394c9a0>, vect__use_idf=False; total time=  30.0s
[CV] END clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10fc1d080>, vect__use_idf=False; total time=  30.0s
[CV] END clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x10f1613a0>, vect__use_idf=

60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/shivesh/Desktop/PythonProject/Sentiment Analysis/.venv/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 833, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shivesh/Desktop/PythonProject/Sentiment Analysis/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1336, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/shivesh/Desktop/PythonProject/Sentiment Analysis/.venv/lib/python3.13/site-packages/sklearn/pipeline.py", line 613

CV Accuracy: 0.897
Best params: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x1123a6d40>}


### Why 5-Fold Stratified Cross-Validation?

We use **k-fold cross-validation** to reduce the influence of “luck” in our
evaluation. Instead of relying on a single train/test split, we split the data
into *k* folds and run *k* rounds of training + validation, each time using a
different fold as the validation set. The final score is the average accuracy
across all folds.

For classification tasks we use **StratifiedKFold**, which keeps the class
distribution (positive / negative) similar in every fold. This makes each fold
representative of the full dataset.

Choosing **5 folds instead of 10** is a practical trade-off:

- 10-fold CV has slightly lower variance in the accuracy estimate,
  but takes about **twice as long**.
- 5-fold CV is **much faster** while still giving a reliable estimate.

On the 50k IMDB reviews, a large grid search with 10-fold CV would be very
expensive, so the book uses **5-fold stratified CV** as a good balance between
runtime and robustness.

In [20]:
# 5. Evaluate on the test set

clf = gs_lr_tfidf.best_estimator_
print("Test Accuracy: %.3f" % clf.score(X_test, y_test))

Test Accuracy: 0.899


## Out-of-Core Learning with HashingVectorizer + SGDClassifier

The previous grid search uses **all 50,000 reviews in memory** at once.
For much larger datasets, this can become too slow or memory-heavy.

To handle larger data, we can use **out-of-core learning**:

- Instead of loading the whole dataset, we **stream** documents from disk in
  small mini-batches.
- We use a classifier that supports **incremental learning** via `partial_fit`
  (here: `SGDClassifier` with logistic loss).
- We use `HashingVectorizer` instead of `CountVectorizer` / `TfidfVectorizer`:
  - HashingVectorizer maps tokens to a fixed-size feature space using a hash
    function.
  - It does not need to store a vocabulary, so it is very memory efficient.
  - This is ideal when we process data in a stream.

### Steps

1. **Tokenizer function**

   We reuse a text cleaning + tokenization function that:
   - removes HTML tags and punctuation,
   - extracts emoticons,
   - lowercases text,
   - optionally removes stop words,
   - optionally applies stemming.

2. **Document stream**

   We define `stream_docs(path)`:

   - opens `movie_data.csv`,
   - skips the header,
   - yields one `(text, label)` pair at a time.

3. **Mini-batch function**

   `get_minibatch(doc_stream, size)`:

   - pulls `size` documents from the stream,
   - returns `X` (list of text) and `y` (list/array of labels).

4. **Model**

   - `HashingVectorizer` with our tokenizer and preprocessor.
   - `SGDClassifier(loss='log_loss')` for online logistic regression.
   - We call `partial_fit` on each mini-batch.

We iterate, for example, over **45 mini-batches** of size 1,000:

- 45 × 1,000 = 45,000 documents for training.
- We keep the last 5,000 documents as a held-out test set.
- At the end, we compute accuracy on that test set.

The accuracy is slightly lower than the full grid-search model (~0.86–0.87 vs
~0.90), but the training is:

- much faster,
- uses much less memory,
- scalable to much larger datasets.

This is the main idea behind **online / streaming learning** in this chapter.

In [21]:
import numpy as np
import re
import csv

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# If not already defined, reuse our preprocessor/tokenizer here.
# I'll show a self-contained version which is close to the book:

stop = ENGLISH_STOP_WORDS

def preprocessor_stream(text):
    # strip HTML tags
    text = re.sub(r"<[^>]*>", "", text)

    # extract emoticons
    emoticons = re.findall(r"(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)

    # remove non-word characters and convert to lower case
    text = re.sub(r"[\W]+", " ", text.lower())

    # append emoticons without hyphens
    text = text + " " + " ".join(emoticons).replace("-", "")

    return text

def tokenizer_stream(text):
    return text.split()


# Document stream generator

def stream_docs(path):
    """Yield (text, label) pairs from the movie_data.csv file."""
    with open(path, "r", encoding="utf-8") as csvfile:
        reader = csv.reader(csvfile)
        next(reader)  # skip header
        for line in reader:
            text, label = line[0], int(line[1])
            yield text, label


# Minibatch helper

def get_minibatch(doc_stream, size):
    """Read `size` documents from the stream."""
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, np.array(y)


# HashingVectorizer + SGDClassifier

vect = HashingVectorizer(
    decode_error="ignore",
    n_features=2**21,           # as in the book
    preprocessor=preprocessor_stream,
    tokenizer=tokenizer_stream
)

clf = SGDClassifier(
    loss="log_loss",            # logistic regression
    random_state=1,
    max_iter=1                  # we'll control epochs via partial_fit
)

doc_stream = stream_docs("movie_data.csv")

# Classes need to be passed for the first call to partial_fit
classes = np.array([0, 1])


# Online training on 45 mini-batches of 1,000 docs each

from pyprind import ProgBar
pbar = ProgBar(45)

for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()


# Evaluate on the remaining 5,000 docs

X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print("Accuracy: %.3f" % clf.score(X_test, y_test))

# Optionally, update the model one last time on the test set
clf.partial_fit(X_test, y_test)

Accuracy: 0.830


0,1,2
,"loss  loss: {'hinge', 'log_loss', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'}, default='hinge' The loss function to be used. - 'hinge' gives a linear SVM. - 'log_loss' gives logistic regression, a probabilistic classifier. - 'modified_huber' is another smooth loss that brings tolerance to  outliers as well as probability estimates. - 'squared_hinge' is like hinge but is quadratically penalized. - 'perceptron' is the linear loss used by the perceptron algorithm. - The other losses, 'squared_error', 'huber', 'epsilon_insensitive' and  'squared_epsilon_insensitive' are designed for regression but can be useful  in classification as well; see  :class:`~sklearn.linear_model.SGDRegressor` for a description. More details about the losses formulas can be found in the :ref:`User Guide ` and you can find a visualisation of the loss functions in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_loss_functions.py`.",'log_loss'
,"penalty  penalty: {'l2', 'l1', 'elasticnet', None}, default='l2' The penalty (aka regularization term) to be used. Defaults to 'l2' which is the standard regularizer for linear SVM models. 'l1' and 'elasticnet' might bring sparsity to the model (feature selection) not achievable with 'l2'. No penalty is added when set to `None`. You can see a visualisation of the penalties in :ref:`sphx_glr_auto_examples_linear_model_plot_sgd_penalties.py`.",'l2'
,"alpha  alpha: float, default=0.0001 Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when `learning_rate` is set to 'optimal'. Values must be in the range `[0.0, inf)`.",0.0001
,"l1_ratio  l1_ratio: float, default=0.15 The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Only used if `penalty` is 'elasticnet'. Values must be in the range `[0.0, 1.0]` or can be `None` if `penalty` is not `elasticnet`. .. versionchanged:: 1.7  `l1_ratio` can be `None` when `penalty` is not ""elasticnet"".",0.15
,"fit_intercept  fit_intercept: bool, default=True Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.",True
,"max_iter  max_iter: int, default=1000 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the ``fit`` method, and not the :meth:`partial_fit` method. Values must be in the range `[1, inf)`. .. versionadded:: 0.19",1
,"tol  tol: float or None, default=1e-3 The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for ``n_iter_no_change`` consecutive epochs. Convergence is checked against the training loss or the validation loss depending on the `early_stopping` parameter. Values must be in the range `[0.0, inf)`. .. versionadded:: 0.19",0.001
,"shuffle  shuffle: bool, default=True Whether or not the training data should be shuffled after each epoch.",True
,"verbose  verbose: int, default=0 The verbosity level. Values must be in the range `[0, inf)`.",0
,"epsilon  epsilon: float, default=0.1 Epsilon in the epsilon-insensitive loss functions; only if `loss` is 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'. For 'huber', determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold. Values must be in the range `[0.0, inf)`.",0.1


## Naive Bayes Classifier (short note)

The chapter briefly mentions the **naive Bayes classifier** as another popular
text classification model.

Key points:

- Very simple and fast to train.
- Assumes features are **conditionally independent** given the class.
- Works surprisingly well on many text tasks (e.g. spam filtering).
- Often used as a strong baseline, especially with Bag-of-Words features.

We are not implementing naive Bayes here, but the idea is:

1. Estimate **P(word | class)** from the training corpus.
2. For a new document, combine word probabilities to compute **P(class | doc)**.
3. Choose the class with the higher posterior probability.

---

## word2vec (short note)

The book also mentions **word2vec** as a more modern alternative to
Bag-of-Words:

- Instead of representing words as one-hot vectors, word2vec learns
  **dense, low-dimensional embeddings**.
- Words with similar meanings end up close together in this vector space.
- Famous examples: **king –> man** and  **woman -> queen**.

In this chapter we just note that:

- word2vec (and more modern methods like GloVe, fastText, and transformers)
  can capture **semantic relationships** that Bag-of-Words cannot.
- Later chapters (and other resources) cover neural-network-based models
  in more detail.

For the IMDB sentiment project here, we stick to Bag-of-Words / TF-IDF +
linear models (logistic regression / SGD).

## Topic Modelling with Latent Dirichlet Allocation (LDA)

So far we have focused on **supervised learning**: predicting the sentiment
label (positive/negative) given a review.

In this section, we switch to an **unsupervised** task: **topic modelling**.

### What is LDA?

**Latent Dirichlet Allocation (LDA)** is a generative probabilistic model that
tries to discover **hidden topics** in a document collection by looking at how
words co-occur across documents.

- Each **document** is modelled as a mixture of topics.
- Each **topic** is a distribution over words.

Given a **bag-of-words matrix** (documents × words), LDA decomposes it into:

- a document-to-topic matrix (how much each topic contributes to a document),
- a topic-to-word matrix (how strongly each word belongs to each topic).

We must **choose the number of topics** in advance (here: 10). This is a
hyperparameter and can be tuned.

### LDA with scikit-learn

Steps from the textbook:

1. Load `movie_data.csv` into a DataFrame `df`.
2. Use `CountVectorizer` to create a bag-of-words matrix `X`:
   - remove very common words (`max_df=0.1` → ignore words in >10% of docs),
   - limit vocabulary size (`max_features=5000`),
   - use English stop words (`stop_words='english'`).
3. Fit an `LatentDirichletAllocation` model with:
   - `n_components=10` (topics),
   - `learning_method='batch'` (use full dataset at once),
   - `random_state=123` for reproducibility.
4. Access `lda.components_`:
   - shape `(n_topics, n_words)`,
   - each row contains word importance for a given topic.
5. For each topic, sort the word importances and print the **top N words**.

The result is a set of interpretable topics, for example:

- Topic 1: *worst minutes awful script stupid*
- Topic 2: *family mother father children girl*
- Topic 3: *american war dvd music tv*
- …

These topics give us a **high-level view** of what themes appear in the movie
reviews without using any labels.

Note: LDA here is **separate** from the sentiment classifier. In the next
chapter the authors show how to embed the classifier into a web app; LDA is
a standalone unsupervised example.

In [22]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


# 1. Load movie_data and build bag-of-words matrix

df = pd.read_csv("movie_data.csv", encoding="utf-8")

count = CountVectorizer(
    max_df=0.1,           # ignore very frequent words (>10% docs)
    max_features=5000,    # keep 5000 most frequent words
    stop_words="english"
)

X = count.fit_transform(df["review"].values)



In [23]:

# 2. Fit LDA model with 10 topics

lda = LatentDirichletAllocation(
    n_components=10,      # number of topics
    random_state=123,
    learning_method="batch"
)

X_topics = lda.fit_transform(X)

# Shape of components_: (n_topics, n_features)
print("lda.components_.shape:", lda.components_.shape)


lda.components_.shape: (10, 5000)


In [24]:

# 3. Print the top words per topic

n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx + 1}:")
    top_indices = topic.argsort()[:-n_top_words - 1:-1]  # indices of top words
    top_words = [feature_names[i] for i in top_indices]
    print(" ".join(top_words))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool


## Comparing the Different Setups

### 5-Fold Stratified Cross-Validation

We use **k-fold cross-validation** to estimate how well our model will generalize and to choose good hyperparameters. In **5-fold stratified CV**:

- The training set is split into 5 folds.
- We train 5 models, each time holding out a different fold as validation.
- Scores are averaged across folds.
- *Stratified* means each fold keeps a similar positive/negative class ratio.

Choosing **5 folds instead of 10** is a runtime vs stability trade-off:

- 10-fold CV has slightly lower variance in the accuracy estimate, but is ~2× slower.
- 5-fold CV is much faster and still reliable on a large dataset like 50k reviews.

In this project we obtain **Test Accuracy ≈ 0.899** with 5-fold stratified CV and a
grid-searched logistic regression model (TF–IDF features).

---

### Out-of-Core Learning with HashingVectorizer + SGDClassifier

Out-of-core learning trains on **mini-batches** that are streamed from disk,
instead of loading the full dataset into memory. Here we use:

- `HashingVectorizer` for features (no stored vocabulary, uses the hashing trick),
- `SGDClassifier(loss='log_loss', penalty='l2')` with `partial_fit` updates.

This setup is very **memory-efficient** and fast for big data, but we do not run a
full grid search and we accept some information loss from hashing. As a result,
the accuracy (~0.84) is lower than the fully tuned in-memory logistic regression
(~0.899), but the example demonstrates how to scale to larger datasets.

---

### LDA vs Logistic Regression vs k-NN

- **Logistic Regression** (this chapter’s main classifier):
  Supervised, discriminative model that maps feature vectors to a probability of
  the positive class. Used for sentiment classification.

- **LDA (Latent Dirichlet Allocation)**:
  Unsupervised probabilistic topic model. Takes a bag-of-words matrix and
  decomposes it into:
  - a document–topic matrix and
  - a topic–word matrix.
  Useful for discovering themes such as “family drama”, “horror”, etc., not for
  direct sentiment labels.

- **k-NN (k-Nearest Neighbors)**:
  Supervised, non-parametric classifier that predicts a label by taking the
  **majority vote** (mode) of the k closest training samples. Mentioned here as
  another family of classifiers, but not used in this chapter.

### Logistic Regression vs LDA vs PCA vs Kernel PCA

In this chapter we mainly use **logistic regression** for sentiment classification,
but there are several related techniques that are easy to confuse:

#### Logistic Regression
- Supervised classifier.
- Models \(P(y=1 \mid x)\) using the logistic (sigmoid) function.
- Learns a linear decision boundary in feature space.
- Used here with TF–IDF features for IMDB review sentiment (positive vs negative).

#### Linear Discriminant Analysis (LDA)
- **This is a different LDA** from the topic model “Latent Dirichlet Allocation.”
- Supervised dimensionality reduction + classifier.
- Finds directions that:
  - maximize the distance between class means, and
  - minimize the variance within each class.
- Uses label information directly and is designed to separate classes well.
- Can be used as:
  - a classifier in its own right, or
  - a feature extractor before another classifier.

#### PCA (Principal Component Analysis)
- Unsupervised dimensionality reduction.
- Ignores class labels and focuses on directions of **maximum variance**.
- Often used for:
  - compressing high-dimensional features,
  - denoising,
  - visualization (e.g. projecting to 2D/3D).

#### Kernel PCA (KPCA)
- Nonlinear extension of PCA.
- Uses kernel functions (e.g. RBF kernel) to perform PCA in an implicit
  high-dimensional feature space.
- Captures **nonlinear** structure in the data, useful when linear PCA is not
  expressive enough.

In short:

- **Logistic Regression** – supervised classifier for predicting labels.
- **Linear Discriminant Analysis** – supervised dimensionality reduction that
  explicitly tries to separate classes.
- **PCA / Kernel PCA** – unsupervised dimensionality reduction methods that find
  useful low-dimensional representations (linear for PCA, nonlinear for KPCA),
  without using class labels.