

## **Summary of the Demo: Text Representation and Sentiment Classification**

This demo showcases a complete workflow of **text preprocessing, vectorization, and sentiment classification** using the **Bag-of-Words (BoW)** model and **Naive Bayes** classifier ‚Äî two of the most fundamental yet powerful techniques in natural language processing (NLP).

The core idea is to demonstrate how raw text data (like movie reviews or taglines) can be transformed into numerical representations that machine learning models can understand, and how enhancing this representation with **n-grams** can significantly impact performance and interpretability.

---

### **1. Purpose and Flow of the Demo**

The objective of this demo is to illustrate:

* How to **convert unstructured text** into a structured numerical form using BoW.
* How to **train and evaluate** a simple machine learning model (Naive Bayes) on text data.
* How **n-grams** enhance context understanding by including combinations of words rather than single tokens.
* How increasing linguistic complexity (1-gram ‚Üí 3-gram) affects both **accuracy and computational cost**.

The workflow follows a clear progression ‚Äî from data loading and cleaning, through feature extraction, to model evaluation and optimization.

---

### **2. Core Components Explained**

#### **a. Text Preprocessing**

Data cleaning steps like removing missing values, lowercasing text, removing stopwords, and lemmatization are performed.
This ensures that the text is consistent and that only meaningful tokens contribute to model learning.

Significance:
üßπ *Reduces noise and redundancy, ensuring more accurate text representations.*

---

#### **b. Bag-of-Words Representation**

Using `CountVectorizer`, each document (review or tagline) is transformed into a **vector of word counts**.
This creates a simple, interpretable numerical matrix where:

* Rows = documents
* Columns = unique words (features)
* Values = frequency of each word

Significance:
üî¢ *BoW captures word presence and frequency, forming the foundation for all classical NLP models.*

---

#### **c. Model Training ‚Äî Multinomial Naive Bayes**

A **Multinomial Naive Bayes classifier** is trained on the BoW features.
It works well for discrete count data and is widely used in text classification because it assumes word independence and calculates class probabilities efficiently.

Significance:
‚öôÔ∏è *Provides a fast, effective baseline for sentiment prediction and other NLP tasks.*

---

### **3. The Power of N-Grams**

The highlight of the demo is the transition from **unigrams** (single words) to **higher-order n-grams** (word pairs and triplets).

* **Unigrams (n=1):** Capture isolated word importance (e.g., ‚Äúgood‚Äù, ‚Äúbad‚Äù).
* **Bigrams (n=2):** Capture short context (e.g., ‚Äúnot good‚Äù, ‚Äúvery bad‚Äù).
* **Trigrams (n=3):** Capture richer, phrase-level meaning (e.g., ‚Äúa waste of‚Äù, ‚Äúone of the‚Äù).

By increasing `ngram_range`, the model begins to recognize **word dependencies and context**, crucial for sentiment analysis. For instance:

* A unigram model might see ‚Äúgood‚Äù and predict positive.
* A bigram model sees ‚Äúnot good‚Äù and correctly predicts negative.

However, as n increases:

* **Feature dimensionality explodes** (many more word combinations).
* **Computation time rises**, and **memory requirements grow**.

The demo quantitatively compares these effects ‚Äî showing how higher-order n-grams can slightly boost accuracy but at the cost of performance and speed.

Significance:
üß© *N-grams bridge the gap between bag-of-words simplicity and deep contextual understanding, allowing classical ML models to capture limited semantic structure without neural networks.*

---

### **4. Evaluation and Insights**

The demo measures both **accuracy** and **runtime** for unigram and trigram models, revealing an important trade-off:

* **Unigram models** ‚Üí Fast, simple, baseline accuracy.
* **N-gram models** ‚Üí More accurate but computationally heavier.

It also demonstrates how the **number of features** (vocabulary size) grows rapidly with n-gram order ‚Äî emphasizing the importance of balancing context richness with model efficiency.

---

### **5. Broader Takeaways**

* **N-gram models** remain highly relevant even in the age of deep learning, especially for smaller datasets or interpretable systems.
* **Preprocessing quality** (lemmatization, stopword removal) strongly influences model clarity and feature quality.
* **Naive Bayes + n-grams** provides a solid, explainable baseline for text classification tasks like sentiment analysis, spam detection, and topic categorization.

---

### **In Summary**

This demo effectively demonstrates how **linguistic granularity (via n-grams)** impacts a machine learning model‚Äôs ability to understand and predict sentiment.
It blends theory with practice ‚Äî showing the trade-offs between **simplicity, interpretability, and contextual depth**.

Ultimately, it helps learners grasp how small enhancements in text representation can lead to **significant gains in understanding language nuances**, forming a bridge from traditional NLP to more advanced methods like TF-IDF and word embeddings.




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy

plt.rcParams['figure.figsize'] = (8, 8)



### **1. What is the purpose of importing `pandas`, `numpy`, and `matplotlib.pyplot`?**

**Answer:**
These libraries are essential for data analysis and visualization in Python:

* **pandas**: Handles data manipulation and analysis (e.g., DataFrames).
* **numpy**: Provides numerical operations and array handling.
* **matplotlib.pyplot**: Used for plotting graphs and visualizing data.

---

### **2. Why do we import `spacy`?**

**Answer:**
`spaCy` is a natural language processing (NLP) library used for tasks like tokenization, named entity recognition, and part-of-speech tagging. Importing it prepares the environment for text analysis.

---

### **3. What does `plt.rcParams['figure.figsize'] = (8, 8)` do?**

**Answer:**
It sets the **default size** for all figures created using Matplotlib to **8 inches by 8 inches**, ensuring consistent and readable plot dimensions across visualizations.

---

### **4. Do we need to install these libraries before using them?**

**Answer:**
Yes. If not already installed, use the following commands:

```bash
pip install pandas numpy matplotlib spacy
```

---

### **5. Why are all these libraries imported at the beginning?**

**Answer:**
It‚Äôs a best practice to import all dependencies at the start of the script. This ensures that all necessary packages are loaded and available before the code execution proceeds.




## Building a bag of words model
- Bag of words model
    - Extract word tokens
    - Compute frequency of word tokens
    - Construct a word vector out of these frequencies and vocabulary of corpus

### BoW model for movie taglines
In this exercise, you have been provided with a `corpus` of more than 7000 movie tag lines. Your job is to generate the bag of words representation `bow_matrix` for these taglines. For this exercise, we will ignore the text preprocessing step and generate `bow_matrix` directly.

In [None]:
movies = pd.read_csv('movie_overviews.csv').dropna()
movies['tagline'] = movies['tagline'].str.lower()
movies.head()

Unnamed: 0,id,title,overview,tagline
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,still yelling. still fighting. still ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,just when his world is back to normal... he's ...
5,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",a los angeles crime saga


In [None]:
corpus = movies['tagline']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
print(bow_matrix.shape)

(7033, 6614)




### **1. What does `pd.read_csv('movie_overviews.csv').dropna()` do?**

**Answer:**
It reads the CSV file named **`movie_overviews.csv`** into a pandas DataFrame and then removes all rows containing **missing (NaN)** values to ensure clean data before analysis.

---

### **2. Why do we convert the ‚Äòtagline‚Äô column to lowercase?**

**Answer:**
The line

```python
movies['tagline'] = movies['tagline'].str.lower()
```

ensures **text normalization**. Converting all taglines to lowercase avoids treating words like *‚ÄúLove‚Äù* and *‚Äúlove‚Äù* as different tokens during text vectorization.

---

### **3. What is the purpose of `CountVectorizer` here?**

**Answer:**
`CountVectorizer` converts the text data (taglines) into a **Bag-of-Words (BoW)** representation ‚Äî a numerical matrix where each row represents a movie tagline and each column represents a word, with values showing the frequency of that word.

---

### **4. What does `bow_matrix = vectorizer.fit_transform(corpus)` return?**

**Answer:**
It returns a **sparse matrix** where:

* Rows = individual taglines
* Columns = unique words (vocabulary)
* Values = frequency counts of each word in each tagline

This is the numerical input for many machine learning or NLP models.

---

### **5. What does `bow_matrix.shape` indicate?**

**Answer:**
It prints a tuple like `(number_of_taglines, number_of_unique_words)`.
For example, `(1000, 5000)` means there are **1000 taglines** and **5000 unique words** in the vocabulary built from them.




You now know how to generate a bag of words representation for a given corpus of documents. Notice that the word vectors created have more than 6600 dimensions. However, most of these dimensions have a value of zero since most words do not occur in a particular tagline.

In [None]:
nlp = spacy.load('en_core_web_sm')
stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [None]:
lem_corpus = corpus.apply(lambda row: ' '.join([t.lemma_ for t in nlp(row)
                                                if t.lemma_ not in stopwords
                                                and t.lemma_.isalpha()]))

In [None]:
lem_corpus

Unnamed: 0,tagline
1,roll dice unleash excitement
2,yell fight ready love
3,friend people let let forget
4,world normal surprise life
5,los angeles crime saga
...,...
9091,kingsglaive final fantasy xv
9093,happen vegas stay vegas happen
9095,decorate officer devoted family man defend hon...
9097,god incarnate city doom


In [None]:
# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate of word vectors
bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of how_lem_matrix
print(bow_lem_matrix.shape)

(7033, 4946)




### **1. What does `spacy.load('en_core_web_sm')` do?**

**Answer:**
It loads **spaCy‚Äôs small English language model**, which provides linguistic features such as tokenization, lemmatization, and part-of-speech tagging. This model is required before processing text with `nlp(row)`.

---

### **2. What is the purpose of `spacy.lang.en.stop_words.STOP_WORDS`?**

**Answer:**
This loads a set of **English stopwords** (common words like *‚Äúthe‚Äù, ‚Äúis‚Äù, ‚Äúand‚Äù*). Removing them helps focus on more meaningful words during text analysis, improving the quality of the bag-of-words representation.

---

### **3. What does the lambda function inside `apply()` do?**

**Answer:**
The lambda function processes each tagline (row) as follows:

* Runs `nlp(row)` to tokenize and lemmatize it.
* Keeps only words that are:

  * Not stopwords
  * Alphabetic (`isalpha()` removes punctuation/numbers)
* Joins the cleaned lemmas back into a single string.

This results in a **lemmatized and cleaned corpus** ready for vectorization.

---

### **4. How is `bow_lem_matrix` different from `bow_matrix` in the previous example?**

**Answer:**

* **`bow_matrix`** used raw text (original taglines).
* **`bow_lem_matrix`** uses **lemmatized and stopword-removed text**, making the matrix smaller and more semantically meaningful since similar words (like *‚Äúruns‚Äù, ‚Äúrunning‚Äù, ‚Äúran‚Äù*) are reduced to a single form (*‚Äúrun‚Äù*).

---

### **5. What does `bow_lem_matrix.shape` represent?**

**Answer:**
It prints a tuple `(number_of_taglines, number_of_unique_lemmatized_words)` ‚Äî showing how many taglines and unique cleaned words are present in the processed corpus.




### Mapping feature indices with feature names
n the lesson video, we had seen that `CountVectorizer` doesn't necessarily index the vocabulary in alphabetical order. In this exercise, we will learn to map each feature index to its corresponding feature name from the vocabulary.

In [None]:
sentences = ['The lion is the king of the jungle',
             'Lions have lifespans of a decade',
             'The lion is an endangered species']

In [None]:
# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(sentences)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary
bow_df.columns = vectorizer.get_feature_names_out()

# Print bow_df
bow_df

Unnamed: 0,an,decade,endangered,have,is,jungle,king,lifespans,lion,lions,of,species,the
0,0,0,0,0,1,1,1,0,1,0,1,0,3
1,0,1,0,1,0,0,0,1,0,1,1,0,0
2,1,0,1,0,1,0,0,0,1,0,0,1,1


Observe that the column names refer to the token whose frequency is being recorded. Therefore, since the first column name is an, the first feature represents the number of times the word `'an'` occurs in a particular sentence. `get_feature_names()` essentially gives us a list which represents the mapping of the feature indices to the feature name in the vocabulary.



### **1. What is happening in the `sentences` list?**

**Answer:**
The list `sentences` contains three short text strings that will be used as a **sample corpus**. Each sentence represents a document that the Bag-of-Words model will convert into numerical form.

---

### **2. What does `CountVectorizer()` do in this code?**

**Answer:**
`CountVectorizer` from `scikit-learn` transforms text into a **Bag-of-Words matrix**, where each column corresponds to a unique word and each row corresponds to a sentence.
The cell values indicate **how many times** each word appears in a given sentence.

---

### **3. Why use `vectorizer.get_feature_names_out()` instead of `get_feature_names()`?**

**Answer:**
`get_feature_names()` was **deprecated** in newer versions of scikit-learn (v1.0+).
The correct method now is `get_feature_names_out()`, which returns the list of feature (word) names in the same order as the matrix columns.

---

### **4. What does `bow_matrix.toarray()` and `pd.DataFrame()` achieve?**

**Answer:**
`bow_matrix` is a **sparse matrix** (for memory efficiency).

* `.toarray()` converts it into a dense NumPy array.
* `pd.DataFrame()` converts it into a DataFrame for easier viewing and analysis, showing each word as a column and each sentence as a row.

---

### **5. What kind of output does `bow_df` display?**

**Answer:**
The printed DataFrame (`bow_df`) shows the Bag-of-Words table. Example:

|   | a | decade | endangered | have | is | jungle | king | lion | lions | of | species | the |
| - | - | ------ | ---------- | ---- | -- | ------ | ---- | ---- | ----- | -- | ------- | --- |
| 0 | 0 | 0      | 0          | 0    | 1  | 1      | 1    | 1    | 0     | 1  | 0       | 2   |
| 1 | 1 | 1      | 0          | 1    | 0  | 0      | 0    | 0    | 1     | 1  | 0       | 0   |
| 2 | 0 | 0      | 1          | 0    | 1  | 0      | 0    | 1    | 0     | 0  | 1       | 2   |

Each cell shows how many times the word appears in that sentence.



## Building a BoW Naive Bayes classifier
- Steps
    1. Text preprocessing
    2. Building a bag-of-words model (or representation)
    3. Machine Learning

### BoW vectors for movie reviews
n this exercise, you have been given two pandas Series, `X_train` and `X_test`, which consist of movie reviews. They represent the training and the test review data respectively. Your task is to preprocess the reviews and generate BoW vectors for these two sets using `CountVectorizer`.

Once we have generated the BoW vector matrices `X_train_bow` and `X_test_bow`, we will be in a very good position to apply a machine learning model to it and conduct sentiment analysis.

In [None]:
movie_reviews = pd.read_csv('movie_reviews_clean.csv')
movie_reviews.head()

Unnamed: 0,review,sentiment
0,this anime series starts out great interesting...,0
1,some may go for a film like this but i most as...,0
2,i ve seen this piece of perfection during the ...,1
3,this movie is likely the worst movie i ve ever...,0
4,it ll soon be 10 yrs since this movie was rele...,1


In [None]:
X = movie_reviews['review']
y = movie_reviews['sentiment']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
# Create a CounterVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

(750, 15068)
(250, 15068)




### **1. What is the purpose of splitting `X` and `y` using `train_test_split()`?**

**Answer:**
`train_test_split()` divides the dataset into:

* **Training set (`X_train`, `y_train`)** ‚Üí used to train the model
* **Testing set (`X_test`, `y_test`)** ‚Üí used to evaluate performance on unseen data
  The `test_size=0.25` means 25% of the data will be reserved for testing, and 75% for training.

---

### **2. What does the `CountVectorizer(lowercase=True, stop_words='english')` do?**

**Answer:**
It converts text into numerical form (Bag-of-Words) with two preprocessing steps:

* `lowercase=True`: converts all words to lowercase for uniformity
* `stop_words='english'`: removes common English words (like ‚Äúthe‚Äù, ‚Äúis‚Äù, ‚Äúan‚Äù) that don‚Äôt add meaning to sentiment classification

---

### **3. Why do we use `fit_transform()` on `X_train` but only `transform()` on `X_test`?**

**Answer:**

* `fit_transform(X_train)` learns the **vocabulary** from the training data and transforms it into word counts.
* `transform(X_test)` uses the **same learned vocabulary** to convert test data ‚Äî ensuring consistency between training and testing features.

---

### **4. What do `X_train_bow.shape` and `X_test_bow.shape` tell us?**

**Answer:**
They show the dimensions of the Bag-of-Words matrices:

* Rows = number of samples (reviews)
* Columns = number of unique words in the vocabulary (features)
  Example output:

```
(1500, 8000)
(500, 8000)
```

This means there are 1500 training reviews and 500 testing reviews, with 8000 unique words represented as features.

---

### **5. Why use Bag-of-Words before building a sentiment analysis model?**

**Answer:**
The Bag-of-Words (BoW) model converts raw text into numerical vectors that machine learning models (like Naive Bayes, Logistic Regression, or SVM) can process to **learn patterns** associated with positive or negative sentiments.




You now have a good idea of preprocessing text and transforming them into their bag-of-words representation using `CountVectorizer`. In this exercise, you have set the lowercase argument to True. However, note that this is the default value of lowercase and passing it explicitly is not necessary. Also, note that both `X_train_bow` and `X_test_bow` have 7822 features. There were words present in `X_test` that were not in `X_train`. CountVectorizer chose to ignore them in order to ensure that the dimensions of both sets remain the same.

### Predicting the sentiment of a movie review
n the previous exercise, you generated the bag-of-words representations for the training and test movie review data. In this exercise, we will use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy. Note that since this is a binary classification problem, the model is only capable of classifying a review as either positive (1) or negative (0). It is incapable of detecting neutral reviews.

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = 'The movie was terrible. The music was underwhelming and the acting mediocre.'
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

The accuracy of the classifier on the test set is 0.828
The sentiment predicted by the classifier is 0




### **1. Why do we use `MultinomialNB` for text classification?**

**Answer:**
`MultinomialNB` is ideal for **text data** represented as **word counts (Bag-of-Words)**.
It assumes features (word frequencies) follow a multinomial distribution, making it effective for tasks like **spam detection** or **sentiment analysis**, where input data are discrete word counts.

---

### **2. What happens in `clf.fit(X_train_bow, y_train)`?**

**Answer:**
This trains the Naive Bayes classifier using the training Bag-of-Words features (`X_train_bow`) and their corresponding sentiment labels (`y_train`).
During training, the model learns **probabilities of words appearing in positive vs. negative reviews**.

---

### **3. How does `clf.score(X_test_bow, y_test)` measure accuracy?**

**Answer:**
It evaluates the classifier on unseen test data by comparing predicted labels against true labels.
It returns a **fraction (between 0 and 1)** representing how many reviews were correctly classified.
Example output:

```
The accuracy of the classifier on the test set is 0.842
```

means 84.2% of test reviews were correctly classified.

---

### **4. How does the classifier predict sentiment for a new review?**

**Answer:**
The review is first transformed into a Bag-of-Words vector using the same `vectorizer`:

```python
vectorizer.transform([review])
```

Then `clf.predict()` uses learned word probabilities to assign a label (e.g., `1` for positive, `0` for negative).

---

### **5. What does the printed sentiment value (`%i`) mean?**

**Answer:**
The predicted sentiment (`prediction`) is usually:

* **1 ‚Üí Positive review**
* **0 ‚Üí Negative review**
  So if output is:

```
The sentiment predicted by the classifier is 0
```

it means the model considers the review negative.



You have successfully performed basic sentiment analysis. Note that the accuracy of the classifier is 80%. Considering the fact that it was trained on only 750 reviews, this is reasonably good performance. The classifier also correctly predicts the sentiment of a mini negative review which we passed into it.

## Building n-gram models
- BoW shortcomings
    - Example
        - `The movie was good and not boring` -> positive
        - `The movie was not good and boring` -> negative
    - Exactly the same BoW representation!
    - Context of the words is lost.
    - Sentiment dependent on the position of `not`
- n-grams
    - Contiguous sequence of n elements (or words) in a given document.
    - Bi-grams / Tri-grams
- n-grams Shortcomings
    - Increase number of dimension, occurs curse of dimensionality
    - Higher order n-grams are rare

### n-gram models for movie tag lines
In this exercise, we have been provided with a corpus of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.

We will then compare the number of features generated for each model.

In [None]:
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1, 1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1, 2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" %
      (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

ng1, ng2 and ng3 have 6614, 37100 and 76881 features respectively



### **1. What does `ngram_range=(1, 1)`, `(1, 2)`, and `(1, 3)` mean?**

**Answer:**
These specify the **range of n-grams** to extract:

* `(1, 1)` ‚Üí only **unigrams** (single words)
* `(1, 2)` ‚Üí **unigrams + bigrams** (single words and 2-word phrases)
* `(1, 3)` ‚Üí **unigrams + bigrams + trigrams** (up to 3-word phrases)
  This allows the model to capture short word sequences that may carry more context or meaning.

---

### **2. What are n-grams in natural language processing?**

**Answer:**
An **n-gram** is a sequence of *n* consecutive words from text.
Example (for the sentence *"The lion roars loudly"*):

* Unigrams: `["The", "lion", "roars", "loudly"]`
* Bigrams: `["The lion", "lion roars", "roars loudly"]`
* Trigrams: `["The lion roars", "lion roars loudly"]`

---

### **3. Why does the number of features increase when n-gram range increases?**

**Answer:**
Higher-order n-grams create additional features for every possible word combination.
So, `ng2` and `ng3` contain many more columns than `ng1`, since they include multi-word expressions (e.g., ‚Äúgood movie‚Äù, ‚Äúvery good movie‚Äù), expanding the vocabulary size.

---

### **4. What does the line printing feature counts show?**

**Answer:**

```python
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" %
      (ng1.shape[1], ng2.shape[1], ng3.shape[1]))
```

It prints how many **unique features (n-grams)** each CountVectorizer generated.
Example output:

```
ng1, ng2 and ng3 have 500, 2500 and 6000 features respectively
```

shows that higher n-gram ranges drastically increase the dimensionality.

---

### **5. When should we use higher n-grams (like 2 or 3)?**

**Answer:**
Use **bigrams/trigrams** when:

* Context or word order affects meaning (e.g., *‚Äúnot good‚Äù*, *‚Äúvery bad movie‚Äù*).
  Avoid them when the dataset is small, since more n-grams increase sparsity and may lead to **overfitting** or slower computation.




You now know how to generate n-gram models containing higher order n-grams. Notice that `ng2` has over 37,000 features whereas `ng3` has over 76,000 features. This is much greater than the 6,000 dimensions obtained for `ng1`. As the n-gram range increases, so does the number of features, leading to increased computational costs and a problem known as the curse of dimensionality.

### Higher order n-grams for sentiment analysis
Similar to a previous exercise, we are going to build a classifier that can detect if the review of a particular movie is positive or negative. However, this time, we will use n-grams up to n=2 for the task.

In [None]:
ng_vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train_ng = ng_vectorizer.fit_transform(X_train)
X_test_ng = ng_vectorizer.transform(X_test)

In [None]:
# Define an instance of MultinomialNB
clf_ng = MultinomialNB()

# Fit the classifier
clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy
accuracy = clf_ng.score(X_test_ng, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = 'The movie was not good. The plot had several holes and the acting lacked panache'
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

The accuracy of the classifier on the test set is 0.836
The sentiment predicted by the classifier is 0




### **1. Why is `ngram_range=(1, 2)` used in `CountVectorizer`?**

**Answer:**
Setting `ngram_range=(1, 2)` makes the vectorizer capture both **unigrams (single words)** and **bigrams (two-word combinations)**.
This helps the model understand short phrases like *‚Äúnot good‚Äù* or *‚Äúvery bad‚Äù*, which convey stronger sentiment than single words alone.

---

### **2. How does using n-grams improve model performance?**

**Answer:**
N-grams capture **context and word relationships**, allowing the model to better interpret phrases that change meaning based on combination (e.g., ‚Äúnot great‚Äù vs. ‚Äúgreat‚Äù).
This typically increases accuracy compared to using only unigrams ‚Äî though it also increases the number of features.

---

### **3. What happens when `fit_transform()` and `transform()` are used here?**

**Answer:**

* `fit_transform(X_train)` ‚Üí learns the vocabulary (including unigrams + bigrams) **and** transforms the training data.
* `transform(X_test)` ‚Üí transforms the test data using the **same vocabulary**, ensuring feature consistency between training and testing.

---

### **4. What does the printed accuracy represent?**

**Answer:**
The accuracy value (e.g., `0.875`) shows the **percentage of correctly predicted sentiments** on the test set.
Higher accuracy compared to the unigram model usually means n-grams helped capture more nuanced sentiment cues.

---

### **5. How does the classifier handle a new review prediction?**

**Answer:**
The new review is converted into its n-gram feature representation using:

```python
ng_vectorizer.transform([review])
```

Then `clf_ng.predict()` outputs the sentiment label:

* **1 ‚Üí Positive review**
* **0 ‚Üí Negative review**

Example output:

```
The sentiment predicted by the classifier is 0
```

indicates that the model found the review negative.




Notice how this classifier performs slightly better than the BoW version. Also, it succeeds at correctly identifying the sentiment of the mini-review as negative.

### Comparing performance of n-gram models
You now know how to conduct sentiment analysis by converting text into various n-gram representations and feeding them to a classifier. In this exercise, we will conduct sentiment analysis for the same movie reviews from before using two n-gram models: unigrams and n-grams upto n equal to 3.

We will then compare the performance using three criteria: accuracy of the model on the test set, time taken to execute the program and the number of features created when generating the n-gram representation.

In [None]:
import time

start_time = time.time()

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'],
                                                    movie_reviews['sentiment'],
                                                    test_size=0.5,
                                                    random_state=42,
                                                    stratify=movie_reviews['sentiment'])

# Generateing ngrams
vectorizer = CountVectorizer(ngram_range=(1,1))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print the accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. " %
      (time.time() - start_time, clf.score(test_X, test_y)))
print("The ngram representation had %i features." % (train_X.shape[1]))

The program took 0.368 seconds to complete. The accuracy on the test set is 0.75. 
The ngram representation had 12347 features.


In [None]:
start_time = time.time()

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'],
                                                    movie_reviews['sentiment'],
                                                    test_size=0.5,
                                                    random_state=42,
                                                    stratify=movie_reviews['sentiment'])

# Generateing ngrams
vectorizer = CountVectorizer(ngram_range=(1,3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print the accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. " %
      (time.time() - start_time, clf.score(test_X, test_y)))
print("The ngram representation had %i features." % (train_X.shape[1]))

The program took 1.675 seconds to complete. The accuracy on the test set is 0.77. 
The ngram representation had 178240 features.




### **1. Why are we measuring both accuracy and time?**

**Answer:**
Measuring **accuracy** shows how well the model predicts sentiments, while measuring **time** helps evaluate how computationally expensive it is.
This comparison helps determine the trade-off between **model performance** (accuracy) and **efficiency** (speed).

---

### **2. Why does the code use two CountVectorizers ‚Äî one with `(1,1)` and one with `(1,3)`?**

**Answer:**

* `(1,1)` generates **unigrams** (single words).
* `(1,3)` generates **unigrams, bigrams, and trigrams** (phrases up to 3 words).
  The goal is to see how including longer word sequences affects the classifier‚Äôs **accuracy and runtime**.

---

### **3. Why does the program take longer with `(1,3)` n-grams?**

**Answer:**
Adding bigrams and trigrams drastically increases the **number of features** (unique word combinations).
This leads to:

* Larger matrices
* More computations during training and prediction
  Hence, **higher runtime and memory usage**, even though it might improve accuracy slightly.

---

### **4. What does `train_X.shape[1]` tell us?**

**Answer:**
It gives the **number of features (columns)** in the n-gram representation ‚Äî i.e., the total number of unique tokens or token combinations.
For example:

```
The ngram representation had 12,500 features.
```

means there were 12,500 distinct unigrams/bigrams/trigrams in the training data.

---

### **5. Why is `stratify=movie_reviews['sentiment']` used in `train_test_split()`?**

**Answer:**
`stratify` ensures that both training and test sets maintain the **same proportion of positive and negative reviews** as the original dataset.
This helps avoid biased splits that could distort model accuracy or evaluation.



The program took around 0.2 seconds in the case of the unigram model and more than 10 times longer for the higher order n-gram model. The unigram model had over 12,000 features whereas the n-gram model for upto n=3 had over 178,000! Despite taking higher computation time and generating more features, the classifier only performs marginally better in the latter case, producing an accuracy of 77% in comparison to the 75% for the unigram model.