### Bag of words: Exercises


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [2]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("./datasets/movies_sentiment_data.csv")

#2. print the shape of the data
print(df.shape)

#3. print top 5 datapoints
df.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['category'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment,category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [9]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df['category'].value_counts()

category
1    25000
0    25000
Name: count, dtype: int64

In [10]:
X = df['review']
y = df['category']

In [11]:
#Do the 'train-test' splitting with test size of 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [12]:
#1. create a pipeline object
rf = Pipeline([
    ('vect', CountVectorizer()), 
    ('clf', RandomForestClassifier(n_estimators=50, criterion='entropy'))
    ])

#2. fit with X_train and y_train
rf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84      4961
           1       0.85      0.84      0.84      5039

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [13]:
#1. create a pipeline object
knn = Pipeline([
    ('vect', CountVectorizer()), 
    ('clf', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
    ])

#2. fit with X_train and y_train
knn.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = knn.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.65      0.66      4961
           1       0.66      0.66      0.66      5039

    accuracy                           0.66     10000
   macro avg       0.66      0.66      0.66     10000
weighted avg       0.66      0.66      0.66     10000



**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [14]:
#1. create a pipeline object
mnb = Pipeline([
    ('vect', CountVectorizer()), 
    ('clf', MultinomialNB())
    ])

#2. fit with X_train and y_train
mnb.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = mnb.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      4961
           1       0.87      0.82      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



### **Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?**

1. **Curse of Dimensionality**: KNN struggles in high-dimensional spaces, which is common in text data represented as bag-of-words or TF-IDF. The distance between points becomes less meaningful as the number of dimensions increases.

2. **Sparse Data**: Text data is often sparse, meaning most features (words) have zero values. KNN relies on distance metrics, which are less effective in sparse datasets.

3. **No Feature Importance**: KNN treats all features equally, whereas models like RandomForest and MultinomialNB can assign importance to specific features, making them more effective for text classification.

4. **Scalability**: KNN requires storing the entire dataset and computing distances for each prediction, making it computationally expensive for large datasets. RandomForest and MultinomialNB are more efficient in such cases.

5. **No Probabilistic Output**: KNN does not provide probabilistic predictions, which are often useful in NLP tasks. MultinomialNB, for example, provides probabilities that can be interpreted and used for decision-making.

6. **Sensitivity to Noise**: KNN is sensitive to noisy data and outliers, which can significantly affect its performance. RandomForest is more robust due to its ensemble nature.