In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [3]:
df = pd.read_csv('IMDB Dataset.csv')

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [9]:
df.shape

(50000, 2)

In [11]:
df['category'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['category'], test_size=0.20, random_state=42)

In [24]:
pipe = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

In [26]:
pipe.fit(X_train, y_train)

In [28]:
y_pred = pipe.predict(X_test)

In [34]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84      4961
           1       0.85      0.83      0.84      5039

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



In [36]:
pipe = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])

In [38]:
pipe.fit(X_train, y_train)

In [40]:
y_pred = pipe.predict(X_test)

In [41]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.65      0.66      4961
           1       0.66      0.67      0.66      5039

    accuracy                           0.66     10000
   macro avg       0.66      0.66      0.66     10000
weighted avg       0.66      0.66      0.66     10000



In [44]:
pipe = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

In [46]:
pipe.fit(X_train, y_train)

In [47]:
y_pred = pipe.predict(X_test)

In [50]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      4961
           1       0.87      0.82      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



## Why didn't the KNN model perform as well as other models like Random Forest and MultinomialNB?


### 1. **Sensitivity to High-Dimensional Data:**
   - KNN is a distance-based algorithm that becomes less effective as the number of features increases, which is common in text data after applying techniques like Bag-of-Words or TF-IDF. The "curse of dimensionality" can cause distances between points to become less meaningful, leading to poor classification performance.

### 2. **Lack of Feature Discrimination:**
   - Unlike RandomForest, which uses a large number of decision trees to handle feature interactions and importance, KNN treats all features equally when calculating distances. In text data, not all words are equally important, and this lack of discrimination can degrade KNN's performance.

### 3. **Model Complexity:**
   - KNN is a simple algorithm that doesn’t learn any parameters from the data, unlike MultinomialNB, which is specifically designed for categorical data and can model the probability of each class given the features. RandomForest, on the other hand, benefits from ensemble learning and is able to capture complex patterns in the data.

### 4. **Scalability Issues:**
   - KNN requires storing the entire training set and performing a comparison for each new instance, which becomes computationally expensive with large datasets. RandomForest and MultinomialNB are more scalable because they summarize the data in a more compact form (e.g., decision trees or probabilities).

### 5. **Sensitivity to Noisy Data:**
   - KNN can be very sensitive to noise in the data because it considers the closest neighbors without any filtering or weighting. If the closest neighbors are noisy, the prediction will be incorrect. RandomForest has built-in mechanisms to handle noise through ensemble voting, and MultinomialNB assumes independence between features, which often results in ts within the notebook?