In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [57]:
df = pd.read_csv('imdbdataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [86]:
df['review'][6]

"I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it."

In [59]:
df['review'][8]

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

In [5]:
df.shape

(50000, 2)

In [6]:
# appling label encoding using lambda function 1 for positive 0 for negative
df['category'] = df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)
df.head()

Unnamed: 0,review,sentiment,category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [7]:
df.drop('sentiment', axis=1, inplace=True)

In [17]:
df.shape

(50000, 2)

In [21]:
X = df['review']
y = df['category']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
print(X_train.shape) # should be (n_samples_train, n_features)
print(X_test.shape)  # should be (n_samples_test, n_features)
print(y_train.shape) # should be (n_samples_train,)
print(y_test.shape)  # should be (n_samples_test,)

(40000,)
(10000,)
(40000,)
(10000,)


In [26]:
# creating sklearn pipelien to fit and scale my data
clf = Pipeline([
    ('cv', CountVectorizer()),
    ('rfc', RandomForestClassifier(criterion='entropy', n_estimators=50))
])

In [27]:
clf.fit(X_train, y_train)

In [28]:
y_pred = clf.predict(X_test)

In [29]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84      5041
           1       0.83      0.85      0.84      4959

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



In [32]:
clk = Pipeline([
    ('cv', CountVectorizer()),
    ('knn', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])

In [37]:
clk.fit(X_train, y_train)

In [39]:
y_pred = clk.predict(X_test)

In [40]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.65      0.66      0.66      4933
           1       0.67      0.66      0.66      5067

    accuracy                           0.66     10000
   macro avg       0.66      0.66      0.66     10000
weighted avg       0.66      0.66      0.66     10000



In [41]:
clm = Pipeline([
    ('cv', CountVectorizer()),
    ('mnb', MultinomialNB())
])

In [42]:
clm.fit(X_train, y_train)

In [43]:
y_pred = clm.predict(X_test)

In [44]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.88      0.83      0.85      5273
           1       0.82      0.87      0.85      4727

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



## here are some possible observations on why KNN may fail to produce good results compared to RandomForest and MultinomialNB:

High dimensionality: KNN can perform poorly in high-dimensional feature spaces because it becomes increasingly difficult to find meaningful distances between samples. In contrast, RandomForest and MultinomialNB are less sensitive to high dimensionality, as they are able to identify important features and weigh them accordingly.

Lack of distinct clusters: KNN assumes that data points that are close to each other in the feature space belong to the same class. However, if the data is not well-separated or has overlapping classes, KNN may produce poor results. RandomForest and MultinomialNB are more robust to overlapping classes because they are able to identify more complex decision boundaries.

Noise and outliers: KNN can be sensitive to noise and outliers in the data, which can lead to incorrect classification. RandomForest and MultinomialNB are more robust to noise and outliers, as they are able to learn from multiple samples and use statistical methods to account for variability.

Imbalanced classes: KNN can be biased towards the majority class if the data is imbalanced, leading to poor performance on the minority class. RandomForest and MultinomialNB are less sensitive to class imbalance because they can adjust the weight of the classes or use different decision thresholds.

Scalability: KNN can be computationally expensive and memory-intensive, especially as the number of samples and dimensions increases. RandomForest and MultinomialNB are generally faster and more scalable because they use efficient algorithms and can handle large datasets.

It's important to note that the performance of each algorithm can depend on the specific dataset and problem at hand. It's always a good idea to try different algorithms and compare their results to find the best approach for your particular application.