<a href="https://colab.research.google.com/github/verschhh/NLP_assignment.ipynb/blob/main/NLP_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [2]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


#### __Test data:__

In [None]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [14]:
# Implement this
import nltk, re
import gensim

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import set_config
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

nltk.download('punkt')
nltk.download('stopwords')

train_df["label"].value_counts()
stop = stopwords.words('english')
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]
stop_words = [word for word in stop if word not in excluding]
snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list=[]
    for sent in texts:
        if isinstance(sent, str) == False:
            sent = ""
        filtered_sentence=[]
        sent = sent.lower()
        sent = sent.strip()
        sent = re.sub('\s+', ' ', sent)
        sent = re.compile('<.*?>').sub('', sent)
        for w in word_tokenize(sent):
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence)
        final_text_list.append(final_string)
    return final_text_list

X=train_df[["text"]]
Y=train_df["label"]
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, shuffle=True, random_state=324)
train_text_list = process_text(X_train["text"].tolist())
val_text_list = process_text(X_val["text"].tolist())
w2v = gensim.models.Word2Vec()

pipeline = Pipeline([('text_vect', CountVectorizer(binary=True,max_features=10)), ('knn', KNeighborsClassifier())])
#pipeline = Pipeline([( 'text_vect', TfidfVectorizer(use_idf=True,max_features=10)), ('knn', KNeighborsClassifier())])

set_config(display='diagram')
pipeline
X_train = train_text_list
X_val = val_text_list
pipeline.fit(X_train, y_train.values)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [13]:
# Implement this
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[671 609]
 [600 620]]
              precision    recall  f1-score   support

           0       0.53      0.52      0.53      1280
           1       0.50      0.51      0.51      1220

    accuracy                           0.52      2500
   macro avg       0.52      0.52      0.52      2500
weighted avg       0.52      0.52      0.52      2500

Accuracy (validation): 0.5164


## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [16]:
# Task 1

# Implement this
#pipeline = Pipeline([('text_vect', CountVectorizer(binary=True,max_features=10)), ('knn', KNeighborsClassifier())])
#[[671 609]
#[600 620]]
#             precision    recall  f1-score   support
#
#          0       0.53      0.52      0.53      1280
#          1       0.50      0.51      0.51      1220
#
#   accuracy                           0.52      2500
#  macro avg       0.52      0.52      0.52      2500
#weighted avg      0.52      0.52      0.52      2500
#
#Accuracy (validation): 0.5164

pipeline = Pipeline([('text_vect', CountVectorizer(binary=False,max_features=10)), ('knn', KNeighborsClassifier())])
#[[671 609]
#[566 654]]
#             precision    recall  f1-score   support
#
#          0       0.54      0.52      0.53      1280
#          1       0.52      0.54      0.53      1220
#
#   accuracy                           0.53      2500
#  macro avg       0.53      0.53      0.53      2500
#weighted avg       0.53      0.53      0.53      2500
#
#Accuracy (validation): 0.53

#OBSERVATION
#We can see that when we switch to false the the binary parameter in CountVectorizer, the results are slightly better (by 0.02)

set_config(display='diagram')
pipeline
X_test = test_df[["text"]]
y_test = test_df["label"]
test_text_list = process_text(X_test["text"].tolist())

X_train = train_text_list
X_val = val_text_list
pipeline.fit(X_train, y_train.values)
val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[671 609]
 [566 654]]
              precision    recall  f1-score   support

           0       0.54      0.52      0.53      1280
           1       0.52      0.54      0.53      1220

    accuracy                           0.53      2500
   macro avg       0.53      0.53      0.53      2500
weighted avg       0.53      0.53      0.53      2500

Accuracy (validation): 0.53


In [20]:
# Task 2

# Implement this

#pipeline = Pipeline([('text_vect', CountVectorizer(binary=True,max_features=10)), ('knn', KNeighborsClassifier())])

#[[671 609]
#[600 620]]
#             precision    recall  f1-score   support
#
#          0       0.53      0.52      0.53      1280
#          1       0.50      0.51      0.51      1220
#
#   accuracy                           0.52      2500
#  macro avg       0.52      0.52      0.52      2500
#weighted avg      0.52      0.52      0.52      2500
#
#Accuracy (validation): 0.5164

#pipeline = Pipeline([('text_vect', CountVectorizer(binary=False,max_features=10)), ('knn', KNeighborsClassifier())])

#[[671 609]
#[566 654]]
#             precision    recall  f1-score   support
#
#          0       0.54      0.52      0.53      1280
#          1       0.52      0.54      0.53      1220
#
#   accuracy                           0.53      2500
#  macro avg       0.53      0.53      0.53      2500
#weighted avg       0.53      0.53      0.53      2500
#
#Accuracy (validation): 0.53

#pipeline = Pipeline([( 'text_vect', TfidfVectorizer(use_idf=True,max_features=10)), ('knn', KNeighborsClassifier())])

#[[665 615]
#[589 631]]
#             precision    recall  f1-score   support
#
#          0       0.53      0.52      0.52      1280
#          1       0.51      0.52      0.51      1220
#
#   accuracy                           0.52      2500
#  macro avg       0.52      0.52      0.52      2500
#weighted avg       0.52      0.52      0.52      2500
#
#Accuracy (validation): 0.5184

pipeline = Pipeline([( 'text_vect', TfidfVectorizer(use_idf=False,max_features=10)), ('knn', KNeighborsClassifier())])

#[[661 619]
#[553 667]]
#             precision    recall  f1-score   support
#
#          0       0.54      0.52      0.53      1280
#          1       0.52      0.55      0.53      1220
#
#   accuracy                           0.53      2500
#  macro avg       0.53      0.53      0.53      2500
#weighted avg       0.53      0.53      0.53      2500
#
#Accuracy (validation): 0.5312

# OBSERVATION : we can see that with the TfidfVectorizer we get slightly better result than with the CountVectorizer

set_config(display='diagram')
pipeline
X_train = train_text_list
X_val = val_text_list
pipeline.fit(X_train, y_train.values)
val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[665 615]
 [589 631]]
              precision    recall  f1-score   support

           0       0.53      0.52      0.52      1280
           1       0.51      0.52      0.51      1220

    accuracy                           0.52      2500
   macro avg       0.52      0.52      0.52      2500
weighted avg       0.52      0.52      0.52      2500

Accuracy (validation): 0.5184


In [27]:
# Task 3

# Implement this
for i in range(1,200, 10):
  print("max_features: ", i)
  pipeline = Pipeline([( 'text_vect', TfidfVectorizer(use_idf=True,max_features=i)), ('knn', KNeighborsClassifier())])
  set_config(display='diagram')
  pipeline
  X_train = train_text_list
  X_val = val_text_list
  pipeline.fit(X_train, y_train.values)
  val_predictions = pipeline.predict(X_val)
  print(confusion_matrix(y_val.values, val_predictions))
  print(classification_report(y_val.values, val_predictions))
  print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#for i in range(1,200, 10):
#  print("max_features: ", i)
#  pipeline = Pipeline([('text_vect', CountVectorizer(binary=True,max_features=i)), ('knn', KNeighborsClassifier())])
#  set_config(display='diagram')
#  pipeline
#  X_train = train_text_list
#  X_val = val_text_list
#  pipeline.fit(X_train, y_train.values)
#  val_predictions = pipeline.predict(X_val)
#  print(confusion_matrix(y_val.values, val_predictions))
#  print(classification_report(y_val.values, val_predictions))
#  print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#OBSERVATION:for the CountVectorizer, the accuracy will tend to stay around 64% with a peak at 65% at i=191, it is possible that by incrinsing furthermore the max_feature, the accuracy will go higher because we can observe an increasing tendancy
#OBSERVATION:for the TfidfVectorizer, the accuracy will tend to stay around 66% with a peak at 68% at i=131, it is possible that by incrinsing furthermore the max_feature, the accuracy will go higher but we can observe a slight decreasing tendancy
#
# We'll select the TfidfVectorizerfor the 4th step

max_features:  1
[[842 438]
 [719 501]]
              precision    recall  f1-score   support

           0       0.54      0.66      0.59      1280
           1       0.53      0.41      0.46      1220

    accuracy                           0.54      2500
   macro avg       0.54      0.53      0.53      2500
weighted avg       0.54      0.54      0.53      2500

Accuracy (validation): 0.5372
max_features:  11
[[664 616]
 [589 631]]
              precision    recall  f1-score   support

           0       0.53      0.52      0.52      1280
           1       0.51      0.52      0.51      1220

    accuracy                           0.52      2500
   macro avg       0.52      0.52      0.52      2500
weighted avg       0.52      0.52      0.52      2500

Accuracy (validation): 0.518
max_features:  21
[[752 528]
 [454 766]]
              precision    recall  f1-score   support

           0       0.62      0.59      0.60      1280
           1       0.59      0.63      0.61      1220

 

In [29]:
# Task 4

# Implement this
# Task 3

# Implement this
for i in range(1,200, 5):
  print("number of neighboor: ", i)
  pipeline = Pipeline([( 'text_vect', TfidfVectorizer(use_idf=True,max_features=141)), ('knn', KNeighborsClassifier(i))])
  set_config(display='diagram')
  pipeline
  X_train = train_text_list
  X_val = val_text_list
  pipeline.fit(X_train, y_train.values)
  val_predictions = pipeline.predict(X_val)
  print(confusion_matrix(y_val.values, val_predictions))
  print(classification_report(y_val.values, val_predictions))
  print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#OBSERVATION: We can see that the accuracy will go up to 73% but we can't seems to go higher

number of neighboor:  1
[[782 498]
 [447 773]]
              precision    recall  f1-score   support

           0       0.64      0.61      0.62      1280
           1       0.61      0.63      0.62      1220

    accuracy                           0.62      2500
   macro avg       0.62      0.62      0.62      2500
weighted avg       0.62      0.62      0.62      2500

Accuracy (validation): 0.622
number of neighboor:  6
[[977 303]
 [478 742]]
              precision    recall  f1-score   support

           0       0.67      0.76      0.71      1280
           1       0.71      0.61      0.66      1220

    accuracy                           0.69      2500
   macro avg       0.69      0.69      0.68      2500
weighted avg       0.69      0.69      0.69      2500

Accuracy (validation): 0.6876
number of neighboor:  11
[[864 416]
 [325 895]]
              precision    recall  f1-score   support

           0       0.73      0.68      0.70      1280
           1       0.68      0.73   