<a href="https://colab.research.google.com/github/waelrash1/predictive_analytics_DT302/blob/main/NLP_word2vec_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing </a>

## K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative

In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).


1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Text Processing: Stop words removal and stemming</a>
4. <a href="#4">Train - Validation Split</a>
5. <a href="#5">Data processing with Pipeline</a>
6. <a href="#6">Train the classifier</a>
7. <a href="#7">Test the classifier</a> Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
8. <a href="#8">Ideas for improvement</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the "helpful" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*
* __isPositive:__ Whether the review is positive or negative (1 or 0)


## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

We will use the __pandas__ library to read our dataset.

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

print('The shape of the dataset is:', df.shape)

The shape of the dataset is: (70000, 6)


Let's look at the first 10 rows of the dataset.

In [2]:
df.head(10)

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0
5,I purchased the home and business because I wa...,Quicken home and business not for amatures,True,1335312000,0.0,0.0
6,The download doesn't take long at all. And it'...,Great!,True,1377993600,0.0,1.0
7,This program is positively wonderful for word ...,Terrific for practice.,False,1158364800,2.397895,1.0
8,Fantastic protection!! Great customer support!!,Five Stars,True,1478476800,0.0,1.0
9,Obviously Win 7 now the last great operating s...,Five Stars,True,1471478400,0.0,1.0


## 2. <a name="2">Exploratory data analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the distribution of __isPositive__ field.

In [3]:
df["isPositive"].value_counts()

Unnamed: 0_level_0,count
isPositive,Unnamed: 1_level_1
1.0,43692
0.0,26308


We can check the number of missing values for each columm below.

In [3]:
print(df.isna().sum())

reviewText    12
summary       15
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields.

## 3. <a name="3">Text Processing: Stop words removal and stemming</a>
(<a href="#0">Go to top</a>)

In [2]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence).

In [3]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')
print(stop)
# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]
print(stop_words)


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [4]:
snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list=[]
    for sent in texts:

        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence=[]

        sent = sent.lower() # Lowercase
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:

        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words

        final_text_list.append(final_string)

    return final_text_list

## 4. <a name="4">Train - Validation Split</a>
(<a href="#0">Go to top</a>)

Let's split our dataset into training (90%) and validation (10%).

In [5]:
from sklearn.model_selection import train_test_split

X=df[["reviewText"]]
Y=df["isPositive"]
X_training, X_test, y_training, y_test = train_test_split(X,
                                                  Y,
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

X_train, X_val, y_train, y_val = train_test_split(X_training,
                                                  y_training,
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

In [8]:
print("Processing the reviewText fields")
train_text_list = process_text(X_train["reviewText"].tolist())
val_text_list = process_text(X_val["reviewText"].tolist())

Processing the reviewText fields


Our __process_text()__ method in section 3 uses empty string for missing values.

## 5. <a name="5">Data processing with Pipeline</a>
(<a href="#0">Go to top</a>)

Today we will use a simple pipeline to use our text field and fit a simple K Nearest Neighbors classifier. This example only uses a single field (reviewText). In the next lecture, we will see how to combine multiple fields.

Our CountVectorizer() will return binary values and use 15 vocabulary words. Feel free to experiment with different numbers here.

In [9]:
# prompt: vectorise using word2vec and then use pipline to train knn on word2vec embedding

from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
import numpy as np

# Train Word2Vec model
sentences = [text.split() for text in train_text_list]
model_w2v = Word2Vec(sentences, vector_size=1000, window=5, min_count=1, workers=4)

def document_vector(doc):
  doc = [word for word in doc if word in model_w2v.wv]
  if not doc:
    return np.zeros(model_w2v.vector_size)
  return np.mean([model_w2v.wv[word] for word in doc], axis=0)




## 6. <a name="6">Train the classifier</a>
(<a href="#0">Go to top</a>)

We train our classifier with __.fit()__ on our training dataset.

In [12]:

train_vectors = [document_vector(text.split()) for text in train_text_list]
val_vectors = [document_vector(text.split()) for text in val_text_list]


pipeline = Pipeline([
    ('classifier', KNeighborsClassifier(n_neighbors=10))
])

pipeline.fit(train_vectors, y_train)

## 7. <a name="7">Test the classifier</a>
(<a href="#0">Go to top</a>)

Let's evaluate the performance of the trained classifier. We use __.predict()__ this time.

In [13]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the validation set
y_pred = pipeline.predict(val_vectors)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy: {accuracy}")

print(classification_report(y_val, y_pred))

Accuracy: 0.7961904761904762
              precision    recall  f1-score   support

         0.0       0.71      0.78      0.74      2371
         1.0       0.86      0.81      0.83      3929

    accuracy                           0.80      6300
   macro avg       0.78      0.79      0.79      6300
weighted avg       0.80      0.80      0.80      6300



## 8. <a name="8">Ideas for improvement</a>
(<a href="#0">Go to top</a>)

We can usually improve performance with some additional work. You can try the following:
* Using your validation data, try different K values and look at accuracy for them.
* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary sizes.
* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem.

In [None]:




# Fit the Pipeline to training data


In [24]:
# prompt: change the number of nigbors, tfidf anf max features for tuning parameters. put all in a loop

for n_neighbors in [3, 5, 7]:
  for use_tfidf in [True, False]:
    for max_features in [100, 500, 1000]:
      if use_tfidf:
        text_vect = TfidfVectorizer(use_idf=True, max_features=max_features)
      else:
        text_vect = CountVectorizer(binary=False, max_features=max_features)
      pipeline = Pipeline([
          ('text_vect', text_vect),
          ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
      ])
      pipeline.fit(X_train, y_train.values)
      val_predictions = pipeline.predict(X_val)
      print(f"Neighbors: {n_neighbors}, TF-IDF: {use_tfidf}, Max Features: {max_features}")
      print(classification_report(y_val.values, val_predictions))
      print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))
      print("-" * 30)



Neighbors: 3, TF-IDF: True, Max Features: 100
              precision    recall  f1-score   support

         0.0       0.65      0.59      0.62      2371
         1.0       0.76      0.81      0.79      3929

    accuracy                           0.73      6300
   macro avg       0.71      0.70      0.70      6300
weighted avg       0.72      0.73      0.72      6300

Accuracy (validation): 0.7258730158730159
------------------------------
Neighbors: 3, TF-IDF: True, Max Features: 500
              precision    recall  f1-score   support

         0.0       0.66      0.27      0.38      2371
         1.0       0.67      0.92      0.78      3929

    accuracy                           0.67      6300
   macro avg       0.67      0.59      0.58      6300
weighted avg       0.67      0.67      0.63      6300

Accuracy (validation): 0.6712698412698412
------------------------------
Neighbors: 3, TF-IDF: True, Max Features: 1000
              precision    recall  f1-score   support

      