<a href="https://colab.research.google.com/github/waelrash1/predictive_analytics_DT302/blob/main/NLP_KNN_BERT_EMBEEDINGS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Embeeding Natural Language Processing </a>

## K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative

In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).


1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Text Processing: Stop words removal and stemming</a>
4. <a href="#4">Train - Validation Split</a>
5. <a href="#5">Data processing with Pipeline</a>
6. <a href="#6">Train the classifier</a>
7. <a href="#7">Test the classifier</a> Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
8. <a href="#8">Ideas for improvement</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the "helpful" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*
* __isPositive:__ Whether the review is positive or negative (1 or 0)


## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

We will use the __pandas__ library to read our dataset.

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

print('The shape of the dataset is:', df.shape)

The shape of the dataset is: (70000, 6)


In [None]:
# IMDB Dataset
df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()


Let's look at the first 10 rows of the dataset.

In [None]:
df.head(10)

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0
5,I purchased the home and business because I wa...,Quicken home and business not for amatures,True,1335312000,0.0,0.0
6,The download doesn't take long at all. And it'...,Great!,True,1377993600,0.0,1.0
7,This program is positively wonderful for word ...,Terrific for practice.,False,1158364800,2.397895,1.0
8,Fantastic protection!! Great customer support!!,Five Stars,True,1478476800,0.0,1.0
9,Obviously Win 7 now the last great operating s...,Five Stars,True,1471478400,0.0,1.0


## 2. <a name="2">Exploratory data analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the distribution of __isPositive__ field.

In [None]:
df["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

We can check the number of missing values for each columm below.

In [None]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields.

## 3. <a name="3">Text Processing: Stop words removal and stemming</a>
(<a href="#0">Go to top</a>)

In [None]:
df=df.dropna()
print(df.isna().sum())

reviewText    0
summary       0
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


In [None]:
!pip install transformers torch scikit-learn


We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence).

## 4. <a name="4">Train - Validation Split</a>
(<a href="#0">Go to top</a>)

Let's split our dataset into training (90%) and validation (10%).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(df["reviewText"], df['isPositive'], test_size=0.1, random_state=42)

## Use BERT for text embeddings:
(<a href="#0">Go to top</a>)

You can use the Hugging Face Transformers library to load a pre-trained BERT model and tokenize your text dat

In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the text data
def tokenize_text(text):
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt', max_length=128)
    return inputs

# Get BERT embeddings for the text data
def get_bert_embeddings(text):
    inputs = tokenize_text(text)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()  # Average pooling of token embeddings

# Convert the train and test data to BERT embeddings
train_embeddings = [get_bert_embeddings(text) for text in train_data]
test_embeddings = [get_bert_embeddings(text) for text in test_data]


## 6. <a name="6">Train the classifier</a>
(<a href="#0">Go to top</a>)

We train our classifier with __.fit()__ on our training dataset.
Train a KNN model:
Now that you have BERT embeddings for your text data, you can train a KNN model using scikit-learn.

This code demonstrates how to use BERT for text embeddings and then train a KNN model for sentiment analysis of AWS product reviews. Make sure to replace 'your_dataset.csv' with the actual path to your dataset file and adjust other parameters as needed. You may also fine-tune the model and preprocessing steps to improve performance.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import accuracy_score

# Initialize and train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=10)

knn.fit(train_embeddings, train_labels)

#rf=RandomForestClassifier()
#rf.fit(train_embeddings, train_labels)
xgb=GradientBoostingClassifier()
xgb.fit(train_embeddings, train_labels)



##Test the classifier
(Go to top)

To evaluate the KNN model's performance on sentiment classification, you can generate a classification report and a confusion matrix. Here's how you can do it using scikit-learn:
|--|--|--|

In [None]:
# Make predictions on the test data
predictions = knn.predict(test_embeddings)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')
#predictions = rf.predict(test_embeddings)
predictions = xgb.predict(test_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 83.05%
Accuracy: 85.62%


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Generate predictions on the test data

# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)


Classification Report:
              precision    recall  f1-score   support

    negative       0.82      0.79      0.80      2583
    positive       0.88      0.90      0.89      4415

    accuracy                           0.86      6998
   macro avg       0.85      0.84      0.84      6998
weighted avg       0.86      0.86      0.86      6998


Confusion Matrix:
[[2028  555]
 [ 451 3964]]


## 7. <a name="7">Test the classifier</a>
(<a href="#0">Go to top</a>)

Let's evaluate the performance of the trained classifier. We use __.predict()__ this time.

## 8. <a name="8">Ideas for improvement</a>
(<a href="#0">Go to top</a>)

We can usually improve performance with some additional work. You can try the following:
* Using your validation data, try different K values and look at accuracy for them.
* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary sizes.
* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem.