# Kindle Review Sentiment Analysis


## Dataset Overview  
### Context  
This dataset is a small subset of Amazon Kindle Store book reviews.  

### Content  
The dataset includes 982,619 entries of product reviews from May 1996 to July 2014, where each reviewer and product has at least 5 reviews.  

#### Columns:  
- **asin**: Unique product ID (e.g., `B000FA64PK`).  
- **helpful**: Helpfulness rating of the review (e.g., `2/3`).  
- **overall**: Product rating.  
- **reviewText**: Full text of the review.  
- **reviewTime**: Review timestamp (raw format).  
- **reviewerID**: Unique reviewer ID (e.g., `A3SPTOKDG7WBLN`).  
- **reviewerName**: Name of the reviewer.  
- **summary**: Summary or title of the review.  
- **unixReviewTime**: Unix timestamp of the review.  

### Acknowledgements  
This dataset was sourced from the Amazon product data repository by Julian McAuley (UCSD). Full dataset details are available [here](http://jmcauley.ucsd.edu/data/amazon/).  

### Potential Analyses  
- **Sentiment Analysis**: Extract and analyze sentiments from review texts.  
- **Helpfulness Analysis**: Understand factors influencing review helpfulness.  
- **Outlier Detection**: Identify fake or unusual reviews.  
- **Product Insights**: Find top-rated products or analyze product similarities based on reviews.  
- **Other Explorations**: Discover unique patterns or insights from the data.  


### Best Practises

1. Preprocessing And Cleaning
2. Train Test Split
3. BOW, TFIDF, Word2vec
4. Train ML algorithms

---

## Loading the Dataset


In [1]:
# Loading the dataset
import pandas as pd

# Define the path to the dataset
file_path = r"D:\Repositories\Data-Science\NLP_Kindle_Review_Sentiment_Analysis\data\all_kindle_review.csv"

# Load the dataset into a DataFrame
data = pd.read_csv(file_path)

# Display the first five rows
data.head()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


## Selecting Relevant Columns


In [4]:
# Selecting relevant columns: 'reviewText' and 'rating'
df = data[['reviewText', 'rating']]

# Display the first five rows of the selected DataFrame
df.head()


Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


## Checking the Shape of the Dataset


In [5]:
# Check the shape of the DataFrame (rows, columns)
df.shape


(12000, 2)

## Checking for Missing Values


In [6]:
# Check for missing values in the DataFrame
df.isnull().sum()


reviewText    0
rating        0
dtype: int64

## Checking Unique Values in the Rating Column


In [7]:
# Check unique values in the 'rating' column
df['rating'].unique()


array([3, 5, 4, 2, 1], dtype=int64)

## Counting the Frequency of Each Rating Value


In [8]:
# Count the frequency of each value in the 'rating' column
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

## Preprocessing And Cleaning

### Converting Ratings to Binary Labels (Positive: 1, Negative: 0)


In [9]:
# Convert ratings to binary labels: Positive (1) if rating >= 3, otherwise Negative (0)
df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)


## Frequency of Binary Rating Labels


In [10]:
# Count the frequency of binary labels in the 'rating' column
df['rating'].value_counts()


rating
1    8000
0    4000
Name: count, dtype: int64

## Converting All Review Text to Lowercase


In [12]:
# Convert all text in the 'reviewText' column to lowercase
df['reviewText'] = df['reviewText'].str.lower()
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].str.lower()


Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


## Importing Libraries and Downloading Stopwords


In [None]:
# Import required libraries
import re
import nltk
from nltk.corpus import stopwords

# Import BeautifulSoup for parsing HTML content
from bs4 import BeautifulSoup


## Cleaning Review Text
- Removing special characters
- Removing stopwords
- Removing URLs
- Removing HTML tags
- Removing extra spaces


In [15]:
# Removing special characters
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s-]+', '', x))

# Removing stopwords
stop_words = set(stopwords.words('english'))
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))

# Removing URLs
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://[\w\.-]+(?:/[\w\./-]*)?', '', str(x)))

# Removing HTML tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

# Removing additional spaces
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join(x.split()))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s-]+', '', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
 

## Text Preprocessing: Lemmatization and WordNet Lemmatizer
- Initializing WordNet Lemmatizer
- Defining Lemmatization Function
- Applying Lemmatization on Review Text


In [17]:
# Import WordNetLemmatizer from NLTK
from nltk.stem import WordNetLemmatizer

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize words in the text
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply the lemmatization function to the 'reviewText' column
df['reviewText'] = df['reviewText'].apply(lambda x: lemmatize_words(x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(lambda x: lemmatize_words(x))


## Train-Test Split
- Splitting the dataset into training and testing sets
- 80% for training, 20% for testing


In [18]:
# Import train_test_split from scikit-learn
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(df['reviewText'], df['rating'], test_size=0.20)


## Bag of Words (BoW) Transformation
- Converting text data into numerical format using CountVectorizer
- Training set and test set transformation


In [19]:
# Import CountVectorizer from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
bow = CountVectorizer()

# Transform the training and testing data using CountVectorizer (Bag of Words)
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()


## TF-IDF Transformation
- Converting text data into numerical format using TfidfVectorizer
- Training set and test set transformation


In [20]:
# Import TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf = TfidfVectorizer()

# Transform the training and testing data using TfidfVectorizer (TF-IDF)
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()


## Visualizing the Bag of Words Transformation (Training Set)
- Displaying the numerical representation of the training data


In [21]:
# Display the Bag of Words transformation of the training data
X_train_bow


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [22]:
# Get the feature names (words) from the Bag of Words model
feature_names = bow.get_feature_names_out()

# Convert the matrix to a DataFrame with feature names as columns
import pandas as pd
X_train_bow_df = pd.DataFrame(X_train_bow, columns=feature_names)

# Display the first few rows of the Bag of Words DataFrame
X_train_bow_df.head()


Unnamed: 0,00,000,002,007,04,05,05142012,06,067day12262010,07,...,zoological,zoom,zoomed,zorn,zorns,zsadist,zsadists,zulu,zwhy,zyra
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training Naive Bayes Classifier
- Training Gaussian Naive Bayes model using Bag of Words (BoW) representation
- Training Gaussian Naive Bayes model using TF-IDF representation


In [23]:
# Import GaussianNB from scikit-learn
from sklearn.naive_bayes import GaussianNB

# Train Naive Bayes model on Bag of Words features (X_train_bow)
nb_model_bow = GaussianNB().fit(X_train_bow, y_train)

# Train Naive Bayes model on TF-IDF features (X_train_tfidf)
nb_model_tfidf = GaussianNB().fit(X_train_tfidf, y_train)


## Model Evaluation - Naive Bayes (BoW)
- Evaluating the Naive Bayes model performance using:
  - Confusion Matrix
  - Accuracy Score
  - Classification Report


In [28]:
# Make sure the prediction is done first
y_pred_bow = nb_model_bow.predict(X_test_bow)  # Making predictions with Naive Bayes model

# Importing evaluation metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Confusion Matrix
conf_matrix_bow = confusion_matrix(y_test, y_pred_bow)
print("Confusion Matrix (BoW):\n", conf_matrix_bow)

# Accuracy Score
accuracy_bow = accuracy_score(y_test, y_pred_bow)
print("\nAccuracy Score (BoW):", accuracy_bow)

# Classification Report
class_report_bow = classification_report(y_test, y_pred_bow)
print("\nClassification Report (BoW):\n", class_report_bow)


Confusion Matrix (BoW):
 [[504 309]
 [719 868]]

Accuracy Score (BoW): 0.5716666666666667

Classification Report (BoW):
               precision    recall  f1-score   support

           0       0.41      0.62      0.50       813
           1       0.74      0.55      0.63      1587

    accuracy                           0.57      2400
   macro avg       0.57      0.58      0.56      2400
weighted avg       0.63      0.57      0.58      2400



In [29]:
# Accuracy Score
accuracy_bow = accuracy_score(y_test, y_pred_bow)
print("BoW Accuracy: ", accuracy_bow)


BoW Accuracy:  0.5716666666666667


## Model Evaluation - Naive Bayes (TF-IDF)
- Evaluating the Naive Bayes model performance using the Confusion Matrix.
- The confusion matrix will help identify the number of correct and incorrect classifications made by the model.


In [30]:
# Making predictions with Naive Bayes model (TF-IDF)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

# Importing necessary evaluation metrics
from sklearn.metrics import confusion_matrix

# Confusion Matrix
conf_matrix_tfidf = confusion_matrix(y_test, y_pred_tfidf)
print("Confusion Matrix (TF-IDF):\n", conf_matrix_tfidf)


Confusion Matrix (TF-IDF):
 [[494 319]
 [702 885]]


In [31]:
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

TFIDF accuracy:  0.5745833333333333


## Conclusion - Sentiment Analysis Evaluation

This project aimed to evaluate sentiment analysis using **Naive Bayes** models, leveraging different text vectorization techniques: **Bag of Words (BoW)** and **TF-IDF**. Both methods provide insights into the effectiveness of these models in classifying sentiment (positive vs. negative reviews). Below are the findings:

### **Bag of Words (BoW)**:
- **Accuracy Score**: 57.17%
- **Confusion Matrix**:
  - True Negatives (Negative Reviews): 504
  - False Positives (Incorrectly classified as Positive): 309
  - False Negatives (Incorrectly classified as Negative): 719
  - True Positives (Correctly classified as Positive): 868
- **Classification Report**:
  - **Precision** for class 1 (positive sentiment) is 0.74, indicating the model is good at identifying positive sentiment when it predicts it, but with some room for improvement in recall (0.55).
  - **Recall** for negative reviews (class 0) is 0.62, suggesting the model is better at identifying negative reviews, though still leaving some false negatives.

### **TF-IDF**:
- **Accuracy Score**: 57.46%
- **Confusion Matrix**:
  - True Negatives: 494
  - False Positives: 319
  - False Negatives: 702
  - True Positives: 885
- **Classification Report**:
  - The TF-IDF model showed a slightly higher accuracy than BoW (57.46%), with an improved performance in identifying positive reviews (class 1), though the recall is still lower compared to its precision (0.55).
  - The model performed similarly in terms of recall for class 0 (negative sentiment), where the recall was 0.55.

### **Insights**:
- Both models performed relatively well, achieving accuracy in the range of 57%, but they show a **bias towards identifying negative reviews** (class 0), possibly because of the higher frequency of negative reviews in the dataset.
- The **TF-IDF model slightly outperformed BoW** in terms of overall accuracy, which indicates that capturing word importance (via TF-IDF) is slightly more effective for this sentiment analysis task compared to the simple frequency count of BoW.

### **Next Steps and Recommendations**:
1. **Advanced Models**: Explore more sophisticated models, such as **Logistic Regression**, **Support Vector Machines (SVM)**, or **Gradient Boosting Machines (GBM)**, which may provide better classification performance.
2. **Deep Learning Models**: Consider leveraging **LSTM (Long Short-Term Memory)** or **BERT (Bidirectional Encoder Representations from Transformers)** for deeper semantic understanding and improved sentiment classification.
3. **Fine-tuning Pretrained Models**: Utilize pre-trained transformer models, such as BERT or RoBERTa, which have shown state-of-the-art results in sentiment analysis tasks and can be fine-tuned for better accuracy.
4. **Data Augmentation**: Use techniques such as **back-translation** or **data synthesis** to balance the dataset and improve the model's ability to generalize across both positive and negative sentiments.

By incorporating these strategies, the overall performance of the sentiment analysis could be significantly improved, resulting in more accurate and reliable predictions.
