## Kindle Review Sentiment Analysis
This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. 

Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.

Columns
1. asin - ID of the product, like B000FA64PK
2. helpful - helpfulness rating of the review - example: 2/3.
3. overall - rating of the product.
4. reviewText - text of the review (heading).
5. reviewTime - time of the review (raw).
6. reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
7. reviewerName - name of the reviewer.
8. summary - summary of the review (description).
9. unixReviewTime - unix timestamp.

Acknowledgements
Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

All License to the data files belong to them.

Inspiration
1. Sentiment analysis based upon reviews of the user.
2. Understanding usefulness of a review/ Factors influencing helpfulness of a review.
3. Fake reviews detection.
4. Best rated product IDs, or similarity between products based on reviews alone.
5. Any other interesting analysis

#### Best Practises
1. Preprocessing And Cleaning
2. Train Test Split
3. BOW,TFIDF,Word2vec
4. Train ML algorithms

In [1]:
import os
print(os.listdir(r"C:\Users\rajro\OneDrive\Desktop\FULL STACK DATA SCIENCE\NLP\Kindle Review"))

['all_kindle_review.csv', 'Kindle Review Sentiment Analyis.ipynb']


In [2]:
# Load the dataset
import pandas as pd
import numpy as np
data = pd.read_csv('all_kindle_review.csv')
data.head()
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [3]:
df=data[['reviewText','rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [4]:
df.shape

(12000, 2)

In [5]:
## Missing Values
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [6]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [7]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [8]:
## Preprocessing And Cleaning

In [9]:
## postive review is 1 and negative review is 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)


In [10]:
df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [11]:
## 1. Lower All the cases
df['reviewText']=df['reviewText'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].str.lower()


In [12]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [13]:
!pip install nltk



In [14]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rajro\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [15]:
from bs4 import BeautifulSoup

In [16]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

In [17]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [18]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer

In [19]:
lemmatizer=WordNetLemmatizer()

In [20]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])


In [22]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajro\AppData\Roaming\nltk_data...


True

In [23]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))


In [24]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [25]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],
                                              test_size=0.20)

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [28]:
!pip install gensim



In [29]:
from gensim.models import Word2Vec
X_train_tokens = [review.split() for review in X_train]
X_test_tokens = [review.split() for review in X_test]
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=2, workers=4)

# Function to compute average Word2Vec for each text
def avg_word2vec(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        # Return a zero vector if no words are in the Word2Vec vocabulary
        return np.zeros(vector_size)
    
# Compute AvgWord2Vec for training and test sets
X_train_avgw2v = np.array([avg_word2vec(tokens, word2vec_model, 100) for tokens in X_train_tokens])
X_test_avgw2v = np.array([avg_word2vec(tokens, word2vec_model, 100) for tokens in X_test_tokens])

In [38]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import numpy as np

### Step 1: Bag-of-Words (BoW) Representation

In [39]:
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train).toarray()
X_test_bow = bow_vectorizer.transform(X_test).toarray()

### Step 2: TF-IDF Representation

In [40]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train).toarray()
X_test_tfidf = tfidf_vectorizer.transform(X_test).toarray()

### Step 3: Naive Bayes Classifier

In [46]:
nb_model_bow = GaussianNB().fit(X_train_bow, y_train)
nb_model_tfidf = GaussianNB().fit(X_train_tfidf, y_train)

y_pred_bow_nb = nb_model_bow.predict(X_test_bow)
y_pred_tfidf_nb = nb_model_tfidf.predict(X_test_tfidf)

print("Naive Bayes (BoW) Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred_bow_nb)))
print("Naive Bayes (TF-IDF) Accuracy: {:.2f}".format( accuracy_score(y_test, y_pred_tfidf_nb)))


Naive Bayes (BoW) Accuracy: 0.56
Naive Bayes (TF-IDF) Accuracy: 0.56


### Step 4: Logistic Regression Classifier

In [47]:
lr_model_bow = LogisticRegression(max_iter=1000, random_state=42)
lr_model_bow.fit(X_train_bow, y_train)

lr_model_tfidf = LogisticRegression(max_iter=1000, random_state=42)
lr_model_tfidf.fit(X_train_tfidf, y_train)

y_pred_bow_lr = lr_model_bow.predict(X_test_bow)
y_pred_tfidf_lr = lr_model_tfidf.predict(X_test_tfidf)

print("Logistic Regression (BoW) Accuracy: {:.2f}".format( accuracy_score(y_test, y_pred_bow_lr)))
print("Logistic Regression (TF-IDF) Accuracy: {:.2f}".format( accuracy_score(y_test, y_pred_tfidf_lr)))

Logistic Regression (BoW) Accuracy: 0.84
Logistic Regression (TF-IDF) Accuracy: 0.83


### Step 5: Random Forest Classifier

In [48]:
rf_model_bow = RandomForestClassifier(random_state=42)
rf_model_bow.fit(X_train_bow, y_train)

rf_model_tfidf = RandomForestClassifier(random_state=42)
rf_model_tfidf.fit(X_train_tfidf, y_train)

y_pred_bow_rf = rf_model_bow.predict(X_test_bow)
y_pred_tfidf_rf = rf_model_tfidf.predict(X_test_tfidf)

print("Random Forest (BoW) Accuracy: {:.2f}".format( accuracy_score(y_test, y_pred_bow_rf)))
print("Random Forest (TF-IDF) Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred_tfidf_rf)))

Random Forest (BoW) Accuracy: 0.80
Random Forest (TF-IDF) Accuracy: 0.79


Logistic Regression Model provides better accuracy compared to others for the Data.

In [49]:
# Train Word2Vec model on the training data
w2v_model = Word2Vec(sentences=[text.split() for text in X_train], vector_size=100, window=5, min_count=1, workers=4)

# Function to compute average Word2Vec vectors for a dataset
def compute_avg_w2v_vectors(corpus, model, vector_size):
    avg_vectors = []
    for sentence in corpus:
        words = sentence.split()
        word_vectors = [model.wv[word] for word in words if word in model.wv]
        if word_vectors:
            avg_vectors.append(np.mean(word_vectors, axis=0))
        else:
            avg_vectors.append(np.zeros(vector_size))
    return np.array(avg_vectors)


### Generate AvgWord2Vec vectors for training and test sets

In [50]:
X_train_w2v = compute_avg_w2v_vectors(X_train, w2v_model, vector_size=100)
X_test_w2v = compute_avg_w2v_vectors(X_test, w2v_model, vector_size=100)

### Step 6: ML Models on AvgWord2Vec
### Logistic Regression

In [51]:
lr_model_w2v = LogisticRegression(max_iter=1000, random_state=42)
lr_model_w2v.fit(X_train_w2v, y_train)
y_pred_w2v_lr = lr_model_w2v.predict(X_test_w2v)
print("Logistic Regression (AvgWord2Vec) Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred_w2v_lr)))

Logistic Regression (AvgWord2Vec) Accuracy: 0.76


### Random Forest

In [52]:
rf_model_w2v = RandomForestClassifier(random_state=42)
rf_model_w2v.fit(X_train_w2v, y_train)
y_pred_w2v_rf = rf_model_w2v.predict(X_test_w2v)
print("Random Forest (AvgWord2Vec) Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred_w2v_rf)))

Random Forest (AvgWord2Vec) Accuracy: 0.76
