## Amazon Kindle Book Review for Sentiment Analysis



Downloaded from : https://www.kaggle.com/datasets/meetnagadia/amazon-kindle-book-review-for-sentiment-analysis

5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns

<ul>
<li>asin - ID of the product, like B000FA64PK<br>
-helpful - helpfulness rating of the review - example: 2/3.<br>
-overall - rating of the product.<br>
-reviewText - text of the review (heading).<br>
-reviewTime - time of the review (raw).<br>
-reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN<br>
-reviewerName - name of the reviewer.<br>
-summary - summary of the review (description).<br>
-unixReviewTime - unix timestamp.</li>
</ul>

<h3>Inspiration</h3>
<p>-Sentiment analysis on reviews.<br>
-Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.<br>
-Fake reviews/ outliers.<br>
-Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).<br>
-Any other interesting analysis</p>

#### Aim for project is to follow best practises
#### Steps to solve
- Preprocessing and cleaning
- Train Test Split
- Convert to vectors: Bow,TFID,Word2Vec
- Train ML Algo

In [1]:
import pandas as pd

df = pd.read_csv('resources/9.all_kindle_review .csv')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [2]:
## lets focus on important features i.e. reviewText and ratings
df = df[['reviewText', 'rating']]
df

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4
...,...,...
11995,Valentine cupid is a vampire- Jena and Ian ano...,4
11996,I have read all seven books in this series. Ap...,5
11997,This book really just wasn't my cuppa. The si...,3
11998,"tried to use it to charge my kindle, it didn't...",1


In [3]:
## check for missing values
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

No missing values

In [4]:
## check for unique ratings
df['rating'].unique()

array([3, 5, 4, 2, 1])

In [5]:
## check for no counts of rating
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [6]:
## Convert this rating into two values i.e. good 1 or bad 0
df['rating'] = df['rating'].apply(lambda x: 0 if x<3 else 1)
df['rating']

0        1
1        1
2        1
3        1
4        1
        ..
11995    1
11996    1
11997    1
11998    0
11999    1
Name: rating, Length: 12000, dtype: int64

In [7]:
df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

Start processing for reviewText

In [8]:
df['reviewText'] = df['reviewText'].str.lower()

In [9]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [10]:
import re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [11]:
## remove special character
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub('[^a-z A-Z 0-9]+', '', x))

## remove the stopwords
df['reviewText'] = df['reviewText'].apply(lambda senteces: " ".join([word for word in senteces.split() if word not in stopwords.words('english')]))

## remove urls
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r"http\S+", "", x))

## remove html tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

## remove additional spaces
df['reviewText'] = df['reviewText'].apply(lambda x: (" ").join(x.split()))



In [13]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [14]:
## Apply Lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()

In [15]:
df['reviewText'] = df['reviewText'].apply(
    lambda sentence: " ".join([lemmatizer.lemmatize(word) for word in word_tokenize(sentence)])
)

df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [30]:
df.shape

(12000, 2)

#### Step2: Train Test Split

In [31]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(df['reviewText'],df['rating'], test_size=0.20)

In [32]:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (9600,)
X_test shape: (2400,)
y_train shape: (9600,)
y_test shape: (2400,)


#### Step 3: Convert text to vectors: Bow,TFID,Word2Vec

In [33]:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (9600,)
X_test shape: (2400,)
y_train shape: (9600,)
y_test shape: (2400,)


#### Using Bag of words

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

#### Using TF-IDF

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

#### Now we are going to apply a nave bias algorithm, because it usually fits well with the sparse matrix data set.

In [36]:
from sklearn.naive_bayes import GaussianNB

nb_model_bow = GaussianNB()
nb_model_bow.fit(X_train_bow, y_train)


In [37]:
nb_model_tfidf = GaussianNB()
nb_model_tfidf.fit(X_train_tfidf, y_train)

## Calculate Metrics

In [38]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [39]:
y_pred_bow = nb_model_bow.predict(X_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

In [40]:
print("BOW accuracy: ",accuracy_score(y_test, y_pred_bow))
print("TFIDF accuracy: ",accuracy_score(y_test, y_pred_tfidf))

BOW accuracy:  0.56375
TFIDF accuracy:  0.5629166666666666


In [42]:
print("BOW confusion_matrix: ",confusion_matrix(y_test, y_pred_bow))
print("TFIDF confusion_matrix: ",confusion_matrix(y_test, y_pred_tfidf))


BOW confusion_matrix:  [[496 301]
 [746 857]]
TFIDF confusion_matrix:  [[478 319]
 [730 873]]


# The accuracy is not good as we are dealing with large data set
# To improve accuracy, we need to use 