## **About the Dataset**

### **Context**

This dataset is a curated subset of Amazon product reviews from the **Kindle Store** category. It offers valuable insights into customer opinions, product quality, and review behaviors over nearly two decades.

### **Content**

The dataset is part of Amazon’s **5-core** collection — meaning that each product and each reviewer has contributed at least **five reviews**.
It spans from **May 1996 to July 2014**, containing a total of **982,619 entries**.

### **Columns**

| Column           | Description                                  |
| ---------------- | -------------------------------------------- |
| `asin`           | Unique product ID (e.g., B000FA64PK)         |
| `helpful`        | Helpfulness rating of the review (e.g., 2/3) |
| `overall`        | Overall product rating (numeric)             |
| `reviewText`     | Full text of the review                      |
| `reviewTime`     | Original review date                         |
| `reviewerID`     | Unique reviewer ID (e.g., A3SPTOKDG7WBLN)    |
| `reviewerName`   | Name of the reviewer                         |
| `summary`        | Short summary or title of the review         |
| `unixReviewTime` | Review timestamp (Unix format)               |

### **Acknowledgements**

This dataset originates from the **Amazon Product Data** compiled by **Julian McAuley** and his team at **UC San Diego (UCSD)**.
Source: [Amazon Product Data – UCSD](http://jmcauley.ucsd.edu/data/amazon/)
All rights and licenses belong to the original authors.

### **Possible Applications & Inspiration**

* Perform **sentiment analysis** on customer reviews.
* Analyze **review helpfulness** — what makes a review perceived as useful?
* Detect **fake or anomalous reviews**.
* Identify **top-rated products** and explore **product similarity** based on textual reviews.
* Conduct **NLP experiments** such as topic modeling or word embeddings.


## **Best Practices**

1. **Data Preprocessing & Cleaning**

   * Handle missing values, duplicates, and irrelevant text.
   * Normalize text (lowercasing, removing punctuation, stopwords, and special characters).
   * Perform lemmatization or stemming to reduce words to their root forms.

2. **Train-Test Split**

   * Divide the dataset into training and testing sets to evaluate model performance effectively.
   * Use stratified sampling if class imbalance exists.

3. **Feature Extraction**

   * Convert text into numerical features using methods such as:

     * **Bag of Words (BoW)**
     * **TF-IDF (Term Frequency–Inverse Document Frequency)**
     * **Word2Vec or other word embeddings**

4. **Model Training & Evaluation**

   * Train various machine learning algorithms (e.g., Logistic Regression, Naive Bayes, SVM, Random Forest).
   * Compare performance using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
   * Optimize hyperparameters and prevent overfitting through techniques like cross-validation.

In [1]:
# Load the dataset
import pandas as pd
data=pd.read_csv('/content/all_kindle_review.csv', engine='python')
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [2]:
df=data[['reviewText','rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [3]:
df.shape

(12000, 2)

In [4]:
df.isnull().sum()

Unnamed: 0,0
reviewText,0
rating,0


In [5]:
df['rating'].unique()

array([3, 5, 4, 2, 1])

In [6]:
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
5,3000
4,3000
3,2000
2,2000
1,2000


In [7]:
# Positive is 1, negative is 0
df.loc[:, 'rating'] = df['rating'].apply(lambda x: 0 if x<3 else 1)

In [8]:
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1,8000
0,4000


In [9]:
df.loc[:, 'reviewText']=df['reviewText'].str.lower()

In [10]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [11]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
# Removing special characters
df.loc[:, 'reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-Z 0-9-]+', '',x))

# Remove the stopswords
df.loc[:, 'reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))

# Remove url
df.loc[:, 'reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))

# Remove html tags
df.loc[:, 'reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

# Remove any additional spaces
df.loc[:, 'reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))


In [13]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [14]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [16]:
df.loc[:, 'reviewText']=df['reviewText'].apply(lambda x:lemmatize_text(x))
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [17]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'], df['rating'], test_size=0.20)

In [18]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
bow=CountVectorizer()
tfidf = TfidfVectorizer()

In [19]:
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [20]:
print(X_train_bow.shape)
print(X_test_bow.shape)

(9600, 35749)
(2400, 35749)


In [21]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [22]:
from sklearn.naive_bayes import GaussianNB

nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)

In [23]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [24]:
y_pred_bow=nb_model_bow.predict(X_test_bow)
y_pred_tfidf=nb_model_tfidf.predict(X_test_tfidf)

In [25]:
confusion_matrix(y_test,y_pred_bow)

array([[542, 278],
       [709, 871]])

In [26]:
print("BOW accuracy: ",round(accuracy_score(y_test,y_pred_bow),2))

BOW accuracy:  0.59


In [27]:
confusion_matrix(y_test,y_pred_tfidf)

array([[530, 290],
       [693, 887]])

In [28]:
print("TFIDF accuracy: ",round(accuracy_score(y_test,y_pred_tfidf), 2))

TFIDF accuracy:  0.59


In [29]:
!pip install gensim



In [30]:
import gensim
from gensim.models import Word2Vec

In [31]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [32]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [33]:
words=[]
for sent in df['reviewText']:
  sent_tokenized=sent_tokenize(sent)
  for sent in sent_tokenized:
    words.append(simple_preprocess(sent))

In [34]:
word2vec_model = gensim.models.Word2Vec(words, epochs=10)

In [35]:
word2vec_model.wv.index_to_key

['book',
 'story',
 'read',
 'one',
 'character',
 'like',
 'good',
 'would',
 'really',
 'love',
 'time',
 'get',
 'author',
 'reading',
 'series',
 'well',
 'much',
 'first',
 'even',
 'didnt',
 'short',
 'know',
 'way',
 'great',
 'could',
 'make',
 'sex',
 'little',
 'dont',
 'two',
 'thing',
 'want',
 'think',
 'find',
 'plot',
 'romance',
 'also',
 'end',
 'life',
 'im',
 'see',
 'enjoyed',
 'go',
 'scene',
 'never',
 'written',
 'take',
 'woman',
 'many',
 'lot',
 'kindle',
 'year',
 'say',
 'thought',
 'work',
 'bit',
 'found',
 'going',
 'give',
 'interesting',
 'liked',
 'writing',
 'novel',
 'loved',
 'another',
 'feel',
 'better',
 'got',
 'come',
 'man',
 'hot',
 'still',
 'back',
 'enough',
 'though',
 'people',
 'star',
 'reader',
 'made',
 'something',
 'review',
 'part',
 'friend',
 'page',
 'cant',
 'bad',
 'world',
 'need',
 'free',
 'new',
 'keep',
 'wasnt',
 'doesnt',
 'relationship',
 'enjoy',
 'recommend',
 'together',
 'next',
 'start',
 'felt',
 'best',
 'put',

In [36]:
word2vec_model.corpus_count

12000

In [37]:
word2vec_model.epochs

10

In [38]:
word2vec_model.wv.similar_by_word('great')

[('amazing', 0.7977503538131714),
 ('wonderful', 0.7353004217147827),
 ('good', 0.7104386687278748),
 ('fantastic', 0.6946941018104553),
 ('awesome', 0.6862539649009705),
 ('excellent', 0.6685007810592651),
 ('exciting', 0.6194494962692261),
 ('nice', 0.5893924236297607),
 ('enjoyable', 0.5808426141738892),
 ('entertaining', 0.5496951341629028)]

In [39]:
word2vec_model.wv['great'].shape

(100,)

In [40]:
word2vec_model.wv['great']

array([ 0.9119095 , -0.75312865,  0.6592263 ,  0.6134204 ,  0.06799739,
       -1.5001289 , -0.5707659 ,  0.89512175,  1.4305828 , -0.44824848,
        0.02053639,  0.6328427 , -0.12751976,  1.1200616 ,  1.5269191 ,
        0.39372218,  2.074278  , -0.5943436 ,  0.14842385, -2.3975704 ,
       -1.4656516 ,  0.22051291, -1.4560322 , -0.32656908,  0.05037889,
        0.8626645 ,  0.24027522, -0.06042769, -0.85210234, -0.92882925,
        0.87602895,  1.3005613 , -0.34921587, -0.8073981 , -1.0865742 ,
       -1.0782218 , -0.53862107, -0.2823288 , -0.7186604 ,  1.4494351 ,
        2.6538837 , -0.39145356,  0.67161286, -1.2778574 ,  0.45798293,
        1.0628244 , -0.35243604,  0.03478662, -0.29301757,  0.82747036,
        1.5394082 , -0.99645275,  0.272381  ,  1.688598  , -1.4386274 ,
        0.68632776, -0.42788133, -0.44767475,  0.11422999, -0.05857861,
        0.5116262 , -0.00981039,  0.9390869 ,  1.2363025 ,  1.1845335 ,
        0.18499449,  1.4295692 , -0.45471102,  0.10432211,  1.14

In [41]:
words[0]

['jace',
 'rankin',
 'may',
 'short',
 'he',
 'nothing',
 'mess',
 'man',
 'hauled',
 'saloon',
 'undertaker',
 'know',
 'he',
 'famous',
 'bounty',
 'hunter',
 'oregon',
 'shot',
 'man',
 'saloon',
 'finished',
 'year',
 'long',
 'quest',
 'avenge',
 'sister',
 'murder',
 'trying',
 'figure',
 'next',
 'snotty',
 'nosed',
 'farm',
 'boy',
 'rescued',
 'gang',
 'bully',
 'offer',
 'money',
 'kill',
 'man',
 'forced',
 'ranch',
 'reluctantly',
 'agrees',
 'bring',
 'man',
 'justice',
 'kill',
 'outright',
 'first',
 'need',
 'tell',
 'sister',
 'widower',
 'newskyla',
 'kyle',
 'springer',
 'bailey',
 'riding',
 'trail',
 'sleeping',
 'ground',
 'past',
 'month',
 'trying',
 'find',
 'jace',
 'want',
 'revenge',
 'man',
 'killed',
 'husband',
 'took',
 'ranch',
 'amongst',
 'crime',
 'shes',
 'keen',
 'detour',
 'jace',
 'want',
 'take',
 'realizes',
 'shes',
 'option',
 'hide',
 'behind',
 'boy',
 'persona',
 'best',
 'try',
 'keep',
 'pace',
 'confrontation',
 'along',
 'way',
 'get',

In [42]:
import numpy as np

def avg_word2vec(doc):
    return np.mean([word2vec_model.wv[word] for word in doc if word in word2vec_model.wv.index_to_key],axis=0)

In [43]:
!pip install tqdm



In [44]:
from tqdm import tqdm
#apply for the entire sentences
X=[]
for i in tqdm(range(len(words))):
    X.append(avg_word2vec(words[i]))

100%|██████████| 12000/12000 [00:34<00:00, 348.75it/s]


In [45]:
X_word2vec = pd.DataFrame(X)

In [46]:
X_word2vec.shape

(12000, 100)

In [47]:
X_word2vec.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.114167,0.161579,0.140269,0.281215,0.127837,-0.233559,0.169358,0.402636,-0.16223,-0.177465,...,0.49543,0.266469,-0.154997,0.001779,0.511001,0.178337,-0.14375,-0.28422,0.054304,-0.132878
1,-0.126622,0.176915,-0.071507,0.81363,0.381267,-0.159032,0.355858,0.465773,-0.009559,-0.386319,...,0.538729,0.36958,-0.064911,-0.053737,0.267023,0.045111,0.144731,-0.448408,0.202802,-0.1642
2,-0.070833,0.189026,-0.025993,0.433982,0.272961,-0.33419,0.410665,0.589773,-0.146922,-0.226773,...,0.435171,0.269906,0.155943,-0.058306,0.333816,-0.040538,-0.029088,-0.248313,-0.164111,-0.15288
3,-0.295647,0.143544,0.035283,0.247805,0.133728,-0.41271,0.185746,0.504729,0.301493,-0.0207,...,0.468709,0.011977,0.076603,-0.349125,-0.157048,-0.146275,0.555232,-0.186546,-0.230022,-0.224351
4,0.465297,0.308239,-0.307456,0.394394,-0.400529,-0.336445,0.339024,0.806053,0.318376,-0.585596,...,0.169249,0.219724,0.683552,-0.298755,0.397284,-0.035106,0.021862,0.069464,0.062833,-0.325291


In [48]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train_w2v,X_test_w2v,y_train_w2v,y_test_w2v = train_test_split(X_word2vec,df['rating'],test_size=0.20)

In [49]:
from sklearn.naive_bayes import GaussianNB

nb_model_word2vec =GaussianNB().fit(X_train_w2v,y_train_w2v)

In [50]:
y_pred_w2v=nb_model_word2vec.predict(X_test_w2v)

In [51]:
confusion_matrix(y_test_w2v,y_pred_w2v)

array([[ 570,  197],
       [ 411, 1222]])

In [52]:
print("Word2Vec Accuracy: ",round(accuracy_score(y_test_w2v,y_pred_w2v), 2))

Word2Vec Accuracy:  0.75


In [53]:
print(classification_report(y_test_w2v,y_pred_w2v))

              precision    recall  f1-score   support

           0       0.58      0.74      0.65       767
           1       0.86      0.75      0.80      1633

    accuracy                           0.75      2400
   macro avg       0.72      0.75      0.73      2400
weighted avg       0.77      0.75      0.75      2400

