About Dataset


Context
This is a small subset of dataset of Book reviews from Amazon Kindle Store category.



Content
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns



asin - ID of the product, like B000FA64PK


-helpful - helpfulness rating of the review - example: 2/3.

-overall - rating of the product.

-reviewText - text of the review (heading).

-reviewTime - time of the review (raw).

-reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN

-reviewerName - name of the reviewer.

-summary - summary of the review (description).

-unixReviewTime - unix timestamp.

Acknowledgements




This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration



-Sentiment analysis on reviews.



-Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.



-Fake reviews/ outliers.



-Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).





In [103]:
#loading the dataset
import pandas as pd

In [104]:
dataset = pd.read_csv("preprocessed_kindle_review .csv")


In [105]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,rating,reviewText,summary
0,0,5,This book was the very first bookmobile book I...,50 + years ago...
1,1,1,"When I read the description for this book, I c...",Boring! Boring! Boring!
2,2,5,I just had to edit this review. This book is a...,Wiggleliscious/new toy ready/!!
3,3,5,I don't normally buy 'mystery' novels because ...,Very good read.
4,4,5,"This isn't the kind of book I normally read, a...",Great Story!


In [106]:
data = dataset[['reviewText','rating']] 
data

Unnamed: 0,reviewText,rating
0,This book was the very first bookmobile book I...,5
1,"When I read the description for this book, I c...",1
2,I just had to edit this review. This book is a...,5
3,I don't normally buy 'mystery' novels because ...,5
4,"This isn't the kind of book I normally read, a...",5
...,...,...
11995,Had to read certain passages twice--typos. Wi...,2
11996,Not what i expected. yet a very interesting bo...,3
11997,Dragon Knights is a world where Knights ride d...,5
11998,"Since this story is very short, it's hard to s...",4


In [107]:
data.shape

(12000, 2)

In [108]:
data.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [109]:
data['rating'].unique()

array([5, 1, 4, 3, 2])

In [110]:
data['rating'].value_counts()

rating
5    3000
4    3000
1    2000
3    2000
2    2000
Name: count, dtype: int64

In [14]:
#PREPROCESSING AND DATA CLEANING

In [112]:
data.loc[:,'label'] = data['rating'].apply(lambda x: 0 if x < 3 else 1)


In [113]:
data.tail()

Unnamed: 0,reviewText,rating,label
11995,Had to read certain passages twice--typos. Wi...,2,0
11996,Not what i expected. yet a very interesting bo...,3,1
11997,Dragon Knights is a world where Knights ride d...,5,1
11998,"Since this story is very short, it's hard to s...",4,1
11999,from 1922 an amazing collection of info on sym...,4,1


In [115]:
data['rating'].unique()

array([5, 1, 4, 3, 2])

In [116]:
data['reviewText'].str.lower()

0        this book was the very first bookmobile book i...
1        when i read the description for this book, i c...
2        i just had to edit this review. this book is a...
3        i don't normally buy 'mystery' novels because ...
4        this isn't the kind of book i normally read, a...
                               ...                        
11995    had to read certain passages twice--typos.  wi...
11996    not what i expected. yet a very interesting bo...
11997    dragon knights is a world where knights ride d...
11998    since this story is very short, it's hard to s...
11999    from 1922 an amazing collection of info on sym...
Name: reviewText, Length: 12000, dtype: object

In [117]:
data.loc[:,'reviewText'] = data['reviewText'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)


In [118]:
data.loc[:, 'reviewText'] = data['reviewText'].str.replace(r'(@\w+|https?://\S+|www\.\S+)', '', regex=True)


In [119]:
data.tail()

Unnamed: 0,reviewText,rating,label
11995,Had to read certain passages twicetypos Wish ...,2,0
11996,Not what i expected yet a very interesting boo...,3,1
11997,Dragon Knights is a world where Knights ride d...,5,1
11998,Since this story is very short its hard to say...,4,1
11999,from 1922 an amazing collection of info on sym...,4,1


In [25]:
pip install nltk

Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Using cached click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2025.11.3-cp310-cp310-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Using cached regex-2025.11.3-cp310-cp310-win_amd64.whl (277 kB)
Using cached click-8.3.1-py3-none-any.whl (108 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk

   ---------- ----------------------------- 1/4 [regex]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------

In [120]:
from nltk.corpus import stopwords

In [121]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


In [122]:
data.loc[:, 'reviewText'] = data['reviewText'].fillna('').apply(
    lambda x: ' '.join(w for w in x.split() if w not in stop_words)
)


In [123]:
data.loc[:, 'reviewText'] = data['reviewText'].str.replace(r'\s+', ' ', regex=True).str.strip()


In [124]:
data['reviewText'].str.lower()

0        this book first bookmobile book i bought i sch...
1        when i read description book i couldnt wait re...
2        i edit review this book i believe i got right ...
3        i dont normally buy mystery novels i dont like...
4        this isnt kind book i normally read although i...
                               ...                        
11995    had read certain passages twicetypos wish buil...
11996    not expected yet interesting book usually don8...
11997    dragon knights world knights ride dragons slay...
11998    since story short hard say much without giving...
11999    1922 amazing collection info symbols cultures ...
Name: reviewText, Length: 12000, dtype: object

In [125]:
data.tail()

Unnamed: 0,reviewText,rating,label
11995,Had read certain passages twicetypos Wish buil...,2,0
11996,Not expected yet interesting book usually don8...,3,1
11997,Dragon Knights world Knights ride dragons slay...,5,1
11998,Since story short hard say much without giving...,4,1
11999,1922 amazing collection info symbols cultures ...,4,1


In [126]:
data['rating'].unique()

array([5, 1, 4, 3, 2])

In [127]:
from nltk.stem import WordNetLemmatizer

In [128]:
lemmatizer = WordNetLemmatizer()

In [129]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word,pos='v') for word in text.split()])

In [130]:
data.loc[:, 'reviewText'] = data['reviewText'].fillna('').apply(lemmatize_words)


In [131]:
data.loc[:,'reviewText']=data['reviewText'].apply(lambda x:lemmatize_words(x))

In [132]:
data['reviewText']

0        This book first bookmobile book I buy I school...
1        When I read description book I couldnt wait re...
2        I edit review This book I believe I get right ...
3        I dont normally buy mystery novels I dont like...
4        This isnt kind book I normally read although I...
                               ...                        
11995    Had read certain passages twicetypos Wish buil...
11996    Not expect yet interest book usually don8216t ...
11997    Dragon Knights world Knights ride dragons slay...
11998    Since story short hard say much without give a...
11999    1922 amaze collection info symbols culture aro...
Name: reviewText, Length: 12000, dtype: object

In [133]:
data.tail()

Unnamed: 0,reviewText,rating,label
11995,Had read certain passages twicetypos Wish buil...,2,0
11996,Not expect yet interest book usually don8216t ...,3,1
11997,Dragon Knights world Knights ride dragons slay...,5,1
11998,Since story short hard say much without give a...,4,1
11999,1922 amaze collection info symbols culture aro...,4,1


In [163]:
data = data.drop(columns=['rating'])


In [164]:
from sklearn.model_selection import train_test_split

In [165]:
X_train,X_test,y_train,y_test = train_test_split(data['reviewText'],data['label'],test_size=0.20)

In [166]:
from sklearn.feature_extraction.text import CountVectorizer

In [167]:
vector = CountVectorizer()

In [168]:
X_train_bow=vector.fit_transform(X_train).toarray()
X_test_bow=vector.transform(X_test).toarray()

In [169]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [170]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(9600, 36661))

In [176]:
from sklearn.linear_model import LogisticRegression
log_model_bow = LogisticRegression(max_iter=1000)
log_model_bow.fit(X_train_bow, y_train)
log_model_tfidf = LogisticRegression(max_iter=1000)
log_model_tfidf.fit(X_train_tfidf, y_train)



0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [177]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [182]:
y_pred_bow=log_model_bow.predict(X_test_bow)

In [183]:
y_pred_tfidf=log_model_tfidf.predict(X_test_tfidf)

In [184]:
print(accuracy_score(y_test,y_pred_bow))

0.8316666666666667


In [185]:
print(accuracy_score(y_test,y_pred_tfidf))

0.8366666666666667


In [186]:
confusion_matrix(y_test,y_pred_bow)

array([[ 554,  223],
       [ 181, 1442]])

In [188]:
print(classification_report(y_test, y_pred_bow))

              precision    recall  f1-score   support

           0       0.75      0.71      0.73       777
           1       0.87      0.89      0.88      1623

    accuracy                           0.83      2400
   macro avg       0.81      0.80      0.80      2400
weighted avg       0.83      0.83      0.83      2400

