## About Dataset
Context
This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns

- asin - ID of the product, like B000FA64PK
- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.

Acknowledgements
This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration
- Sentiment analysis on reviews.
- Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
- Fake reviews/ outliers.
- Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
- Any other interesting analysis

In [28]:
## Loading the DataSet 
import pandas as pd
df = pd.read_csv("Data/kindle_reviews.csv")

In [29]:
df.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


In [30]:
df = df[['reviewText','overall']]
df.head()

Unnamed: 0,reviewText,overall
0,I enjoy vintage books and movies so I enjoyed ...,5
1,This book is a reissue of an old one; the auth...,4
2,This was a fairly interesting read. It had ol...,4
3,I'd never read any of the Amy Brewster mysteri...,5
4,"If you like period pieces - clothing, lingo, y...",4


In [31]:
df.shape

(982619, 2)

In [32]:
### Missing value 
df.isna().sum()

reviewText    22
overall        0
dtype: int64

In [33]:
df = df.dropna()

In [34]:
df.isna().sum()

reviewText    0
overall       0
dtype: int64

In [35]:
df['overall'].unique()

array([5, 4, 3, 2, 1], dtype=int64)

In [36]:
df['overall'].value_counts()

overall
5    575246
4    254010
3     96193
2     34130
1     23018
Name: count, dtype: int64

## Preprocessing and Cleaning

In [37]:
#Positive review = 1 && and negative review = 0
df['rating']=df['overall'].apply(lambda x:0 if x<3 else 1)

In [38]:
df.head()

Unnamed: 0,reviewText,overall,rating
0,I enjoy vintage books and movies so I enjoyed ...,5,1
1,This book is a reissue of an old one; the auth...,4,1
2,This was a fairly interesting read. It had ol...,4,1
3,I'd never read any of the Amy Brewster mysteri...,5,1
4,"If you like period pieces - clothing, lingo, y...",4,1


In [39]:
df.drop('overall',axis=1,inplace=True)

In [40]:
df['rating'].value_counts()

rating
1    925449
0     57148
Name: count, dtype: int64

In [41]:
## 1. Lower all the cases
df['reviewText']=df['reviewText'].str.lower()

In [42]:
df.head()

Unnamed: 0,reviewText,rating
0,i enjoy vintage books and movies so i enjoyed ...,1
1,this book is a reissue of an old one; the auth...,1
2,this was a fairly interesting read. it had ol...,1
3,i'd never read any of the amy brewster mysteri...,1
4,"if you like period pieces - clothing, lingo, y...",1


In [43]:
## 2. removing the special characters
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to C:\Users\Bishal
[nltk_data]     Roy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [44]:
!pip install swifter



In [45]:
import swifter

In [47]:
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import swifter

# Define stopwords
stop_words = set(stopwords.words('english'))

# Removing special characters but keeping spaces
df['reviewText'] = df['reviewText'].swifter.apply(lambda x: re.sub('[^a-zA-Z0-9- ]+', '', x))

# Remove stopwords
df['reviewText'] = df['reviewText'].swifter.apply(lambda x: " ".join([y for y in x.split() if y not in stop_words]))

# Remove URLs
df['reviewText'] = df['reviewText'].swifter.apply(lambda x: re.sub(r'(http|https|ftp|ssh)://[\w_-]+(\.[\w_-]+)+[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-]?', '', str(x)))

# Remove HTML tags
df['reviewText'] = df['reviewText'].swifter.apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

# Remove any additional spaces
df['reviewText'] = df['reviewText'].swifter.apply(lambda x: " ".join(x.split()))


Pandas Apply:   0%|          | 0/982597 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/982597 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/982597 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/982597 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/982597 [00:00<?, ?it/s]

In [48]:
df['reviewText']

0         enjoy vintage books movies enjoyed reading boo...
1         book reissue old one author born 1910 era say ...
2         fairly interesting read old- style terminology...
3         id never read amy brewster mysteries one reall...
4         like period pieces - clothing lingo enjoy myst...
                                ...                        
982614    yasss hunny great read dre mess cherika refuse...
982615    enjoyed book beginning end far lex hoe sneaky ...
982616    great book cherika fool let man get away much ...
982617    say excellent book please believe definitely p...
982618    book everything hope alexus wise move lawd tho...
Name: reviewText, Length: 982597, dtype: object

In [49]:
df.head()

Unnamed: 0,reviewText,rating
0,enjoy vintage books movies enjoyed reading boo...,1
1,book reissue old one author born 1910 era say ...,1
2,fairly interesting read old- style terminology...,1
3,id never read amy brewster mysteries one reall...,1
4,like period pieces - clothing lingo enjoy myst...,1


In [50]:
## lemmatizer
from nltk.stem import WordNetLemmatizer


In [51]:
lemmatizer = WordNetLemmatizer()

In [52]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [53]:
df['reviewText'] = df['reviewText'].swifter.apply(lambda x: lemmatize_words(x))

Pandas Apply:   0%|          | 0/982597 [00:00<?, ?it/s]

In [54]:
df.head()

Unnamed: 0,reviewText,rating
0,enjoy vintage book movie enjoyed reading book ...,1
1,book reissue old one author born 1910 era say ...,1
2,fairly interesting read old- style terminology...,1
3,id never read amy brewster mystery one really ...,1
4,like period piece - clothing lingo enjoy myste...,1


In [74]:
## train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df['reviewText'],df['rating'],
                                                test_size=0.30)

In [75]:
X_train.shape,y_train.shape

((687817,), (687817,))

In [76]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [78]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE()

# Fit and resample the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_bow, y_train)


In [79]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE()

# Fit and resample the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)


In [83]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and fit the MultinomialNB model
nb_model_bow = MultinomialNB().fit(X_train_bow, y_train)
nb_model_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)


In [85]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [86]:
y_pred_bow = nb_model_bow.predict(X_test_bow)

In [88]:
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

In [92]:
accuracy = accuracy_score(y_test, y_pred_bow)
print("Bow accuracy score:", accuracy)

Bow accuracy score: 0.95


In [93]:
 confusion_matrix(y_test,y_pred_bow)

array([[  3920,  13231],
       [  1508, 276121]], dtype=int64)

In [95]:
accuracy = accuracy_score(y_test, y_pred_tfidf)
print("Tfidf accuracy score:", accuracy)

confusion_matrix(y_test,y_pred_tfidf)

Tfidf accuracy score: 0.9418210190650654


array([[     1,  17150],
       [     0, 277629]], dtype=int64)

In [97]:
## Let's do it with now using Word2Vec
!pip install gensim



In [109]:
import gensim
from gensim.models import Word2Vec, keyedvectors

In [110]:
import gensim.downloader as api

In [111]:
import gensim

In [112]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [113]:
words=[]
for sent in df['reviewText']:
    sent_token=sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))


In [114]:
words


[['enjoy',
  'vintage',
  'book',
  'movie',
  'enjoyed',
  'reading',
  'book',
  'plot',
  'unusual',
  'dont',
  'think',
  'killing',
  'someone',
  'self',
  'defense',
  'leaving',
  'scene',
  'body',
  'without',
  'notifying',
  'police',
  'hitting',
  'someone',
  'jaw',
  'knock',
  'would',
  'wash',
  'todaystill',
  'good',
  'read'],
 ['book',
  'reissue',
  'old',
  'one',
  'author',
  'born',
  'era',
  'say',
  'nero',
  'wolfe',
  'introduction',
  'quite',
  'interesting',
  'explaining',
  'author',
  'he',
  'forgotten',
  'id',
  'never',
  'heard',
  'himthe',
  'language',
  'little',
  'dated',
  'time',
  'like',
  'calling',
  'gun',
  'heater',
  'also',
  'made',
  'good',
  'use',
  'fire',
  'dictionary',
  'look',
  'word',
  'like',
  'deshabille',
  'canarsie',
  'still',
  'well',
  'worth',
  'look',
  'see'],
 ['fairly',
  'interesting',
  'read',
  'old',
  'style',
  'terminologyi',
  'glad',
  'get',
  'read',
  'story',
  'doesnt',
  'coarse'

In [116]:
model = gensim.models.Word2Vec(words)

In [117]:
model.corpus_count

982588

In [118]:
model.epochs

5

In [119]:
model.wv.index_to_key

['book',
 'story',
 'read',
 'one',
 'love',
 'character',
 'like',
 'really',
 'good',
 'get',
 'series',
 'author',
 'great',
 'would',
 'time',
 'well',
 'reading',
 'first',
 'life',
 'know',
 'way',
 'loved',
 'want',
 'make',
 'much',
 'thing',
 'see',
 'even',
 'find',
 'could',
 'little',
 'also',
 'end',
 'next',
 'two',
 'enjoyed',
 'think',
 'go',
 'cant',
 'romance',
 'im',
 'short',
 'lot',
 'going',
 'take',
 'dont',
 'come',
 'didnt',
 'new',
 'wait',
 'feel',
 'say',
 'written',
 'keep',
 'recommend',
 'give',
 'many',
 'back',
 'never',
 'part',
 'need',
 'friend',
 'work',
 'put',
 'people',
 'thought',
 'review',
 'another',
 'made',
 'still',
 'liked',
 'found',
 'year',
 'woman',
 'reader',
 'world',
 'start',
 'man',
 'right',
 'bit',
 'something',
 'hot',
 'family',
 'writing',
 'relationship',
 'help',
 'looking',
 'novel',
 'together',
 'got',
 'always',
 'definitely',
 'page',
 'best',
 'though',
 'plot',
 'interesting',
 'better',
 'fun',
 'different',
 'sex'

In [120]:
def avg_word2vec(doc):
    # remove out-of-vocabulary words
    #sent = [word for word in doc if word in model.wv.index_to_key]
    #print(sent)
    
    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)
                #or [np.zeros(len(model.wv.index_to_key))], axis=0)

In [135]:
!pip install tqdm




In [138]:
from tqdm import tqdm

In [None]:
#apply for the entire sentences
import numpy as np
X=[]
for i in tqdm(range(len(words))):
    X.append(avg_word2vec(words[i]))

In [143]:
import cupy as cp
from tqdm import tqdm

# Function to calculate average word2vec using GPU
def avg_word2vec(doc):
    # remove out-of-vocabulary words
    valid_words = [word for word in doc if word in model.wv.index_to_key]
    if not valid_words:
        return cp.zeros(model.wv.vector_size)
    
    # Using cupy for GPU-accelerated computation
    vectors = cp.array([model.wv[word] for word in valid_words])
    return cp.mean(vectors, axis=0)

# Apply for the entire dataset in batches
X = []
batch_size = 1000  # Define a suitable batch size
for i in tqdm(range(0, len(words), batch_size)):
    batch = words[i:i + batch_size]
    X.extend([avg_word2vec(doc) for doc in batch])

# Convert X to a CuPy array or NumPy array as needed
X = cp.asnumpy(cp.array(X))  # Convert back to NumPy if needed


--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda11x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------


  0%|                                                                                          | 0/983 [00:00<?, ?it/s][A
  0%|                                                                                | 1/983 [00:17<4:48:09, 

In [142]:
!pip install cupy




In [144]:
len(X)

982588

In [145]:
X[1]

array([ 3.59689832e-01, -2.14274228e-01,  3.91006470e-01,  5.20935833e-01,
       -8.01299155e-01,  9.68654454e-01, -4.92752641e-01,  3.19551736e-01,
        5.83681390e-02,  8.40776145e-01, -4.68573719e-01,  1.37554622e+00,
        2.30846807e-01, -3.57641608e-01,  5.59558757e-02, -7.38365173e-01,
       -2.01248750e-01,  1.35500193e+00, -3.54635626e-01, -2.81497926e-01,
        9.17247385e-02,  8.57834220e-01,  4.00017142e-01, -9.67737362e-02,
       -8.26528072e-02,  6.11208022e-01,  2.13331759e-01,  2.36462399e-01,
       -3.71228725e-01,  4.05258566e-01,  3.34701650e-02, -1.04636595e-01,
       -3.84601802e-01,  6.20178998e-01,  5.10337591e-01, -1.82049409e-01,
        3.41096401e-01,  2.44448841e-01,  6.20506883e-01, -9.90347937e-02,
       -3.25532019e-01,  2.36154363e-01, -6.04074180e-01, -8.83755460e-02,
        1.04354930e+00, -4.26314652e-01,  3.94280225e-01,  1.89956203e-01,
        1.71147168e-01, -6.44864962e-02,  3.89294303e-03,  8.68300274e-02,
       -1.79602847e-01, -

In [147]:
##independent Features
X_new=np.array(X)

In [148]:
X_new[0].shape

(100,)

In [150]:
## Dependent Features
## Output Features
y = df['rating']

In [151]:
## this is the final independent features
df=pd.DataFrame()
for i in range(0,len(X)):
    df=df.append(pd.DataFrame(X[i].reshape(1,-1)),ignore_index=True)
    

AttributeError: 'DataFrame' object has no attribute 'append'

In [152]:
df = pd.DataFrame()
df_list = []

for i in range(0, len(X)):
    df_list.append(pd.DataFrame(X[i].reshape(1, -1)))

df = pd.concat(df_list, ignore_index=True)


In [153]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.164693,-0.450787,-0.076586,0.285654,-0.668866,0.652217,-0.270837,0.578819,0.215276,0.886951,...,-0.911499,-0.255396,1.375522,0.152857,-1.115374,0.137921,-0.22739,0.82861,-0.687673,-0.110202
1,0.35969,-0.214274,0.391006,0.520936,-0.801299,0.968654,-0.492753,0.319552,0.058368,0.840776,...,-0.175223,-0.494471,1.107128,-0.201796,-0.635321,0.366324,-0.316507,1.126595,-0.567191,-0.41356
2,0.306312,-0.108586,-0.749057,0.196136,-0.915475,1.272307,-0.311547,-0.641302,-0.017996,1.440623,...,-0.61497,-0.554416,0.276732,-0.577004,-1.074206,0.522517,0.191669,1.484632,-0.572112,-0.139593
3,-0.100046,0.042467,-1.116595,-0.417993,-0.480786,1.146366,-0.058657,-0.106494,0.697739,1.102172,...,-0.246911,-0.499656,1.103583,-0.760878,-0.809592,0.311672,-0.226485,0.691803,-0.30935,-0.744449
4,-0.946843,0.650548,0.077292,-0.193667,-1.193294,0.29066,-0.602315,-0.244617,0.278286,1.186648,...,-0.849107,-0.559049,0.840631,-0.46553,-0.198698,0.112827,-0.214781,1.522043,-0.955251,0.288204


In [154]:
df['target']=y

In [155]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,target
0,0.164693,-0.450787,-0.076586,0.285654,-0.668866,0.652217,-0.270837,0.578819,0.215276,0.886951,...,-0.255396,1.375522,0.152857,-1.115374,0.137921,-0.22739,0.82861,-0.687673,-0.110202,1.0
1,0.35969,-0.214274,0.391006,0.520936,-0.801299,0.968654,-0.492753,0.319552,0.058368,0.840776,...,-0.494471,1.107128,-0.201796,-0.635321,0.366324,-0.316507,1.126595,-0.567191,-0.41356,1.0
2,0.306312,-0.108586,-0.749057,0.196136,-0.915475,1.272307,-0.311547,-0.641302,-0.017996,1.440623,...,-0.554416,0.276732,-0.577004,-1.074206,0.522517,0.191669,1.484632,-0.572112,-0.139593,1.0
3,-0.100046,0.042467,-1.116595,-0.417993,-0.480786,1.146366,-0.058657,-0.106494,0.697739,1.102172,...,-0.499656,1.103583,-0.760878,-0.809592,0.311672,-0.226485,0.691803,-0.30935,-0.744449,1.0
4,-0.946843,0.650548,0.077292,-0.193667,-1.193294,0.29066,-0.602315,-0.244617,0.278286,1.186648,...,-0.559049,0.840631,-0.46553,-0.198698,0.112827,-0.214781,1.522043,-0.955251,0.288204,1.0


In [156]:
df.dropna(inplace=True)

In [157]:
df.isna().sum()

0         0
1         0
2         0
3         0
4         0
         ..
96        0
97        0
98        0
99        0
target    0
Length: 101, dtype: int64

In [162]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30)

In [159]:
print(len(X), len(y))


982588 982597


In [160]:
if len(y) > len(X):
    y = y[:len(X)]

In [161]:
print(len(X), len(y))


982588 982588


In [163]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()

In [164]:
classifier.fit(X_train,y_train)

In [165]:
y_pred=classifier.predict(X_test)

In [166]:
from sklearn.metrics import accuracy_score,classification_report
print(accuracy_score(y_test,y_pred))

0.9417356170935995


In [167]:
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.46      0.00      0.01     17163
           1       0.94      1.00      0.97    277614

    accuracy                           0.94    294777
   macro avg       0.70      0.50      0.49    294777
weighted avg       0.91      0.94      0.91    294777

