## About Dataset
Context
This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns

- asin - ID of the product, like B000FA64PK
- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.

Acknowledgements
This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration
- Sentiment analysis on reviews.
- Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
- Fake reviews/ outliers.
- Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
- Any other interesting analysis

#### Best Practises
1. Preprocessing And Cleaning
2. Train Test Split
3. BOW,TFIDF,Word2vec
4. Train ML algorithms

In [None]:
# Load the dataset
import pandas as pd
data=pd.read_csv('Kindle Reviews/all_kindle_review.csv')
data.head()

In [None]:
df=data[['reviewText','rating']]
df.head()

In [None]:
df.shape

In [None]:
## Missing Values
df.isnull().sum()

In [None]:
df['rating'].unique()

In [None]:
df['rating'].value_counts()

In [None]:
## Preprocessing And Cleaning

In [None]:
## postive review is 1 and negative review is 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

In [None]:
df['rating'].value_counts()

In [None]:
## 1. Lower All the cases
df['reviewText']=df['reviewText'].str.lower()

In [None]:
df.head()

In [None]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
from bs4 import BeautifulSoup

In [None]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))


In [None]:
df.head()

In [None]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer=WordNetLemmatizer()

In [None]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])


In [None]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

In [None]:
df.head()

In [None]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],
                                              test_size=0.20)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [None]:
X_train_bow

In [None]:
from sklearn.naive_bayes import GaussianNB
nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [None]:
y_pred_bow=nb_model_bow.predict(X_test_bow)

In [None]:
y_pred_tfidf=nb_model_bow.predict(X_test_tfidf)

In [None]:
confusion_matrix(y_test,y_pred_bow)

In [None]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))

In [None]:
confusion_matrix(y_test,y_pred_tfidf)

In [None]:
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))