### Content
    core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.

### Columns


- asin-ID of the product, like B000FA64PK
- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.

In [29]:
import pandas as pd 
import numpy as np 

In [30]:
df = pd.read_csv('review.csv')

In [31]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [32]:
df.drop(['Unnamed: 0.1'], inplace = True , axis= 1)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      12000 non-null  int64 
 1   asin            12000 non-null  object
 2   helpful         12000 non-null  object
 3   rating          12000 non-null  int64 
 4   reviewText      12000 non-null  object
 5   reviewTime      12000 non-null  object
 6   reviewerID      12000 non-null  object
 7   reviewerName    11962 non-null  object
 8   summary         11998 non-null  object
 9   unixReviewTime  12000 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 937.6+ KB


In [34]:
df = df[['reviewText','rating']]


In [35]:
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [36]:
df.shape

(12000, 2)

In [37]:
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [38]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [39]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

### Data Preprocessing

In [41]:
 # Positive if review is 1 and negeative if review is 0
df['rating'] = df['rating'].apply(lambda x:0 if x<3 else 1)

In [47]:

df.head()

df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [51]:
df['reviewText'] = df['reviewText'].str.lower() 

In [54]:
# %pip install bs4
import re
import stopwords 
from bs4 import BeautifulSoup

# Removing special characters
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9]+', ' ', x))

# Remove the stopwords
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([y for y in x.split() if y not in stopwords]))

# Remove URLs
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp)://[a-zA-Z0-9./]+', '', x))

# Remove HTML tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

# Remove any additional spaces
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join(x.split()))





ModuleNotFoundError: No module named 'stopwords'