## Objective

The objective of this project is to build a machine learning model that classifies reviews entered by customers in an e-commerce website. 

Only reviews in English language are considered and reviews entered in other languages are omitted.

To train the model with similar data, we use Amazon Review Dataset (given in the link: https://nijianmo.github.io/amazon/index.html). We download 'Clothing Shoes and Jewelry' dataset in .json format and has 32,292,099 ratings.

## Part 1

- The downloaded dataset has 32 (and odd) million reviews. And out of many columns, we only consider `reviewText`, `overall` & `summary` columns.


- We use only 1 million of the reviews at a time and create a CSV file having 4000 reviews of ratings 1, 2, 4 & 5 and 8000 reviews of rating 3.


- We save each CSV file by `chunk_number.csv`. Thus we will have 33 chunks, having 24,000 reviews.


- We then concatenate all 33 chunks to create `final_dataset.csv` with 7,92,000 reviews.

In [None]:
import pandas as pd
import glob

df = pd.read_json('Clothing_Shoes_and_Jewelry.json', lines=True, chunksize=1_000_000)

counter=1
for chunk in df:
    chunk = chunk[['reviewText', 'overall', 'summary']]
    df1 = chunk[chunk['overall']==1.0].sample(4000)
    df2 = chunk[chunk['overall']==2.0].sample(4000)
    df3 = chunk[chunk['overall']==3.0].sample(8000)
    df4 = chunk[chunk['overall']==4.0].sample(4000)
    df5 = chunk[chunk['overall']==5.0].sample(4000)
    
    combined_chunk = pd.concat([df1, df2, df3, df4, df5], axis=0, ignore_index=True)
    combined_chunk.to_csv(str(counter)+'.csv', index=False)
    print(f'{counter} out of 33')
    counter += 1

print('completed')

files = glob.glob('*.csv')

combined_csv = pd.concat([pd.read_csv(f) for f in files], axis=0, ignore_index=True)
combined_csv.to_csv('final_dataset.csv', index=False)

Since we want only `reviewText`, `overall` & `summary` columns, we retain those and drop the other columns in the dataset. In other words, we are dropping 51 columns out of 54 columns in the dataset, retaining only 3 above said columns.

## Part 2

In [12]:
import pandas as pd

df = pd.read_csv('final_dataset.csv', index_col=False)
df.columns

Index(['reviewText', 'overall', 'summary'], dtype='object')

We are classifying the reviews with ratings 1 & 2 as negative reviews and ratings with 4 & 5 as positive reviews. Thus, we need to drop all the rows (reviews) having 3 as ratings.

In [None]:
df = pd.read_csv('final_dataset.csv')
df = df[df['overall'] != 3]
df.dropna(inplace=True)

df.to_csv('balanced_reviews.csv', index=False)

# Adding a new column 'reviewPositive' where the value for any review (row) takes 1 
# if the rating of that review is greater than 3.

df['reviewPositive'] = np.where(df['overall'] > 3, 1, 0)

# Dropping the 'overall' column from the dataframe
df.drop(labels=['overall', 'summary'], axis=1, inplace=True)

# Dropping null values.
df.dropna(inplace=True)

# Saving the file as 'balanced_reviews.csv'
df.to_csv('binarized_reviews.csv')