# <b><u> Project Title : Build a recommender engine that reviews customer ratings and purchase history to recommend items and improve sales. </u></b>

### Amazon.com is one of the largest electronic commerce and cloud computing companies.

### Just a few Amazon related facts:

### They lost $4.8 million in August 2013, when their website went down for 40 mins. They hold the patent on 1-Click buying, and licenses it to Apple. Their Phoenix fulfilment centre is a massive 1.2 million square feet. Amazon relies heavily on a Recommendation engine that reviews customer ratings and purchase history to recommend items and improve sales.


### This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on their website.

### It contains

* ### the unique UserId (Customer Identification),
* ### the product ASIN (Amazon's unique product identification code for each product),
* ### Ratings (ranging from 1-5 based on customer satisfaction) and
* ### the Timestamp of the rating (in UNIX time)

### This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

### This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).


In [14]:

import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Load the dataset
rating_df = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 4/Week 4/ratings_Beauty.csv')

rating_json = pd.read_json('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 4/Week 4/reviews_Beauty_5.json.gz', compression='infer',lines = True)


rating_df.head()

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,205616461,5.0,1369699200
1,A3JM6GV9MNOF9X,558925278,3.0,1355443200
2,A1Z513UWSAAO0F,558925278,5.0,1404691200
3,A1WMRR494NWEWV,733001998,4.0,1382572800
4,A3IAAVS479H7M7,737104473,1.0,1274227200


In [4]:
rating_json.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1YJEY40YUW4SE,7806397051,Andrea,"[3, 4]",Very oily and creamy. Not at all what I expect...,1,Don't waste your money,1391040000,"01 30, 2014"
1,A60XNB876KYML,7806397051,Jessica H.,"[1, 1]",This palette was a decent price and I was look...,3,OK Palette!,1397779200,"04 18, 2014"
2,A3G6XNM240RMWA,7806397051,Karen,"[0, 1]",The texture of this concealer pallet is fantas...,4,great quality,1378425600,"09 6, 2013"
3,A1PQFP6SAJ6D80,7806397051,Norah,"[2, 2]",I really can't tell what exactly this thing is...,2,Do not work on my face,1386460800,"12 8, 2013"
4,A38FVHZTNQ271F,7806397051,Nova Amor,"[0, 0]","It was a little smaller than I expected, but t...",3,It's okay.,1382140800,"10 19, 2013"


In [5]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2023070 entries, 0 to 2023069
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   UserId     object 
 1   ProductId  object 
 2   Rating     float64
 3   Timestamp  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 61.7+ MB


In [6]:
rating_json.reviewerID.unique()

array(['A1YJEY40YUW4SE', 'A60XNB876KYML', 'A3G6XNM240RMWA', ...,
       'A3L85FTL937CEC', 'A1QAWMPT3S6YIB', 'A2CG5Y82ZZNY6W'], dtype=object)

In [7]:
rating_df.UserId.unique()


array(['A39HTATAQ9V7YF', 'A3JM6GV9MNOF9X', 'A1Z513UWSAAO0F', ...,
       'AFPRQT3V8C1U1', 'A1RYQPQ01T5D5R', 'A3MQDRRGC9070R'], dtype=object)

In [8]:
rating_df['UserId'].value_counts()

A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276
                 ... 
APAES6XEFV9Q1       1
A2KESJRGCHF12H      1
A3ULUTR76528UV      1
A2EOWJM6KDY5CH      1
A1JIE6Y2GOUQYP      1
Name: UserId, Length: 1210271, dtype: int64

In [9]:
rating_json['reviewerID'].value_counts()

A2V5R832QCSOMX    204
ALNFHVS3SC4FV     192
AKMEY1BSHSDG7     182
A3KEZLJ59C1JVH    154
ALQGOMOY1F5X9     150
                 ... 
A2RGE6WNAYPSD5      5
A28CJ91VZ63A7A      5
A3578WPHOPUCQD      5
A2F1FP9DWAWMNB      5
A5H8NA5CJ0FK4       5
Name: reviewerID, Length: 22363, dtype: int64

In [10]:
left = rating_json

In [11]:
right=rating_df

In [12]:
rating_json.rename({"reviewerID": "UserId"}, axis = "columns", inplace = True) 



In [16]:
df3 = pd.merge(rating_df, rating_json)


In [None]:
products_per_user = rating_df.groupby(by='UserId')['Rating'].count().sort_values(ascending=False)

In [None]:
products_per_user

UserId
A3KEZLJ59C1JVH           389
A281NPSIMI1C2R           336
A3M174IC0VXOS2           326
A2V5R832QCSOMX           278
A3LJLRIZL38GG3           276
                        ... 
A3BQ47C773YMU1             1
A3BQ3Y37XL049D             1
A3BQ3NGQ3JJBR3             1
A3BQ3BW37JKZZ4             1
A00008821J0F472NDY6A2      1
Name: Rating, Length: 1210271, dtype: int64

In [None]:
print('unique users =', df3['UserId'].nunique())

unique users = 1210271


In [None]:
print('unique product =', df3['ProductId'].nunique())

unique product = 71056


In [None]:
#Check the top 10 users based on ratings
most_rated_product=rating_df.groupby('UserId').size().sort_values(ascending=False)[:10]
print('Top 10 users based on ratings:\n',most_rated_product)

Top 10 users based on ratings:
 UserId
A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276
ALQGOMOY1F5X9     275
AKMEY1BSHSDG7     269
A3R9H6OKZHHRJD    259
A1M04H40ZVGWVG    249
A1RRMZKOMZ2M7J    225
dtype: int64


In [None]:
counts=rating_df.UserId.value_counts()
rating_df1=rating_df[rating_df.UserId.isin(counts[counts>=20].index)]
print('Number of users who have rated 25 or more items =',len(rating_df1))
print('Number of unique users in the final data = ', rating_df1['UserId'].nunique())
print('Number of unique products in the final data = ', rating_df1['UserId'].nunique())

Number of users who have rated 25 or more items = 97860
Number of unique users in the final data =  2826
Number of unique products in the final data =  2826
