# Project 5<br>$\color{coral}{\text{Building User-Based Recommendation Model for Amazon}}$<br>*Name of the contributer : Rajeev Vhanhuve*

![Image of Yaktocat](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Amazon_logo.svg/905px-Amazon_logo.svg.png)

DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.
 
Data Dictionary

UserID – 4848 customers who provided a rating for each movie<br>
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users
 
Data Considerations

-	All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA. 
-	Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.
 
Analysis Task

-	Exploratory Data Analysis:
     -	Which movies have maximum views/ratings?
     -	What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
     -	Define the top 5 movies with the least audience.
   
-	Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm, which provides the ratings for each of the users.

     -	Divide the data into training and test data
     -	Build a recommendation model on training data
     -	Make predictions on the test data

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime

# CONFIGURATION
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load the dataset
movie_df=pd.read_csv('Amazon - Movies and TV Ratings.csv')

In [3]:
# View the top 5 elements of the dataset
movie_df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [4]:
# shape of the dataset
movie_df.shape

(4848, 207)

In [5]:
# info
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


In [6]:
# Applying describe for discriptive statistics
movie_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


In [7]:
movie_df.columns

Index(['user_id', 'Movie1', 'Movie2', 'Movie3', 'Movie4', 'Movie5', 'Movie6',
       'Movie7', 'Movie8', 'Movie9',
       ...
       'Movie197', 'Movie198', 'Movie199', 'Movie200', 'Movie201', 'Movie202',
       'Movie203', 'Movie204', 'Movie205', 'Movie206'],
      dtype='object', length=207)

In [8]:
# Number of missing values in each column.
movie_df.isna().sum()

user_id        0
Movie1      4847
Movie2      4847
Movie3      4847
Movie4      4846
            ... 
Movie202    4842
Movie203    4847
Movie204    4840
Movie205    4813
Movie206    4835
Length: 207, dtype: int64

### Exploratory Data Analysis:

#### - Which movies have maximum views/ratings?

In [9]:
# Movies having maximum views.
movie_df.describe().T['count'].sort_values(ascending=False)[:5]

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
Name: count, dtype: float64

In [10]:
# Movies having maximum ratings.
movie_df.drop('user_id', axis=1).sum().sort_values(ascending = False)[:5]

Movie127    9511.0
Movie140    2794.0
Movie16     1446.0
Movie103    1241.0
Movie29     1168.0
dtype: float64

*So movie127 has maximum views/ratings.*

#### - What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

In [11]:
# Average rating for each movie.
movie_df.drop('user_id', axis=1).mean()

Movie1      5.000000
Movie2      5.000000
Movie3      2.000000
Movie4      5.000000
Movie5      4.103448
              ...   
Movie202    4.333333
Movie203    3.000000
Movie204    4.375000
Movie205    4.628571
Movie206    4.923077
Length: 206, dtype: float64

In [12]:
# Defining top 5 movies with maximum ratings.
movie_df.drop('user_id', axis=1).mean().sort_values(ascending = False)[:5]

Movie1      5.0
Movie55     5.0
Movie131    5.0
Movie132    5.0
Movie133    5.0
dtype: float64

#### - Define the top 5 movies with the least audience.

In [13]:
movie_df.describe().T['count'].sort_values(ascending=True)[:5]

Movie1      1.0
Movie71     1.0
Movie145    1.0
Movie69     1.0
Movie68     1.0
Name: count, dtype: float64

#### - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.
   - Divide the data into training and test data
   - Build a recommendation model on training data
   - Make predictions on the test data

In [14]:
from surprise import Reader, Dataset

In [15]:
movie_df_melt = movie_df.melt(id_vars = movie_df.columns[0],value_vars=movie_df.columns[1:],var_name="Movies",value_name="Rating")
movie_df_melt

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [16]:
movie_df.columns[0]

'user_id'

In [17]:
movie_df.columns[1:]

Index(['Movie1', 'Movie2', 'Movie3', 'Movie4', 'Movie5', 'Movie6', 'Movie7',
       'Movie8', 'Movie9', 'Movie10',
       ...
       'Movie197', 'Movie198', 'Movie199', 'Movie200', 'Movie201', 'Movie202',
       'Movie203', 'Movie204', 'Movie205', 'Movie206'],
      dtype='object', length=206)

In [18]:
movie_df[movie_df['user_id']=='A3LKP6WPMP9UKX']

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,


In [19]:
rd = Reader()
data = Dataset.load_from_df(movie_df_melt.fillna(0),reader=rd)
data

<surprise.dataset.DatasetAutoFolds at 0x7fea727e3690>

In [20]:
data.df.head()

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,0.0
2,A3LKP6WPMP9UKX,Movie1,0.0
3,AVIY68KEPQ5ZD,Movie1,0.0
4,A1CV1WROP5KTTW,Movie1,0.0


In [21]:
from surprise.model_selection import train_test_split

In [22]:
trainset, testset = train_test_split(data,test_size=0.25)

In [23]:
from surprise import SVD, accuracy

In [24]:
#Using SVD (Singular Value Decomposition)
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fea735917d0>

In [25]:
pred = svd.test(testset)

In [26]:
accuracy.rmse(pred)

RMSE: 1.0265


1.026489752015934

In [27]:
accuracy.mae(pred)

MAE:  1.0123


1.0122623191481983

In [28]:
from surprise.model_selection import cross_validate

In [29]:
results = cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0262  1.0266  1.0261  1.0263  0.0002  
MAE (testset)     1.0121  1.0123  1.0120  1.0122  0.0001  
Fit time          38.97   39.32   38.96   39.09   0.17    
Test time         3.46    3.46    3.43    3.45    0.01    


In [30]:
def repeat(ml_type,dframe):
    rd = Reader()
    data = Dataset.load_from_df(dframe,reader=rd)
    print(cross_validate(ml_type, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True))
    print("--"*15)
    usr_id ='A3LKP6WPMP9UKX'
    mv = 'Movie1'
    r_u = 0.0
    print(ml_type.predict(usr_id,mv,r_ui = r_u,verbose=True))
    print("--"*15)

In [36]:
repeat(SVD(),movie_df_melt.fillna(movie_df_melt['Rating'].mean()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0870  0.0899  0.0807  0.0859  0.0039  
MAE (testset)     0.0100  0.0100  0.0099  0.0100  0.0001  
Fit time          38.79   39.95   39.94   39.56   0.54    
Test time         3.51    2.60    3.50    3.20    0.43    
{'test_rmse': array([0.08701355, 0.08989929, 0.08066596]), 'test_mae': array([0.01002982, 0.0100134 , 0.00989834]), 'fit_time': (38.79262447357178, 39.9533166885376, 39.93920946121216), 'test_time': (3.510366439819336, 2.596755266189575, 3.498109817504883)}
------------------------------
user: A3LKP6WPMP9UKX item: Movie1     r_ui = 0.00   est = 4.39   {'was_impossible': False}
user: A3LKP6WPMP9UKX item: Movie1     r_ui = 0.00   est = 4.39   {'was_impossible': False}
------------------------------


Thank you