# PG AI - Machine Learning
# Assessement Project: Building user-based recommendation model for Amazon

DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.<br>

## Data Dictionary:

UserID – 4848 customers who provided a rating for each movie<br>
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users<br>

## Data Considerations
-All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.<br>
-Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.<br>



## Analysis Task
-Exploratory Data Analysis:

Q1 - Which movies have maximum views/ratings?<br>
Q2 - What is the average rating for each movie? Define the top 5 movies with the maximum ratings.<br>
Q3 - Define the top 5 movies with the least audience.<br>

Q4 - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

- Divide the data into training and test data<br>
- Build a recommendation model on training data<br>
- Make predictions on the test data<br>

By Edson Teixeira<br>
teixeiraedson252@gmail.com <br>
November 25th 2021

In [1]:
# import the required libraries
import pandas as pd
import numpy as np
import surprise
import re
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Load Dataset and store it in df variable:
MovTvRatings = pd.read_csv('Amazon - Movies and TV Ratings.csv')
MovTvRatings.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


#### Q1 - Which movies have maximum views/ratings?

In [3]:
#Views
MovTvRatings.describe().T['count'].sort_values(ascending=False)[:1].to_frame()

Unnamed: 0,count
Movie127,2313.0


In [4]:
#Ratings
MovTvRatings.drop('user_id',axis=1).sum().sort_values(ascending=False)[:1].to_frame()

Unnamed: 0,0
Movie127,9511.0


In [5]:
# Q2 - What is the average rating for each movie? 
MovTvRatings.drop('user_id',axis=1).mean()

Movie1      5.000000
Movie2      5.000000
Movie3      2.000000
Movie4      5.000000
Movie5      4.103448
              ...   
Movie202    4.333333
Movie203    3.000000
Movie204    4.375000
Movie205    4.628571
Movie206    4.923077
Length: 206, dtype: float64

In [6]:
#Define the top 5 movies with the maximum ratings.
MovTvRatings.drop('user_id',axis=1).mean().sort_values(ascending=False)[:5].to_frame()

Unnamed: 0,0
Movie1,5.0
Movie55,5.0
Movie131,5.0
Movie132,5.0
Movie133,5.0


In [7]:
# Q3 - Define the top 5 movies with the least audience.
MovTvRatings.describe().T['count'].sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,count
Movie1,1.0
Movie71,1.0
Movie145,1.0
Movie69,1.0
Movie68,1.0


In [8]:
# Question 4: Recommendation Model:
from surprise import Reader
from surprise import accuracy
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise.model_selection import cross_validate

In [9]:
mtr_melt = MovTvRatings.melt(id_vars = MovTvRatings.columns[0],value_vars=MovTvRatings.columns[1:],var_name="Movies",value_name="Rating")
mtr_melt


Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [10]:
rd = Reader()
data = Dataset.load_from_df(mtr_melt.fillna(0),reader=rd)

In [11]:
## Divide dataset
trainset, testset = train_test_split(data,test_size=0.25)

In [12]:
#Using SVD (Singular Value Descomposition)
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f5497370c50>

In [13]:
## Test Dataset
pred = svd.test(testset)
accuracy.rmse(pred)
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

RMSE: 1.0258
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0264  1.0266  1.0255  1.0262  0.0005  
MAE (testset)     1.0122  1.0122  1.0117  1.0120  0.0003  
Fit time          36.51   36.78   36.66   36.65   0.11    
Test time         3.70    3.16    3.16    3.34    0.25    


{'test_rmse': array([1.02641929, 1.02663252, 1.02551171]),
 'test_mae': array([1.01222883, 1.01222991, 1.01168705]),
 'fit_time': (36.51420450210571, 36.78360652923584, 36.65638709068298),
 'test_time': (3.6984102725982666, 3.155616521835327, 3.1598305702209473)}