# Building user-based recommendation model for Amazon.

Which movies have maximum views/ratings? 
What is the average rating for each movie? Define the top 5 movies with the maximum ratings.  
Define the top 5 movies with the least audience.  

Divide the data into training and test data  
Build a recommendation model on training data  
Make predictions on the test data  

In [1]:
#importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt

In [2]:
#Read the data using pandas

dataset = pd.read_csv('Amazon - Movies and TV Ratings.csv')
dataset.head(5)

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [3]:
n_users = dataset.user_id.nunique()
n_movies = dataset.shape[1]-1

print('Num. of Users: ', n_users)
print('Num of Movies: ',n_movies)

Num. of Users:  4848
Num of Movies:  206


In [4]:
#Default index column is sufficient. user_id is independent of prediction. so dropping the column

dataset.drop(columns='user_id',axis=1,inplace= True)

### 1. Which movies have maximum views/ratings?

In [5]:
print(dataset.count().idxmax()," has maximum views/ratings.\nTotal number of ratings are ",dataset['Movie127'].notna().sum())

Movie127  has maximum views/ratings.
Total number of ratings are  2313


### 2. What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

##### Top5 movies having maximum number of ratings:

In [6]:
dataset.count().sort_values(ascending = False).head(5)

Movie127    2313
Movie140     578
Movie16      320
Movie103     272
Movie29      243
dtype: int64

##### Average rating of each movie:

In [7]:
dataset.mean()

Movie1      5.000000
Movie2      5.000000
Movie3      2.000000
Movie4      5.000000
Movie5      4.103448
              ...   
Movie202    4.333333
Movie203    3.000000
Movie204    4.375000
Movie205    4.628571
Movie206    4.923077
Length: 206, dtype: float64

##### Top5 movies having maximum ratings:

In [8]:
dataset.mean().sort_values(ascending = False).head(5)

Movie1      5.0
Movie55     5.0
Movie131    5.0
Movie132    5.0
Movie133    5.0
dtype: float64

### 3. Define the top 5 movies with the least audience.

In [9]:
dataset.count().sort_values().head(5)

Movie1      1
Movie71     1
Movie145    1
Movie69     1
Movie68     1
dtype: int64

In [10]:
#replacing NaN values with 0

dataset.fillna(0,inplace=True)

### 4. Divide the data into training and test data

In [11]:
#splitting the data into training and test set

train_data, test_data = train_test_split(dataset, test_size=0.30,random_state=27)

### 5. Build a recommendation model on training data

In [12]:
user_similarity = pairwise_distances(train_data, metric='cosine')
user_similarity

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 0., ..., 1., 1., 0.],
       [1., 0., 0., ..., 1., 1., 0.],
       ...,
       [1., 1., 1., ..., 0., 0., 1.],
       [1., 1., 1., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 1., 0.]])

### 6. Make predictions

In [13]:
mean_user_rating = train_data.mean(axis=1)
ratings_diff = (train_data - mean_user_rating[:, np.newaxis]) 
user_pred = mean_user_rating[:, np.newaxis] + user_similarity.dot(ratings_diff) / np.array([np.abs(user_similarity).sum(axis=1)]).T
user_pred

array([[0.00249179, 0.00249179, 0.00249179, ..., 0.00819576, 0.03671561,
        0.01900328],
       [0.00069239, 0.00069239, 0.00069239, ..., 0.01135747, 0.06468284,
        0.03156498],
       [0.00069239, 0.00069239, 0.00069239, ..., 0.01135747, 0.06468284,
        0.03156498],
       ...,
       [0.00258251, 0.00258251, 0.00258251, ..., 0.00897665, 0.04094735,
        0.02109186],
       [0.00258251, 0.00258251, 0.00258251, ..., 0.00897665, 0.04094735,
        0.02109186],
       [0.00069239, 0.00069239, 0.00069239, ..., 0.01135747, 0.06468284,
        0.03156498]])

In [14]:
# finding the RMSE
def rmse(prediction, original):
    prediction = prediction[original.nonzero()].flatten() 
    original = original[original.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, original))
print('RMSE for User-based Collaborative Filtering is ' + str(rmse(user_pred, np.array(test_data))))

RMSE for User-based Collaborative Filtering is 3.9929613991275907
