# EDSA Movie Recommendation Challenge

# Introduction

## Context

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

## Problem Statement
With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

## Evaluation
The evaluation metric for this competition is Root Mean Square Error. Root Mean Square Error (RMSE) is commonly used in regression analysis and forecasting, and measures the standard deviation of the residuals arising between predicted and actual observed values for a modelling process.

# Data Exploration

## Installing and Importing packages
An importang module we'll need to install and import here is turicreate. The following explanation is taken directly from the module's documentation page.

Turi Create simplifies the development of custom machine learning models. You don’t have to be a machine learning expert to add recommendations, object detection, image classification, image similarity or activity classification to your app.

1. Easy-to-use: Focus on tasks instead of algorithms
2. Visual: Built-in, streaming visualizations to explore your data
3. Flexible: Supports text, images, audio, video and sensor data
4. Fast and Scalable: Work with large datasets on a single machine
5. Ready To Deploy: Export models to Core ML for use in iOS, macOS, watchOS, and tvOS apps

In [None]:
!pip install turicreate

In [5]:
import pandas as pd
import numpy as np
import turicreate

## Loading data files

In [6]:
path = '/content/drive/My Drive/Projects/recommendation_engine/data/train.csv'
ratings_train = pd.read_csv(path, verbose = False)

In [7]:
ratings_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [8]:
ratings_train.pop('timestamp')

0           1518349992
1           1206238739
2           1076215539
3           1423042565
4            833375837
               ...    
10000033    1521235092
10000034    1002580977
10000035    1227674807
10000036    1479921530
10000037     858984862
Name: timestamp, Length: 10000038, dtype: int64

In [9]:
n_users = ratings_train['userId'].nunique()
n_items = ratings_train['movieId'].nunique()

print('Number of Users: '+ str(n_users))
print('Number of Movies: '+str(n_items))

Number of Users: 162541
Number of Movies: 48213


# Data preprocessing

In [10]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(ratings_train, test_size = 0.25)

In [11]:
train_data = turicreate.load_sframe(train_data)
test_data = turicreate.load_sframe(test_data)

In [12]:
train_data

userId,movieId,rating
109731,32875,4.0
36863,208,3.0
1706,2762,5.0
101655,76077,3.5
78772,1307,4.0
51179,65130,3.5
156975,27032,4.0
92857,110,4.0
68514,33834,2.5
30048,53996,3.5


# Base Model (Popularity Recommendation Engine)
Here we implement a simple model that makes recommendations using item popularity. When a target is provided, popularity is computed using the item's mean target value. When the target column contains ratings, as in our case, the model computes the mean rating for each item and uses this to rank items for recommendations.

## Implementation

In [13]:
popularity_model = turicreate.popularity_recommender.create(train_data, user_id='userId', item_id='movieId', target='rating')

## Model Evaluation

In [14]:
popularity_model.evaluate(test_data, metric ='rmse', target = 'rating')


Overall RMSE: 0.9625065687676638

Per User RMSE (best)
+--------+------+-------+
| userId | rmse | count |
+--------+------+-------+
| 81581  | 0.0  |   1   |
+--------+------+-------+
[1 rows x 3 columns]


Per User RMSE (worst)
+--------+-------------------+-------+
| userId |        rmse       | count |
+--------+-------------------+-------+
| 32608  | 3.721537120079721 |   1   |
+--------+-------------------+-------+
[1 rows x 3 columns]


Per Item RMSE (best)
+---------+------+-------+
| movieId | rmse | count |
+---------+------+-------+
|  190219 | 0.0  |   1   |
+---------+------+-------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------+------+-------+
| movieId | rmse | count |
+---------+------+-------+
|  200000 | 4.5  |   1   |
+---------+------+-------+
[1 rows x 3 columns]



{'rmse_by_item': Columns:
 	movieId	int
 	rmse	float
 	count	int
 
 Rows: 31741
 
 Data:
 +---------+--------------------+-------+
 | movieId |        rmse        | count |
 +---------+--------------------+-------+
 |  26613  | 1.0801234497346435 |   3   |
 |   4441  | 0.9756793774363807 |   35  |
 |  128606 | 0.8688550639541239 |   17  |
 |   7899  | 0.6547145545649521 |   15  |
 |  182611 | 3.0334651417301375 |   1   |
 |  74426  | 1.3550289356452232 |   6   |
 |  144280 | 0.2999999999999998 |   1   |
 |  162160 | 0.7307623453915623 |   6   |
 |  138402 |        2.75        |   1   |
 |  101076 | 0.9806644814063397 |  112  |
 +---------+--------------------+-------+
 [31741 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
 	userId	int
 	rmse	float
 	count	int
 
 Rows: 159435
 
 Data:
 +--------+---------------------+-------+
 | userId |         rmse        | c

The overall RMSE from this simple model is 0.96. Pretty good for a simple model.