# EDSA Movie Recommendation Challenge

# Introduction

## Context

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

## Problem Statement
With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

## Evaluation
The evaluation metric for this competition is Root Mean Square Error. Root Mean Square Error (RMSE) is commonly used in regression analysis and forecasting, and measures the standard deviation of the residuals arising between predicted and actual observed values for a modelling process.

# Data Exploration

## Installing and Importing packages
An importang module we'll need to install and import here is turicreate. The following explanation is taken directly from the module's documentation page.

Turi Create simplifies the development of custom machine learning models. You don’t have to be a machine learning expert to add recommendations, object detection, image classification, image similarity or activity classification to your app.

1. Easy-to-use: Focus on tasks instead of algorithms
2. Visual: Built-in, streaming visualizations to explore your data
3. Flexible: Supports text, images, audio, video and sensor data
4. Fast and Scalable: Work with large datasets on a single machine
5. Ready To Deploy: Export models to Core ML for use in iOS, macOS, watchOS, and tvOS apps

In [None]:
!pip install turicreate

In [3]:
import pandas as pd
import numpy as np
import re

import turicreate

## Loading data files

In [4]:
path = '/content/drive/My Drive/Projects/recommendation_engine/data/train.csv'
ratings_train = pd.read_csv(path, verbose = False)

In [5]:
ratings_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [6]:
ratings_train.pop('timestamp')

0           1518349992
1           1206238739
2           1076215539
3           1423042565
4            833375837
               ...    
10000033    1521235092
10000034    1002580977
10000035    1227674807
10000036    1479921530
10000037     858984862
Name: timestamp, Length: 10000038, dtype: int64

In [7]:
n_users = ratings_train['userId'].nunique()
n_items = ratings_train['movieId'].nunique()

print('Number of Users: '+ str(n_users))
print('Number of Movies: '+str(n_items))

Number of Users: 162541
Number of Movies: 48213


# Data preprocessing

In [8]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(ratings_train, test_size = 0.25, random_state = 42)

In [9]:
train_data = turicreate.load_sframe(train_data)
test_data = turicreate.load_sframe(test_data)

In [10]:
train_data

userId,movieId,rating
125167,2455,4.0
40255,5952,4.5
51016,4310,2.5
38009,235,4.0
5487,1120,4.5
151386,8641,5.0
136136,3489,2.5
57947,2384,3.0
61869,4975,3.0
143757,4006,2.0


# Base Model (Popularity Recommendation Engine)
Here we implement a simple model that makes recommendations using item popularity. When a target is provided, popularity is computed using the item's mean target value. When the target column contains ratings, as in our case, the model computes the mean rating for each item and uses this to rank items for recommendations.

## Implementation

In [None]:
popularity_model = turicreate.popularity_recommender.create(train_data, user_id='userId', item_id='movieId', target='rating')

## Model Evaluation

In [None]:
popularity_model.evaluate(test_data, metric ='rmse', target = 'rating')


Overall RMSE: 0.9621607059113966

Per User RMSE (best)
+--------+------+-------+
| userId | rmse | count |
+--------+------+-------+
| 145938 | 0.0  |   1   |
+--------+------+-------+
[1 rows x 3 columns]


Per User RMSE (worst)
+--------+--------------------+-------+
| userId |        rmse        | count |
+--------+--------------------+-------+
| 55933  | 3.9190066547638374 |   1   |
+--------+--------------------+-------+
[1 rows x 3 columns]


Per Item RMSE (best)
+---------+------+-------+
| movieId | rmse | count |
+---------+------+-------+
|  182305 | 0.0  |   1   |
+---------+------+-------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------+------+-------+
| movieId | rmse | count |
+---------+------+-------+
|  169720 | 4.5  |   1   |
+---------+------+-------+
[1 rows x 3 columns]



{'rmse_by_item': Columns:
 	movieId	int
 	rmse	float
 	count	int
 
 Rows: 31819
 
 Data:
 +---------+--------------------+-------+
 | movieId |        rmse        | count |
 +---------+--------------------+-------+
 |  26613  | 1.298075498574717  |   2   |
 |   4441  | 0.7769045817644962 |   30  |
 |  159407 |        0.5         |   1   |
 |  128606 | 1.0611502524230925 |   27  |
 |  85892  | 1.6666666666666665 |   1   |
 |   7899  | 0.8393563293494889 |   21  |
 |  133925 |        0.0         |   1   |
 |  74426  | 1.0975083002271124 |   6   |
 |  144280 |       0.375        |   2   |
 |  162160 | 0.6442352540027595 |   4   |
 +---------+--------------------+-------+
 [31819 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
 	userId	int
 	rmse	float
 	count	int
 
 Rows: 159376
 
 Data:
 +--------+---------------------+-------+
 | userId |         rmse        | c

The overall RMSE from this simple model is 0.96. Pretty good for a simple model.

## Making Movie Recommendations

In [None]:
# for two users recommend 5 movies
user = 133
popularity_model.recommend(users = [user], k = 5)

userId,movieId,score,rank
133,125339,5.0,1
133,153296,5.0,2
133,171777,5.0,3
133,147037,5.0,4
133,185677,5.0,5


For example for user with userId 133, the 5 best recommendations would be the movies shown in the table above. Note that these are the same for the user with userId 3567. This is because the popularity recommendation model does not discriminate users based on preference, all users are simply recommended the same movies.

In [11]:
imdb_data = pd.read_csv('/content/drive/My Drive/Projects/recommendation_engine/data/imdb_data.csv', usecols = ['movieId', 'title_cast', 'director', 'plot_keywords'])
imdb_data.head()

Unnamed: 0,movieId,title_cast,director,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,fatherhood|doberman|dog|mansion


In [12]:
movies_data = pd.read_csv('/content/drive/My Drive/Projects/recommendation_engine/data/movies.csv')
movies_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
imdb_data['movieId'].nunique(), movies_data['movieId'].nunique()

(27278, 62423)

In [14]:
 def mergeData(main_df, other_df, on = ['movieId']):
    data = pd.merge(
        left = main_df,
        right = other_df,
        on = on,
        how = 'outer'
    )
    return data

In [15]:
item_data = mergeData(main_df = imdb_data, other_df = movies_data)
item_data.head(3)

Unnamed: 0,movieId,title_cast,director,plot_keywords,title,genres
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,toy|rivalry|cowboy|cgi animation,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,board game|adventurer|fight|game,Jumanji (1995),Adventure|Children|Fantasy
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,boat|lake|neighbor|rivalry,Grumpier Old Men (1995),Comedy|Romance


In [16]:
item_data.isna().sum() / item_data.shape[0]

movieId          0.000000
title_cast       0.734557
director         0.731565
plot_keywords    0.750135
title            0.037202
genres           0.037202
dtype: float64

In [17]:
item_data.fillna('Unknown', inplace = True)

In [18]:
text_corpus = [row[2].replace(' ', '_').replace('|', ' ') + ' ' + row[3].replace('|', ' ') + ' ' + row[4].replace(' ', '_').replace('|', ' ') + ' ' + row[6].replace(' ', '_').replace('|', ' ') for row in item_data.itertuples()]

In [19]:
item_data['text'] = pd.Series(text_corpus)
item_data.head(3)

Unnamed: 0,movieId,title_cast,director,plot_keywords,title,genres,text
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,toy|rivalry|cowboy|cgi animation,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Tom_Hanks Tim_Allen Don_Rickles Jim_Varney Wal...
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,board game|adventurer|fight|game,Jumanji (1995),Adventure|Children|Fantasy,Robin_Williams Jonathan_Hyde Kirsten_Dunst Bra...
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,boat|lake|neighbor|rivalry,Grumpier Old Men (1995),Comedy|Romance,Walter_Matthau Jack_Lemmon Sophia_Loren Ann-Ma...


In [20]:
del text_corpus, imdb_data, movies_data

In [21]:
item_data.drop(['title_cast', 'director', 'plot_keywords', 'title', 'genres'], axis = 1, inplace = True)

In [22]:
item_data['movieId'].nunique()

64835

In [23]:
item_data = pd.merge(
    left = ratings_train['movieId'],
    right = item_data,
    on = 'movieId',
    how = 'left'
).drop_duplicates()

In [24]:
item_data['movieId'].nunique()

48213

In [45]:
item_data = item_data.reset_index().drop('index', axis = 1)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
text_matrix = vec.fit_transform(item_data['text'])
text_matrix

<48213x132363 sparse matrix of type '<class 'numpy.int64'>'
	with 427270 stored elements in Compressed Sparse Row format>

In [26]:
from sklearn.neighbors import NearestNeighbors

In [27]:
number_of_neighbors = 64
knn = NearestNeighbors(n_neighbors = number_of_neighbors)
knn.fit(text_matrix)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=64, p=2,
                 radius=1.0)

In [28]:
similarity = knn.kneighbors(text_matrix, n_neighbors = number_of_neighbors + 1)

In [29]:
neighbors = similarity[1]
sim_data = pd.DataFrame(neighbors, index = item_data['movieId'])
sim_data.drop(0, axis = 1, inplace = True)
sim_data.head(5)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1
57669,33621,24072,23684,35656,34156,12416,21453,21611,16053,5865,19844,31683,20025,36684,33862,8964,16080,41330,2880,25133,22988,11297,28691,18168,11397,10396,43878,14120,43016,16885,45211,11706,37682,29445,17067,23373,47756,36424,17419,43658,41934,32115,35910,45180,35947,37842,42705,21883,17187,14773,20210,7455,12459,41248,21246,16993,8308,36064,25461,33292,13387,37618,47802,16110
5,2364,21453,33621,24072,23684,21611,5865,34156,19844,16053,8964,16080,25133,28691,22988,11397,43878,10396,11297,18168,20025,41330,2880,24815,47756,21246,32162,42544,20210,39191,29445,37842,3501,25523,32115,14773,45211,33292,16110,8308,37682,12416,26209,30748,43016,45613,16993,17067,45082,47559,23373,37781,14120,35910,36424,42705,27780,21883,33017,17419,16885,43658,17187,5649
5459,24072,33621,34156,26010,5865,21453,16053,23684,19844,21611,20025,16080,10396,25133,8964,18168,2880,28691,11297,11397,41330,22988,43878,14120,4378,47756,25523,10840,26209,17187,17067,47559,33017,32115,27,23373,32822,45082,45613,29886,11975,42544,23488,6837,41934,12416,47802,37682,14445,36424,43658,16993,7455,27787,26438,20210,32162,27780,30272,29445,33292,32068,14773,37842
32296,24072,21611,19844,34156,16053,33621,21453,5865,23684,2880,11397,22988,41330,36684,10396,18168,8981,11297,43878,20025,16080,28691,25133,4259,8964,21932,45613,32822,17067,30748,12416,39191,37842,35656,17187,37682,33292,27780,33017,16110,45211,20210,32162,21246,14773,23488,5649,7455,16993,47559,3501,36424,42544,43016,47756,35910,47802,23373,14120,45082,29445,26209,11706,43658
366,30860,16827,31683,28691,47756,32213,20662,29203,8787,16885,561,23343,29709,32400,21246,34610,46481,32162,44752,11295,47802,32115,16345,19844,32077,47559,25729,12416,33621,28555,25523,20495,23373,17067,4508,45082,20134,4093,42544,4119,39191,24072,11779,14786,29926,21453,5865,16110,27780,37187,36424,23488,30912,35910,17187,37682,35656,42705,34156,18434,7455,39560,43658,24914


In [None]:
nearest_items = pd.DataFrame()

for movie_id in item_data['movieId']:
    most_similar_indices = sim_data.loc[movie_id]
    most_similar_items = item_data.loc[most_similar_indices]['movieId']

    ranking = range(1, number_of_neighbors + 1)

    distances = similarity[0]
    dist_data = pd.DataFrame(distances, index = item_data['movieId'])
    dist_data.drop(0, axis = 1, inplace = True)
    scores = dist_data.loc[movie_id].to_list()

    data = pd.DataFrame(
        {
            'movieId':movie_id,
            'similar':most_similar_items,
            'score':scores,
            'rank':ranking
        }
    )

    nearest_items = pd.concat([nearest_items, data])
nearest_items

In [None]:
nearest_items = turicreate.load_sframe(nearest_items)
nearest_items

In [None]:
item_sim_recommender = turicreate.item_similarity_recommender.create(train_data, user_id = 'userId', item_id = 'movieId', target = 'rating', nearest_items = nearest_items)