[![movies](80s-movies.jpg)](80s-movies.jpg)

# What should I watch?

**Overview:**<br/>
Using movies & ratings datasets we will create two recommendation engine to predict what movies we should watch. Both engines will use **collaborative filtering** as the preferred method:
1. Item to item
2. Hybrid: User to user, followed by item to item

In a general sense, the engine will group similar users and similar items.

**Method:**<br/>
Typically, the workflow of a collaborative filtering system is:

1. A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.[![melbourne]
2. The system matches this user's ratings against other users' and finds the people with most "similar" tastes.
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time. ~ Wikipedia (https://en.wikipedia.org/wiki/Collaborative_filtering)<br/><br/>
[![movies](met_21_4_493_fig1a.gif)](met_21_4_493_fig1a.gif)
<br/><br/>

In [1]:
import numpy as np 
import pandas as pd
import re
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

import warnings
warnings.filterwarnings("ignore")

In [2]:
#importing movie metadata and keep necessary columns
meta = pd.read_csv("movies_metadata.csv")
meta = meta[['id', 'original_title', 'original_language',
             'revenue', 'vote_average', 'vote_count', 'popularity', 'genres']]
meta = meta.rename(columns={'id':'movieId'})
meta = meta[meta['original_language']== 'en']
meta.head()

Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genres
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,"[{'id': 35, 'name': 'Comedy'}]"


In [3]:
meta.genres = [list(map(int, re.findall('\d+', x))) for x in meta.genres]
meta.head()

Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genres
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,"[16, 35, 10751]"
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,"[12, 14, 10751]"
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,"[10749, 35]"
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,"[35, 18, 10749]"
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,[35]


In [4]:
max_length = len(max(meta.genres, key = len))
print('Max # of Genres: ', max_length)

def padarray(A, size):
    t = size - len(A)
    return np.pad(A, pad_width=(0, t), mode='constant')

meta.genres = [padarray(x, max_length) for x in meta.genres]
meta.head()

Max # of Genres:  8


Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genres
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,"[16, 35, 10751, 0, 0, 0, 0, 0]"
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,"[12, 14, 10751, 0, 0, 0, 0, 0]"
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,"[10749, 35, 0, 0, 0, 0, 0, 0]"
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,"[35, 18, 10749, 0, 0, 0, 0, 0]"
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,"[35, 0, 0, 0, 0, 0, 0, 0]"


In [5]:
for n in range(0, max_length):
    meta['genre'+str(n+1)] = meta.genres.apply(lambda x: int(x[n]))

meta.drop('genres', axis=1, inplace=True)
meta.head()

Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,16,35,10751,0,0,0,0,0
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,12,14,10751,0,0,0,0,0
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,10749,35,0,0,0,0,0,0
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,35,18,10749,0,0,0,0,0
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,35,0,0,0,0,0,0,0


In [6]:
#importing movie ratings and keep necessary columns
ratings = pd.read_csv("ratings.csv")
ratings = ratings[['userId', 'movieId', 'rating']]

# taking a 2.5MM sample because it can take too long to pivot data later on
ratings = ratings.head(2500000)

#convert data types before merging
meta.movieId = pd.to_numeric(meta.movieId, errors = 'coerce')
ratings.movieId = pd.to_numeric(ratings.movieId, errors = 'coerce')

#merge the 2 datasets, so that we can have the labels for the movie titles
data= pd.merge(ratings, meta, on = 'movieId', how = 'inner')
data.head()

Unnamed: 0,userId,movieId,rating,original_title,original_language,revenue,vote_average,vote_count,popularity,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8
0,1,858,5.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
1,3,858,4.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
2,5,858,5.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
3,12,858,4.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
4,20,858,4.5,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0


In [7]:
data.groupby('rating').size()

rating
0.5     10101
1.0     27515
1.5     10365
2.0     56663
2.5     32570
3.0    179982
3.5     80961
4.0    222883
4.5     58301
5.0    130337
dtype: int64

In [8]:
data_ref = data
data_ref['target'] = np.where(data_ref.rating < 4, 0, 1)
data_ref['popularity'] = data_ref.popularity.astype(float)
data_ref.groupby('target').size()

target
0    398157
1    411521
dtype: int64

In [9]:
X = data_ref.drop(['target', 'revenue', 'rating', 'original_title', 'original_language'], axis=1)
y = data_ref.target

In [10]:
X.head()

Unnamed: 0,userId,movieId,vote_average,vote_count,popularity,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8
0,1,858,6.5,630.0,10.234919,35,18,10749,0,0,0,0,0
1,3,858,6.5,630.0,10.234919,35,18,10749,0,0,0,0,0
2,5,858,6.5,630.0,10.234919,35,18,10749,0,0,0,0,0
3,12,858,6.5,630.0,10.234919,35,18,10749,0,0,0,0,0
4,20,858,6.5,630.0,10.234919,35,18,10749,0,0,0,0,0


In [11]:
gbc = GradientBoostingClassifier(n_estimators=30000, warm_start=True)

for n in range(1, 21):
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=n)
    gbc.fit(X_train, Y_train)
    print('Epoch #', n)
    print('Score: ', gbc.score(X_test, Y_test))

Epoch # 1
Score:  0.7051983499654184
Epoch # 2
Score:  0.7395390771662879
Epoch # 3
Score:  0.7392858907222607
Epoch # 4
Score:  0.7390018278826203
Epoch # 5
Score:  0.7399651714257485
Epoch # 6
Score:  0.7386004347396502
Epoch # 7
Score:  0.7403603892895959
Epoch # 8
Score:  0.7393044165596285
Epoch # 9
Score:  0.7380014326647565
Epoch # 10
Score:  0.7384090010868491
Epoch # 11
Score:  0.7380261337812469
Epoch # 12
Score:  0.7403974409643316
Epoch # 13
Score:  0.7403850904060864
Epoch # 14
Score:  0.7376803181503804
Epoch # 15
Score:  0.7393044165596285
Epoch # 16
Score:  0.7386004347396502
Epoch # 17
Score:  0.739650232190495
Epoch # 18
Score:  0.7389215492540263
Epoch # 19
Score:  0.7399466455883806
Epoch # 20
Score:  0.7381990415966801


In [12]:
# Create a dataframe with feature importance greater than 1%
fi1 = pd.DataFrame(data = gbc.feature_importances_, index = X_train.columns, columns = ['Importance'])
print(fi1[fi1['Importance']>.01].sort_values('Importance', ascending=False))
print('Importance sum: ', fi1.Importance.sum())

              Importance
userId          0.653210
movieId         0.108677
popularity      0.076746
vote_count      0.048804
vote_average    0.035698
genre2          0.023567
genre1          0.022462
genre3          0.016052
Importance sum:  1.0


In [13]:
joblib.dump(gbc, 'gbc30000.pkl')

['gbc30000.pkl']

In [14]:
gbc = GradientBoostingClassifier(n_estimators=60000, warm_start=True)

for n in range(1, 21):
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=n)
    gbc.fit(X_train, Y_train)
    print('Epoch #', n)
    print('Score: ', gbc.score(X_test, Y_test))

Epoch # 1
Score:  0.7071559134472878
Epoch # 2
Score:  0.7611093271415867
Epoch # 3
Score:  0.7612575338405296
Epoch # 4
Score:  0.7614983697263116
Epoch # 5
Score:  0.7621961762671673
Epoch # 6
Score:  0.7605288509040609
Epoch # 7
Score:  0.7619059381484043
Epoch # 8
Score:  0.7606153048117775
Epoch # 9
Score:  0.7609178934887857
Epoch # 10
Score:  0.761041399071238
Epoch # 11
Score:  0.7605967789744097
Epoch # 12
Score:  0.7624308368738267
Epoch # 13
Score:  0.7629989625531074
Epoch # 14
Score:  0.7597260646181208
Epoch # 15
Score:  0.7603991700424859
Epoch # 16
Score:  0.7613131113526331
Epoch # 17
Score:  0.7615168955636795
Epoch # 18
Score:  0.7615662977966604
Epoch # 19
Score:  0.7623320324078648
Epoch # 20
Score:  0.7607017587194941


In [15]:
# Create a dataframe with feature importance greater than 1%
fi1 = pd.DataFrame(data = gbc.feature_importances_, index = X_train.columns, columns = ['Importance'])
print(fi1[fi1['Importance']>.01].sort_values('Importance', ascending=False))
print('Importance sum: ', fi1.Importance.sum())

              Importance
userId          0.665493
movieId         0.094520
popularity      0.068659
vote_count      0.047611
vote_average    0.038556
genre2          0.025840
genre1          0.025255
genre3          0.017924
genre4          0.010712
Importance sum:  0.9999999999999916


In [16]:
joblib.dump(gbc, 'gbc60000.pkl')

['gbc60000.pkl']