[![movies](80s-movies.jpg)](80s-movies.jpg)

# What should I watch?

**Overview:**<br/>
Using movies & ratings datasets we will create two recommendation engine to predict what movies we should watch. Both engines will use **collaborative filtering** as the preferred method:
1. Item to item
2. Hybrid: User to user, followed by item to item

In a general sense, the engine will group similar users and similar items.

**Method:**<br/>
Typically, the workflow of a collaborative filtering system is:

1. A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.[![melbourne]
2. The system matches this user's ratings against other users' and finds the people with most "similar" tastes.
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time. ~ Wikipedia (https://en.wikipedia.org/wiki/Collaborative_filtering)<br/><br/>
[![movies](met_21_4_493_fig1a.gif)](met_21_4_493_fig1a.gif)
<br/><br/>

In [1]:
import numpy as np 
import pandas as pd
import re
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
from tqdm import tqdm

import warnings
warnings.filterwarnings("ignore")

In [2]:
#importing movie metadata and keep necessary columns
meta = pd.read_csv("movies_metadata.csv")
meta = meta[['id', 'original_title', 'original_language',
             'revenue', 'vote_average', 'vote_count', 'popularity', 'genres']]
meta = meta.rename(columns={'id':'movieId'})
meta = meta[meta['original_language']== 'en']
meta.head()

Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genres
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,"[{'id': 35, 'name': 'Comedy'}]"


In [3]:
meta.genres = [list(map(int, re.findall('\d+', x))) for x in meta.genres]
meta.head()

Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genres
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,"[16, 35, 10751]"
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,"[12, 14, 10751]"
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,"[10749, 35]"
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,"[35, 18, 10749]"
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,[35]


In [4]:
max_length = len(max(meta.genres, key = len))
print('Max # of Genres: ', max_length)

def padarray(A, size):
    t = size - len(A)
    return np.pad(A, pad_width=(0, t), mode='constant')

meta.genres = [padarray(x, max_length) for x in meta.genres]
ref = meta
meta.head()

Max # of Genres:  8


Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genres
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,"[16, 35, 10751, 0, 0, 0, 0, 0]"
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,"[12, 14, 10751, 0, 0, 0, 0, 0]"
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,"[10749, 35, 0, 0, 0, 0, 0, 0]"
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,"[35, 18, 10749, 0, 0, 0, 0, 0]"
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,"[35, 0, 0, 0, 0, 0, 0, 0]"


In [5]:
for n in range(0, max_length):
    meta['genre'+str(n+1)] = meta.genres.apply(lambda x: int(x[n]))

meta.drop('genres', axis=1, inplace=True)
meta.head()

Unnamed: 0,movieId,original_title,original_language,revenue,vote_average,vote_count,popularity,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8
0,862,Toy Story,en,373554033.0,7.7,5415.0,21.9469,16,35,10751,0,0,0,0,0
1,8844,Jumanji,en,262797249.0,6.9,2413.0,17.0155,12,14,10751,0,0,0,0,0
2,15602,Grumpier Old Men,en,0.0,6.5,92.0,11.7129,10749,35,0,0,0,0,0,0
3,31357,Waiting to Exhale,en,81452156.0,6.1,34.0,3.85949,35,18,10749,0,0,0,0,0
4,11862,Father of the Bride Part II,en,76578911.0,5.7,173.0,8.38752,35,0,0,0,0,0,0,0


In [6]:
#importing movie ratings and keep necessary columns
ratings = pd.read_csv("ratings.csv")
ratings = ratings[['userId', 'movieId', 'rating']]

# taking a 2.5MM sample because it can take too long to pivot data later on
ratings = ratings.head(2500000)

#convert data types before merging
meta.movieId = pd.to_numeric(meta.movieId, errors = 'coerce')
ratings.movieId = pd.to_numeric(ratings.movieId, errors = 'coerce')

#merge the 2 datasets, so that we can have the labels for the movie titles
data= pd.merge(ratings, meta, on = 'movieId', how = 'inner')
data.head()

Unnamed: 0,userId,movieId,rating,original_title,original_language,revenue,vote_average,vote_count,popularity,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8
0,1,858,5.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
1,3,858,4.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
2,5,858,5.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
3,12,858,4.0,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0
4,20,858,4.5,Sleepless in Seattle,en,227799884.0,6.5,630.0,10.2349,35,18,10749,0,0,0,0,0


In [7]:
#pivot the table so that rows = users and columns = movies and the content is the ratings
matrix= data.pivot_table(index='userId', columns='original_title', values='rating')
matrix.head(10)

original_title,!Women Art Revolution,$5 a Day,'Gator Bait,'R Xmas,'Twas the Night Before Christmas,(A)Sexual,...And the Pursuit of Happiness,10 Items or Less,10 Things I Hate About You,"10,000 BC",...,Æon Flux,Бабник,Грозовые ворота,Дневник его жены,Мой сводный брат Франкенштейн,"Цирк сгорел, и клоуны разбежались",به امید دیدار,مارمولک,რამინი,黑太陽731
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


In [8]:
# Check to see if the columns are not empty
print('Total ratings score of userId 1: ', matrix.iloc[0].sum())
print('Mean ratings score of userId 1: ', matrix.iloc[0].mean())
print('Ratings Count of userId 1: ', matrix.iloc[0].count())

Total ratings score of userId 1:  30.0
Mean ratings score of userId 1:  4.285714285714286
Ratings Count of userId 1:  7


In [9]:
# Pearson Correlation
def pearsonR(s1, s2):
    s1_c = s1-s1.mean()
    s2_c = s2-s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c**2) * np.sum(s2_c**2))

In [10]:
# Create watched list based on userId.
def has_watched(M, userid):
    watched = []
    t = M[M.index==userid].notnull()
    for c in t.columns:
        if t[c].values[0] == True:
            watched.append(c)
    return watched

In [11]:
# Return the score of a recently watched movie
def returnscore(movie, userid, data_ref):
    rs = data_ref.loc[(data_ref.userId == userid) & (data_ref.original_title == movie)].reset_index(drop=True)
    rs.drop(['original_title', 'original_language', 'revenue', 'target'], axis=1, inplace=True)
    rating = rs.iloc[0]['rating']
    if rating < 4:
        s = 0
    else:
        s = 1
    return s, rating

In [12]:
def getx(movie, userid, data_ref):
    newx = data_ref.loc[data_ref.original_title == movie].reset_index(drop=True)
    newx.drop(['userId', 'rating', 'original_title', 'original_language', 'revenue', 'target'], axis=1, inplace=True)
    newx = newx[:1]

    idx = 0
    new_col = [userid]  
    newx.insert(loc=idx, column='userId', value=new_col)
    return newx

In [103]:
def findcommong(movie1, movie2, ref):
    list1 = ref[ref.original_title == movie1].genres.values
    list1 = list1[0]
    list2 = ref[ref.original_title == movie2].genres.values
    list2 = list2[0]
    common = [i for i in list1 if i in list2 if i != 0]
    return common

In [134]:
def findallcommon(list1, list2, ref):
    all_common = []
    watched_genre = []
    rec_genre = []
    mov_list1 = list1.tolist()
    mov_list2 = list2.tolist()
    
    for n, title in enumerate(mov_list1):
        
        m1 = ref[ref.original_title == mov_list1[n]].genres.values
        m1 = m1[0]
        m2 = ref[ref.original_title == mov_list2[n]].genres.values
        m2 = m2[0]
        
        watched_genre.append(m1)
        rec_genre.append(m2)
        all_common.append(findcommong(mov_list1[n], mov_list2[n], ref))
    ln = np.concatenate(all_common).ravel().tolist()
    df = pd.DataFrame()
    df['watched_title'] = list1
    df['watched_genre'] = watched_genre
    df['recommended_title'] = list2
    df['recommended_genre'] = rec_genre
    df['in_commmon_genre'] = all_common
    return df, len(ln)

# Collaborative filtering (item to item)

In [13]:
# The parameters here are: recently watched movie name, matrix name, number of recommendations, and userID.
def recommend(movie, M, n, userid):
    
    # A function called to create watched list based on userID & append recently watched movie
    watched = has_watched(M, userid)
    watched.append(movie)
    
    # A function to make N recommendations based on Pearson Correlation.
    reviews=[]
    for title in M.columns:
        if title in watched:
            continue
        cor = pearsonR(M[movie], M[title])
        if np.isnan(cor):
            continue
        else:
            reviews.append((title, cor))
    
    # Sort the table of movies descending by similarity
    reviews.sort(key= lambda tup: tup[1], reverse=True)
    rev = pd.DataFrame(reviews[:n], columns=['Title', 'Score'])
    return rev

In [14]:
def getrecdf(usr_list, mtx, num_of_rec, model, data_ref):
    
    uid = []
    movt = []
    ratings = []
    scores = []
    trec = []
    simscore = []
    proba = []
    ypred = []

    for u in tqdm(usr_list):
        hst = has_watched(matrix, u)

        for mov in hst:
            s, rg = returnscore(mov, u, data_ref)
            rec = recommend(mov, mtx, num_of_rec, u)
            t = rec.Title.values
            sc = rec.Score.values
            for n, m in enumerate(t):
                uid.append(u)
                movt.append(mov)
                ratings.append(rg)
                scores.append(s)
                trec.append(m)
                simscore.append(sc[n])


                X = getx(m, u, data_ref)
                pred = model.predict(X)
                prob = model.predict_proba(X)
                ypred.append(int(pred))
                proba.append(float(prob[:,int(pred)]))

    tempdf = pd.DataFrame()
    tempdf['userId'] = uid
    tempdf['original_title'] = movt
    tempdf['rating'] = ratings
    tempdf['target'] = scores
    tempdf['recommended_title'] = trec
    tempdf['similarity_score'] = simscore
    tempdf['probability_of_pred'] = proba
    tempdf['pred'] = ypred
    return tempdf

In [15]:
data_ref = data
data_ref['target'] = np.where(data_ref.rating < 4, 0, 1)
data_ref['popularity'] = data_ref.popularity.astype(float)

gbc = joblib.load('gbc30000.pkl') 

comp_user_list = data_ref.userId.unique()
comp_user_list = comp_user_list[:500]
user_list_under10 = []

for u in comp_user_list:
    hw = has_watched(matrix, u)
    if len(hw) < 11:
        user_list_under10.append(u)
        
trunc_user_list = user_list_under10[:50]       
print('Length of list under 10: ', len(user_list_under10))
print('Truncated list under 10: ', trunc_user_list)

Length of list under 10:  72
Truncated list under 10:  [1, 3, 5, 28, 50, 109, 138, 143, 146, 184, 204, 206, 210, 401, 448, 502, 643, 647, 655, 671, 695, 734, 812, 840, 858, 862, 867, 915, 959, 1182, 1202, 1206, 1209, 1303, 1317, 1347, 1377, 1387, 1442, 1446, 1452, 1474, 1533, 1606, 1610, 1632, 1642, 1690, 1715, 1766]


In [141]:
# let's set the num of recommendations
num_rec = 5
print('Number of recommendations is set to: ', num_rec)

Number of recommendations is set to:  5


In [16]:
df_t = getrecdf(trunc_user_list, matrix, num_rec, gbc, data_ref)
df_t.head(20)

100%|██████████| 50/50 [1:21:14<00:00, 97.50s/it] 


Unnamed: 0,userId,original_title,rating,target,recommended_title,similarity_score,probability_of_pred,pred
0,1,Fools Rush In,4.0,1,One Night at McCool's,0.200594,0.93059,1
1,1,Fools Rush In,4.0,1,Notes on a Scandal,0.168553,0.999389,1
2,1,Fools Rush In,4.0,1,The Time Machine,0.130388,0.980047,1
3,1,Fools Rush In,4.0,1,Sister Act,0.127215,0.853216,1
4,1,Fools Rush In,4.0,1,My Name Is Bruce,0.115548,0.813008,1
5,1,License to Wed,4.0,1,Beetlejuice,0.147927,0.974679,1
6,1,License to Wed,4.0,1,Terminator 3: Rise of the Machines,0.14668,0.98632,1
7,1,License to Wed,4.0,1,Point Break,0.115753,0.992133,1
8,1,License to Wed,4.0,1,The Million Dollar Hotel,0.11239,0.999353,1
9,1,License to Wed,4.0,1,Loose Screws,0.111083,0.53406,1


In [17]:
print('Accuracy Score: ', accuracy_score(df_t.target, df_t.pred))
print('Average Similarity Score: ', df_t.similarity_score.mean())
print('Average Probability Score: ', df_t.probability_of_pred.mean(), '\n')

print('Confusion Matrix: ')
pd.crosstab(df_t.target, df_t.pred)

Accuracy Score:  0.7326666666666667
Average Similarity Score:  0.14820295499016517
Average Probability Score:  0.7283862394279564 

Confusion Matrix: 


pred,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,263,192
1,209,836


In [170]:
print('There is a total of {} recommendations.'.format(len(df_t)))

com_list, l_com = findallcommon(df_t.original_title, df_t.recommended_title, ref)
print('Between the watched list from users and the recommended titles, there are {} common genres.'.format(l_com))

com_list.head(20)

There is a total of 1500 recommendations.
Between the watched list from users and the recommended titles, there are 1132 common genres.


Unnamed: 0,watched_title,watched_genre,recommended_title,recommended_genre,in_commmon_genre
0,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",One Night at McCool's,"[28, 35, 80, 0, 0, 0, 0, 0]",[35]
1,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Notes on a Scandal,"[18, 10749, 0, 0, 0, 0, 0, 0]","[18, 10749]"
2,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",The Time Machine,"[53, 12, 14, 878, 10749, 0, 0, 0]",[10749]
3,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Sister Act,"[10402, 35, 0, 0, 0, 0, 0, 0]",[35]
4,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",My Name Is Bruce,"[35, 27, 0, 0, 0, 0, 0, 0]",[35]
5,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Beetlejuice,"[14, 35, 0, 0, 0, 0, 0, 0]",[35]
6,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Terminator 3: Rise of the Machines,"[28, 53, 878, 0, 0, 0, 0, 0]",[]
7,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Point Break,"[28, 53, 80, 0, 0, 0, 0, 0]",[]
8,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",The Million Dollar Hotel,"[18, 53, 0, 0, 0, 0, 0, 0]",[]
9,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Loose Screws,"[35, 0, 0, 0, 0, 0, 0, 0]",[35]


[![movies](cognitive-bias_feature31.jpg)](cognitive-bias_feature31.jpg)

Item to item collaborative filtering, seems to be doing fairly well, however it doesn't take into account user scoring biased. For example the item to item filtering doesn't account for a harsh scorer or a user that always gives a high score. It just simply looks at the current movie and finds the most similar movie based on the ratings.

The next method (recommendation engine), using **user to user, followed by item to item** collaborative filtering will remove this user bias by grouping similar users first, then basing the recommendation on similar items that this group of users have watched. Simply this method is equivalent to saying "Birds of the same feather, flock together".

# Collaborative filtering (user to user, followed by item to item)

In [18]:
# The parameters here are: matrix name, number of similar users, and userID.
def recommend_sim_user(movie, M, n_user, n_rec, userid):
    
    # A function to make N recommendations based on Pearson Correlation.
    users=[]
    for u in range(0, len(M)):
        if u == userid - 1:
            continue
        cor = pearsonR(M.iloc[userid - 1], M.iloc[u])
        if np.isnan(cor):
            continue
        else:
            users.append((u, cor))
    
    # Sort the table of users descending by similarity
    users.sort(key= lambda tup: tup[1], reverse=True)
    usr = pd.DataFrame(users[:n_user], columns=['User', 'Score'])
    
    # Create new matrix with just the similar user to our user
    M2 = M[M.index.isin(usr.User.values)]
    
    # A function called to create watched list based on userID & append recently watched movie
    watched = has_watched(M, userid)
    watched.append(movie)
    
    # A function to make N recommendations based on Pearson Correlation.
    reviews=[]
    for title in M2.columns:
        if title in watched:
            continue
        cor = pearsonR(M2[movie], M2[title])
        if np.isnan(cor):
            continue
        else:
            reviews.append((title, cor))
    
    # Sort the table of movies descending by similarity
    reviews.sort(key= lambda tup: tup[1], reverse=True)
    rev = pd.DataFrame(reviews[:n_rec], columns=['Title', 'Score'])
    
    return usr, rev

In [151]:
def gethybridrecdf(usr_list, mtx, num_of_user, num_of_rec, model, data_ref):
    
    uid = []
    movt = []
    ratings = []
    scores = []
    trec = []
    simscore = []
    proba = []
    ypred = []

    for u in tqdm(usr_list):
        hst = has_watched(matrix, u)

        for mov in hst:
            s, rg = returnscore(mov, u, data_ref)
            sim_usr, rec = recommend_sim_user(mov, mtx, num_of_user, num_of_rec, u)
            t = rec.Title.values
            sc = rec.Score.values
            for n, m in enumerate(t):
                uid.append(u)
                movt.append(mov)
                ratings.append(rg)
                scores.append(s)
                trec.append(m)
                simscore.append(sc[n])


                X = getx(m, u, data_ref)
                pred = model.predict(X)
                prob = model.predict_proba(X)
                ypred.append(int(pred))
                proba.append(float(prob[:,int(pred)]))

    tempdf = pd.DataFrame()
    tempdf['userId'] = uid
    tempdf['original_title'] = movt
    tempdf['rating'] = ratings
    tempdf['target'] = scores
    tempdf['recommended_title'] = trec
    tempdf['similarity_score'] = simscore
    tempdf['probability_of_pred'] = proba
    tempdf['pred'] = ypred
    return tempdf    

In [20]:
df_ht = gethybridrecdf(trunc_user_list, matrix, 750, num_rec, gbc, data_ref)
df_ht.head(20)

100%|██████████| 50/50 [2:32:49<00:00, 183.39s/it]  


Unnamed: 0,userId,original_title,rating,target,recommended_title,similarity_score,probability_of_pred,pred
0,1,Fools Rush In,4.0,1,Music Box,0.360087,0.699544,1
1,1,Fools Rush In,4.0,1,Shakespeare in Love,0.331499,0.927945,1
2,1,Fools Rush In,4.0,1,Hulk,0.328091,0.799758,1
3,1,Fools Rush In,4.0,1,Le Professionnel,0.311649,0.873384,1
4,1,Fools Rush In,4.0,1,Nell,0.307196,0.991436,1
5,1,License to Wed,4.0,1,The Ewok Adventure,0.322721,0.999588,1
6,1,License to Wed,4.0,1,The Ring,0.277714,0.556047,1
7,1,License to Wed,4.0,1,Man's Favorite Sport?,0.265726,0.551134,0
8,1,License to Wed,4.0,1,Affair in Havana,0.265722,0.54166,1
9,1,License to Wed,4.0,1,Warlords of the 21st Century,0.264543,0.720944,1


In [21]:
print('Accuracy Score: ', accuracy_score(df_ht.target, df_ht.pred))
print('Average Similarity Score: ', df_ht.similarity_score.mean())
print('Average Probability Score: ', df_ht.probability_of_pred.mean(), '\n')

print('Confusion Matrix: ')
pd.crosstab(df_ht.target, df_ht.pred)

Accuracy Score:  0.6407017543859649
Average Similarity Score:  0.3069078300453241
Average Probability Score:  0.7227120266668843 

Confusion Matrix: 


pred,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,270,155
1,357,643


In [171]:
print('There is a total of {} recommendations.'.format(len(df_ht)))

com_hlist, hl_com = findallcommon(df_ht.original_title, df_ht.recommended_title, ref)
print('Between the watched list from users and recommended titles, there are {} common genres.'.format(hl_com))

com_hlist.head(20)

There is a total of 1425 recommendations.
Between the watched list from users and recommended titles, there are 993 common genres.


Unnamed: 0,watched_title,watched_genre,recommended_title,recommended_genre,in_commmon_genre
0,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Music Box,"[80, 18, 10749, 53, 0, 0, 0, 0]","[18, 10749]"
1,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Shakespeare in Love,"[10749, 36, 0, 0, 0, 0, 0, 0]",[10749]
2,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Hulk,"[18, 28, 878, 0, 0, 0, 0, 0]",[18]
3,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Le Professionnel,"[28, 12, 53, 0, 0, 0, 0, 0]",[]
4,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Nell,"[18, 53, 0, 0, 0, 0, 0, 0]",[18]
5,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",The Ewok Adventure,"[12, 10751, 14, 878, 10770, 0, 0, 0]",[]
6,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",The Ring,"[18, 0, 0, 0, 0, 0, 0, 0]",[]
7,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Man's Favorite Sport?,"[35, 10749, 0, 0, 0, 0, 0, 0]",[35]
8,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Affair in Havana,"[18, 80, 0, 0, 0, 0, 0, 0]",[]
9,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Warlords of the 21st Century,"[878, 0, 0, 0, 0, 0, 0, 0]",[]


In [23]:
df_ht2 = gethybridrecdf(trunc_user_list, matrix, 200, num_rec, gbc, data_ref)
df_ht2.head(20)

100%|██████████| 50/50 [3:35:01<00:00, 258.03s/it]  


Unnamed: 0,userId,original_title,rating,target,recommended_title,similarity_score,probability_of_pred,pred
0,1,Fools Rush In,4.0,1,The French Connection,0.599591,0.99799,1
1,1,Fools Rush In,4.0,1,Gerry,0.570916,0.585654,1
2,1,Fools Rush In,4.0,1,A Scanner Darkly,0.566139,0.975888,1
3,1,Fools Rush In,4.0,1,Music Box,0.54134,0.699544,1
4,1,Fools Rush In,4.0,1,The Pianist,0.510643,0.96393,0
5,1,License to Wed,4.0,1,Memoirs of a Geisha,0.395465,0.918861,1
6,1,License to Wed,4.0,1,A Kiss Before Dying,0.376728,0.721933,0
7,1,License to Wed,4.0,1,Mr. Brooks,0.367852,0.911889,1
8,1,License to Wed,4.0,1,Ask the Dust,0.36757,0.988353,1
9,1,License to Wed,4.0,1,Evil Dead II,0.354849,0.987123,1


In [24]:
print('Accuracy Score: ', accuracy_score(df_ht2.target, df_ht2.pred))
print('Average Similarity Score: ', df_ht2.similarity_score.mean())
print('Average Probability Score: ', df_ht2.probability_of_pred.mean(), '\n')

print('Confusion Matrix: ')
pd.crosstab(df_ht2.target, df_ht2.pred)

Accuracy Score:  0.6097378277153558
Average Similarity Score:  0.432504620259916
Average Probability Score:  0.7219301314164545 

Confusion Matrix: 


pred,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,225,150
1,371,589


In [172]:
print('There is a total of {} recommendations.'.format(len(df_ht2)))

com_hlist2, hl_com2 = findallcommon(df_ht2.original_title, df_ht2.recommended_title, ref)
print('Between the watched list from users and recommended titles, there are {} common genres.'.format(hl_com2))
com_hlist2.head(20)

There is a total of 1335 recommendations.
Between the watched list from users and recommended titles, there are 952 common genres.


Unnamed: 0,watched_title,watched_genre,recommended_title,recommended_genre,in_commmon_genre
0,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",The French Connection,"[28, 80, 53, 0, 0, 0, 0, 0]",[]
1,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Gerry,"[9648, 18, 12, 0, 0, 0, 0, 0]",[18]
2,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",A Scanner Darkly,"[16, 878, 53, 0, 0, 0, 0, 0]",[]
3,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Music Box,"[80, 18, 10749, 53, 0, 0, 0, 0]","[18, 10749]"
4,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",The Pianist,"[18, 10752, 0, 0, 0, 0, 0, 0]",[18]
5,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Memoirs of a Geisha,"[18, 36, 10749, 0, 0, 0, 0, 0]",[]
6,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",A Kiss Before Dying,"[18, 53, 80, 9648, 10749, 0, 0, 0]",[]
7,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Mr. Brooks,"[18, 80, 9648, 53, 0, 0, 0, 0]",[]
8,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Ask the Dust,"[18, 10749, 0, 0, 0, 0, 0, 0]",[]
9,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Evil Dead II,"[27, 35, 14, 0, 0, 0, 0, 0]",[35]


In [129]:
df_ht3 = gethybridrecdf(trunc_user_list, matrix, 2000, num_rec, gbc, data_ref)
df_ht3.head(20)

100%|██████████| 50/50 [3:43:38<00:00, 268.37s/it]  


Unnamed: 0,userId,original_title,rating,target,recommended_title,similarity_score,probability_of_pred,pred
0,1,Fools Rush In,4.0,1,Music Box,0.246535,0.699544,1
1,1,Fools Rush In,4.0,1,Hulk,0.206738,0.799758,1
2,1,Fools Rush In,4.0,1,One Flew Over the Cuckoo's Nest,0.204262,0.877566,0
3,1,Fools Rush In,4.0,1,Le Professionnel,0.192837,0.873384,1
4,1,Fools Rush In,4.0,1,The French Connection,0.192679,0.99799,1
5,1,License to Wed,4.0,1,The Ewok Adventure,0.196114,0.999588,1
6,1,License to Wed,4.0,1,Beetlejuice,0.187629,0.974679,1
7,1,License to Wed,4.0,1,Rise of the Zombies,0.176958,0.529446,1
8,1,License to Wed,4.0,1,Warlords of the 21st Century,0.166285,0.720944,1
9,1,License to Wed,4.0,1,Man's Favorite Sport?,0.163935,0.551134,0


In [130]:
print('Accuracy Score: ', accuracy_score(df_ht3.target, df_ht3.pred))
print('Average Similarity Score: ', df_ht3.similarity_score.mean())
print('Average Probability Score: ', df_ht3.probability_of_pred.mean(), '\n')

print('Confusion Matrix: ')
pd.crosstab(df_ht3.target, df_ht3.pred)

Accuracy Score:  0.6701388888888888
Average Similarity Score:  0.22795395218518028
Average Probability Score:  0.7274098608904003 

Confusion Matrix: 


pred,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,271,164
1,311,694


In [173]:
print('There is a total of {} recommendations.'.format(len(df_ht3)))

com_hlist3, hl_com3 = findallcommon(df_ht3.original_title, df_ht3.recommended_title, ref)
print('Between the watched list from users and recommended titles there are {} common genres.'.format(hl_com3))
com_hlist3.head(20)

There is a total of 1440 recommendations.
Between the watched list from users and recommended titles there are 1063 common genres.


Unnamed: 0,watched_title,watched_genre,recommended_title,recommended_genre,in_commmon_genre
0,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Music Box,"[80, 18, 10749, 53, 0, 0, 0, 0]","[18, 10749]"
1,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Hulk,"[18, 28, 878, 0, 0, 0, 0, 0]",[18]
2,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",One Flew Over the Cuckoo's Nest,"[18, 0, 0, 0, 0, 0, 0, 0]",[18]
3,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",Le Professionnel,"[28, 12, 53, 0, 0, 0, 0, 0]",[]
4,Fools Rush In,"[18, 35, 10749, 0, 0, 0, 0, 0]",The French Connection,"[28, 80, 53, 0, 0, 0, 0, 0]",[]
5,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",The Ewok Adventure,"[12, 10751, 14, 878, 10770, 0, 0, 0]",[]
6,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Beetlejuice,"[14, 35, 0, 0, 0, 0, 0, 0]",[35]
7,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Rise of the Zombies,"[28, 27, 53, 0, 0, 0, 0, 0]",[]
8,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Warlords of the 21st Century,"[878, 0, 0, 0, 0, 0, 0, 0]",[]
9,License to Wed,"[35, 0, 0, 0, 0, 0, 0, 0]",Man's Favorite Sport?,"[35, 10749, 0, 0, 0, 0, 0, 0]",[35]


.
.
.

[![movies](maxresdefault.jpg)](maxresdefault.jpg)