# Recommender Systems by Group 8 in CSDA1040 Class Fall 2019

Work is based on [Movie Recommender Systems on Kaggle](https://www.kaggle.com/rounakbanik/movie-recommender-systems) with modification to codebase for fixes, clarifications and adaptation for DASH app.


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

#Recommender System Library surprise
from surprise import Dataset, SVD
from surprise import Reader

from surprise.model_selection import cross_validate

from collections import defaultdict


# depreciated
# from surprise import evaluate


In [2]:
# reading csv from movie.ipynb output for a cleaned csv based on movies_metadata.csv
md = pd.read_csv('../input/movies_cleaned.csv')
md.head()

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000.0,"['Animation', 'Comedy', 'Family']",http://toystory.disney.com/toy-story,862,,en,Toy Story,...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,1,False,,65000000.0,"['Adventure', 'Fantasy', 'Family']",,8844,,en,Jumanji,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0.0,"['Romance', 'Comedy']",,15602,,en,Grumpier Old Men,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,3,False,,16000000.0,"['Comedy', 'Drama', 'Romance']",,31357,,en,Waiting to Exhale,...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0.0,['Comedy'],,11862,,en,Father of the Bride Part II,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


## A Simple Top Movies Listing based on different genres
From the previous study, we are able to summarize all movies into 32 different genres. By feeding build_top_chart function for different genres, we are able to pull out movies with top vote_average. Then, we filter out those that are more trust worthy ones (where movies that have vote counts in the upper 0.05% (or above 99.95%) quantile and show it to end_user

In [3]:
import re

def get_top_chart_by_genre(genre, quantile=0.995):
    qualified_df = md[md['vote_count'] > md['vote_count'].quantile(quantile)].sort_values('vote_average', ascending=False)
    genre_filtered_df = qualified_df['genres'].str.contains(genre, flags=re.IGNORECASE, regex=True)
    if genre != '':
        # return qualified_df[genre_filtered_df]
        return qualified_df[genre_filtered_df].index
    else:
        # return qualified_df
        return qualified_df.index
        

In [4]:
idx = get_top_chart_by_genre('')
# idx.shape
qf_df = md[md.index.isin(idx)]
qf_df[['title', 'release_date', 'vote_average', 'vote_count' ]].sort_values('vote_average', ascending=False).head(10)

Unnamed: 0,title,release_date,vote_average,vote_count
828,The Godfather,1972-03-14,8.5,6024.0
313,The Shawshank Redemption,1994-09-23,8.5,8358.0
521,Schindler's List,1993-11-29,8.3,4436.0
12421,The Dark Knight,2008-07-16,8.3,12269.0
2198,Life Is Beautiful,1997-12-20,8.3,3643.0
2828,Fight Club,1999-10-15,8.3,9678.0
23496,Whiplash,2014-10-10,8.3,4376.0
291,Pulp Fiction,1994-09-10,8.3,8670.0
5453,Spirited Away,2001-07-20,8.3,3968.0
350,Forrest Gump,1994-07-06,8.2,8147.0


In [5]:
idx = get_top_chart_by_genre('Romance')
# idx.shape
qf_df = md[md.index.isin(idx)]
qf_df[['title', 'release_date', 'vote_average', 'vote_count' ]].sort_values('vote_average', ascending=False).head(10)

Unnamed: 0,title,release_date,vote_average,vote_count
350,Forrest Gump,1994-07-06,8.2,8147.0
7168,Eternal Sunshine of the Spotless Mind,2004-03-19,7.9,3758.0
22003,Her,2013-12-18,7.9,4215.0
40458,La La Land,2016-11-29,7.9,4745.0
23337,The Fault in Our Stars,2014-05-16,7.6,3868.0
1628,Titanic,1997-11-18,7.5,7770.0
2165,Edward Scissorhands,1990-12-05,7.5,3731.0
580,Aladdin,1992-11-25,7.4,3495.0
20762,The Great Gatsby,2013-05-10,7.3,3885.0
19598,Silver Linings Playbook,2012-09-08,7.0,4840.0


## Content Based Recommender System
The previous method can only show top rated movies by all voters. However, we want something that is tailored to an individual user. We will try different ways to recommend movies to our end users. First, we will try to find movies that are similiar.

## Recommender System based on text mining of Movie Descriptor 
We will try to suggest movies based on keywords from the descriptive text provided in the database, and we will use this info to find the best match for end user.

In [6]:
md['tagline'] = md['tagline'].fillna('')
md['overview'] = md['overview'].fillna('')
md['keywords'] = md['tagline'] + ' ' + md['overview']

In [7]:
# We build a Term Frequency (TF)-Inverse Data Frequency (IDF) summary of keywords using scikit-learn library
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2),min_df=0, stop_words='english')
X = vectorizer.fit_transform(md['keywords'])

# Show TF-IDF Vectorizer properties
# print('Shape: ')
# print(X.shape)
# print('Feature Names: ')
# print(vectorizer.get_feature_names())

# Next We build a lookup matrix that share similarity score for all movies title in the database
cosine_similarity = linear_kernel(X,X)
# showing cosine_similiarty characteristics
# cosine_similarity.shape

In [8]:

def get_recommended_movies_by_title(title, n=30):
    #a = md.index[md['Title'] == 'The Godfather']
    a = md[md['title'] == title]
    if a.empty:
        #print('Is Empty')
        return []
    else:
        # print('Found Title')
        # print(a.index)
        
        b = a.index.astype('int')
        # print(b)
        
        # slice the cosine_similiarity matrix for this specific title
        c = cosine_similarity[b]
        d = c.tolist()
        e = list(*d)
        f = list(enumerate(e))
        g = sorted(f, key=lambda x:x[1], reverse=True)
        g = g[1:n+1]
        movies_id = [x[0] for x in g]
        return movies_id

In [9]:
movie_lst = get_recommended_movies_by_title('Family Business')
rec_df = md[md.index.isin(movie_lst)]
rec_df.head(3)

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,keywords
795,801,False,,0.0,"['Comedy', 'Foreign']",,9098,,de,Echte Kerle,...,0.0,100.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,Macho cop finds himself in a relationship with...,Regular Guys,False,5.2,9.0,Macho cop finds himself in a relationship with...
3909,3928,False,"{'id': 107469, 'name': 'Save The Last Dance Co...",13000000.0,"['Drama', 'Family', 'Romance', 'Music']",,9816,,en,Save the Last Dance,...,91038276.0,112.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Only Person You Need To Be Is Yourself.,Save the Last Dance,False,6.3,359.0,The Only Person You Need To Be Is Yourself. A ...
3984,4004,False,,0.0,"['Comedy', 'Fantasy']",,2608,,en,Maid to Order,...,9868521.0,93.0,"[{'iso_639_1': 'es', 'name': 'Español'}, {'iso...",Released,She was raised in a Beverly Hills mansion. Now...,Maid to Order,False,5.2,17.0,She was raised in a Beverly Hills mansion. Now...


In [10]:
movie_lst = get_recommended_movies_by_title('Batman Forever')
rec_df = md[md.index.isin(movie_lst)]
rec_df.head(3)

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,keywords
584,585,False,"{'id': 120794, 'name': 'Batman Collection', 'p...",35000000.0,"['Fantasy', 'Action']",,268,,en,Batman,...,411348924.0,126.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Have you ever danced with the devil in the pal...,Batman,False,7.0,2145.0,Have you ever danced with the devil in the pal...
1321,1328,False,"{'id': 120794, 'name': 'Batman Collection', 'p...",80000000.0,"['Action', 'Fantasy']",,364,,en,Batman Returns,...,280000000.0,126.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"The Bat, the Cat, the Penguin.",Batman Returns,False,6.6,1706.0,"The Bat, the Cat, the Penguin. Having defeated..."
1482,1491,False,"{'id': 120794, 'name': 'Batman Collection', 'p...",125000000.0,"['Action', 'Crime', 'Fantasy']",,415,,en,Batman & Robin,...,238207122.0,125.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Strength. Courage. Honor. And loyalty.,Batman & Robin,False,4.2,1447.0,Strength. Courage. Honor. And loyalty. Along w...


## Collaborative Filtering

[Surprise - FAQ How to get the top-N recommendations for each user](https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-top-n-recommendations-for-each-user)



In [11]:
# rating_df = pd.read_csv('../input/ratings_small.csv')
rating_df = pd.read_csv('../input/ratings_small.csv')
reader = Reader()
algo = SVD()

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(rating_df[['userId', 'movieId', 'rating']], reader)

# We can now use this dataset as we please, e.g. calling cross_validate
trainset = data.build_full_trainset()
algo.fit(trainset)

testset = trainset.build_anti_testset()
predictions = algo.test(testset)

In [12]:
# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8984  0.8965  0.8918  0.8994  0.8954  0.8963  0.0027  
MAE (testset)     0.6891  0.6907  0.6882  0.6912  0.6903  0.6899  0.0011  
Fit time          3.60    3.63    3.64    3.63    3.63    3.63    0.01    
Test time         0.11    0.10    0.10    0.10    0.10    0.10    0.00    


{'test_rmse': array([0.89840585, 0.89653512, 0.89176526, 0.89940498, 0.89537261]),
 'test_mae': array([0.68913586, 0.69066048, 0.68815402, 0.69123848, 0.6902898 ]),
 'fit_time': (3.603806972503662,
  3.628370761871338,
  3.642861843109131,
  3.634511947631836,
  3.630908250808716),
 'test_time': (0.10542988777160645,
  0.10388422012329102,
  0.10273623466491699,
  0.10292220115661621,
  0.10204815864562988)}

In [13]:
from surprise.model_selection import KFold
from surprise import accuracy

kf = KFold(n_splits=3)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.9088
RMSE: 0.9005
RMSE: 0.9035


In [14]:
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [15]:
top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])


2, 2905, 720, 2678, 215, 4271, 117, 3718, 4278, 334]
159 [296, 2959, 4226, 4973, 913, 1136, 260, 2858, 1198, 2329]
649 [307, 593, 527, 457, 319, 235, 364, 293, 32, 417]
96 [908, 923, 50, 912, 969, 1212, 56367, 527, 48394, 7361]
270 [50, 4226, 7361, 56367, 1704, 48780, 109487, 48516, 79132, 4973]
174 [1094, 1380, 342, 48]
395 [110, 3471, 2916, 2640, 2023, 1206, 2240]
621 [4993, 1136, 1704, 356, 3671, 2918, 8961, 2571, 914, 3101]
168 [296, 912, 589, 32, 110, 150, 924, 509, 1, 514]
532 [5952, 6377, 1097, 1262, 110, 733, 2268, 5377, 480, 2151]
576 [260, 2571, 1036, 1960, 1270, 1249, 1537, 4034, 480, 2006]
192 [527, 356, 457, 551, 364, 595, 474, 316, 2, 587]
402 [858, 1196, 1035, 1198, 78499, 31658, 5618, 5952, 6539, 76093]
94 [7153, 2692, 1233, 55820, 78499, 1222, 50, 110, 858, 6016]
617 [527, 1225, 457, 919, 356, 2572, 1784, 1387, 480, 1997]
562 [50, 527, 778, 1210, 44191, 1213, 4993, 5989, 3578, 4306]
569 [913, 912, 908, 1225, 1358, 1247, 2359, 920, 2804, 931]
615 [593, 2858, 7502, 58559

In [16]:
top_n[515]

[(48516, 4.602314507027896),
 (70286, 4.5372254958182925),
 (2804, 4.38298495713748),
 (3386, 4.350281306956536),
 (3246, 4.318912263418754),
 (3275, 4.2379702891274995),
 (8827, 4.0432662888209245),
 (53125, 3.796329959206197),
 (169, 3.768941592111123),
 (273, 3.7406592903604063)]

# Hybrid Recommender
In Hybrid Recommeder, we combine both content based filter and user-based collaborative filtering into another recommender. First, we generate a list of movies that are top rated based on movies title that a user selected. Then, we apply prediction to the set of top rated movies that the user might like to watch by giving predicted rating for each top rated movies. 

In [21]:
def get_user_based_recommedation_by_title(uid, title):
    movie_lst = get_recommended_movies_by_title(title)
    rec_df = md[md.index.isin(movie_lst)][['id', 'title']]
    rec_df['est_rating'] = rec_df['id'].apply(lambda x: algo.predict(uid, x).est)
    rec_df = rec_df.sort_values('est_rating', ascending=False)
    # comment to show estimated rating for the df
    # print(rec_df) 
    return rec_df.id

In [23]:
movie_lst = get_user_based_recommedation_by_title(800,'The Matrix')
rec_df = md[md.id.isin(movie_lst)]
rec_df[['title', 'vote_count', 'vote_average', 'id']]

id                                              title  est_rating
167     10428                                            Hackers    3.548695
602      8766                              Hellraiser: Bloodline    3.548695
43670   24959                                            Program    3.548695
43669   24914                                        Kid's Story    3.548695
43665   24960                                       World Record    3.548695
43663   24660                                  A Detective Story    3.548695
41235   29333                             Murder in a Blue World    3.548695
41127  210052                                        Wolfskinder    3.548695
34933    1877                                     Doctor Mordrid    3.548695
33545  281826                                          Algorithm    3.548695
29990   51604                                     The Rain Fairy    3.548695
28419  265712                               Stand by Me Doraemon    3.548695
27294   34

Unnamed: 0,title,vote_count,vote_average,id
167,Hackers,406.0,6.2,10428
602,Hellraiser: Bloodline,111.0,4.9,8766
1340,Sneakers,301.0,6.7,2322
3057,Supernova,109.0,4.9,10384
6478,Commando,753.0,6.4,10999
9110,Takedown,56.0,6.3,10429
9323,The Animatrix,433.0,6.9,55931
10758,Pulse,154.0,5.0,9682
12756,War Games: The Dead Code,47.0,5.1,14154
13821,The Inhabited Island,24.0,5.4,16911


In [22]:
movie_lst = get_user_based_recommedation_by_title(1,'Avatar')
rec_df = md[md.id.isin(movie_lst)]
rec_df[['title', 'vote_count', 'vote_average', 'id']]

id                                       title  est_rating
2444      603                                  The Matrix    2.943148
602      8766                       Hellraiser: Bloodline    2.837423
35368  132873                        Die Mondverschwörung    2.837423
35336  111480               OSS 117: Mission for a Killer    2.837423
33339  334394                                      Baskin    2.837423
32657  331592                                   Listening    2.837423
29891   13179                                 Tinker Bell    2.837423
29747   61803                                   Shakedown    2.837423
28419  265712                        Stand by Me Doraemon    2.837423
27300   48841           Vengeance: The Story of Tony Cimo    2.837423
27270  196255                             Gregory Go Boom    2.837423
23454   89187                              Beware of Pity    2.837423
22600  422906                       The Wizard of Baghdad    2.837423
21420   49802              The 

Unnamed: 0,title,vote_count,vote_average,id
602,Hellraiser: Bloodline,111.0,4.9,8766
2444,The Matrix,9079.0,7.9,603
3057,Supernova,109.0,4.9,10384
3517,Pandora and the Flying Dutchman,19.0,6.5,38688
3633,House Party 2,22.0,4.7,16096
3638,Project Moon Base,4.0,2.5,26270
4590,Jeepers Creepers,731.0,6.1,8922
6044,Tears of the Sun,582.0,6.4,9567
6378,Lara Croft Tomb Raider: The Cradle of Life,1443.0,5.5,1996
9028,Fetishes,10.0,5.1,63054
