# Lecture 9 - Machine Learning (3) - Recommendation System

* Recommendation system is a sort of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. They are primarily used in commercial applications (https://en.wikipedia.org/wiki/Recommender_system)

* There are two common types of recommender systems:
    * **Content-Based Filtering** focus on the attributes of the items and give you recommendations based on the similarity between them.
    * **Collaborative Filtering** produces recommendations based on the user's attitude (activity) to items.
    
* In this lecture, you will basic examples of recommendation system with Python.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline

* Movie recommendation is one of the first step to start learning recommendation systems.
* MovieLens dataset is a famous one for learning to build the recommendation systems.
    * https://grouplens.org/datasets/movielens/
    * https://kaggle.com/grouplens/movielens-20m-dataset

In [3]:
movies = pd.read_csv('./movies.csv')

In [5]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action Animation Comedy Fantasy
9738,193583,No Game No Life: Zero (2017),Animation Comedy Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action Animation


* Let's produce a content-based filtering based on genre similarity.

In [7]:
vectorizer = CountVectorizer()
genre_vec = vectorizer.fit_transform(movies['genres'])

In [9]:
genre_vec.toarray()

array([[0, 1, 1, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [11]:
sim_mat = cosine_similarity(genre_vec, genre_vec)

In [13]:
sim_mat

array([[1.        , 0.77459667, 0.31622777, ..., 0.        , 0.31622777,
        0.4472136 ],
       [0.77459667, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.31622777, 0.        , 1.        , ..., 0.        , 0.        ,
        0.70710678],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.31622777, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.4472136 , 0.        , 0.70710678, ..., 0.        , 0.        ,
        1.        ]])

In [15]:
genre_sim = pd.DataFrame(index=movies['title'], columns=movies['title'], data=sim_mat)

In [21]:
genre_sim['Toy Story (1995)'].sort_values(ascending=False).head(20)

title
Toy Story (1995)                                                     1.000000
Moana (2016)                                                         1.000000
Adventures of Rocky and Bullwinkle, The (2000)                       1.000000
The Good Dinosaur (2015)                                             1.000000
Antz (1998)                                                          1.000000
Asterix and the Vikings (Astérix et les Vikings) (2006)              1.000000
Emperor's New Groove, The (2000)                                     1.000000
Toy Story 2 (1999)                                                   1.000000
Shrek the Third (2007)                                               1.000000
Turbo (2013)                                                         1.000000
Wild, The (2006)                                                     1.000000
Monsters, Inc. (2001)                                                1.000000
Tale of Despereaux, The (2008)                            

* Now, let's import one more dataset "ratings.csv", and produce collaborative filtering.

In [23]:
ratings = pd.read_csv('./ratings.csv')

In [25]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


* Let's merge those two dataframes.

In [29]:
df_movies = pd.merge(ratings, movies, on='movieId')
df_movies

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure Animation Children Comedy Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure Animation Children Comedy Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure Animation Children Comedy Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure Animation Children Comedy Fantasy
...,...,...,...,...,...,...
100831,610,160341,2.5,1479545749,Bloodmoon (1997),Action Thriller
100832,610,160527,4.5,1479544998,Sympathy for the Underdog (1971),Action Crime Drama
100833,610,160836,3.0,1493844794,Hazard (2005),Action Drama Thriller
100834,610,163937,3.5,1493848789,Blair Witch (2016),Horror Thriller


* Now, reshape the dataframe with using pivot_table.

In [31]:
movie_user_matrix = df_movies.pivot_table(index='title', columns='userId', values='rating')
movie_user_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),,,,,,,,,,,...,,,,,,,,,,4.0
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,
'Salem's Lot (2004),,,,,,,,,,,...,,,,,,,,,,
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,


* Fill the NaN values to 0.

In [33]:
movie_user_matrix.fillna(0, inplace=True)
movie_user_matrix

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0
xXx (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,2.0
xXx: State of the Union (2005),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5
¡Three Amigos! (1986),4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


* Obtain cosine similarity.

In [35]:
item_based_filter = cosine_similarity(movie_user_matrix)
item_based_filter

array([[1.        , 0.        , 0.        , ..., 0.32732684, 0.        ,
        0.        ],
       [0.        , 1.        , 0.70710678, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.70710678, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.32732684, 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [38]:
item_sim = pd.DataFrame(index=movie_user_matrix.index, columns=movie_user_matrix.index, data=item_based_filter)

In [46]:
item_sim['(500) Days of Summer (2009)'].sort_values(ascending=False).head(20)

title
(500) Days of Summer (2009)                1.000000
Silver Linings Playbook (2012)             0.528581
Up in the Air (2009)                       0.482924
Adventureland (2009)                       0.480897
50/50 (2011)                               0.472387
Toy Story 3 (2010)                         0.462940
Crazy, Stupid, Love. (2011)                0.462535
Hangover, The (2009)                       0.455245
Scott Pilgrim vs. the World (2010)         0.451259
Zodiac (2007)                              0.450463
Descendants, The (2011)                    0.448854
Secret Life of Walter Mitty, The (2013)    0.442243
Yes Man (2008)                             0.441504
About Time (2013)                          0.440465
Kick-Ass (2010)                            0.438496
Alice in Wonderland (2010)                 0.437569
Holiday, The (2006)                        0.434027
Hitch (2005)                               0.431444
Darjeeling Limited, The (2007)             0.428977
Juno (

* You can also check out further materials to learn.
    * "Recommender Systems: An Introduction" by D. Jannach et al. (https://www.amazon.com/Recommender-Systems-Introduction-Dietmar-Jannach/dp/0521493366).
    * https://www.youtube.com/watch?v=39vJRxIPSxw
    * https://developers.google.com/machine-learning/recommendation
    * https://surprise.readthedocs.io/en/stable/