# Recommender System with GraphLab
In which we demonstrate some features of GraphLab while building a movie recommendation system

In [1]:
import graphlab as gl

A newer version of GraphLab Create (v2.1) is available! Your current version is v1.10.1.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.


## Matrix Factorization

If $\mathbf{M}$ is a $n \times n$ matrix, then the singular value decomposition is

\begin{equation}
\mathbf{M} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^{T}
\end{equation}

Both $\mathbf{U}$ and $\mathbf{V}$ are $n \times n$ and $\mathbf{\Sigma}$ is an $n \times n$ diagonal matrix.

For a recommender system we can think of SVD as representing the structure of ratings

$\mathbf{U}$ contains information about users and their preferences. The columns represent latent factors, and the entries in a row represent a given user's loadings on those factors.

Similiary $\mathbf{V}^T$ represents information about items (e.g. movies). In this case the rows represent latent factors, and entries in a column represent a given movie's loading on those factors.

To build a matrix factorization recommender we limit the number of singular values or latent features we consider.

If we consider $p \lt n$ singular values, then we have that $\mathbf{U}$ is $n \times p$, $\mathbf{\Sigma}$ is $p \times p$ and $\mathbf{V}^{T}$ is $p \times n$

So our predicted ratings matrix is obtained by 

\begin{equation}
\overline{\mathbf{M}} = \mathbf{U}_{n \times p}\mathbf{\Sigma}_{p \times p}\mathbf{V}_{p \times n}
\end{equation}

The factorization recommender tries to minimize the function:
\begin{equation}
\min_{\mathbf{w},\mathbf{a},\mathbf{b},\mathbf{U},\mathbf{V}} \frac{1}{\lvert\mathcal{D}\rvert} \sum_{i,j,r_{i,j} \in \mathcal{D}}{ \mathcal{L}\bigl(score\left(i,j\right),r_{i,j} \bigr) + \lambda_{1}\left(\lVert\mathbf{w}\rVert_{2}^{2} + \Vert\mathbf{a}\Vert_{2}^{2} + \Vert\mathbf{b}\Vert_{2}^{2}\right) + \lambda_{2}\left(\Vert\mathbf{U}\Vert_{2}^{2} + \Vert\mathbf{V}\Vert_{2}^{2}\right)} 
\end{equation}
where:
\begin{equation}
score\left(i,j\right) = \mu + w_i + w_j +\mathbf{a}^{T}\mathbf{x}_{i} + \mathbf{b}^{T}\mathbf{y}_{i} + \mathbf{u}_{i}^{T}\mathbf{v}_{j}
\end{equation}
and
$\mu$ is the overall average rating, $w_i$ is the user bias, $w_j$ is the item bias, $\mathbf{a}$ and $\mathbf{x}$ are the user data, $\mathbf{a}$ and $\mathbf{x}$ are the item data, and $\mathbf{u}$ and $\mathbf{v}$ are the user and item factors

In [2]:
def add_path(base, name):
    return base+name



if __name__ == '__main__':
    base_path = "./data/sample-movie-recommender-master/dataset/ml-20m/"
    ratings_path = add_path(base_path, "ratings.csv")
    movies_path = add_path(base_path, "movies.csv")
    ratings = gl.SFrame.read_csv(ratings_path)
    movies = gl.SFrame.read_csv(movies_path)

    



[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1473137271.log


This non-commercial license of GraphLab Create for academic use is assigned to cullywest@gmail.com and will expire on June 24, 2017.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
movies.head()
movies[0]
movies[movies['movieId'] < 10]
movies[movies['movieId'] < 10].shape

(9, 3)

In [4]:
df = movies.to_dataframe()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
movieId    27278 non-null int64
title      27278 non-null object
genres     27278 non-null object
dtypes: int64(1), object(2)
memory usage: 639.4+ KB


In [5]:
sf = gl.SFrame(df)
sf.head()

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Child ren|Comedy|Fantasy ...
2,Jumanji (1995),Adventure|Children|Fantas y ...
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995) ...,Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


In [6]:
train, test = gl.recommender.util.random_split_by_user(ratings, 'userId', 'movieId', max_num_users=1000, item_test_proportion=0.2 )
#or train, test = ratings.random_split(0.8, seed=10)

In [7]:
train.shape

(19973775, 4)

In [8]:
recommender = gl.recommender.factorization_recommender.create(train, 'userId', 'movieId', 'rating', max_iterations=5)

We can find similar items to a movie:
```python
inception_id = movies.filter_by('Inception (2010)', 'title')['movieid']
similar_movies = recommender.get_similar_items(inception_id)['similar']
movies.filter_by(similar_movies, 'movieid')
```


## Now that we have our model we can do some things.
## First we can get recommendations

In [25]:
top_movies = recommender.recommend([1])['movieId']
#recommender.recommend([1])
movies.filter_by(top_movies, 'movieId')

movieId,title,genres
356,Forrest Gump (1994),Comedy|Drama|Romance|War
1210,Star Wars: Episode VI - Return of the Jedi (1 ...,Action|Adventure|Sci-Fi
2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
3147,"Green Mile, The (1999)",Crime|Drama
3578,Gladiator (2000),Action|Adventure|Drama
7502,Band of Brothers (2001),Action|Drama|War
44555,"Lives of Others, The (Das leben der Anderen) (2 ...",Drama|Romance|Thriller
58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
63082,Slumdog Millionaire (2008) ...,Crime|Drama|Romance
79132,Inception (2010),Action|Crime|Drama|Myster y|Sci-Fi|Thriller|IMAX ...


### GraphLab also allows the incorporation of side data. In this case we have some information about movies we can include

In [27]:
#recommender_with_side_data = gl.factorization_recommender.create(train, 'userId', 'movieId', 'rating', 
#item_data=movies, max_iterations=5)
recommender_with_side_data = gl.factorization_recommender.create(test, 'userId', 'movieId', 'rating', 
                                                                 item_data=movies, max_iterations=5, verbose=False)



In [28]:
top_movies = recommender_with_side_data.recommend([1])['movieId']
movies.filter_by(top_movies, 'movieId')

movieId,title,genres
97,"Hate (Haine, La) (1995)",Crime|Drama
1131,Jean de Florette (1986),Drama|Mystery
1209,Once Upon a Time in the West (C'era una volta il ...,Action|Drama|Western
1797,Everest (1998),Documentary|IMAX
1927,All Quiet on the Western Front (1930) ...,Action|Drama|War
2951,"Fistful of Dollars, A (Per un pugno di doll ...",Action|Western
5772,My Dinner with André (1981) ...,Drama
7063,"Aguirre: The Wrath of God (Aguirre, der Zorn ...",Adventure|Drama
42730,Glory Road (2006),Drama
49822,"Good Shepherd, The (2006)",Drama|Thriller


## Find similar movies

In [36]:
inception_id = movies.filter_by('Inception (2010)', 'title')['movieId']
similar_movies = recommender.get_similar_items(inception_id)['similar']
#andre_id = movies.filter_by(5772, 'movieId')['movieId']
#similar_movies = recommender.get_similar_items(andre_id)['similar']
movies.filter_by(similar_movies, 'movieId')

movieId,title,genres
33166,Crash (2004),Crime|Drama
55765,American Gangster (2007),Crime|Drama|Thriller
60408,Welcome to the Sticks (Bienvenue chez les ...,Comedy
61729,Ghost Town (2008),Comedy|Fantasy|Romance
72011,Up in the Air (2009),Drama|Romance
74152,Zach Galifianakis: Live at the Purple Onion ...,Comedy|Documentary
81639,Jack Goes Boating (2010),Comedy|Romance
87234,Submarine (2010),Comedy|Drama|Romance
96821,"Perks of Being a Wallflower, The (2012) ...",Drama|Romance
97921,Silver Linings Playbook (2012) ...,Comedy|Drama


In [37]:
similar_movies = recommender_with_side_data.get_similar_items(inception_id)['similar']
movies.filter_by(similar_movies, 'movieId')

movieId,title,genres
1424,Inside (1996),Action
2654,"Wolf Man, The (1941)",Drama|Fantasy|Horror
3505,No Way Out (1987),Drama|Mystery|Thriller
4082,Barfly (1987),Comedy|Drama|Romance
5299,My Big Fat Greek Wedding (2002) ...,Comedy|Romance
40755,Forty Guns (1957),Drama|Western
85574,Black Bread (Pa Negre) (2010) ...,Drama
92154,Faust (2011),Drama
102749,Captain America II: Death Too Soon (1979) ...,Action|Crime
108516,Visitors (2013),Documentary


## Find similar users

In [14]:
similar_users = recommender.get_similar_users([1])['similar']

In [15]:
users = ratings.groupby(key_columns='userId', 
                        operations={'avg_rating':gl.aggregate.AVG('rating'), 'count':gl.aggregate.COUNT()})

In [16]:
users.head()

userId,count,avg_rating
21855,22,4.36363636364
88004,34,3.70588235294
79732,24,3.5
63664,43,3.76744186047
127950,78,3.80128205128
7899,730,3.57191780822
25263,22,3.95454545455
130872,75,3.84
87629,38,3.86842105263
30621,247,3.53441295547


In [17]:
users.filter_by(similar_users, 'userId')

userId,count,avg_rating
122116,104,3.80769230769
37799,200,4.0025
138397,893,3.91377379619
1959,226,3.18805309735
120019,177,3.75706214689
83395,161,3.50931677019
82896,141,4.29078014184
129956,29,3.75862068966
136148,303,3.75742574257
112338,142,3.78521126761


In [69]:
ratings.show()

Canvas is accessible via web browser at the URL: http://localhost:46754/index.html
Opening Canvas in default web browser.


In [38]:
gl.factorization_recommender.
