<a id='top'></a>

------

CSCI E-82 Advanced Machine Learning, Data Mining and Artificial Intelligence
=====

# Section : Saturday 8th December 10AM EST

----------


Recommender Systems
================

[Popularity methods](#Popularity-methods)
- Books dataset   

[Content-based recommenders](#Content-based-recommenders)
 - [Movielens dataset](#Movielens)   
 
[Collaborative-filtering recommenders](#Collaborative-filtering-recommenders)
  - Movielens (cont'd)    
  
Dask (in another notebook)  

-------


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.sparse as sp

#import sklearn
#from sklearn import metrics
#from sklearn import model_selection

from IPython import display

import time
from urllib.request import Request, urlopen

import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import  pairwise_distances, cosine_similarity
from sklearn.metrics import mean_squared_error, jaccard_similarity_score

from scipy.stats.stats import pearsonr

from numpy.linalg import solve
from scipy.sparse.linalg import svds

from sklearn.model_selection import train_test_split
from urllib.request import urlretrieve
import os
import zipfile


In [2]:
# special matplotlib command for global plot configuration
from matplotlib import rcParams
import matplotlib as mpl

dark2_colors = ['#1b9e77','#d95f02','#7570b3','#e7298a','#66a61e','#e6ab02','#a6761d','#666666']

def set_mpl_params():
    rcParams['figure.figsize'] = (10, 6)
    rcParams['figure.dpi'] = 150
    rcParams['axes.prop_cycle'].by_key()['color'][1]
    rcParams['lines.linewidth'] = 2
    rcParams['axes.facecolor'] = 'white'
    rcParams['font.size'] = 16
    rcParams['patch.edgecolor'] = 'white'
    rcParams['patch.facecolor'] = dark2_colors[0]
    rcParams['font.family'] = 'StixGeneral'

set_mpl_params()

<a id='Popularity-methods'></a>
[back to top](#top)

# Popularity methods

#### Advantages:
- Require only users and items and use basic techniques - generally very fast;
- If you don't know much - use this. With very little information, they work best;
- With lots of information, they still work well: people keep up with the Joneses;
- If you are considering transactions like purchases - then use frequent pattern mining, frequent itemsets.

#### Drawbacks:
- Not really personalized - in the worst case, everyone gets the same recommendation.

## Simple popularity and correlation-based recommender for books  

#### Let's look at what goes into a recommender and the various similarity measures.

### We will use this dataset of book reviews, which contains:
- The unique reviewer ID number
- The name of the book
- A rating score from 1 to 10

#### We create a pandas dataframe of the individual book ratings and then print the first 5 ratings in the frame

In [3]:
data = pd.read_csv("data_books.csv", sep = ",", header=None,
                         names=['Reviewer', 'Book', 'Rating'])

print ("There are %d book reviews in the dataframe" %len(data))
data.head()

There are 383852 book reviews in the dataframe


Unnamed: 0,Reviewer,Book,Rating
0,276726,Rites of Passage,5
1,276729,Help!: Level 1,3
2,276729,The Amsterdam Connection : Level 4 (Cambridge ...,6
3,276744,A Painted House,7
4,276747,Little Altars Everywhere,9


In [4]:
data.Rating.describe()

count    383852.000000
mean          7.626710
std           1.841331
min           1.000000
25%           7.000000
50%           8.000000
75%           9.000000
max          10.000000
Name: Rating, dtype: float64

#### When looking at explicit (ratings provided by user) and implicit ratings data (based on browsing history/ mouse clicks, movements recordings)  - keep in mind it is typically pretty poor

In [5]:
len(data[data.duplicated(["Reviewer", "Book"])]) #Reviewers have reviewed the same book multiple times.

1046

In [6]:
data[(data.Reviewer == 8067) & (data.Book == 'The Boy Next Door')]

Unnamed: 0,Reviewer,Book,Rating
11503,8067,The Boy Next Door,10
11558,8067,The Boy Next Door,6


In [8]:
dups = data[data.duplicated(["Reviewer", "Book"], keep=False)]

book_1 = "Harry Potter and the Chamber of Secrets (Book 2)"
dups[dups.Book == book_1]

Unnamed: 0,Reviewer,Book,Rating
2284,254,Harry Potter and the Chamber of Secrets (Book 2),9
2285,254,Harry Potter and the Chamber of Secrets (Book 2),9
17701,11676,Harry Potter and the Chamber of Secrets (Book 2),10
17702,11676,Harry Potter and the Chamber of Secrets (Book 2),8
17730,11676,Harry Potter and the Chamber of Secrets (Book 2),10
48772,30735,Harry Potter and the Chamber of Secrets (Book 2),9
48773,30735,Harry Potter and the Chamber of Secrets (Book 2),9
354731,252829,Harry Potter and the Chamber of Secrets (Book 2),10
354732,252829,Harry Potter and the Chamber of Secrets (Book 2),10


In [9]:
dups = data[data.Reviewer == 11676]
dups[dups.Book.str.contains("^Harry Potter")]

Unnamed: 0,Reviewer,Book,Rating
17701,11676,Harry Potter and the Chamber of Secrets (Book 2),10
17702,11676,Harry Potter and the Chamber of Secrets (Book 2),8
17713,11676,Harry Potter and the Prisoner of Azkaban (Book 3),9
17714,11676,Harry Potter and the Goblet of Fire (Book 4),8
17715,11676,Harry Potter and the Goblet of Fire (Book 4),10
17727,11676,Harry Potter and the Sorcerer's Stone (Book 1),10
17730,11676,Harry Potter and the Chamber of Secrets (Book 2),10
17734,11676,Harry Potter and the Prisoner of Azkaban (Harr...,9
19130,11676,Harry Potter and the Sorcerer's Stone (Book 1),8
19131,11676,Harry Potter and the Sorcerer's Stone (Harry P...,10


In [10]:
#Dropping duplicates
data = data.drop_duplicates(["Reviewer", "Book"], keep='last')
len(data[data.duplicated(["Reviewer", "Book"])])

0

### Sort the reviews dataframe on review counts and show the top 20 most reviewed books
Since we want to do a recommendation based on popularity, we need to do some sorting and filtering of the ratings data 

In [11]:
# Top 20 books
print (pd.value_counts(data.Book).head(20))

The Lovely Bones: A Novel                                           707
Wild Animus                                                         581
The Da Vinci Code                                                   494
The Secret Life of Bees                                             402
The Nanny Diaries: A Novel                                          391
The Red Tent (Bestselling Backlist)                                 383
Bridget Jones's Diary                                               367
A Painted House                                                     363
Life of Pi                                                          335
Divine Secrets of the Ya-Ya Sisterhood: A Novel                     323
Harry Potter and the Chamber of Secrets (Book 2)                    321
Angels &amp                                                         316
Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))    315
The Summons                                                     

### Now sort the dataframe on reviewer counts and show the top 10 reviewers

In [12]:
# Top 10 reviewers
print (pd.value_counts(data.Reviewer).head(10)) 
#Reviewer #11676 appears 6695 times in the dataset

11676     6695
98391     5651
189835    1824
153662    1821
23902     1170
235105    1014
76499     1011
171118     956
16795      945
248718     937
Name: Reviewer, dtype: int64


### Cold-start recommendation - if you had a new user, with no other information on his/her preference for genre or past reading history

In [14]:
toprated = data.groupby('Book').agg({'Rating': [np.size, np.mean]})
toprated.sort_values([('Rating', 'mean')], ascending=False).head(10)
#Book - La Tour sombre, tome 1 : Le pistolero has been reviewed twice and average rating = 10.0

Unnamed: 0_level_0,Rating,Rating
Unnamed: 0_level_1,size,mean
Book,Unnamed: 1_level_2,Unnamed: 2_level_2
Rock Stars: People at the Top of the Charts,1,10.0
"La Tour sombre, tome 1 : Le pistolero",2,10.0
Big Honkin' Zits : A Zits Treasury,1,10.0
Big Honey Hunt (Beginner Books),1,10.0
Big Help!,1,10.0
"La Terre vue du ciel, 2e �?©dition",1,10.0
Stranger to the Sun (Angel),1,10.0
Danger Music (Five Star First Edition Mystery Series),1,10.0
Strangers and Sojourners: A Novel (Children of the Last Days (Hardcover)),1,10.0
Strangers at Dawn,1,10.0


### It doesn't help to consider books that have only or two ratings.

In [15]:
atleast_10 = toprated['Rating']['size'] >= 10
toprated[atleast_10].sort_values([('Rating', 'mean')], ascending=False).head(10)

Unnamed: 0_level_0,Rating,Rating
Unnamed: 0_level_1,size,mean
Book,Unnamed: 1_level_2,Unnamed: 2_level_2
Postmarked Yesteryear: 30 Rare Holiday Postcards,11,10.0
Dilbert: A Book of Postcards,13,9.923077
Harry Potter and the Chamber of Secrets Postcard Book,23,9.869565
The Lorax,10,9.8
Kiss of the Night (A Dark-Hunter Novel),10,9.8
Route 66 Postcards: Greetings from the Mother Road,11,9.727273
Maus 1. Mein Vater kotzt Geschichte aus. Die Geschichte eines �?�?berlebenden.,10,9.7
"The Return of the King (The Lord of The Rings, Part 3)",16,9.625
Harry Potter Und Der Feuerkelch,10,9.6
Fox in Socks (I Can Read It All by Myself Beginner Books),15,9.6


### What if you now know that your reader has read and liked *Harry Potter and the Chamber of Secrets?*

We can get the reviewer IDs of all of the reviewers of the book, and use that to make a better recommendation

In [16]:
# Getting all the reviewers for this book
book_1 = 'Harry Potter and the Chamber of Secrets (Book 2)'
book_1_reviewers = data[data.Book == book_1].Reviewer
print ("%d people have reviewed %s" %(len(book_1_reviewers), book_1))

321 people have reviewed Harry Potter and the Chamber of Secrets (Book 2)


In [17]:
book_1_reviewers.head() #book_1 has been reviewed by reviewer #278356, #254 and so on

1605    278356
2285       254
4076      2033
7269      4809
7305      4896
Name: Reviewer, dtype: int64

Create a set of reviewer IDs containing only the reviewers of Harry Potter and the Chamber of Secrets and use it to filter the original dataset



In [18]:
# Creating a set with only the reviewer of Harry Potter and the Chamber of Secrets
book1_reviewers_only = data[data.Reviewer.isin(book_1_reviewers)]
book1_reviewers_only.head(5)

Unnamed: 0,Reviewer,Book,Rating
1591,278356,MANCHILD IN THE PROMISED LAND,3
1592,278356,Is It Too Late to Run Away and Join the Circus...,9
1593,278356,The Complete Idiot's Guide to Cycling,8
1594,278356,Adventures of a bystander,9
1595,278356,Coyote Waits,9


And then sort the result on the number of reviews, from most to least

In [19]:
print (pd.value_counts(book1_reviewers_only.Book).head(10))

Harry Potter and the Chamber of Secrets (Book 2)                    321
Harry Potter and the Prisoner of Azkaban (Book 3)                   154
Harry Potter and the Goblet of Fire (Book 4)                        134
Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))     92
Harry Potter and the Sorcerer's Stone (Book 1)                       90
Harry Potter and the Order of the Phoenix (Book 5)                   75
The Fellowship of the Ring (The Lord of the Rings, Part 1)           34
Bridget Jones's Diary                                                26
The Two Towers (The Lord of the Rings, Part 2)                       23
To Kill a Mockingbird                                                21
Name: Book, dtype: int64


The Harry Potter books are, not surprisingly, the most reviewed books. But Lord of the Rings is also there.


<a id='Content-based-recommenders'></a>
[back to top](#top)

# Content-based recommenders:

### Content-Based systems focus on properties of items. Similarity of items is determined by measuring the similarity in their properties.  The ideas here are exposed in the reading list chapter from [MMDS](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf).
#### This is the way that you think you should recommend things - look at the features (attributes) of items, just like you are shopping, and then look at other similar items to determine what is the best next buy  ...
#### *If a Netflix user has watched many cowboy movies, then recommend a movie classified in the database as having the “cowboy” genre*


#### Advantages:
- Good option for cold-start, before you know much about users and items together, but know something about them individually;
- Familiarity with the method: this is most like classification;
- Can be vector-based: keywords; TF-IDF of bag-of-words, but generally, you want to be recommending the same type of item: books, movies, songs, etc.;
- The profiles created can acquire their own value.

#### Drawbacks:
- Often, you recommend what the buyer already knows about or would buy anyway (or has already bought);
- This is most like classification - so there are same drawbacks: scalability, the need for the tuning of unknown hyper-parameters, overfitting & regularization;
- Need to have clear structure in the features (attributes) to exploit - so this brings in all the difficulties of feature engineering in terms of potentiallly disparate feature types and values.

<a id='Movielens'></a>
[back to top](#top)
## Movielens data 
### Movielens contains explicit ratings data on movies from the publicly-available Grouplens datasets

#### [MovieLens Latest Datasets](https://grouplens.org/datasets/movielens/) 

These datasets will change over time, and are not appropriate for reporting research results. We will keep the download links stable for automated downloads. We will not archive or make available previously released versions.

#### Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

- Small: 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Last updated 10/2016.  
[README.html](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html)  
[ml-latest-small.zip (size: 1 MB)](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)    

- Full: 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Last updated 8/2017.  
[README.html](http://files.grouplens.org/datasets/movielens/ml-latest-README.html)  
[ml-latest.zip (size: 224 MB)](http://files.grouplens.org/datasets/movielens/ml-latest.zip)
Permalink: http://grouplens.org/datasets/movielens/latest/    

#### Ratings are made on a 5-star scale, with single star increments (1 stars - 5 stars).

- Old 100k stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.  
[README.txt](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt)  
[ml-100k.zip (size: 5 MB)](http://files.grouplens.org/datasets/movielens/ml-100k.zip)
Index of unzipped files
Permalink: http://grouplens.org/datasets/movielens/100k/  



In [20]:
old100k_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
datasets_path = ''
old100k_dataset_path = os.path.join(datasets_path, 'ml-100k.zip')

old100k_f = urlretrieve(old100k_dataset_url, old100k_dataset_path)

In [21]:
with zipfile.ZipFile(old100k_dataset_path, "r") as z:
    z.extractall(datasets_path)

In [73]:
# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file

# Note that two of the files, users and items, are pipe separated values 
# and the other, ratings, is tab separated.

# Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('./datasets/ml-100k/u.user', sep='|', names=u_cols, encoding='latin-1')

# Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('./datasets/ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')

# Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('./datasets/ml-100k/u.item', sep='|', names=i_cols,  encoding='latin-1')


### Users
#### Print out the data on the first 10 users from the users dataset 

In [74]:
print (users.shape)
users.head(10)

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


### Items
#### Print out the first 10 rows from the items dataset  - each row contains 24 columns of  features - the id and title etc. and 19 of them are one-hot-encodings for the genre of the movie

In [75]:
print (items.shape)
items.head(5)

(1682, 24)


Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Ratings
#### Print out the first 10 rows from the ratings dataset  - we see it uses the triple user / movie / rating plus a timestamp

In [76]:
print (ratings.shape)
ratings.head(5)

(100000, 4)


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Double check that we have the same number of unique users and items.

In [77]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie_id.unique().shape[0]
print ("Users:", n_users, "\nMovies", n_items)

Users: 943 
Movies 1682


### Sparsity of ratings
#### What is the sparsity of the ratings dataset?

In [78]:
sparsity=round(100.0-len(ratings)/float(n_users*n_items)*100,1)
print ("\nThe sparsity level of MovieLens100K dataset is ", str(sparsity),  "%")


The sparsity level of MovieLens100K dataset is  93.7 %


We want to make a cross-matrix - that is unpivot the table of ratings - so that we can have a utility matrix with users in rows and items in columns.  We need to take away 1 from each ID, since Python uses zero-indexing. 

In [79]:
mtx = np.zeros((n_users, n_items)) 
#As noticed above this matrix is too sparse, if we use regular numpy for this sparse matrix we are wasting resources
#Although it doesn't matter for this small sample dataset, it certainly does for larger datasets. 
print("Nbytes:", mtx.nbytes)

Nbytes: 12689008


In [80]:
ratings[['user_id','movie_id','rating']].describe()

Unnamed: 0,user_id,movie_id,rating
count,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986
std,266.61442,330.798356,1.125674
min,1.0,1.0,1.0
25%,254.0,175.0,3.0
50%,447.0,322.0,4.0
75%,682.0,631.0,4.0
max,943.0,1682.0,5.0


In [81]:
#Subtracting 1 to make it easier to work with Python zero based indexing
users.user_id = users.user_id - 1 
items['movie id'] = items['movie id'] - 1

ratings.user_id = ratings.user_id - 1
ratings.movie_id = ratings.movie_id - 1

In [82]:
ratings[['user_id','movie_id','rating']].describe()

Unnamed: 0,user_id,movie_id,rating
count,100000.0,100000.0,100000.0
mean,461.48475,424.53013,3.52986
std,266.61442,330.798356,1.125674
min,0.0,0.0,1.0
25%,253.0,174.0,3.0
50%,446.0,321.0,4.0
75%,681.0,630.0,4.0
max,942.0,1681.0,5.0


In [83]:
#Row = users, columns = movies
row_ind = ratings.user_id
col_ind = ratings.movie_id
ratings_data = ratings.rating
#We'll use scipy sparse matrices (COOrdinate format) - COO is a fast format for constructing sparse matrices
mtx = sp.coo_matrix((ratings_data, (row_ind, col_ind)), shape=(n_users, n_items), dtype=np.int8)
print("Size/Nbytes: ",mtx.col.nbytes + mtx.row.nbytes + mtx.data.nbytes)
#print(mtx)
mtx

Size/Nbytes:  900000


<943x1682 sparse matrix of type '<class 'numpy.int8'>'
	with 100000 stored elements in COOrdinate format>

## Content-based filtering by user profile

### First, we create a matrix of genres that we will use to generate a user profile


In [84]:
genres = items[items.columns[6:]]

<img src="matrix.png" alt = "matrix" style = "width:1182px; height=702px;">

*Transpose matrices for correct dimensions*

### And then score the genres for each user - we will consider only those above the average

In [85]:
#We had 1682 movies and 943 users
genres.T.shape, mtx.todense().T.shape

((18, 1682), (1682, 943))

In [86]:
#Time with regular numpy matrices
start = time.time()
profiles = np.dot(genres.T, mtx.T)
end = time.time()
print(end - start)

39.10369277000427


In [87]:
#Time with sparse matrices 
start = time.time()
profiles = np.dot(sp.coo_matrix(genres.T),mtx.T)
end = time.time()
print(end - start)

11.877945184707642


In [88]:
print(profiles.shape)
profiles.todense()

(18, 943)


matrix([[250,  38,  39, ...,  38,  74, 227],
        [123,  13,  14, ...,  27,  52, 114],
        [ 40,   4,   0, ...,  14,  19,   7],
        ...,
        [188,  43,  53, ...,  28,  80, 134],
        [ 92,  11,  14, ...,   5,  47,  53],
        [ 22,   0,   0, ...,   0,  14,  23]], dtype=int64)

In [89]:
# mean centre the data
print(profiles.mean(axis=0)[0,21], np.mean(profiles[:,21]))
profiles = profiles - profiles.mean(axis=0)

print (profiles[:,21])

# binarize the resulting matrix where 1 is for ratings above average
profiles = np.where(profiles>0, 1, 0)
print (profiles[:,21])

59.333333333333336 59.33333333333333
[[189.66666667]
 [ 61.66666667]
 [-59.33333333]
 [-49.33333333]
 [155.66666667]
 [-27.33333333]
 [-59.33333333]
 [ 11.66666667]
 [-58.33333333]
 [-54.33333333]
 [-40.33333333]
 [-42.33333333]
 [-56.33333333]
 [  9.66666667]
 [ 37.66666667]
 [ 27.66666667]
 [ -4.33333333]
 [-42.33333333]]
[1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0]


In [90]:
print(profiles.shape)
profiles

(18, 943)


array([[1, 1, 1, ..., 1, 1, 1],
       [1, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

We are trying to recommend movies for user 21, let's which movies user has rated

In [91]:
#ratings[((ratings.user_id==22) & (ratings.rating>3))][['movie_id','rating']] #.values.flatten()
ratings[((ratings.user_id==21) & (ratings.rating>3))][['movie_id']].values.flatten()

array([127,  79, 257, 509,  78, 510, 226, 172, 398, 185,  95, 116, 402,
       998, 501, 434, 175, 221, 549, 392, 647, 117, 691, 225, 237, 454,
       357, 207, 171, 229, 152, 153, 193, 194, 167, 203, 650, 186, 514,
        84, 450, 839, 430, 201, 215, 183, 174,  23, 108, 522,  20, 200,
       567, 180,  16, 143, 160, 711, 289, 429,  88, 227, 731,   3, 126,
        61, 249,  49, 208, 525, 173, 384, 791,  67])

In [92]:
#Movies rated>3 by user 21
items[(items['movie id'].isin(list(ratings[((ratings.user_id==21) & (ratings.rating>3))][['movie_id']].values.flatten())) )] #ratings[ratings.user_id==22][['movie_id','rating']]
#seems like many genres with Drama (check with np.sum)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
3,3,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,16,From Dusk Till Dawn (1996),05-Feb-1996,,http://us.imdb.com/M/title-exact?From%20Dusk%2...,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
20,20,Muppet Treasure Island (1996),16-Feb-1996,,http://us.imdb.com/M/title-exact?Muppet%20Trea...,0,1,1,0,0,...,0,0,0,1,0,0,0,1,0,0
23,23,Rumble in the Bronx (1995),23-Feb-1996,,http://us.imdb.com/M/title-exact?Hong%20Faan%2...,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
49,49,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
61,61,Stargate (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Stargate%20(1...,0,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0
67,67,"Crow, The (1994)",01-Jan-1994,,"http://us.imdb.com/M/title-exact?Crow,%20The%2...",0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,0
78,78,"Fugitive, The (1993)",01-Jan-1993,,"http://us.imdb.com/M/title-exact?Fugitive,%20T...",0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
79,79,Hot Shots! Part Deux (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?Hot%20Shots!%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
84,84,"Ref, The (1994)",01-Jan-1994,,"http://us.imdb.com/M/title-exact?Ref,%20The%20...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [93]:
tmp = items[(items['movie id'].isin(list(ratings[((ratings.user_id==21) & (ratings.rating>3))][['movie_id']].values.flatten())) )] #ratings[ratings.user_id==22][['movie_id','rating']]
tmp.sum()


movie id                                                          21797
movie title           Get Shorty (1995)From Dusk Till Dawn (1996)Mup...
release date          01-Jan-199505-Feb-199616-Feb-199623-Feb-199601...
video release date                                                    0
IMDb URL              http://us.imdb.com/M/title-exact?Get%20Shorty%...
unknown                                                               0
Action                                                               46
Adventure                                                            23
Animation                                                             0
Children's                                                            0
Comedy                                                               35
Crime                                                                 6
Documentary                                                           0
Drama                                                           

### We use the Jaccard similarity score to do the scoring (SciKit Learn)

Remember: [Jaccard similarity](#https://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-score) just measures the proportion of intersection of two sets

In [94]:
 np.array(profiles[:,21])

array([1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])

In [95]:
print(genres.values[0,:],"\n",genres.values[1,:],"\n",genres.values[21,:],"\n",genres.values[22,:])
genres.head()

[0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0] 
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0] 
 [1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0] 
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0]


Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [96]:
target_profile = np.array(profiles[:,21])
preds = np.zeros(genres.shape[0]) 
for i in range(genres.shape[0]):
    score = jaccard_similarity_score(target_profile, genres.values[i,:])
    preds[i] = score
print(preds.shape)

(1682,)


Then sort and show the top scores

In [97]:
print(preds)
np.sort(preds)[-10:]

[0.55555556 0.77777778 0.66666667 ... 0.72222222 0.66666667 0.66666667]


array([0.83333333, 0.83333333, 0.83333333, 0.83333333, 0.83333333,
       0.83333333, 0.83333333, 0.83333333, 0.83333333, 0.83333333])

Let's consider only the movies that user 21 has not viewed, and show the top scored movies in order. This would be our recommendation to them based on Jaccard similarity.  

In [98]:
watched_movies = ratings[ratings.user_id == 21]['movie_id']
items_scored = items.copy()
items_scored[items_scored['movie id'].isin(watched_movies)==False]
items_scored['score'] =  preds

In [99]:
items_scored.sort_values(['score'], ascending=False).head(10)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,score
251,251,"Lost World: Jurassic Park, The (1997)",23-May-1997,,http://us.imdb.com/M/title-exact?Lost%20World%...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,0,0.833333
357,357,Spawn (1997),01-Aug-1997,,http://us.imdb.com/M/title-exact?Spawn+(1997/I),0,1,1,0,0,...,0,0,0,0,0,1,1,0,0,0.833333
163,163,"Abyss, The (1989)",01-Jan-1989,,"http://us.imdb.com/M/title-exact?Abyss,%20The%...",0,1,1,0,0,...,0,0,0,0,0,1,1,0,0,0.833333
635,635,Escape from New York (1981),01-Jan-1981,,http://us.imdb.com/M/title-exact?Escape%20from...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,0,0.833333
830,830,Escape from L.A. (1996),09-Aug-1996,,http://us.imdb.com/M/title-exact?Escape%20from...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,0,0.833333
172,172,"Princess Bride, The (1987)",01-Jan-1987,,http://us.imdb.com/M/title-exact?Princess%20Br...,0,1,1,0,0,...,0,0,0,0,1,0,0,0,0,0.833333
384,384,True Lies (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?True%20Lies%2...,0,1,1,0,0,...,0,0,0,0,1,0,0,0,0,0.833333
256,256,Men in Black (1997),04-Jul-1997,,http://us.imdb.com/M/title-exact?Men+in+Black+...,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0.833333
171,171,"Empire Strikes Back, The (1980)",01-Jan-1980,,http://us.imdb.com/M/title-exact?Empire%20Stri...,0,1,1,0,0,...,0,0,0,0,1,1,0,1,0,0.833333
719,719,First Knight (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?First%20Knigh...,0,1,1,0,0,...,0,0,0,0,1,0,0,0,0,0.833333


In [100]:
tmp = items_scored.sort_values(['score'], ascending=False).head(10)
tmp.sum()

movie id                                                           3938
movie title           Lost World: Jurassic Park, The (1997)Spawn (19...
release date          23-May-199701-Aug-199701-Jan-198901-Jan-198109...
video release date                                                    0
IMDb URL              http://us.imdb.com/M/title-exact?Lost%20World%...
unknown                                                               0
Action                                                               10
Adventure                                                            10
Animation                                                             0
Children's                                                            0
Comedy                                                                3
Crime                                                                 0
Documentary                                                           0
Drama                                                           

<a id='Collaborative-filtering-recommenders'></a>
[back to top](#top)

# Collaborative-filtering recommenders
### Collaborative-Filtering systems focus on the relationship between users and items. Similarity of items is determined by the similarity of the ratings of those items by the users who have rated both items. [MMDS](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)
####  You don't need to know about the items, or the users, individually - the important information is the interaction between them.
#### *Recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users.* Or vice-versa.

#### Advantages:
- No in-depth knowledge required of the users or items themselves;
- Based on the past behavior of users, but not the context of the choice;
- Powerful at predicting next the buy as long as you have some of this information;
- You don't even need to have the same types of items, it still works;

#### Drawbacks:
- Bad at cold-start and for finding long-tail items  (Serendipity)
- Major problem is how to get around the sparsity of the matrix


**Memory-Based Collaborative Filtering:** 2 approaches - item-item filtering and user-item filtering.  
- **item-item filtering** pick an item, find users who liked that item, and find other items that those users or similar users also liked. *“Users who liked this item also liked …”*
- **user-item filtering** pick a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. *“Users who are similar to you also liked …”*  


**Model-Based Collaborative Filtering:** using latent factors obtained through **matrix factorization** - the factorization can be done in various ways. **SVD** and a method that lends itself to parallelization, **Alternating Least Squares (ALS)**

## Movielens data - explicit feedback CF model

### We follow parts of two online examples (much of the following code is based on these): 
- the discussion in Ethan Rosenthal's blog [here](http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/)


## Memory-based CF using Cosine similarity

## Movielens data 
### Movielens contains explicit ratings data on movies from the publicly-available Grouplens datasets

#### [MovieLens Latest Datasets](https://grouplens.org/datasets/movielens/) 

These datasets will change over time, and are not appropriate for reporting research results. We will keep the download links stable for automated downloads. We will not archive or make available previously released versions.

#### Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

- Small: 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Last updated 10/2016.  
[README.html](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html)  
[ml-latest-small.zip (size: 1 MB)](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)    

- Full: 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Last updated 8/2017.  
[README.html](http://files.grouplens.org/datasets/movielens/ml-latest-README.html)  
[ml-latest.zip (size: 224 MB)](http://files.grouplens.org/datasets/movielens/ml-latest.zip)
Permalink: http://grouplens.org/datasets/movielens/latest/    

#### Ratings are made on a 5-star scale, with single star increments (1 stars - 5 stars).

- Old 100k stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.  
[README.txt](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt)  
[ml-100k.zip (size: 5 MB)](http://files.grouplens.org/datasets/movielens/ml-100k.zip)
Index of unzipped files
Permalink: http://grouplens.org/datasets/movielens/100k/  



In [7]:
#Code repeated from above

old100k_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
datasets_path = ''
old100k_dataset_path = os.path.join(datasets_path, 'ml-100k.zip')

old100k_f = urlretrieve(old100k_dataset_url, old100k_dataset_path)

with zipfile.ZipFile(old100k_dataset_path, "r") as z:
    z.extractall(datasets_path)
    
# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file

# Note that two of the files, users and items, are pipe separated values 
# and the other, ratings, is tab separated.

# Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('./datasets/ml-100k/u.user', sep='|', names=u_cols, encoding='latin-1')

# Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('./datasets/ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')

# Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('./datasets/ml-100k/u.item', sep='|', names=i_cols,  encoding='latin-1')


### Users
#### Print out the data on the first 10 users from the users dataset 

In [8]:
print (users.shape)
users.head(10)

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


### Items
#### Print out the first 10 rows from the items dataset  - each row contains 24 columns of  features - the id and title etc. and 19 of them are one-hot-encodings for the genre of the movie

In [9]:
print (items.shape)
items.head(5)

(1682, 24)


Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Ratings
#### Print out the first 10 rows from the ratings dataset  - we see it uses the triple user / movie / rating plus a timestamp

In [10]:
print (ratings.shape)
ratings.head(5)

(100000, 4)


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [11]:
#Double check that we have the same number of unique users and items.
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie_id.unique().shape[0]
print ("Users:", n_users, "\nMovies", n_items)

Users: 943 
Movies 1682


In [12]:
ratings[['user_id','movie_id','rating']].describe()

Unnamed: 0,user_id,movie_id,rating
count,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986
std,266.61442,330.798356,1.125674
min,1.0,1.0,1.0
25%,254.0,175.0,3.0
50%,447.0,322.0,4.0
75%,682.0,631.0,4.0
max,943.0,1682.0,5.0


In [13]:
#Python has zero based index, so subtract 1
users.user_id = users.user_id - 1 
items['movie id'] = items['movie id'] - 1

ratings.user_id = ratings.user_id - 1
ratings.movie_id = ratings.movie_id - 1

In [14]:
ratings[['user_id','movie_id','rating']].describe()

Unnamed: 0,user_id,movie_id,rating
count,100000.0,100000.0,100000.0
mean,461.48475,424.53013,3.52986
std,266.61442,330.798356,1.125674
min,0.0,0.0,1.0
25%,253.0,174.0,3.0
50%,446.0,321.0,4.0
75%,681.0,630.0,4.0
max,942.0,1681.0,5.0


In [15]:
#Row = users, columns = movies
row_ind = ratings.user_id
col_ind = ratings.movie_id
ratings_data = ratings.rating
#We'll use scipy sparse matrices (COOrdinate format) - COO is a fast format for constructing sparse matrices
mtx = sp.coo_matrix((ratings_data, (row_ind, col_ind)), shape=(n_users, n_items), dtype=np.int8)
print("Size/Nbytes: ",mtx.col.nbytes + mtx.row.nbytes + mtx.data.nbytes)
#print(mtx)
mtx

Size/Nbytes:  900000


<943x1682 sparse matrix of type '<class 'numpy.int8'>'
	with 100000 stored elements in COOrdinate format>

In [16]:
def train_test_split1(ratings_mat):
    ratings_mat_csr = ratings_mat.tocsr()
    
    test = np.zeros(ratings_mat.shape)
    train = ratings_mat_csr.copy()
    for user in range(ratings_mat.shape[0]):
        test_ratings = np.random.choice(ratings_mat_csr[user, :].nonzero()[1], 
                                        size=10, 
                                        replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings_mat_csr[user, test_ratings].todense()
    
    print(train.shape, test.shape)
    return train, test

## Memory-based CF using Cosine similarity

We make a train and a test dataset, and then create a set of utility matrices for each - remember, we have one matrix that contains the ratings by user with the users as rows and the movies as columns, and the other is just a mask matrix with 1s if the movie has been rated by that user.

In [20]:
rating_matrix = sp.coo_matrix((ratings.rating, (ratings.user_id, ratings.movie_id)), shape=(n_users, n_items), dtype=np.int8)
train_data, test_data = train_test_split1(rating_matrix)


(943, 1682) (943, 1682)


In [26]:
#test_data[942,:].nonzero()
#test_data[942,[ 54,  61,  66,  96, 231, 390, 442, 568, 584, 738]]
#train_data[942,[ 54,  61,  66,  96, 231, 390, 442, 568, 584, 738]].todense()
#rating_matrix.tocsr()[942,[ 54,  61,  66,  96, 231, 390, 442, 568, 584, 738]].todense()

In [27]:
rating_matrix.shape , train_data.shape, test_data.shape

((943, 1682), (943, 1682), (943, 1682))

And we can compute a similarity matrix for both the users and the items. We'll use the SciKitLearn pairwise distances method, with cosine similarity:  


$$
\text{similarity} (u, u') = 
cos(\theta{}) = 
\frac{\textbf{r}_{u} \dot{} \textbf{r}_{u'}}{\| \textbf{r}_{u} \| \| \textbf{r}_{u'} \|} = 
\sum_{i} \frac{r_{ui}r_{u'i}}{\sqrt{\sum\limits_{i} r_{ui}^2} \sqrt{\sum\limits_{i} r_{u'i}^2} }
$$

In [115]:
user_similarity = 1. - pairwise_distances(train_data, metric='cosine')
item_similarity = 1. - pairwise_distances(train_data.T, metric='cosine')

print(user_similarity[:5, :5])

[[1.         0.13911674 0.02837252 0.04916335 0.35723059]
 [0.13911674 1.         0.05786029 0.11606998 0.0716053 ]
 [0.02837252 0.05786029 1.         0.26484528 0.02383873]
 [0.04916335 0.11606998 0.26484528 1.         0.0440049 ]
 [0.35723059 0.0716053  0.02383873 0.0440049  1.        ]]


In [116]:
print (user_similarity.shape)
print (item_similarity.shape)

(943, 943)
(1682, 1682)


We next create a prediction function. 

With user-based filtering, we want to calculate a weighted sum of all other users' ratings for each movie, where the weights used are the similarity measures between those users and the user of interest.  

$$ \sum\limits_{u'} \text{similarity}(u, u') r_{u'i} $$   

We normalize by dividing by the sum of the similarity of users.  

$$ \hat{r}_{ui} = \frac{\sum\limits_{u'} \text{similarity}(u, u') r_{u'i}}{\sum\limits_{u'}\text{similarity}(u, u')} $$


https://en.wikipedia.org/wiki/Collaborative_filtering

And then we also want to correct for bias - if Siskel is always rating higher than Ebert - by subtracting the mean rating for each other user from their rating, and then adding the user's own bias back to the prediction.

$ \hat{r}_{ui} = \bar{r_{u}} + \frac{\sum\limits_{u'} \text{similarity}(u, u') (r_{u'i} - \bar{r_{u'}})}{\sum\limits_{u'} \text{similarity}(u, u')} $

With item-based filtering, we want to similarly calculate a weighted sum of all other movies' ratings, where the weights used are the similarity measures between those movies and the movie of interest.  The calculations are similar - just reversing the $u$'s and the $i$'s - and we don't bother to subtract the bias.

In [117]:
def predict(ratings, similarity, input_type='user'):
    if input_type == 'user':
        pred = np.dot(sp.csr_matrix(similarity), ratings) / np.array([similarity.sum(axis=1)]).T
    return pred

def get_mse(pred, actual):
    # Ignore nonzero terms.
    pred = np.array(pred[actual.nonzero()[0],actual.nonzero()[1]][0])[0]    
    actual = actual[actual.nonzero()[0],actual.nonzero()[1]].flatten()
    print(pred.shape,actual.shape)
    return mean_squared_error(pred, actual)

#See source for item based CF

In [118]:
user_prediction1 = predict(train_data, user_similarity, input_type='user')

print(user_prediction1.shape, test_data.shape)
print('User-based CF MSE: ' + str(get_mse(user_prediction1, test_data)))

(943, 1682) (943, 1682)
(9430,) (9430,)
User-based CF MSE: 8.412540952132092


### Top - K Collaborative filtering

In [147]:
def predict_topk1(ratings, similarity, kind='user', k=40):
    pred = np.zeros(ratings.shape)
    if kind == 'user':
        for i in range(ratings.shape[0]):
            top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]
            pred[i,:] = np.dot(user_similarity[i,top_k_users],np.array(train_data[top_k_users[0]].todense()))
            pred[i,:] /= np.sum(np.abs(similarity[i, :][top_k_users]))
    
    return pred

In [148]:
user_prediction2 = predict_topk1(train_data, user_similarity, kind='user')

print(user_prediction2.shape, test_data.shape)



(943, 1682) (943, 1682)


In [150]:
pred = np.array(user_prediction2[test_data.nonzero()[0],test_data.nonzero()[1]])
actual = test_data[test_data.nonzero()[0],test_data.nonzero()[1]].flatten()
print(pred.shape,actual.shape)
mean_squared_error(pred, actual) #MSE with top-k is better 

(9430,) (9430,)


6.525742860857987

More Recommender Systems - 
- Matrix Factorization Method  - [Example by Moritz Haller](http://archive.is/LZIEe)

Python based libraries - 
- https://maciejkula.github.io/spotlight/  
- http://www.libfm.org/
... more libraries

RecSys Challenge 
- http://www.recsyschallenge.com/2019/  (Trivago)
- http://www.recsyschallenge.com/2018/  (Spotify) 