 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="http://localhost:8888/notebooks/my_lesson4.ipynb#Prepare-data" data-toc-modified-id="Prepare-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Prepare data</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson4.ipynb#Create-simple-dot-product-model-with-functional-API" data-toc-modified-id="Create-simple-dot-product-model-with-functional-API-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create simple dot product model with functional API</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson4.ipynb#Adding-bias" data-toc-modified-id="Adding-bias-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Adding bias</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson4.ipynb#Explore-the-output..." data-toc-modified-id="Explore-the-output...-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Explore the output...</a></span></li><li><span><a href="http://localhost:8888/notebooks/my_lesson4.ipynb#Neural-net" data-toc-modified-id="Neural-net-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Neural net</a></span></li></ul></div>

# Embeddings and collaborative filtering

**To Do**
1. Get movielense data and prepare for embedding input. **DONE**
2. Create simple dot product model using functional API
3. Add bias term
4. Analyze results
5. Create neural net

In [16]:
# Imports
from __future__ import division, print_function
import pandas as pd
import numpy as np
from keras.models import Model
from keras.layers import Input, Embedding, merge
from keras.layers.core import Flatten, Dense, Dropout
from keras.regularizers import l2
from keras.optimizers import Adam

In [5]:
# Basic setup
path = "/Users/stephanrasp/repositories/courses/data/movielens/ml-latest-small/"   # Mac

## Prepare data 

In [6]:
ratings = pd.read_csv(path + 'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [8]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()
len(users), len(movies)

(671, 9066)

In [9]:
# So what now...
# We update the movie and user ids so they are contiguous integers
# I guess we need that for the embeddings
userid2idx = {o:i for i, o in enumerate(users)}
movieid2idx = {o:i for i, o in enumerate(movies)}

In [10]:
ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x])
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])

In [11]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,2.5,1260759144
1,0,1,3.0,1260759179
2,0,2,3.0,1260759182
3,0,3,2.0,1260759185
4,0,4,4.0,1260759205


In [13]:
# Now let's split into test and train set
mask = np.random.rand(len(ratings)) < 0.8
train = ratings[mask]
valid = ratings[~mask]

In [14]:
len(train), len(valid)

(80100, 19904)

## Create simple dot product model with functional API

In [21]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()
n_users, n_movies

(671, 9066)

In [22]:
n_factors = 50   # Number of latent factors 

In [45]:
# Now let's set up this model
# First we need an input which is always one movieID and one userID
user_in = Input(shape=(1,))
movie_in = Input(shape=(1,))

In [46]:
# Then we need these embeddings
user_emb = Embedding(n_users, n_factors, W_regularizer=l2(1e-4))(user_in)
movie_emb = Embedding(n_movies, n_factors, W_regularizer=l2(1e-4))(movie_in)

In [47]:
x = merge([user_emb, movie_emb], mode='dot')

In [48]:
x = Flatten()(x)

In [49]:
# Then create the model
dot_model = Model([user_in, movie_in], x)

In [50]:
# And compile it
dot_model.compile(Adam(0.01), 'mse')

In [60]:
batch_size = 64
dot_model.fit([train.userId, train.movieId], train.rating, 
              batch_size=batch_size, nb_epoch=1, 
              validation_data=[[valid.userId, valid.movieId], valid.rating])

Train on 80100 samples, validate on 19904 samples
Epoch 1/1


<keras.callbacks.History at 0x121867910>

Ok, so I get a better score than Jeremy, but I am also horribly overfitting. So let's alleviate this.

So increasing the batch size drastically speeds up the training a lot but also lowers the loss reduction per epoch. But I do get a lower Best score it seems. Hmm, this is interestings. 

## Adding bias

In [119]:
# Let's set up the same model again but with bias terms...
user_in = Input(shape=(1,))
movie_in = Input(shape=(1,))

In [120]:
user_emb = Embedding(n_users, n_factors, W_regularizer=l2(1e-4), 
                     input_length=1)(user_in)
movie_emb = Embedding(n_movies, n_factors, W_regularizer=l2(1e-4), 
                      input_length=1)(movie_in)

In [121]:
# I think I now have to add the bias terms
user_bias = Embedding(n_users, 1, input_length=1)(user_in)
movie_bias = Embedding(n_movies, 1, input_length=1)(movie_in)

In [122]:
user_bias = Flatten()(user_bias)
movie_bias = Flatten()(movie_bias)

In [123]:
x = merge([user_emb, movie_emb], mode='dot')

In [124]:
x = Flatten()(x)

In [125]:
x = merge([x, user_bias], mode='sum')
x = merge([x, movie_bias], mode='sum')

In [126]:
dot_model_with_bias = Model([user_in, movie_in], x)

In [127]:
dot_model_with_bias.compile(Adam(0.01), 'mse')

In [129]:
batch_size = 64
dot_model_with_bias.optimizer.lr=0.001
dot_model_with_bias.fit(
    [train.userId, train.movieId], 
    train.rating, 
    batch_size=batch_size, 
    nb_epoch=10, 
    validation_data=[[valid.userId, valid.movieId], valid.rating],
)

Train on 80100 samples, validate on 19904 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x124bb9b50>

In [None]:
# I don't really get this whole flattening business..

## Explore the output...

In [131]:
# Restrict to 2000 most popular movies
g = ratings.groupby('movieId')['rating'].count()
topMovies = g.sort_values(ascending=False)[:2000]
topMovies = np.array(topMovies.index)

In [133]:
# Get the learned weights, in particular the movie biases
get_movie_bias = Model(movie_in, movie_bias)  # So these are from the graph above

In [134]:
biases = get_movie_bias.predict(topMovies)

In [135]:
# Now do some magic so that we get the movie names
movie_names = pd.read_csv(path+'movies.csv').set_index('movieId')['title'].to_dict()

In [137]:
movie_ratings = [(b[0], movie_names[movies[i]]) for i,b in zip(topMovies,biases)]

In [138]:
movie_ratings[:5]

[(1.2869363, 'Forrest Gump (1994)'),
 (1.5347273, 'Pulp Fiction (1994)'),
 (1.8540565, 'Shawshank Redemption, The (1994)'),
 (1.544189, 'Silence of the Lambs, The (1991)'),
 (1.5551853, 'Star Wars: Episode IV - A New Hope (1977)')]

In [140]:
# Ahhh I get it
from operator import itemgetter
sorted(movie_ratings, key=itemgetter(0))[:15]

[(-1.3430359, 'Battlefield Earth (2000)'),
 (-1.279649, 'Super Mario Bros. (1993)'),
 (-1.0247694, 'Jaws 3-D (1983)'),
 (-1.0155251, 'Police Academy 6: City Under Siege (1989)'),
 (-0.97901976, 'Mighty Morphin Power Rangers: The Movie (1995)'),
 (-0.95439267, 'Spice World (1997)'),
 (-0.86914134, 'Police Academy 5: Assignment: Miami Beach (1988)'),
 (-0.86761612, 'Police Academy 3: Back in Training (1986)'),
 (-0.81225383, 'Speed 2: Cruise Control (1997)'),
 (-0.80387968, 'Avengers, The (1998)'),
 (-0.77620196, 'Howard the Duck (1986)'),
 (-0.76154387, 'RoboCop 3 (1993)'),
 (-0.7593286, 'Bio-Dome (1996)'),
 (-0.73670113, 'House on Haunted Hill (1999)'),
 (-0.70962787, 'Anaconda (1997)')]

In [141]:
sorted(movie_ratings, key=itemgetter(0), reverse=True)[:15]

[(2.008496, 'Paradise Lost: The Child Murders at Robin Hood Hills (1996)'),
 (1.9751985, 'African Queen, The (1951)'),
 (1.9301219, "Once Upon a Time in the West (C'era una volta il West) (1968)"),
 (1.9016222,
  'Fog of War: Eleven Lessons from the Life of Robert S. McNamara, The (2003)'),
 (1.8994795, 'Paths of Glory (1957)'),
 (1.8967917, 'Far from Heaven (2002)'),
 (1.8955195, 'Band of Brothers (2001)'),
 (1.8827574, 'Paris, Texas (1984)'),
 (1.8709689, 'Touch of Evil (1958)'),
 (1.8614258, 'Ran (1985)'),
 (1.8555335, 'Rebecca (1940)'),
 (1.8540565, 'Shawshank Redemption, The (1994)'),
 (1.8537581, 'Seventh Seal, The (Sjunde inseglet, Det) (1957)'),
 (1.8438414, 'Third Man, The (1949)'),
 (1.8426716, 'Great Dictator, The (1940)')]

In [142]:
# Do the same for the embeddings
get_movie_emb = Model(movie_in, movie_emb)
movie_emb = get_movie_emb.predict([topMovies])

In [143]:
movie_emb.shape

(2000, 1, 50)

In [144]:
movie_emb = np.squeeze(movie_emb)

In [145]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
movie_pca = pca.fit(movie_emb.T).components_

In [146]:
movie_pca.shape

(3, 2000)

In [147]:
fac0 = movie_pca[0]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac0, topMovies)]

In [148]:
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]

[(0.084199741, 'Apocalypse Now (1979)'),
 (0.080692954, 'Fargo (1996)'),
 (0.080456316, 'Pulp Fiction (1994)'),
 (0.074975364, 'Big Lebowski, The (1998)'),
 (0.072329663, 'Easy Rider (1969)'),
 (0.072274983, 'Goodfellas (1990)'),
 (0.072027408, 'American Psycho (2000)'),
 (0.071809612, 'Lost in Translation (2003)'),
 (0.071455263, 'Clockwork Orange, A (1971)'),
 (0.070462324, 'Royal Tenenbaums, The (2001)')]

In [149]:
sorted(movie_comp, key=itemgetter(0))[:10]

[(-0.088991426, 'Pearl Harbor (2001)'),
 (-0.068364277, 'Double Jeopardy (1999)'),
 (-0.066979334, 'Batman Forever (1995)'),
 (-0.064642921, 'Forever Young (1992)'),
 (-0.063483112, 'Independence Day (a.k.a. ID4) (1996)'),
 (-0.062966213, 'Con Air (1997)'),
 (-0.062631384, 'Shrek (2001)'),
 (-0.061947726, 'Entrapment (1999)'),
 (-0.061784845, 'Bodyguard, The (1992)'),
 (-0.061012696, 'Twister (1996)')]

## Neural net

In [163]:
movie_in = Input(shape=(1,))
user_in = Input(shape=(1,))

In [164]:
movie_emb = Embedding(n_movies, n_factors)(movie_in)
user_emb= Embedding(n_users, n_factors)(user_in)

In [165]:
movie_emb

<tf.Tensor 'Gather_48:0' shape=(?, 1, 50) dtype=float32>

In [166]:
# Now concatenate them
x = merge([movie_emb, user_emb], mode='concat')

In [167]:
x = Flatten()(x)

In [168]:
# Now create the network
x = Dense(70, activation='relu')(x)
x = Dense(1, activation='linear')(x)

In [169]:
x

<tf.Tensor 'add_110:0' shape=(?, 1) dtype=float32>

In [173]:
nn_model = Model([user_in, movie_in], x)

In [174]:
nn_model.compile(Adam(0.001), 'mse')

In [175]:
nn_model.fit(
    [train.userId, train.movieId], 
    train.rating, 
    batch_size=batch_size, 
    nb_epoch=1, 
    validation_data=[[valid.userId, valid.movieId], valid.rating],
)

Train on 80100 samples, validate on 19904 samples
Epoch 1/1


<keras.callbacks.History at 0x12bb40bd0>