The code used below comes from the following course:
http://course17.fast.ai/index.html

In [1]:
from __future__ import division,print_function

import os, json
from glob import glob
import numpy as np
np.set_printoptions(precision=4, linewidth=100)
from matplotlib import pyplot as plt

In [2]:
from theano.sandbox import cuda

In [3]:
import utils; reload(utils)
from utils import *

Using TensorFlow backend.


In [4]:
import os
import pandas as pd
import numpy as np

In [5]:
path = "data/dblp/"
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)
batch_size=64

## Set up data

We will be working with the data, which contains one rating per row, like this:

In [6]:
ratings = pd.read_csv(path+'ratings.csv')
ratings.head()

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating
0,0,1000233,1833108,1.333333
1,1,1000233,1767939,0.333333
2,2,1001517,712958,1.0
3,3,100190,1293337,1.333333
4,4,1003930,1156812,1.0


In [7]:
ratings.shape

(2608, 4)

In [8]:
ratings = ratings[ratings['rating'] <= 1]

In [9]:
ratings.shape

(2103, 4)

In [10]:
ratings.drop(['Unnamed: 0'], axis=1, inplace=True)

In [11]:
ratings.head()

Unnamed: 0,userId,movieId,rating
1,1000233,1767939,0.333333
2,1001517,712958,1.0
4,1003930,1156812,1.0
5,1004722,1052580,1.0
6,1004722,361337,0.5


In [12]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()

#new = np.concatenate([users,movies])
#new_users = np.unique(new)

In [13]:
users

array([1000233, 1001517, 1003930, ...,   99818,   99832,  998816])

In [14]:
userid2idx = {o:i for i,o in enumerate(users)}
movieid2idx = {o:i for i,o in enumerate(movies)}
#userid2idx = {o:i for i,o in enumerate(new_users)}

We update the movie and user ids so that they are contiguous integers, which we want when using embeddings.

In [15]:
ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x])
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])
#ratings.movieId = ratings.movieId.apply(lambda x: userid2idx[x])
#ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])

In [16]:
ratings.head()

Unnamed: 0,userId,movieId,rating
1,0,0,0.333333
2,1,1,1.0
4,2,2,1.0
5,3,3,1.0
6,3,4,0.5


In [17]:
ratings[ratings['userId'] == 1011]

Unnamed: 0,userId,movieId,rating
1517,1011,296,0.333333


In [18]:
user_min, user_max, movie_min, movie_max = (ratings.userId.min(), 
    ratings.userId.max(), ratings.movieId.min(), ratings.movieId.max())
user_min, user_max, movie_min, movie_max

(0, 1724, 0, 917)

In [19]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()
n_users, n_movies

#n_users = len(ratings.userId)
#n_movies = len(ratings.movieId)
#n_users, n_movies

(1725, 918)

This is the number of latent factors in each embedding.

In [20]:
n_factors = 50

In [21]:
np.random.seed = 42

Randomly split into training and validation.

In [22]:
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]

##  Dot product and Bias

The most basic model is a dot product of a movie embedding and a user embedding along with the bias term, that is, a single bias for each user and each movie representing how positive or negative each user is, and how good each movie is. We can add that easily by simply creating an embedding with one output for each movie and each user, and adding it to our output.

In [23]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

In [24]:
user_in, u = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, m = embedding_input('movie_in', n_movies, n_factors, 1e-4)

  app.launch_new_instance()
  app.launch_new_instance()


In [25]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [26]:
ub = create_bias(user_in, n_users)
mb = create_bias(movie_in, n_movies)

In [27]:
x = merge([u, m], mode='dot')
x = Flatten()(x)
x = merge([x, ub], mode='sum')
x = merge([x, mb], mode='sum')
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')

  if __name__ == '__main__':
  name=name)
  app.launch_new_instance()


In [28]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1, 
          validation_data=([val.userId, val.movieId], val.rating))

  from ipykernel import kernelapp as app


Train on 1676 samples, validate on 427 samples
Epoch 1/1


<keras.callbacks.History at 0x7fdc0b4a1a50>

We’ve just created a embedding layer that creates a Users by Latent Factors matrix and a embedding layer that creates a Movies by Latent Factors matrix.  When the input to these is a user id and a movie id, then they return the latent factor vectors for the user and the movie, respectively.  The merge layer then takes the dot product of these two things to return rating.  We compile the model using MSE as the loss function and the AdaMax learning algorithm (which is superior to Sparse Gradient Descent).  Our callbacks monitor the validation loss.

In [29]:
model.optimizer.lr=0.01

In [30]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 1676 samples, validate on 427 samples
Epoch 1/6
Epoch 2/6

  from ipykernel import kernelapp as app


Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7fdc0b6b9150>

In [31]:
model.optimizer.lr=0.001

In [32]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=10, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 1676 samples, validate on 427 samples
Epoch 1/10
Epoch 2/10

  from ipykernel import kernelapp as app


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fdc0b658490>

This problem is that we do not exactly wat is the best benchmarks, however the it looks like a not too bad approach.

In [33]:
model.save_weights(model_path+'dot_product_bias.h0')

In [34]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
user_in (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
movie_in (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1, 50)         86250       user_in[0][0]                    
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 1, 50)         45900       movie_in[0][0]                   
___________________________________________________________________________________________

In [35]:
model.load_weights(model_path+'dot_product_bias.h0')

We can use the model to generate predictions by passing a pair of ints - a user id and a movie id. For instance, this predicts that user #3 would really enjoy movie #6.

In [36]:
model.predict([np.array([10]), np.array([15])])

array([[ 0.1479]], dtype=float32)

In [46]:
my_list = []
for i in range(n_movies):
    array_= model.predict([np.array([10]), np.array([i])])
    my_list.append(array_[0][0])
    
print("5 strongest predictions are on these mentors:", np.argsort(my_list)[-5:], "\n", "with the following predictions:", np.sort(my_list)[-5:])


5 strongest predictions are on these mentors: [100  19 189 565  13] 
 with the following predictions: [ 0.2722  0.2766  0.2882  0.2954  0.4315]


In [None]:
# if we have to look up the key based on value and vice versa
# initial node value 200145 -> new value 1428

# find old based on given new
##lKey = [key for key, value in userid2idx.iteritems() if value == 14][0]
##lKey
# find new based on given old
#userid2idx[1019381]

In [None]:
ratings[ratings['userId']==965]

In [None]:
ratings[ratings['movieId']==14013]

##  Neural net

Rather than creating a special purpose architecture (like our dot-product with bias earlier), it's often both easier and more accurate to use a standard neural network. Let's try it! Here, we simply concatenate the user and movie embeddings into a single vector, which we feed into the neural net.

In [38]:
user_in, u = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, m = embedding_input('movie_in', n_movies, n_factors, 1e-4)

  app.launch_new_instance()
  app.launch_new_instance()


In [39]:
x = merge([u, m], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
nn = Model([user_in, movie_in], x)
nn.compile(Adam(0.0001), loss='mse')

  if __name__ == '__main__':


In [40]:
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=10, 
          validation_data=([val.userId, val.movieId], val.rating))

  from ipykernel import kernelapp as app


Train on 1676 samples, validate on 427 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fdc007af210>

This improves on our accuracy even further.

In [41]:
nn.save_weights(model_path+'nn.h0')

In [42]:
nn.load_weights(model_path+'nn.h0')

In [43]:
# prediction between a given pair of authors
nn.predict([np.array([10]), np.array([15])])

array([[ 0.191]], dtype=float32)

In [44]:
def create_pred(person_x):
    my_list = []
    for i in range(n_movies):
        array_= nn.predict([np.array([person_x]), np.array([i])])
        my_list.append(array_[0][0])
    print("For user", person_x, "the system recommends these 5 mentors:", np.argsort(my_list)[-5:], "\n", "with the following predictions:", np.sort(my_list)[-5:])


In [45]:
# give a person Id to make recommendation for it
create_pred(10)

For user 10 the system recommends these 5 mentors: [ 41 364 271 147 738] 
 with the following predictions: [ 0.2517  0.2518  0.2549  0.2577  0.26  ]


In [47]:
ratings[ratings['userId']==965]

Unnamed: 0,userId,movieId,rating
1454,965,727,0.5
1455,965,716,0.5


In [49]:
ratings[ratings['movieId']==738]

Unnamed: 0,userId,movieId,rating
1500,1001,738,1.0
2383,1577,738,0.666667


### Get the embeddings

In [None]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()

In [None]:
get_user_emb = Model(user_in, u)
user_emb = np.squeeze(get_user_emb.predict([users]))
user_emb.shape

In [None]:
get_movie_emb = Model(movie_in, m)
movie_emb = np.squeeze(get_movie_emb.predict([movies]))
type(movie_emb)

In [None]:
import pandas as pd 
df = pd.DataFrame(user_emb)
df.to_csv(path + 'user_emb.csv', encoding='utf-8')

In [None]:
import pandas as pd 
df = pd.DataFrame(movie_emb)
df.to_csv(path + 'movie_emb.csv', encoding='utf-8')