# Recommendation systems
Welcome to your first homework assignment about recommendation systems.


## The Cold Start Problem 

The colaborative filtering method discussed in class does not address the problem of new user or new movies. What prediction would you use in these cases:

* A new user but a known movie
* A new movie and a known user
* A new user and new movie

**A new user but a known movie**:
* Option 1: We can predict average rating given to that movie by existing users
* Option 2 (better): We can collect information like demographics, declared interest at the time of signup, current location, and data from his initial interaction with the platform (clicks,search history). Based on the information collected, we can match up the new user with existing user profiles. Then we can predict the rating as the average of the ratings given to that movie by the user profiles which are close to the new user.

**A new movie but a known user**:
For a new movie we can create a movie profile: genre, director, cast, release date, language, length etc. Then for the existing user we can create a similar profile based on the movies that he has rates. For ex: What is the probability that the concerned user likes horror movie or movie by Christopher Nolan. Then we can use the similarity between the movie and user profile to predict the rating. We can also factor in user bias based on his existing rating. This will tell us whether he has a critical rater or liberal rater.

**A new movie and a new user**:
For the new movie, we can create a movie profile: genre, director, cast, release date, language, length etc.
For the new user, we can collect information like demographics, declared interest at the time of signup, current location, and data from his initial interaction with the platform (clicks,search history).
Now from the pool of existing users and movies, we can filter out movies that are closer to the profile of new movie and filter out users that are closer to the profile of new user and then we can take the average of ratings given to the filtered movies by the filtered users.


## Matrix Factorization with bias
We want to extend the Matrix Factorization model discussed in class to add a "bias" parameter for each user and another "bias" parameter for each movie.  For the problem in class we had the parameters matrix $U$ and $V$, we are adding $u^0$ which is a vector of dimension $n_u$ and $v^0$ which is a vector of dimension $n_m$. The equations

$$\hat{y}_{ij} = u_{0i} + v_{0j} + u_i \cdot v_j  $$ 
 
(a) How many weights (parameters) are we fitting for this problem?

(b) Write the gradient descent equations for this problem.

(a) Assuming the dimension of U and V are $n_u$ x k and $n_m$ x k, the number of parameters: $$n_u * k + n_m * k + n_u+n_m$$

(b) Vectorized version:
   $$1.\: u^0 = u^0 - \eta * \nabla E(u^0) \;where \; \nabla E(u^0)=-2*np.sum((Y-UV^T-u^0-v^{0T})*R,axis = 0)/N$$ <br/>
   $$2.\: v^0 = v^0 - \eta * \nabla E(v^0) \;where \; \nabla E(v^0)=-2*(np.sum((Y-UV^T-u^0-v^{0T})*R,axis = 1))^T/N$$<br/>
   
   $$3. \: U = U - \eta * \nabla E(U)\;where \; \nabla E(U)=-2*((Y-UV^T-u^0-v^{0T})*R)V)/N$$<br/>
   $$4. \:V = V - \eta * \nabla E(V) \;where \; \nabla E(V)=-2*((Y-UV^T-u^0-v^{0T})*R)^TU)/N$$<br/>
   
   where $R$ is the binarized version of $Y$
   

# Collaborative Filtering with Gradient Descent 

In this part of the assignment you will build a collaborative filtering model to predict netflix ratings.  This assignment will step you through how to do this using stochastic gradient descent. The data for this assignment is ...

**Instructions:**
- Do not use loops (for/while) in your code, unless the instructions explicitly ask you to do so.
- DO NOT change paths (-3 points)
- DO NOT submit data to github (-2 points)

**You will learn to:**
- Build the general architecture of a learning algorithm, including:
    - Encoding rating data
    - Initializing parameters
    - Calculating the cost function
    - Calculating gradient
    - Using an optimization algorithm (gradient descent) 
    - Predicting on new data
- Putting it all together.

In [1]:
import numpy as np
import pandas as pd

## Encoding rating data
Here are our very small subset of fake data to get us started.

In [2]:
# The first row says that user 1 reated movie 11 with a score of 4
!cat tiny_training2.csv 

userId,movieId,rating
11,1,4
11,23,5
2,23,5
2,4,3
31,1,4
31,23,4
4,1,5
4,3,2
52,1,1
52,3,4
61,3,5
7,23,1
7,3,3


In [3]:
# here is a handy function from fast.ai
def proc_col(col):
    """Encodes a pandas column with continous ids. 
    """
    uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx[x] for x in col]), len(uniq)

In [4]:
def encode_data(df):
    """Encodes rating data with continous user and movie ids using 
    the helpful fast.ai function from above.
    
    Arguments:
      train_csv: a csv file with columns user_id,movie_id,rating 
    
    Returns:
      df: a dataframe with the encode data
      num_users
      num_movies
      
    """
    # YOUR CODE HERE
    _,user_col,num_users = proc_col(df.userId)
    _,movie_col,num_movies = proc_col(df.movieId)
    df.userId = user_col
    df.movieId = movie_col
#     raise NotImplementedError()
    return df, num_users, num_movies

In [5]:
df = pd.read_csv("tiny_training2.csv")
df, num_users, num_movies = encode_data(df)

In [6]:
df

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [7]:
assert(num_users == 7)

In [8]:
assert(num_movies == 4)

In [9]:
np.testing.assert_equal(df["userId"].values, np.array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 6]))

## Initializing parameters

In [10]:
def create_embedings(n, K):
    """ Create a numpy random matrix of shape n, K
    
    The random matrix should be initialized with uniform values in (0, 6/K)
    Arguments:
    
    Inputs:
    n: number of items/users
    K: number of factors in the embeding 
    
    Returns:
    emb: numpy array of shape (n, num_factors)
    """
    np.random.seed(3)
    emb = 6*np.random.random((n, K)) / K
    return emb

# here is an example on how the prediction matrix would look like with 7 users and 5 movies
np.dot(create_embedings(7,3), create_embedings(5,3).transpose())

array([[ 3.55790894,  4.69774849,  0.92361109,  1.58739544,  3.00593239],
       [ 4.69774849,  7.44656163,  1.18135616,  2.64524868,  4.74559066],
       [ 0.92361109,  1.18135616,  0.24548062,  0.34025121,  0.69616965],
       [ 1.58739544,  2.64524868,  0.34025121,  1.61561   ,  2.41361975],
       [ 3.00593239,  4.74559066,  0.69616965,  2.41361975,  3.82505541],
       [ 2.02000808,  3.29656257,  0.43174569,  2.065911  ,  3.07264619],
       [ 2.07691001,  3.02887291,  0.53270924,  1.02482544,  1.90251125]])

## Encoding Y as a sparse matrix
This code helps you encode a $Y$ as a sparse matrix from the dataframe. 

In [11]:
from scipy import sparse
def df2matrix(df, nrows, ncols, column_name="rating"):
    """ Returns a sparse matrix constructed from a dataframe
    
    This code assumes the df has columns: MovieID,UserID,Rating
    """
    values = df[column_name].values
    ind_movie = df['movieId'].values
    ind_user = df['userId'].values
    return sparse.csc_matrix((values,(ind_user, ind_movie)),shape=(nrows, ncols))

In [12]:
df = pd.read_csv("tiny_training2.csv")
df, num_users, num_movies = encode_data(df)
Y = df2matrix(df, num_users, num_movies)

In [13]:
print(Y)

  (0, 0)	4
  (2, 0)	4
  (3, 0)	5
  (4, 0)	1
  (0, 1)	5
  (1, 1)	5
  (2, 1)	4
  (6, 1)	1
  (1, 2)	3
  (3, 3)	2
  (4, 3)	4
  (5, 3)	5
  (6, 3)	3


In [14]:
def sparse_multiply(df, emb_user, emb_movie):
    """ This function returns U*V^T element wise multi by R as a sparse matrix.
    
    It avoids creating the dense matrix U*V^T
    """
    df["Prediction"] = np.sum(emb_user[df["userId"].values]*emb_movie[df["movieId"].values], axis=1)
    return df2matrix(df, emb_user.shape[0], emb_movie.shape[0], column_name="Prediction")

## Calculating the cost function

In [15]:
# Use vectorized computation for this function. No loops!
# Hint: use df2matrix and sparse_multiply
def cost(df, emb_user, emb_movie):
    """ Computes mean square error
    
    First compute prediction. Prediction for user i and movie j is
    emb_user[i]*emb_movie[j]
    
    Arguments:
      df: dataframe with all data or a subset of the data
      emb_user: embedings for users
      emb_movie: embedings for movies
      
    Returns:
      error(float): this is the MSE
    """
    # YOUR CODE HERE
    df["Prediction"] = np.sum(emb_user[df["userId"].values]*emb_movie[df["movieId"].values], axis=1)
    error = np.mean(np.square(df.Prediction - df.rating))
#     raise NotImplementedError()
    return error

In [16]:
emb_user = np.ones((num_users, 3))
emb_movie = np.ones((num_movies, 3))
error = cost(df, emb_user, emb_movie)
assert(np.around(error, decimals=2) == 2.23)

## Calculating gradient

In [17]:
def finite_difference(df, emb_user, emb_movie, ind_u=None, ind_m=None, k=None):
    """ Computes finite difference on MSE(U, V).
    
    This function is used for testing the gradient function. 
    """
    e = 0.000000001
    c1 = cost(df, emb_user, emb_movie)
    K = emb_user.shape[1]
    x = np.zeros_like(emb_user)
    y = np.zeros_like(emb_movie)
    if ind_u is not None:
        x[ind_u][k] = e
    else:
        y[ind_m][k] = e
    c2 = cost(df, emb_user + x, emb_movie + y)
    return (c2 - c1)/e

In [18]:
def gradient(df, Y, emb_user, emb_movie):
    """ Computes the gradient.
    
    First compute prediction. Prediction for user i and movie j is
    emb_user[i]*emb_movie[j]
    
    Arguments:
      df: dataframe with all data or a subset of the data
      Y: sparse representation of df
      emb_user: embedings for users
      emb_movie: embedings for movies
      
    Returns:
      d_emb_user
      d_emb_movie
    """
    # YOUR CODE HERE
    R = Y.sign().todense()
    delta = np.multiply(Y.todense(),R) - sparse_multiply(df,emb_user,emb_movie).todense()
    d_emb_user = -2*np.dot(delta,emb_movie)/len(Y.data)
    d_emb_movie = -2*np.dot(delta.transpose(),emb_user)/len(Y.data)
    
    return d_emb_user,d_emb_movie
#     raise NotImplementedError()

In [19]:
K = 3
emb_user = create_embedings(num_users, K)
emb_movie = create_embedings(num_movies, K)
Y = df2matrix(df, emb_user.shape[0], emb_movie.shape[0])
grad_user, grad_movie = gradient(df, Y, emb_user, emb_movie)

In [20]:
user=1
approx = np.array([finite_difference(df, emb_user, emb_movie, ind_u=user, k=i) for i in range(K)])
assert(np.all(np.abs(grad_user[user] - approx) < 0.0001))

In [21]:
movie=1
approx = np.array([finite_difference(df, emb_user, emb_movie, ind_m=movie, k=i) for i in range(K)])
assert(np.all(np.abs(grad_movie[movie] - approx) < 0.0001))

## Using gradient descent with momentum

In [22]:
# you can use a for loop to iterate through gradient descent
def gradient_descent(df, emb_user, emb_movie, iterations=100, learning_rate=0.01, df_val=None):
    """ Computes gradient descent with momentum (0.9) for a number of iterations.
    
    Prints training cost and validation cost (if df_val is not None) every 50 iterations.
    
    Returns:
    emb_user: the trained user embedding
    emb_movie: the trained movie embedding
    """
    Y = df2matrix(df, emb_user.shape[0], emb_movie.shape[0])
    # YOUR CODE HERE
    grad_u_moment,grad_m_moment = gradient(df,Y,emb_user,emb_movie)
    emb_user = np.array(np.subtract(emb_user,learning_rate*grad_u_moment))
    emb_movie = np.array(np.subtract(emb_movie,learning_rate*grad_m_moment))

    for i in range(iterations-1):
        grad_user,grad_movie = gradient(df,Y,emb_user,emb_movie)
        grad_u_moment = .9*grad_u_moment+.1*grad_user
        grad_m_moment = .9*grad_m_moment + .1*grad_movie
        emb_user = np.array(np.subtract(emb_user,learning_rate*grad_u_moment))
        emb_movie = np.array(np.subtract(emb_movie,learning_rate*grad_m_moment))
        
        if df_val is not None and i%50 ==0:
            print("Training cost:",cost(df, emb_user, emb_movie))
            print("Validation cost:",cost(df_val, emb_user, emb_movie))
#     raise NotImplementedError()
    return emb_user, emb_movie

In [23]:
emb_user = create_embedings(num_users, 3)
emb_movie = create_embedings(num_movies, 3)
emb_user, emb_movie = gradient_descent(df, emb_user, emb_movie, iterations=200, learning_rate=0.01)

In [24]:
train_mse = cost(df, emb_user, emb_movie)
assert(np.around(train_mse, decimals=2) == 0.53)

## Predicting on new data
Now we should write a function that given new data is able to predict ratings. First we write a function that encodes new data. If a new user or item is present that row should be remove. Collaborative Filtering is not good at handling new users or new items. To help with this task, you could write a an auxiliary function similar to `proc_col`.

In [25]:
def encode_new_data(df_val, df_train):
    """ Encodes df_val with the same encoding as df_train.
    Returns:
    df_val: dataframe with the same encoding as df_train
    """
    # YOUR CODE HERE
    u2idx,_,_ = proc_col(df_train.userId)
    m2idx,_,_= proc_col(df_train.movieId)
    df_val = df_val.loc[df_val['userId'].isin(u2idx.keys())]
    df_val = df_val.loc[df_val['movieId'].isin(m2idx.keys())]
    df_val.userId = df_val.userId.apply(lambda x: u2idx[x])
    df_val.movieId = df_val.movieId.apply(lambda x: m2idx[x])
#     raise NotImplementedError()
    return df_val

In [26]:
df_t = pd.read_csv("tiny_training2.csv")
df_v = pd.read_csv("tiny_val2.csv")
df_v = encode_new_data(df_v, df_t)

In [27]:
assert(len(df_v.userId.unique())==2)

In [28]:
assert(len(df_v) == 2)

## Putting it all together
For this part you should get data from here
`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

In [29]:
# Don't change this path use a simlink if you have the data somewhere else
path = "ml-latest-small/"
data = pd.read_csv(path + "ratings.csv")
# sorting by timestamp take as validation data the most recent data doesn't work so let's just take 20%
# at random
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()
df_train, num_users, num_movies = encode_data(train.copy())
df_val = encode_new_data(val.copy(), train.copy())
print(len(val), len(df_val))

20205 19507


In [30]:
K = 50
emb_user = create_embedings(num_users, K)
emb_movie = create_embedings(num_movies, K)
emb_user, emb_movie = gradient_descent(df_train, emb_user, emb_movie, iterations=2000, learning_rate=1, df_val=df_val)

Training cost: 12.339832098837672
Validation cost: 12.44785275740831
Training cost: 9.878326915111817
Validation cost: 10.015694669049159
Training cost: 7.125120820518513
Validation cost: 7.2608494209530665
Training cost: 5.167711000844929
Validation cost: 5.298338385431961
Training cost: 4.041209115702141
Validation cost: 4.165483203474745
Training cost: 3.3177954868811352
Validation cost: 3.430797749047875
Training cost: 2.8155930032536434
Validation cost: 2.918695474604298
Training cost: 2.451351434804249
Validation cost: 2.5477432509873417
Training cost: 2.1779959134714586
Validation cost: 2.2702402399715047
Training cost: 1.9669199451659873
Validation cost: 2.056784680233289
Training cost: 1.7999295462948757
Validation cost: 1.8886174999799386
Training cost: 1.665034965214191
Validation cost: 1.7533785022270296
Training cost: 1.5540982598296744
Validation cost: 1.642685795669171
Training cost: 1.4614491159128065
Validation cost: 1.5507049806418736
Training cost: 1.383039623475071


In [31]:
train_mse = cost(df_train, emb_user, emb_movie)
val_mse = cost(df_val, emb_user, emb_movie)
print(train_mse, val_mse)

0.7665730026459809 0.9104271876851561


In [32]:
train_mse = cost(df_train, emb_user, emb_movie)
assert(np.around(train_mse, decimals=2) == 0.77)

In [33]:
val_mse = cost(df_val, emb_user, emb_movie)
assert(np.around(val_mse, decimals=2) == 0.91)

#  Advanced Recommendation (Optional)
Use Regression / Random forest to add other features to your recommendation algorithm. Here are the steps:

* Here are potential features: 
    * user embeding, movie embeding (from the hw)
    * movie genres
    * daypart features
    
Other extentions: add regularization to gradient descent

### Bias

In [30]:
def create_bias(n):
    """ Create a numpy random matrix of shape n, 1
    
    The random matrix should be initialized with zeroes
    Arguments:
    
    Inputs:
    n: number of items/users 
    
    Returns:
    emb: numpy array of shape (n, 1)
    """
    bias = np.zeros((n,1))
    return bias

In [32]:
def cost_regularized(df, emb_user, emb_movie,bias_user,bias_movie,lambd):
    """ Computes mean square error
    
    First compute prediction. Prediction for user i and movie j is
    emb_user[i]*emb_movie[j]
    
    Arguments:
      df: dataframe with all data or a subset of the data
      emb_user: embedings for users
      emb_movie: embedings for movies
      
    Returns:
      error(float): this is the MSE
    """
    # YOUR CODE HERE
    df["Prediction"] = np.squeeze(np.sum(emb_user[df["userId"].values]*emb_movie[df["movieId"].values], axis=1)[:,None]+bias_user[df['userId'].values]+bias_movie[df['movieId'].values])
    error = np.mean(np.square(df.Prediction - df.rating))
#     raise NotImplementedError()
    return error

In [70]:
def gradient_regularized(df, Y, emb_user, emb_movie,bias_user,bias_movie,lambd):
    """ Computes the gradient.
    
    First compute prediction. Prediction for user i and movie j is
    emb_user[i]*emb_movie[j]
    
    Arguments:
      df: dataframe with all data or a subset of the data
      Y: sparse representation of df
      emb_user: embedings for users
      emb_movie: embedings for movies
      lambd: regularization parameter
      
    Returns:
      d_emb_user
      d_emb_movie
    """
    # YOUR CODE HERE
#     set_trace()
    R = Y.sign().todense()
    delta = np.multiply(Y.todense(),R) - sparse_multiply(df,emb_user,emb_movie).todense()-np.multiply(bias_user,R)-np.multiply(bias_movie.T,R)
    d_emb_user = -2*np.dot(delta,emb_movie)/len(Y.data) + 2*lambd*emb_user
    d_emb_movie = -2*np.dot(delta.transpose(),emb_user)/len(Y.data) +2*lambd*emb_movie
    d_bias_user = -2*np.sum(delta,axis=1)/len(Y.data)
    d_bias_movie = -2*np.sum(delta,axis=0).T/len(Y.data)
    
    return d_emb_user,d_emb_movie,d_bias_user,d_bias_movie
#     raise NotImplementedError()

In [79]:
# you can use a for loop to iterate through gradient descent
def gradient_descent_reg(df, emb_user, emb_movie,bias_user,bias_movie,iterations=100, learning_rate=0.01, df_val=None,lambd=.01):
    """ Computes gradient descent with momentum (0.9) for a number of iterations.
    
    Prints training cost and validation cost (if df_val is not None) every 50 iterations.
    
    Returns:
    emb_user: the trained user embedding
    emb_movie: the trained movie embedding
    """
#     set_trace()
    Y = df2matrix(df, emb_user.shape[0], emb_movie.shape[0])
    # YOUR CODE HERE
    grad_u_moment,grad_m_moment ,grad_ub_moment,grad_mb_moment= gradient_regularized(df,Y,emb_user,emb_movie,bias_user,bias_movie,lambd)

    for i in range(iterations):
        grad_user,grad_movie,grad_user_bias,grad_movie_bias = gradient_regularized(df,Y,emb_user,emb_movie,bias_user,bias_movie,lambd)
        grad_u_moment = .9*grad_u_moment+.1*grad_user
        grad_m_moment = .9*grad_m_moment + .1*grad_movie
        grad_ub_moment = .9*grad_ub_moment+.1*grad_user_bias
        grad_mb_moment = .9*grad_mb_moment + .1*grad_movie_bias        
        
        emb_user = np.array(np.subtract(emb_user,learning_rate*grad_u_moment))
        emb_movie = np.array(np.subtract(emb_movie,learning_rate*grad_m_moment))
        bias_user = np.array(np.subtract(bias_user,learning_rate*grad_ub_moment))
        bias_movie = np.array(np.subtract(bias_movie,learning_rate*grad_mb_moment))
        
        if df_val is not None and i%50 ==0:
            print("Training cost:",cost_regularized(df, emb_user, emb_movie,bias_user,bias_movie,lambd))
            print("Validation cost:",cost_regularized(df_val, emb_user, emb_movie,bias_user,bias_movie,lambd))
#     raise NotImplementedError()
    return emb_user, emb_movie, bias_user,bias_movie

In [86]:
# Don't change this path use a simlink if you have the data somewhere else
path = "ml-latest-small/"
data = pd.read_csv(path + "ratings.csv")
# sorting by timestamp take as validation data the most recent data doesn't work so let's just take 20%
# at random
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()
df_train, num_users, num_movies = encode_data(train.copy())
df_val = encode_new_data(val.copy(), train.copy())
print(len(val), len(df_val))

20205 19507


In [81]:
num_users

671

In [82]:
num_movies

8442

In [85]:
K = 50
emb_user = create_embedings(num_users, K)
emb_movie = create_embedings(num_movies, K)
bias_user = create_bias((num_users))
bias_movie = create_bias((num_movies))

In [53]:
emb_user.shape

(671, 50)

In [54]:
emb_movie.shape

(8442, 50)

In [55]:
bias_user.shape

(671, 1)

In [56]:
bias_movie.shape

(8442, 1)

In [61]:
Y = df2matrix(df_traxin, emb_user.shape[0], emb_movie.shape[0])

In [87]:
emb_user, emb_movie, bias_user,bias_movie = gradient_descent_reg(df_train, emb_user, emb_movie,bias_user,bias_movie,iterations=2000,learning_rate=1, df_val=df_val,lambd=1)

Training cost: 12.230724796857148
Validation cost: 12.340560392540814
Training cost: 6.902039588580828
Validation cost: 7.041636459315589
Training cost: 4.524842216160127
Validation cost: 4.635745256699486
Training cost: 3.364780189037686
Validation cost: 3.458644312404154
Training cost: 2.715302690140108
Validation cost: 2.7982122694071387
Training cost: 2.312102255737825
Validation cost: 2.387494108266488
Training cost: 2.0417631370183598
Validation cost: 2.1116745803974415
Training cost: 1.8490628525866992
Validation cost: 1.9147995458781697
Training cost: 1.70478093137466
Validation cost: 1.767250979614349
Training cost: 1.5924258706359546
Validation cost: 1.6523037085626466
Training cost: 1.5021631241567883
Validation cost: 1.559973113477183
Training cost: 1.427832741039832
Validation cost: 1.4839945590752548
Training cost: 1.3654017657158868
Validation cost: 1.4202571190085345
Training cost: 1.3121229190713581
Validation cost: 1.3659527367125712
Training cost: 1.2660577594151192


---

In [None]:
### Metadata

In [None]:
# YOUR CODE HERE
# raise NotImplementedError()

In [None]:
movie = pd.read_csv('ml-latest-small/movies.csv')

In [None]:
movie.head()

In [None]:
links = pd.read_csv('ml-latest-small/links.csv')

In [None]:
links.head()

In [None]:
tags = pd.read_csv('ml-latest-small/tags.csv')

In [None]:
tags.head()

In [None]:
### Creating features

In [None]:
#### Genre

In [None]:
genre = ["Action","Adventure","Animation","Children","Comedy","Crime","Documentary","Drama","Fantasy",
 "Film-Noir","Horror","Musical","Mystery","Romance","Sci-Fi","Thriller","War","Western"]

In [None]:
movie['genres'] = movie['genres'].str.split('|').apply(lambda x:[1 if i in x else 0 for i in genre]) 

In [None]:
movie.head()

In [None]:
tmp = movie.copy()

In [None]:
tmp[["Action","Adventure","Animation","Children","Comedy","Crime","Documentary","Drama","Fantasy",
 "Film-Noir","Horror","Musical","Mystery","Romance","Sci-Fi","Thriller","War","Western"]] = pd.DataFrame(movie.genres.values.tolist())

In [None]:
tmp = tmp.drop(['genres'],axis=1)

In [None]:
movie_tf = tmp.copy()

In [None]:
del(tmp)

In [None]:
#### Release year

In [None]:
movie_tf['release_year'] = movie_tf.title.apply(lambda x: x[-5:-1])

In [None]:
movie_tf = movie_tf.drop(['title'],axis=1)

In [None]:
#### Number of genres

In [None]:
movie_tf['num_genres'] = np.sum(movie_tf.values[:,1:19],axis=1) #adding total number of genres for each movie

In [None]:
data_tf = pd.merge(data,movie_tf,how = 'left',on = 'movieId') #joining with ratings data

In [None]:
#### Year of rating

In [None]:
import datetime

In [None]:
data_tf['rating_year'] = data_tf.timestamp.apply(lambda x:datetime.datetime.fromtimestamp(x).year)

In [None]:
data_tf = data_tf.drop(['timestamp'],axis=1) #converting integer to timestamp

In [None]:
data_tf['release_year'] = data_tf.release_year.apply(lambda x: int(x) if x.isdigit() else int(1900)) #extracting year of release: 10 movies don't have year so assigned year as 1900

In [None]:
#### Difference between rating and release year

In [None]:
data_tf['year_diff'] = data_tf['rating_year'] - data_tf['release_year']

In [None]:
#### Years elapsed since release

In [None]:
data_tf['years_since_release'] = 2016 - data_tf['release_year']

In [None]:
#### Fixing data types

In [None]:
data_tf['num_genres'] = data_tf.num_genres.astype('int')

In [None]:
data_tf.shape

In [None]:
data_tf[['userId', 'movieId','Action', 'Adventure', 'Animation',
       'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
       'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Thriller', 'War', 'Western', 'release_year', 'num_genres',
       'rating_year', 'year_diff', 'years_since_release']] = data_tf[['userId', 'movieId','Action', 'Adventure', 'Animation',
       'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
       'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Thriller', 'War', 'Western', 'release_year', 'num_genres',
       'rating_year', 'year_diff', 'years_since_release']].astype('uint32')

In [None]:
#### Adding decades

In [None]:
data_tf['decade'] = data_tf.release_year.apply(lambda x: int(str(x)[2]) if x != 1900 else 11)

In [None]:
movie_list1 = list(data_tf.movieId.unique())

In [None]:
movie_list2 = list(train.movieId.unique())

In [None]:
movie_list3 = list(set(movie_list1)-set(movie_list2))

In [None]:
data_tf.shape

In [None]:
data_tf = data_tf.loc[data_tf['movieId'].isin(movie_list2)] #keeping only those movies for which we have the embeddings

In [None]:
data_tf.shape

In [None]:
train_mse = cost(df_train, emb_user, emb_movie)
assert(np.around(train_mse, decimals=2) == 0.77)

In [None]:
val_mse = cost(df_val, emb_user, emb_movie)
assert(np.around(val_mse, decimals=2) == 0.91)