- In this notebook a simple SVD approach is used to fit a low rank matrix based on a given training set. The result is then used to get predictions for a separate test set. 
- In addition the extracted features of the SVD are used to analyze similarities among movie titles

- The movie & rating data is based on the MovieLens 100K (09/2018) data set

In [1]:
import numpy as np
import pandas as pd

from utility_matrix_prep import create_utility_matrix, get_merged_ratings_df

from scipy.linalg import svd, sqrtm

from sklearn.neighbors import NearestNeighbors

### load dataset

In [2]:
#load ratings data:
path = '/media/vincent/harddrive/Movielens_data/ml-small_09_2018/'
file_name = 'ratings.csv'

rating_data = pd.read_csv(path + file_name,  encoding = "ISO-8859-1", header=0)

#load movies data:
movies_data = pd.read_csv(path + 'movies.csv',encoding = "ISO-8859-1", header=0)
#adjust some col-labels:
movies_data['movie_title'] = movies_data['title']
movies_data = movies_data[['movieId','movie_title','genres']]
    


### create train & test set 

In [3]:
def train_test_split(data_df, ratio = 0.2):
    
    '''
    creates test_set based on most recent x% of ratings for each user 
    remaining entries are used as training set
    '''
    
    test_set = pd.DataFrame(columns=data_df.columns)
    train_set = pd.DataFrame(columns=data_df.columns)

    grouped_ratings = data_df.groupby('userId')
    
    #get most recent ratings of each user and split into train & testset:
    for name, group in grouped_ratings:
        set_size = int(group.shape[0] * ratio)
        temp_test = group.sort_values('timestamp').iloc[group.shape[0]-1-set_size:] #slice last (most recent) rows of sorted df
        temp_train = group.sort_values('timestamp').iloc[:group.shape[0]-2-set_size] #slice first rows of sorted df
        
        test_set = pd.concat([test_set, temp_test])
        train_set = pd.concat([train_set, temp_train])
        
    return train_set, test_set

The data set is sorted by the given timestamps. The most recent ratings of each user are used as test set (most recent 20%)

In [4]:
#split data:
train_ratings, test_ratings = train_test_split(rating_data, ratio = 0.2)
print('shape of train_ratings: ', train_ratings.shape)
print('shape of test_ratings: ', test_ratings.shape)

# prepare test_set:
print('\n test set before processing: \n  \n', test_ratings.head(2))
#call custom function:
test_ratings = get_merged_ratings_df(test_ratings,movies_data)
test_ratings['userId'] = test_ratings['userId'].apply(lambda row: 'User_' + str(row))
print('\n test set after processing: \n  \n', test_ratings.head(2))



shape of train_ratings:  (79676, 4)
shape of test_ratings:  (20550, 4)

 test set before processing: 
  
     userId movieId  rating  timestamp
192      1    2959     5.0  964983282
176      1    2654     5.0  964983393

 test set after processing: 
  
    userId           movie_title  rating
0  User_1     Fight Club (1999)     5.0
1  User_1  Wolf Man, The (1941)     5.0


### create utility matrix

Based on the training set, the utility matrix is created which contains the ratings for each user as a row entry. Columns indicate the different movie titles.

In [5]:
#call custom function to create utility matrix
train_utilmatrix = create_utility_matrix(ratings_df = train_ratings, movies_df = movies_data)
print('shape of utility matrix: ', train_utilmatrix.shape)

#get glimpse on utility matrix:
train_utilmatrix.iloc[0:5,:10]

No. of duplicated entries:  0
shape of utility matrix:  (610, 9737)


Unnamed: 0,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995)
User_1,4.0,,4.0,,,4.0,,,,
User_2,,,,,,,,,,
User_3,,,,,,,,,,
User_4,,,,,,,,,,
User_5,4.0,,,,,,,,,


### SVD application

In [6]:
def apply_svd(util_matrix_df, n_dim = 20):
    
    '''
    function applies svd on given matrix and returns a low rank matrix approximation based on given number of dimensions
    '''
    
    #store user indices & movie indices:
    movies_idx = util_matrix_df.columns.values
    user_idx = util_matrix_df.index.values
    
    #handle NaN values:
    utilMatrix = util_matrix_df.values
    # the nan or unavailable entries are masked
    mask = np.isnan(utilMatrix)
    #create masked array:
    masked_arr = np.ma.masked_array(utilMatrix, mask)
    #calculate mean without considering the NaN values:
    movie_means = np.mean(masked_arr, axis=0)
    
    # replace NaN entries by average rating of each movie
    utilMatrix = masked_arr.filled(movie_means)
    
    #Mean Normalization of data:
    mean_matrix = np.tile(movie_means, (utilMatrix.shape[0],1))
    #subtract mean ratings of utility matrix (zero has now a meaning! -> equals the average rating)
    utilMatrix = utilMatrix - mean_matrix
    
    #get non-masked matrix:
    utilMatrix = np.ma.getdata(utilMatrix)
    
    # Singular-value decomposition
    U, sig, VT = svd(utilMatrix)
    
    #populate sig:
    sig=np.diag(sig)
        
    #keep only "n" factors:
    U=U[:,0:n_dim]
    sig=sig[0:n_dim,0:n_dim]
    VT=VT[0:n_dim,:]
    
    #get matrix squareroot:
    sig= sqrtm(sig)

    # reconstruct matrix
    Usig = np.dot(U,sig)
    sigV=np.dot(sig,VT)
    lowMatrix = np.dot(Usig,sigV)
       
    #add mean ratings:
    lowMatrix = lowMatrix + mean_matrix
    #again get non-masked matrix
    lowMatrix = np.ma.getdata(lowMatrix)
    
    return lowMatrix, U, sig, VT, movies_idx, user_idx



In [7]:
def find_kNNs_movies(movie_latent_feat, movie_title: str, movie_lookup: np.array, k = 5):
    
    '''function maps a movie_title with the latent_feature_matrix and 
        searches for k nearest neighbours within the laten_features_matrix '''
    
    #check for matched search term:
    movie_indices = np.where(np.core.defchararray.find(movie_lookup.astype(str), movie_title)>=0)
    
    if len(movie_indices[0]) > 1:
        print('There are multiple matches \n Try another search term')
        return None
    
    elif len(movie_indices[0]) == 0:
        print('There are no matches for your search term')
        return None
    
    else:
        movie_idx = movie_indices[0][0]
        
    #transpose feature matrix to calculate distance matrix:
    feat_matrix = movie_latent_feat.transpose()
    #create vector to match:
    movie_vector = feat_matrix[movie_idx]
    movie_vector = movie_vector.reshape(1,len(movie_vector))
    #delete vector of the feature matrix (this way the movie is not fitted in the KNN model):
    feat_matrix = np.delete(feat_matrix, (movie_idx), axis=0)
    
    #create KNN model and fit data:
    neigh = NearestNeighbors(k, metric = 'cosine')
    neigh.fit(feat_matrix)
    #get k nearest neighbours: 
    matches_indices = neigh.kneighbors(movie_vector, k, return_distance=False)
    
    #get corresponding movie titles:
    matched_movie_titles = unique_movies[matches_indices] #creates 2d array
    
    return matched_movie_titles[0]
    

In [8]:
def get_movie_matches(matched_movies_array_list):
    
    '''
    given a list of movie arrays, the function finds the movies which occur in all movie arrays
    '''
        
    for i in range(len(matched_movies_array_list)):
        if i < 1:
            matched_movies_init = matched_movies_array_list[i]
        else:
            matched_movies = np.append(matched_movies_init, matched_movies_array_list[i], axis=0)
            matched_movies_init = matched_movies
    
    #threshold to filter the duplicated movies --> e.g. if there are 3 movie arrays -> the movie count as to be > 2 such that the movie is shared among all arrays
    thresh = len(matched_movies_array_list) - 1 

    #find duplicates & their count:
    movies, count = np.unique(matched_movies, return_counts=True)
    duplicates = movies[count > thresh]
    
    print('List of movies shared by all movie requests: \n')
    print(duplicates)
    
    return duplicates

    

#### apply SVD on dataset & evaluate results

In [9]:
def lookup_preds(row, util_matrix = None):
    
    '''function looks up values in util_matrix
        if there is no match since the movie does not have entries in the util_matrix,
        return the mean rating of the user's existing ratings
    '''
    
    #slice correct user_id
    df_row_slice = util_matrix[util_matrix.index == row['userId']]
    
    #slice value of column match:
    try:
        df_col_val = df_row_slice.at[row['userId'],row['movie_title']]
    except:
        df_col_val = df_row_slice.mean(axis=1)
    
    return df_col_val


def rmse(df):
    diff_sqt = (df['rating'] - df['preds'])**2
    rmse = np.sqrt(diff_sqt.sum()/len(diff_sqt))
    print('RMSE: ', rmse)
    return rmse


SVD is applied on the utility matrix: A low rank approximation of the utility matrix is obtained. In addition the latent features are returned.

In [10]:
#apply SVD on training set:
train_matrix_approx, latent_U, _, latent_VT, movie_idx, users_idx = apply_svd(train_utilmatrix, n_dim = 20)

#add index & column labels & get glimpse on matrix approximation:
train_matrix_approx_df = pd.DataFrame(train_matrix_approx, index=users_idx, columns=movie_idx)
train_matrix_approx_df.iloc[:5,:5]

Unnamed: 0,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995)
User_1,4.144042,3.617682,3.507549,2.133603,3.165202
User_2,3.888115,3.425046,3.255152,2.122729,3.024906
User_3,3.859009,3.384688,3.233519,2.125313,3.001892
User_4,3.719572,3.331433,3.295986,2.126353,3.087875
User_5,3.924441,3.425925,3.222067,2.116197,3.02009


For the data in the test set the corresponding predictions are looked up in the fitted matrix. If no predictions are available for a given movie title in the fitted matrix, the average rating of an user is used as prediction

In [11]:
# call function on each row to get predictions:
test_ratings['preds'] = test_ratings.apply(lambda x: lookup_preds(x, train_matrix_approx_df), axis=1)

print('\n', test_ratings.head(3), '\n')

#get rmse:
rmse_test = rmse(test_ratings)


    userId           movie_title  rating     preds
0  User_1     Fight Club (1999)     5.0  4.071172
1  User_1  Wolf Man, The (1941)     5.0  3.167932
2  User_1        Dracula (1931)     4.0  3.235683 

RMSE:  1.3649622078094938


#### use latent features of movies to find similar movies

For the movies 'Inception', 'Dark Knight Rises' & 'Hangover', find movies which are most similar. Based on the list of similar movies check which are shared by all three titles:

In [12]:
#get unique movie titles:
unique_movies = np.array(train_matrix_approx_df.columns)

#get the 30 most similar movies for each title:
movies_matched = find_kNNs_movies(latent_VT, 'Inception (2010)', unique_movies, k=30)
movies_matched2 = find_kNNs_movies(latent_VT, 'Dark Knight Rises, The (2012)', unique_movies, k=30)
movies_matched3 = find_kNNs_movies(latent_VT, 'Hangover, The (2009)', unique_movies, k=30)

li_matched_movies = [movies_matched,movies_matched2,movies_matched3]

#get movies which are "neighbours" for all given titles:
shared_movies_arr = get_movie_matches(li_matched_movies)

List of movies shared by all movie requests: 

['Dark Knight, The (2008)' 'How To Change The World (2015)'
 'Purge: Anarchy, The (2014)']


In [13]:
movies_matched[:5]

array(['How To Change The World (2015)', 'Dark Knight, The (2008)',
       'DeadHeads (2011)', 'Wave, The (Welle, Die) (2008)',
       'Gifted (2017)'], dtype=object)