# Movie recommendation system

It uses **Matrix Factorization (MF) algorithm for Item to Item collaborative Filtering (MF IICBF)**

It's a model-based recommender system. It’s a collaborative based filtering but with some additional ideas.  
The main idea here - split all items (movies) into groups called Latent Features. For example for different movies we could group them into genre groups, like Action, Horror, and then assign these groups to users - for example Bob likes 20% of Action movies and 80 % of Horrow movies (so all movies he watched and liked could be grouped into 2 features).

And the main problem here - splitting all movies into these groups.  
There is a PCA technique (Principal Compoonent  Analysis) which  is a statistical procedure which reduces the dimension of our user-item matrix without losing any important information! In simple words it’s a procedure to come up with latent features (find features to group all movies into). 


At the beginning we do an initial setup and import.



In [69]:

import pandas as pd
import numpy as np  
from sklearn.metrics import pairwise_distances
from scipy.sparse.linalg import svds # we will do SVD with scipy

#initial output setup
pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 70)
pd.set_option('display.width', 500)

Now we are reading MovieLens datasets with movies and user ratings.  

In [70]:
# Reading movies file
movies = pd.read_csv('data/movies.csv', sep=',', encoding='latin-1', usecols=['movieId','title','genres'])

# Reading ratings file
ratings = pd.read_csv('data/ratings.csv', sep=',', encoding='latin-1', usecols=['userId','movieId','rating','timestamp'])

To find the recommendation for specific user we need to provide movie ratings for this user.  
I'm going to provide recommendations for "Star Wars".

In [71]:

# Find all Star Wars movies
print(movies[movies.title.str.contains("Star Wars")])

# create a new user and add ratings for Star Wars movies.
max_user_id = ratings.userId.max()
my_user_id = max_user_id+1
my_rating = pd.DataFrame([[my_user_id,260,5],[my_user_id,1196,5],[my_user_id,1210,5]],columns=['userId','movieId',"rating"])
# append ratings to the existing dataset
ratings = ratings.append(my_rating,ignore_index=True)

# check last 5 rows to ensure user properly added
ratings.tail()

      movieId                                              title                                genres
224       260          Star Wars: Episode IV - A New Hope (1977)               Action|Adventure|Sci-Fi
898      1196  Star Wars: Episode V - The Empire Strikes Back...               Action|Adventure|Sci-Fi
911      1210  Star Wars: Episode VI - Return of the Jedi (1983)               Action|Adventure|Sci-Fi
1979     2628   Star Wars: Episode I - The Phantom Menace (1999)               Action|Adventure|Sci-Fi
3832     5378  Star Wars: Episode II - Attack of the Clones (...          Action|Adventure|Sci-Fi|IMAX
5896    33493  Star Wars: Episode III - Revenge of the Sith (...               Action|Adventure|Sci-Fi
6823    61160                   Star Wars: The Clone Wars (2008)     Action|Adventure|Animation|Sci-Fi
7367    79006  Empire of Dreams: The Story of the 'Star Wars'...                           Documentary
8683   122886  Star Wars: Episode VII - The Force Awakens (2015)  Action|

Unnamed: 0,userId,movieId,rating,timestamp
100834,610,168252,5.0,1493846000.0
100835,610,170875,3.0,1493846000.0
100836,611,260,5.0,
100837,611,1196,5.0,
100838,611,1210,5.0,


Now let's perform some data investigation for ratings dataset.  
Sparcity is calculated as percent of values in the matrix.

In [72]:
n_users = ratings.userId.unique().shape[0] # number of unique users
n_movies = ratings.movieId.unique().shape[0] # number of unique movies
print('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies))

sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)
print('The sparsity level of the dataset is ' +  str(sparsity * 100) + '%')

Number of users = 611 | Number of movies = 9724
The sparsity level of the dataset is 98.3%


Now we are going to create sparse user-to-movie pivot table, which contains movie ratings for all movies and every user.  
Later we will use this table for MF.

In [73]:

ratings_pivot = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
ratings_pivot.head()

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,36,38,...,187031,187541,187593,187595,187717,188189,188301,188675,188751,188797,188833,189043,189111,189333,189381,189547,189713,190183,190207,190209,190213,190215,190219,190221,191005,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we need to prepare data.  
We are going to convert all values as a deviation from the mean value.  

So we are calculating mean rating value for every user, and then process the original table and calculate deviation from this value.  


In [74]:
user_ratings_pivot = ratings_pivot.values # returns NumPy representation of ratings pivot table (dataset's array representation)
print(user_ratings_pivot)

[[4. 0. 4. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [3. 0. 0. ... 0. 0. 0.]
 [5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Now we are going to use SVD.  
It's an algorightm to do the matrix factorisation.  
It processes the input matrix and decompose it into 2 lower rank matrices (one is latent features matrix).  


In [75]:
# apply SVD method to the and get as a result
# U = lower-rank matrix with user to features values
# sigma - diagonal matrix
# Vt - lower-rank matrix with movie to features matrix
U, sigma, Vt = svds(user_ratings_pivot, k = 50)

sigma = np.diag(sigma) #scipy returns an array instead of the diag matrix, so we need to covnert

# check shapes of results
print(sigma.shape)
print(U.shape)
print(Vt.shape)

(50, 50)
(611, 50)
(50, 9724)


Now we can calculate prediction ratings as a dot product of 3 calculated matrices.  
Here we need to use correct values instead of mean-deviation.


In [76]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt)
print(all_user_predicted_ratings.shape)


(611, 9724)


In [77]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = ratings_pivot.columns)
preds.head()

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,34,36,38,...,187031,187541,187593,187595,187717,188189,188301,188675,188751,188797,188833,189043,189111,189333,189381,189547,189713,190183,190207,190209,190213,190215,190219,190221,191005,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,2.18495,0.391633,0.8371,-0.082701,-0.543542,2.521065,-0.887665,-0.025187,0.196942,1.605342,-0.487325,0.238066,0.340428,0.05273,-0.112107,1.351254,-0.40911,-0.073231,0.288128,0.420735,0.232836,0.357406,-0.145697,0.248317,0.071823,-0.224034,-0.037536,0.072296,-0.161649,-0.076665,-0.335666,0.798281,1.652119,-0.317737,0.13807,...,0.033877,0.013562,0.155168,0.041159,-0.181594,-0.163434,-0.124946,0.068277,0.03837,0.053101,0.087785,0.009311,0.011174,-0.019988,0.048769,-0.010174,-0.090797,-0.015214,0.005587,0.014898,0.003725,0.005587,0.003725,0.003725,-0.032011,-0.024897,-0.02134,-0.028454,-0.028454,-0.024897,-0.028454,-0.024897,-0.024897,-0.024897,-0.05914
1,0.209973,0.004889,0.031136,0.017289,0.183989,-0.05998,0.083681,0.023824,0.048244,-0.151797,0.078741,0.063507,0.004551,-0.000828,0.045123,0.095526,-0.025819,-0.004233,0.109314,-0.008865,-0.076552,0.014678,0.008501,0.022756,-0.213004,0.04035,-0.002419,-0.022327,-0.174237,0.013023,0.032537,-0.200054,-0.165025,0.030066,1.7e-05,...,0.016133,0.050673,0.067765,0.056652,-0.008167,-0.00735,0.013486,-0.002467,0.037876,0.034995,-0.003172,0.002407,0.002888,0.018447,-0.001762,-0.001525,-0.004084,0.009439,0.001444,0.003851,0.000963,0.001444,0.000963,0.000963,0.024298,0.018899,0.016199,0.021598,0.021598,0.018899,0.021598,0.018899,0.018899,0.018899,0.03198
2,0.013576,0.034661,0.050505,0.000187,-0.005458,0.114697,-0.007452,0.000736,0.004761,-0.061295,-0.004365,0.032795,0.011505,-0.010543,0.007799,0.039761,-0.012951,0.01985,-0.061142,-0.013329,-0.073133,0.012797,-0.018178,0.0464,-0.041909,-0.004594,-0.012161,-0.005949,0.077093,0.00485,0.020475,0.029609,-0.000791,0.014199,0.003032,...,0.004665,0.001418,-0.001979,0.009616,-0.008726,-0.007854,0.013464,0.008514,-0.003015,0.007247,0.010947,-3e-06,-4e-06,-0.003638,0.006082,-0.000426,-0.004363,-0.001972,-2e-06,-5e-06,-1e-06,-2e-06,-1e-06,-1e-06,-0.002067,-0.001608,-0.001378,-0.001837,-0.001837,-0.001608,-0.001837,-0.001608,-0.001608,-0.001608,-0.000534
3,2.012104,-0.395132,-0.290011,0.093849,0.124135,0.259978,0.473116,0.036044,0.011479,-0.023374,0.663969,-0.10796,0.279998,0.265492,-0.029648,0.143392,1.799882,-0.257221,-0.014255,0.120752,1.940472,-0.187845,-0.036481,0.062927,1.637332,0.156245,0.02351,0.546534,0.496478,0.258787,-0.193984,2.078207,1.904503,0.848447,-0.078282,...,0.000263,-0.013504,0.168943,-0.084685,0.116135,0.104521,-0.004731,-0.019294,0.047298,-0.004414,-0.024807,-0.002637,-0.003164,0.005605,-0.013782,0.002416,0.058067,-0.022149,-0.001582,-0.004219,-0.001055,-0.001582,-0.001055,-0.001055,0.002527,0.001966,0.001685,0.002246,0.002246,0.001966,0.002246,0.001966,0.001966,0.001966,-0.02153
4,1.336997,0.772816,0.064191,0.113824,0.27496,0.584009,0.250691,0.131507,-0.086427,1.035331,0.963734,-0.071943,0.137122,0.236478,0.207081,0.541823,0.77436,-0.050299,0.333631,0.083555,1.143493,0.329835,0.153495,0.100726,0.863559,0.240538,0.033295,0.12076,0.037955,-0.000212,0.460581,1.260663,1.906905,1.002758,-0.016437,...,6e-05,-0.006795,0.031475,0.00818,-0.018854,-0.016969,0.00078,-0.004706,0.000969,-0.003761,-0.006051,0.007059,0.008471,-0.005795,-0.003362,0.00033,-0.009427,-0.014938,0.004236,0.011295,0.002824,0.004236,0.002824,0.002824,-0.005661,-0.004403,-0.003774,-0.005032,-0.005032,-0.004403,-0.005032,-0.004403,-0.004403,-0.004403,-0.006112


Now we write the main function  to work with predictions.


In [78]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
    """
    :param 
    predictions : The SVD reconstructed matrix, 
    userID : UserId for which you want to predict the top rated movies, 
    movies : Matrix with movie data, 
    original_ratings : Original Rating matrix, 
    num_recommendations : num of records to be returned
    :return: num_recommendations top movies
    """ 
    # Get and sort the user's predictions
    user_row_number = userID - 1 # User ID starts at 1, not 0
    # get all the values per movie for specified user
    sorted_user_predictions = predictions.iloc[user_row_number].sort_values(ascending=False)
        
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.userId == userID]
    user_full = user_data.merge(movies, how = 'left', left_on = 'movieId', right_on = 'movieId').sort_values(['rating'], ascending=False)
                 
    
    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return recommendations

In [79]:
# get 20 recommendations
recommend_movies(preds, my_user_id, movies, ratings, 20)

User 611 has already rated 3 movies.
Recommending highest 20 predicted ratings movies not already rated.


Unnamed: 0,movieId,title,genres
898,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
1936,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
966,1270,Back to the Future (1985),Adventure|Comedy|Sci-Fi
506,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi
614,780,Independence Day (a.k.a. ID4) (1996),Action|Adventure|Sci-Fi|Thriller
987,1291,Indiana Jones and the Last Crusade (1989),Action|Adventure
658,858,"Godfather, The (1972)",Crime|Drama
897,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Fantasy|Romance
900,1200,Aliens (1986),Action|Adventure|Horror|Sci-Fi
936,1240,"Terminator, The (1984)",Action|Sci-Fi|Thriller
