# Movie recommendation system

It uses **User Based Collaborative Filtering (CBF)**

It's a memory-based type of collaborative filtering, based on similarity between users.
The main idea here - to recommend same movies for similar users. For example if user A watched video V1,V2,V3 
and user B is watched video V2,V3 then users are similar and we could recommend video V1 to user B.


At the beginning we do an initial setup and import.



In [11]:

import pandas as pd
import numpy as np 
from sklearn.metrics import pairwise_distances

#initial output setup
pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 70)
pd.set_option('display.width', 500)

Now we are loading movies and ratings movielens' datasets.
We can't use IMDB here, because it doesn't provide with user ratings. And we need it for UBCF.

In [7]:

movies=pd.read_csv("data/movies.csv")
# here we set max number of rows for ratings dataset, 
# algorithm is memory based and and leads or memory overflow on the large number of ratings
ratings=pd.read_csv("data/ratings.csv",nrows=99999)

ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


I'm looking for Star Wars -like movies recommendation. So we need to search for Star Wars movies, and get their ids.

In [3]:
print(movies[movies.title.str.contains("Star Wars")])

       movieId                                              title                                genres
257        260          Star Wars: Episode IV - A New Hope (1977)               Action|Adventure|Sci-Fi
1166      1196  Star Wars: Episode V - The Empire Strikes Back...               Action|Adventure|Sci-Fi
1179      1210  Star Wars: Episode VI - Return of the Jedi (1983)               Action|Adventure|Sci-Fi
2537      2628   Star Wars: Episode I - The Phantom Menace (1999)               Action|Adventure|Sci-Fi
5270      5378  Star Wars: Episode II - Attack of the Clones (...          Action|Adventure|Sci-Fi|IMAX
9952     33493  Star Wars: Episode III - Revenge of the Sith (...               Action|Adventure|Sci-Fi
12593    61160                   Star Wars: The Clone Wars (2008)     Action|Adventure|Animation|Sci-Fi
14912    79006  Empire of Dreams: The Story of the 'Star Wars'...                           Documentary
21250   109713               Star Wars: Threads of Destiny (2014

Now we creating a new user - myself, and provide ratings for these 2 movies, taken from the previous step

Star Wars: Episode IV - A New Hope (1977)   
Star Wars: Episode VI - Return of the Jedi (1983)  
Star Wars: Episode VI - Return of the Jedi (1983)  

we create a new user with id is bigger than the maximum number of users in the movielens dataset.

In [8]:
#add new user - myself to  check the predictions
#user  - get max user value here, for now just 999999999
my_user_id = 999999999
my_rating = pd.DataFrame([[my_user_id,260,5],[my_user_id,1196,5],[my_user_id,1210,5]],columns=['userId','movieId',"rating"])
ratings = ratings.append(my_rating,ignore_index=True)

#check last 5 rows to ensure user properly added
ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
99997,757,2118,4.0,1184014000.0
99998,757,2124,3.5,1184074000.0
99999,999999999,260,5.0,
100000,999999999,1196,5.0,
100001,999999999,1210,5.0,


Now we need to create a single dataset which will contain all movies and user ratings.

In [10]:
#merge both datasets (inner join operation) by movieId
movies_and_ratings = pd.merge(movies,ratings,on="movieId")
#show me my ratings
movies_and_ratings[movies_and_ratings['userId']==my_user_id].head()


Unnamed: 0,movieId,title,genres,userId,rating,timestamp
7674,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,999999999,5.0,
26905,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,999999999,5.0,
28237,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi,999999999,5.0,


Now we need to create a temporary table which will contain users/movies data. Every column is a movie related user's rating.

In simple words we have a table which contains all movies ratings per user.

In [12]:
#create pivot table, which is anologue of excel spreadsheet, with userId as rows, movieId as columns and ratings as values
#we drops the oroginal indexes, and use our indexes
ratings_matrix_users = movies_and_ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating').reset_index(drop=True)
#now we fill NaN values with 0 (NaN values - it's movies without user's rating)
ratings_matrix_users.fillna( 0, inplace = True )

ratings_matrix_users.head()

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,...,200838,201200,201340,201588,201646,201749,201773,201811,202263,202265,202429,202439,202759,202934,203208,203218,203222,203244,203322,203375,203513,203519,203649,204352,204542,204692,204698,204704,205054,205072,205106,205413,205499,205557,206272
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0,4.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we are calculating cosine similarity between users. We calculate this similarity per user and based on the movie's recommendations

In [14]:
movie_similarity = 1 - pairwise_distances( ratings_matrix_users.values, metric="cosine" )
#Filling diagonals with 0s for future use when sorting is done (will be explained later), 
np.fill_diagonal( movie_similarity, 0 ) 
#create a new matrix with user/user similarities
ratings_matrix_users = pd.DataFrame( movie_similarity )
ratings_matrix_users.head()
#we see here a distance between users (user,user)!!! this similarity we will use for items recommendations

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,...,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757
0,0.0,0.762322,0.795569,0.74031,0.58538,0.713644,0.483813,0.613872,0.587894,0.650287,0.757955,0.791427,0.812564,0.839441,0.732855,0.787865,0.460789,0.734377,0.824454,0.74661,0.748297,0.382547,0.686465,0.721393,0.502705,0.697569,0.656762,0.630851,0.542527,0.672086,0.785477,0.518198,0.786187,0.757059,0.732197,...,0.504968,0.832759,0.302022,0.84614,0.308225,0.732292,0.778153,0.601921,0.729441,0.565407,0.734289,0.798535,0.84242,0.766931,0.820321,0.803252,0.798151,0.79727,0.728536,0.768867,0.681737,0.723453,0.663986,0.811305,0.482705,0.820605,0.731827,0.740974,0.658093,0.573658,0.507467,0.799494,0.832629,0.706292,0.435819
1,0.762322,0.0,0.851134,0.845573,0.778461,0.853804,0.59594,0.823053,0.826056,0.839609,0.808417,0.919295,0.937771,0.760281,0.905279,0.789124,0.542125,0.94147,0.89144,0.948573,0.787132,0.517673,0.865426,0.787593,0.652304,0.86438,0.884432,0.85107,0.559998,0.862938,0.931246,0.662353,0.82246,0.791767,0.832617,...,0.591909,0.826404,0.473745,0.913559,0.472995,0.788991,0.821289,0.750972,0.882246,0.792028,0.897152,0.929801,0.856294,0.910188,0.923578,0.872068,0.907889,0.921949,0.878565,0.840458,0.793979,0.775799,0.816721,0.817747,0.644709,0.907388,0.820333,0.90885,0.868289,0.705866,0.633376,0.802113,0.942991,0.860851,0.684739
2,0.795569,0.851134,0.0,0.949809,0.589485,0.756184,0.41675,0.636739,0.611415,0.684263,0.759393,0.807306,0.940765,0.850471,0.761767,0.806257,0.74024,0.783419,0.955911,0.831225,0.934976,0.334861,0.690867,0.779418,0.46652,0.750047,0.695535,0.728849,0.555639,0.761565,0.82127,0.48686,0.806443,0.764098,0.827921,...,0.442365,0.848849,0.338528,0.928882,0.342932,0.899557,0.88835,0.580027,0.738277,0.596108,0.755349,0.899591,0.86958,0.88271,0.904518,0.926165,0.959958,0.819772,0.827453,0.901127,0.893223,0.716109,0.645743,0.887117,0.445917,0.894193,0.898252,0.790688,0.691411,0.544246,0.458736,0.950494,0.891995,0.713025,0.584674
3,0.74031,0.845573,0.949809,0.0,0.553947,0.735843,0.3642,0.602503,0.596383,0.653252,0.726999,0.768222,0.916451,0.799266,0.74292,0.772563,0.76308,0.766609,0.942304,0.840308,0.945414,0.28786,0.658415,0.729692,0.411259,0.722708,0.686375,0.743533,0.494002,0.780663,0.775722,0.436994,0.761455,0.725148,0.788552,...,0.380036,0.796465,0.323227,0.914223,0.320482,0.908957,0.840103,0.538248,0.702827,0.574708,0.728259,0.899477,0.8163,0.90119,0.854153,0.893316,0.928362,0.784796,0.829117,0.915091,0.903814,0.6598,0.594408,0.856195,0.398828,0.878092,0.888267,0.75769,0.679258,0.487419,0.410016,0.935265,0.851681,0.684626,0.641613
4,0.58538,0.778461,0.589485,0.553947,0.0,0.749003,0.852828,0.960714,0.911527,0.93186,0.664093,0.845964,0.747585,0.635868,0.831131,0.553453,0.335918,0.891353,0.651776,0.767177,0.515223,0.797138,0.86824,0.646965,0.910249,0.90267,0.806332,0.746899,0.575804,0.668492,0.875108,0.885178,0.734351,0.708642,0.750078,...,0.861111,0.702467,0.696448,0.701281,0.710149,0.53907,0.58475,0.945139,0.846486,0.841572,0.822564,0.770698,0.774381,0.698273,0.785566,0.59577,0.707167,0.813837,0.603989,0.53868,0.558256,0.604386,0.914295,0.630178,0.895607,0.77878,0.588616,0.786225,0.91675,0.907414,0.892208,0.566304,0.792438,0.858039,0.558378


Now we find the maximum values for every user,based on the maximum similarity value.
The most similar user is himself, and we fill these values with 0 (diagonal values in the table above)

In [15]:
# here is why we need to use 0,0 in diagonal, to not stick with these 1.0 at diagonals!
similar_users = ratings_matrix_users.idxmax(axis=1).to_frame() #converting pivot to   datagrame 
similar_users.columns=["similarUser"]
similar_users
#1st column is user
#2nd columnt is the most similar user

Unnamed: 0,similarUser
0,106
1,604
2,242
3,646
4,668
...,...
753,351
754,207
755,216
756,369


Now we create 2 functions:
get list of recommended movies, based on the user similarity.
and a helper function to get the movie information based on movieId

In [17]:
movieId_recommended=list()
def getRecommendedMoviesAsperUserSimilarity(userId):
    """
     Recommending movies which user hasn't watched as per User Similarity
    :param user_id: user_id to whom movie needs to be recommended
    :return: movieIds to user 
    """
    user2Movies= ratings[ratings['userId']== userId]['movieId']
    sim_user=similar_users.iloc[0,0]
    df_recommended=pd.DataFrame(columns=['movieId','title','genres','userId','rating','timestamp'])
    for movieId in ratings[ratings['userId']== sim_user]['movieId']:
        if movieId not in user2Movies:
            df_new= movies_and_ratings[(movies_and_ratings.userId==sim_user) & (movies_and_ratings.movieId==movieId)]
            df_recommended=pd.concat([df_recommended,df_new])
        best10=df_recommended.sort_values(['rating'], ascending = False )[1:10]  
    return best10['movieId']

In [18]:
def movieIdToTitle(listMovieIDs):
    """
     Converting movieId to titles
    :param user_id: List of movies
    :return: movie titles
    """
    movie_titles= list()
    for id in listMovieIDs:
        movie_titles.append(movies[movies['movieId']==id]['title'])
    return movie_titles

And finally we ask to provide recommendations

In [20]:
recommend_movies= movieIdToTitle(getRecommendedMoviesAsperUserSimilarity(my_user_id))
print("Movies you should watch are:\n")
print(recommend_movies)

Movies you should watch are:

[149    Rob Roy (1995)
Name: title, dtype: object, 375    True Lies (1994)
Name: title, dtype: object, 314    Shawshank Redemption, The (1994)
Name: title, dtype: object, 310    Secret of Roan Inish, The (1994)
Name: title, dtype: object, 292    Pulp Fiction (1994)
Name: title, dtype: object, 287    Once Were Warriors (1994)
Name: title, dtype: object, 243    Hoop Dreams (1994)
Name: title, dtype: object, 549    True Romance (1993)
Name: title, dtype: object, 20    Get Shorty (1995)
Name: title, dtype: object]
