# Movie recommendation system

It uses **Content Based Filtering (CBF)**

It's a properties based recommendation.  
For example if some movie has a list of properties,  
then similar movies will be ones, contain similar list of properties.


To check the similarity we could use the simple *Cosine Similarity* algorithm.  
When we calculate properties vector for every movie, and then just choose  
movies with the highest values.




In [2]:

import pandas as pd

#initial output setup
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)

Next we are loading title.basics collection which has titles and genres for all movies.  
We will use genres as the main properties to provide recommendation.  

For example if some movie have the following genres "Action,Drama,Fantasy",  
then the most similar movies will be ones which have same genres.

In [3]:

import numpy as np 

movies=pd.read_csv("data/title.basics.tsv",sep="\t")

# replace all NaN values in using columns (title and genres)
movies["primaryTitle"] = movies["primaryTitle"].replace(np.nan,"")
movies["genres"] = movies["genres"].replace(np.nan, value="")

# list first 10 movies
movies.head(10)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


Now we start processing movies with Tf-idf algorithm.  
As a result we have 0..1 values for every movie's genre.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf_movies_genres = TfidfVectorizer()
tfidf_movies_genres_matrix = tfidf_movies_genres.fit_transform(movies["genres"])

# list the result matrix
print(pd.DataFrame(data = tfidf_movies_genres_matrix.toarray(),columns = tfidf_movies_genres.get_feature_names()))

           action  adult  adventure  animation  biography    comedy  crime  documentary     drama    family  fantasy   fi  film  game  history  horror  music  musical  mystery  news  noir  reality  \
0        0.000000    0.0        0.0   0.000000        0.0  0.000000    0.0     0.741105  0.000000  0.000000      0.0  0.0   0.0   0.0      0.0     0.0    0.0      0.0      0.0   0.0   0.0      0.0   
1        0.000000    0.0        0.0   0.796339        0.0  0.000000    0.0     0.000000  0.000000  0.000000      0.0  0.0   0.0   0.0      0.0     0.0    0.0      0.0      0.0   0.0   0.0      0.0   
2        0.000000    0.0        0.0   0.685141        0.0  0.440164    0.0     0.000000  0.000000  0.000000      0.0  0.0   0.0   0.0      0.0     0.0    0.0      0.0      0.0   0.0   0.0      0.0   
3        0.000000    0.0        0.0   0.796339        0.0  0.000000    0.0     0.000000  0.000000  0.000000      0.0  0.0   0.0   0.0      0.0     0.0    0.0      0.0      0.0   0.0   0.0      0.0   


Now we create a method to be used for our recommendation system.  
According to the sklearn documentation we are using linear_kernel here  
instead of cosine_similarity function.

In [5]:
def get_recommendation_based_on_genre(movie_title):
    idx_movie = movies.loc[movies["primaryTitle"].isin([movie_title])]
    idx_movie = idx_movie.index 



    # here we use just a single movie to create cosine_similarity vector
    # otherwise on big datasets we could easily go out of the memory
    cosimsim = linear_kernel(tfidf_movies_genres_matrix[idx_movie],tfidf_movies_genres_matrix)
    
    # get similarity vectors for all movies and sort it in backward
    sim_scores_for_specific_movie = list(enumerate(cosimsim[0]))
    sim_scores_for_specific_movie_sorted = sorted(sim_scores_for_specific_movie, key=lambda x: x[1], reverse=True)

    # now choose 10 top items for the recommendation
    sim_scores_for_specific_movie_sorted = sim_scores_for_specific_movie_sorted[1:11]
    
    # # Get the movie indices, and provide the list of movies
    movie_indices = [i[0] for i in sim_scores_for_specific_movie_sorted]
    return movies.iloc[movie_indices]




In [6]:
recg = get_recommendation_based_on_genre("Star Wars: Episode I - The Phantom Menace")
print(recg["primaryTitle"])

17107                    The Sea Beast
32731     Adventures of Captain Marvel
32974                     Le due tigri
38526                    Blonde Savage
40393    The Adventures of Sir Galahad
40446                           Bagdad
41226          Tarzan's Magic Fountain
43017                 The Magic Carpet
44387                  Son of Ali Baba
46944                        Abe Hayat
Name: primaryTitle, dtype: object
