The objective of this project is to develop an interactive movie recommendation system that allows users to input either a movie title or a genre. In response, the system will generate a curated list of movie recommendations, leveraging ratings from users who exhibit similar preferences.

The underlying concept revolves around user-based collaborative filtering. For simplicity, I am passing just a single movie as input instead of a matrix of preferred movies.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

I am going to work with the MovieLens 25M dataset, which contains movie reviews and ratings.
There are two different files in this dataset, first one is the movies.csv with columns ['movieId',  'title',  'genre'] 
and then the ratings.csv file with columns ['userid',  'movieid',  'rating',  'timestamp']

In [2]:
df_movies = pd.read_csv('movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Data Cleaning

1) There are no null values


2) Cleanning up the title and genres by removing all the special characters

In [3]:
df_movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [4]:
import re
def cleaned_title(title):
    return re.sub("([^a-zA-Z0-9 ])","",title)
def split_genres(genres):
    return genres.split("|")

In [5]:
df_movies['cleaned_title'] = df_movies['title'].apply(cleaned_title)
df_movies['split_genres']= df_movies['genres'].apply(split_genres)

In [6]:
df_movies

Unnamed: 0,movieId,title,genres,cleaned_title,split_genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995,"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995,[Comedy]
...,...,...,...,...,...
62418,209157,We (2018),Drama,We 2018,[Drama]
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001,[Documentary]
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018,"[Comedy, Drama]"
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001,[(no genres listed)]


Part 1 - Recommend movies based on titles.                                                                                            

Creating a TFIDF matrix, converting titles of our movies to numbers (term frequency & inverse document frequency)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
tfidf= vectorizer.fit_transform(df_movies['cleaned_title'])

Finding most similar results using cosine similarity

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
def search(title):    
   # title = "A Girl Thing"
    title = cleaned_title(title)
    new_vec = vectorizer.transform([title])
    similarity = cosine_similarity(new_vec,tfidf).flatten()
    indices = np.argpartition(similarity,-5)[-5:]
    results= df_movies.iloc[indices][::-1]
    return results

Building an interactive widget to input movie name and get search results

In [11]:
import ipywidgets as widgets
from IPython.display import display
movie_input = widgets.Text(value= 'Bad Poem',
                          description = 'Movie title',
                          disabled = False)

movie_output = widgets.Output()

def on_enter(data):
    with movie_output:
        movie_output.clear_output()
        title = data["new"]
        if len(title)>5:
            display(search(title))
            
movie_input.observe(on_enter ,names ='value' )

display(movie_input,movie_output)

Text(value='Bad Poem', description='Movie title')

Output()

In [12]:
df_ratings = pd.read_csv('ratings.csv')
df_ratings.head()


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [13]:
df_ratings.shape

(25000095, 4)

Part 2 - Recommend movies based on preferences(ratings) of users that are similar to us.    


For example, a user likes the below movie 'pulp fiction 1994', we are going to recommend movies to this person based on the ratings of similar user who also liked this movie.

In [16]:
movie_id=296
df_movies[df_movies['movieId']==296]

Unnamed: 0,movieId,title,genres,cleaned_title,split_genres
292,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,Pulp Fiction 1994,"[Comedy, Crime, Drama, Thriller]"


In [17]:
# Identifying other users who also watched the movie pulp fiction and gave a rating that is greater than 4 

similar_users = df_ratings[(df_ratings['movieId']==movie_id) & (df_ratings['rating']>4)]['userId'].unique()
similar_users

array([     1,      3,      8, ..., 162533, 162534, 162536], dtype=int64)

In [18]:
#Identifying all the other movies that has received ratings higher than 4 from the users in 'Pulp Fiction' enthusiasts group

similar_recs= df_ratings[(df_ratings['userId'].isin (similar_users)) & (df_ratings['rating']>4)]["movieId"]
similar_recs                                                                        

0              296
2              307
3              665
8             1237
16            2351
             ...  
24999516    133771
24999517    134130
24999518    148626
24999519    148685
24999522    204698
Name: movieId, Length: 2544315, dtype: int64

In [19]:
similar_recs.value_counts()

296       42988
318       21304
2959      17170
593       17059
50        16328
          ...  
59290         1
6203          1
7583          1
166753        1
148685        1
Name: movieId, Length: 22220, dtype: int64

In [20]:
#To get the top recommendations,finding only the movies that greater than 10% of these users have liked.

similar_recs= similar_recs.value_counts()/len(similar_users)
similar_recs= similar_recs[similar_recs>.10]
similar_recs

296      1.000000
318      0.495580
2959     0.399414
593      0.396832
50       0.379827
           ...   
4995     0.102726
923      0.102726
3996     0.102424
2502     0.101447
44191    0.100400
Name: movieId, Length: 94, dtype: float64

Next, our focus shifts to identifying niche films—movies that exhibit a remarkable similarity to "Pulp Fiction." For instance, the "Avengers" movie might be a beloved choice among a broader audience, extending beyond the fanbase of "Pulp Fiction."

To achieve this, we'll compute a ratio comparing the overall user preference to that of the "Pulp Fiction" enthusiasts. This calculation allows us to pinpoint recommendations that align closely with our initial input.

In [21]:
total_users = df_ratings[(df_ratings['movieId'].isin (similar_recs.index)) & (df_ratings['rating']>4)]
total_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
23,1,3949,5.0,1147868678
29,1,4973,4.5,1147869080
37,1,6016,5.0,1147869090
48,1,7361,5.0,1147880055
...,...,...,...,...
25000058,162541,4995,5.0,1240951903
25000062,162541,5618,4.5,1240953299
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613


In [22]:
total_users['movieId'].value_counts()

318      51678
296      42988
2571     36851
356      35527
593      34114
         ...  
1201      8383
55820     8262
223       7773
2542      7747
32587     7426
Name: movieId, Length: 94, dtype: int64

In [23]:
total_user_recs= total_users['movieId'].value_counts()/len(total_users['userId'].unique())
total_user_recs

318      0.347093
296      0.288727
2571     0.247508
356      0.238616
593      0.229125
           ...   
1201     0.056304
55820    0.055491
223      0.052207
2542     0.052032
32587    0.049876
Name: movieId, Length: 94, dtype: float64

In [24]:
recommendation_percentage = pd.concat([similar_recs,total_user_recs],axis=1)
recommendation_percentage.columns = ['similar_users','all_users']
recommendation_percentage

Unnamed: 0,similar_users,all_users
1,0.160929,0.126504
32,0.193310,0.101721
47,0.300154,0.146526
50,0.379827,0.203368
110,0.225412,0.163163
...,...,...
58559,0.208337,0.148555
60069,0.103564,0.077394
68157,0.143156,0.067534
79132,0.173002,0.133255


Identifying the contrast between the percentage of all users who have expressed a liking for a particular movie and the percentage of similar users who share the same positive sentiment towards that movie. 

A substantial difference in these percentages is sought after to identify valuable recommendations.

In [25]:
recommendation_percentage['score']= recommendation_percentage['similar_users']/recommendation_percentage['all_users']
recommendation_percentage= recommendation_percentage.sort_values('score',ascending=False)
recommendation_percentage

Unnamed: 0,similar_users,all_users,score
296,1.000000,0.288727,3.463478
1089,0.274379,0.102211,2.684435
7438,0.162138,0.067252,2.410910
32587,0.117242,0.049876,2.350650
6874,0.201033,0.086098,2.334923
...,...,...,...
1,0.160929,0.126504,1.272118
480,0.128454,0.101311,1.267921
150,0.112403,0.091868,1.223536
1197,0.147832,0.121024,1.221511


In [56]:
#Merging the above results with movies dataframe to get title along with the movie id

merged_df_test=recommendation_percentage.merge(df_movies,left_index=True, right_on='movieId')
merged_df_test[:10]


Unnamed: 0,similar_users,all_users,score,movieId,title,genres,cleaned_title,split_genres
292,1.0,0.288727,3.463478,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,Pulp Fiction 1994,"[Comedy, Crime, Drama, Thriller]"
1062,0.274379,0.102211,2.684435,1089,Reservoir Dogs (1992),Crime|Mystery|Thriller,Reservoir Dogs 1992,"[Crime, Mystery, Thriller]"
7299,0.162138,0.067252,2.41091,7438,Kill Bill: Vol. 2 (2004),Action|Drama|Thriller,Kill Bill Vol 2 2004,"[Action, Drama, Thriller]"
9778,0.117242,0.049876,2.35065,32587,Sin City (2005),Action|Crime|Film-Noir|Mystery|Thriller,Sin City 2005,"[Action, Crime, Film-Noir, Mystery, Thriller]"
6751,0.201033,0.086098,2.334923,6874,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller,Kill Bill Vol 1 2003,"[Action, Crime, Thriller]"
762,0.175677,0.075822,2.316962,778,Trainspotting (1996),Comedy|Crime|Drama,Trainspotting 1996,"[Comedy, Crime, Drama]"
2451,0.120243,0.052032,2.310923,2542,"Lock, Stock & Two Smoking Barrels (1998)",Comedy|Crime|Thriller,Lock Stock Two Smoking Barrels 1998,"[Comedy, Crime, Thriller]"
1191,0.152484,0.06783,2.248054,1222,Full Metal Jacket (1987),Drama|War,Full Metal Jacket 1987,"[Drama, War]"
1182,0.235135,0.10475,2.244732,1213,Goodfellas (1990),Crime|Drama,Goodfellas 1990,"[Crime, Drama]"
11932,0.124197,0.055491,2.23814,55820,No Country for Old Men (2007),Crime|Drama,No Country for Old Men 2007,"[Crime, Drama]"


Part 3 : Identifying the top 10 recommendations based on input genre

In [57]:
genre = ['Comedy']

In [58]:
merged_df_genre=df_ratings.merge(df_movies,left_index=True, right_on='movieId')
filtered_columns = ['movieId','cleaned_title','split_genres','rating'][:20]
merged_df_genre=merged_df_genre[filtered_columns]
merged_df_genre=merged_df_genre[(merged_df_genre['rating']>4) & merged_df_genre['split_genres'].apply(lambda lst: any(item in lst for item in genre))]
merged_df_genre[:10]

Unnamed: 0,movieId,cleaned_title,split_genres,rating
2,3,Grumpier Old Men 1995,"[Comedy, Romance]",5.0
17,18,Four Rooms 1995,[Comedy],5.0
18,19,Ace Ventura When Nature Calls 1995,[Comedy],5.0
19,20,Money Train 1995,"[Action, Comedy, Crime, Drama, Thriller]",4.5
37,38,It Takes Two 1995,"[Children, Comedy]",4.5
53,54,Big Green The 1995,"[Children, Comedy]",4.5
55,56,Kids of the Round Table 1995,"[Adventure, Children, Comedy, Fantasy]",5.0
68,69,Friday 1995,[Comedy],5.0
71,72,Kicking and Screaming 1995,"[Comedy, Drama]",5.0
81,82,Antonias Line Antonia 1995,"[Comedy, Drama]",4.5


Putting all the above code into a function, 
this way we could ask user to either input a movie name or genre and display recommendations based off it

In [61]:
def find_similar_movies(movieid):
    similar_users = df_ratings[(df_ratings['movieId']==movieid) & (df_ratings['rating']>4)]['userId'].unique()
    similar_recs= df_ratings[(df_ratings['userId'].isin (similar_users)) & (df_ratings['rating']>4)]["movieId"]
    
    similar_recs= similar_recs.value_counts()/len(similar_users)
    similar_recs= similar_recs[similar_recs>.10]
    
    total_users = df_ratings[(df_ratings['movieId'].isin (similar_recs.index)) & (df_ratings['rating']>4)]
    total_user_recs= total_users['movieId'].value_counts()/len(total_users['userId'].unique())

    recommendation_percentage = pd.concat([similar_recs,total_user_recs],axis=1)
    recommendation_percentage.columns = ['similar_users','all_users']
    
    recommendation_percentage['score']= recommendation_percentage['similar_users']/recommendation_percentage['all_users']
    recommendation_percentage= recommendation_percentage.sort_values('score',ascending=False)
    
    merged_df=recommendation_percentage.merge(df_movies,left_index=True, right_on='movieId')
    return merged_df[:10]

In [62]:
def recommend_movies_based_on_genre(genre_list):
     merged_df_genre=df_ratings.merge(df_movies,left_index=True, right_on='movieId')
     filtered_columns = ['movieId','cleaned_title','split_genres','rating'][:20]
     merged_df_genre=merged_df_genre[filtered_columns]
     merged_df_genre =merged_df_genre[(merged_df_genre['rating']>4) & merged_df_genre['split_genres'].apply(lambda lst: any(item in lst for item in genre_list))]
     return  merged_df_genre[:10]


In [63]:
#recommend_movies_based_on_genre(['Adventure'])

In [64]:
#Desinging an interactive widget

movie_input_name = widgets.Text(value= 'Bad Poem',
                          description = 'Movie title',
                          disabled = False)
genre_input = widgets.Text(value= 'Comedy',
                           description = 'Enter Genre',
                           disabled = False)

recs_list = widgets.Output()

input_type_dropdown = widgets.Dropdown(options=['Movie', 'Genre'], description='Input Type:', value='Movie')

def on_type(change):
    with recs_list:
        recs_list.clear_output()
        
        input_type = input_type_dropdown.value
        
        if input_type == 'Movie':
            title = change.new
            if len(title) >= 5:
                results = search(title)
                movieid = results.iloc[0]['movieId']
                display(find_similar_movies(movieid))

        elif input_type == 'Genre':
            genre_list = genre_input.value.split(',')
            if genre_list:
                display(recommend_movies_based_on_genre(genre_list))
                movie_input_name.value = ''  

movie_input_name.observe(on_type, names='value')

display(movie_input_name, genre_input, input_type_dropdown, recs_list)


Text(value='Bad Poem', description='Movie title')

Text(value='Comedy', description='Enter Genre')

Dropdown(description='Input Type:', options=('Movie', 'Genre'), value='Movie')

Output()

Now when I typed in the move name 'No fear no die', I got the top 10 recommendation based on similar user ratings.

Similarly when I remmove the title and type in preferred genre, recommendations would change accordingly.

In [74]:
#Debugging code

#find_similar_movies(296)
#search('Bad Poem')
#results = search('Bad Poem')
#movieid = results.iloc[0]['movieId']
#movieid
#find_similar_movies(movieid)