# Movie Recommendation System with Python

---

![Jupyter](https://img.shields.io/badge/Jupyter-gray?style=flat-square&logo=jupyter)
![Python](https://img.shields.io/badge/Python-gray?style=flat-square&logo=python)

Movie recommendation system that uses the [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/) which contains movie reviews, users and over 25 million ratings. As first step it was build a search engine with Jupyter to find a movie title and then it was created a recommendation engine to recommend similar movies.  



#### 1. Import libraries


In [1]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import ipywidgets as widgets
from IPython.display import display

#### 2. Read movie dataset

In [2]:
# reading csv file that contains the movie catalog with pandas

movie_catalog = pd.read_csv("movies.csv")

#####    2.1 Display movie catalog

In [3]:
movie_catalog

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


#### 3. Clean movie titles

In [4]:
# Cleaning movie titles with regex

def clean_movie_title(title):
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [5]:
# Adding a new column of the clean titles to the movie catalog 

movie_catalog["clean_title"] = movie_catalog["title"].apply(clean_movie_title)

In [6]:
# display movie catalog with new column added

movie_catalog

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


#### 4. Create a TF-IDF Matrix 

Convert movie titles into sets of numbers using TfidfVectorizer to create a TFIDF Matrix, so this makes the search more accurate.

In [7]:
vectorizer = TfidfVectorizer(ngram_range=(1,2)) 
tfidf_matrix = vectorizer.fit_transform(movie_catalog["clean_title"])

#### 5. Create a search function 

Search function is used to search the title typed, clean it, convert it into set of numbers, find the similarity between the title searched and all the titles on movie catalog and return the top 5 similar titles 

In [8]:
def search(title):
    title = clean_movie_title(title)
    query_vector = vectorizer.transform([title])
    titles_similarity = cosine_similarity(query_vector, tfidf_matrix).flatten()
    records = np.argpartition(titles_similarity, -5)[-5:] # show top 5 similar titles 
    movie_recommendation = movie_catalog.iloc[records].iloc[::-1]  # asending order
    return movie_recommendation

#### 6. Build an interactive Jupyter Notebook widget

In [9]:
# interactive Jupyter Notebook widget 
# in which user type in the name of movie and it displays the search results

movie_input = widgets.Text(
    value='The Matrix',
    description='Movie Title:',
    disabled=False
)

movie_list = widgets.Output()

# funcion on_type is called when we type something on the text box 
def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))

movie_input.observe(on_type, names='value')

display(movie_input, movie_list)

Text(value='The Matrix', description='Movie Title:')

Output()

- ***Here*** is a sample of the interactive Jupyter Notebook widget, in which user type in a movie title, in this case **The Matrix** and it displays the search results with the top 5 similar movie titles. 

![ezgif com-gif-maker](https://user-images.githubusercontent.com/45029403/213072576-4d99e1c2-e9ac-433f-bf44-3c34437a9d21.gif)

#### 7. Read movie ratings dataset

In this section we get movie recommendation similar to the movie we searched using the **ratings.csv** file.

In [10]:
# ratings csv contains users and the ratings they give to the movies

ratings = pd.read_csv("ratings.csv")

In [11]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


#### 8. Find users who also liked the same movie searched

In [12]:
movie_id = 2571  # The Matrix 

similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

In [13]:
similar_users

array([     2,      4,     13, ..., 162532, 162534, 162541], dtype=int64)

#### 9. Find movies that similar users liked 

In [14]:
# find the movies that similar_users rated 5 

similar_user_recommendations = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

In [15]:
similar_user_recommendations

72            110
74            151
76            260
79            318
80            333
            ...  
25000085     8983
25000086    31658
25000089    45517
25000090    50872
25000094    63876
Name: movieId, Length: 2391195, dtype: int64

##### 9.1 Find only the movies that grater than 10% of the users who are similar liked

In [16]:
similar_user_recommendations = similar_user_recommendations.value_counts() / len(similar_users)

similar_user_recommendations = similar_user_recommendations[similar_user_recommendations > .10]

In [17]:
similar_user_recommendations

2571     1.000000
318      0.475537
2959     0.449214
296      0.412255
4993     0.401265
           ...   
72998    0.106130
48394    0.103552
1201     0.103389
111      0.102521
1653     0.101761
Name: movieId, Length: 107, dtype: float64

##### 9.2 Find the movies that define the similarity to the movie searched 

In [18]:
# finding how much all users like movies 

all_users = ratings[(ratings["movieId"].isin(similar_user_recommendations.index)) & (ratings["rating"] > 4)]

In [19]:
# find the movies that define the similarity to the movie searched 

all_user_recommendations = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [20]:
all_user_recommendations

318      0.345393
296      0.287313
2571     0.246296
356      0.237447
593      0.228003
           ...   
1580     0.049144
72998    0.047921
54286    0.047767
5445     0.043904
1653     0.041157
Name: movieId, Length: 107, dtype: float64

#### 10. Create a recommendation score

In [21]:
# creating a recomendation score comparing the %

recommendations_percentages = pd.concat([similar_user_recommendations, all_user_recommendations], axis=1)
recommendations_percentages.columns = ["similar", "all"]

In [22]:
recommendations_percentages

Unnamed: 0,similar,all
2571,1.000000,0.246296
318,0.475537,0.345393
2959,0.449214,0.218726
296,0.412255,0.287313
4993,0.401265,0.189252
...,...,...
72998,0.106130,0.047921
48394,0.103552,0.055600
1201,0.103389,0.056028
111,0.102521,0.082462


In [23]:
# Adding a new column of the score 
recommendations_percentages["score"] = recommendations_percentages["similar"] / recommendations_percentages["all"]

In [24]:
recommendations_percentages = recommendations_percentages.sort_values("score", ascending=False)

In [25]:
recommendations_percentages

Unnamed: 0,similar,all,score
2571,1.000000,0.246296,4.060161
5445,0.117419,0.043904,2.674428
1527,0.174351,0.066742,2.612311
1653,0.101761,0.041157,2.472492
32587,0.120594,0.049632,2.429754
...,...,...,...
527,0.271607,0.217202,1.250482
111,0.102521,0.082462,1.243256
912,0.108355,0.093456,1.159424
608,0.167241,0.145635,1.148360


##### 10.1 Display top 10 recommendation of movie **The Matrix**

In [26]:
# top 10 recommendations 

recommendations_percentages.head(10).merge(movie_catalog, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
2480,1.0,0.246296,4.060161,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,Matrix The 1999
5337,0.117419,0.043904,2.674428,5445,Minority Report (2002),Action|Crime|Mystery|Sci-Fi|Thriller,Minority Report 2002
1475,0.174351,0.066742,2.612311,1527,"Fifth Element, The (1997)",Action|Adventure|Comedy|Sci-Fi,Fifth Element The 1997
1591,0.101761,0.041157,2.472492,1653,Gattaca (1997),Drama|Sci-Fi|Thriller,Gattaca 1997
9778,0.120594,0.049632,2.429754,32587,Sin City (2005),Action|Crime|Film-Noir|Mystery|Thriller,Sin City 2005
1523,0.117663,0.049144,2.394242,1580,Men in Black (a.k.a. MIB) (1997),Action|Comedy|Sci-Fi,Men in Black aka MIB 1997
10679,0.154406,0.064911,2.378739,44191,V for Vendetta (2006),Action|Sci-Fi|Thriller|IMAX,V for Vendetta 2006
10002,0.175518,0.074014,2.371422,33794,Batman Begins (2005),Action|Crime|IMAX,Batman Begins 2005
5310,0.14963,0.063554,2.354373,5418,"Bourne Identity, The (2002)",Action|Mystery|Thriller,Bourne Identity The 2002
1207,0.191827,0.081847,2.343727,1240,"Terminator, The (1984)",Action|Sci-Fi|Thriller,Terminator The 1984


#### 11. Create a recommendation function

Using the code from steps 8 to 10 we created a recommendation function that return the top 10 movies recommendation.

In [27]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recommendations = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    
    similar_user_recommendations = similar_user_recommendations.value_counts() / len(similar_users)
    similar_user_recommendations = similar_user_recommendations[similar_user_recommendations > .10]
    
    all_users = ratings[(ratings["movieId"].isin(similar_user_recommendations.index)) & (ratings["rating"] > 4)]
    all_user_recommendations = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    
    recommendations_percentages = pd.concat([similar_user_recommendations, all_user_recommendations], axis=1)
    recommendations_percentages.columns = ["similar", "all"]
    
    recommendations_percentages["score"] = recommendations_percentages["similar"] / recommendations_percentages["all"]
    recommendations_percentages = recommendations_percentages.sort_values("score", ascending=False)
    
    return recommendations_percentages.head(10).merge(movie_catalog, left_index=True, right_on="movieId")[["score", "title", "genres"]]

#### 12. Create an interactive recommendation widget with Jupyter

In [28]:
# Interactive recommendation widget

movie_name_input = widgets.Text(
    value='The Matrix',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list) 

Text(value='The Matrix', description='Movie Title:')

Output()

- ***Here*** is a sample of the of the interactive Jupyter Notebook widget finished, in which user search an specific movie title in the data, in this case **The Matrix** and it display the top 10 movies recommended similar to the movie searched using user-based collaborative filtering.   

![ezgif com-gif-maker (1)](https://user-images.githubusercontent.com/45029403/213073816-6ba4a876-de2a-464d-a08c-447b9775e67d.gif)