# A Content-Based Movie Recommender System by Phil Han

### The following is a fairly straightforward movie recommender system with the well-known MovieLens datasets.  There are two common types of recommender systems: 1) content-based recommender system which relies on the similarity of movies when it recommends the movies to users and 2) collaborative recommender system which recommends items based on how similar users liked the items.  The below system, however, is a hybrid in nature but mostly based on the content-based recommender.

### Before I start, for this week's assignement, I am very much indebted to Vikas Paruchuri for his excellent "*Movie Recommendation System With Python And Pandas: Data Project*" tutorial on Dataquest YouTube channel since much of the below code closely follows his tutorial (Paruchuri, 2022).  Also, I'd to acknowledge Yohan Jeong for his exceptional blog post on "*Making a Content-Based Movie Recommender With Python*".  His blog provided me with a good understanding of the content-based approach to a movie recommender system as well as cosine similarity (Jeong, 2018).

In [1]:
# Import the necessary libraries 
import numpy as np
import pandas as pd

# load the 'movies' dataset from the movielens data
movies = pd.read_csv('movies.csv')
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [2]:
# Clean titles using 'regex' to remove any special characters within the titles

# import regular expression
import re

def clean_title(title):
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [3]:
# Create a new column for the 'clean_titles' using panda's apply method
movies["clean_title"] = movies["title"].apply(clean_title)

In [4]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,Black Butler Book of the Atlantic 2017
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,No Game No Life Zero 2017
9739,193585,Flint (2017),Drama,Flint 2017
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,Bungo Stray Dogs Dead Apple 2018


## Build Search Engine

### *Step 1*: We want to build a search engine.  We do this by converting the titles into numeric variables, using 'term frequency' method or 'Tfid' vectorizer, which will hep us find unique titles as well as similar ones.  

In [5]:
# import Tfid library
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize class which will look at groups of two words that are consecutive.
vectorizer = TfidfVectorizer(ngram_range=(1,2))

# Use the vectorizer to turn titles to sets of numbers
tfidf = vectorizer.fit_transform(movies["clean_title"])

### *Step 2*: We will use cosine similarity to build our recommender system by calculating the similarity between movies.  Cosine similarity provides a way to measure how similar users, items, or content is (Stieber, 2018).  

In [6]:
# Compute similarities using Cosine Similarity

# import cosine similarity and numpy
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#def search(title):
def search(title):
    title = clean_title(title) # clean title
    query_vec = vectorizer.transform([title]) # use vectorize to transform titles to a set of numbers
    
    # compare query terms to each of the titles in dataset
    similarity = cosine_similarity(query_vec, tfidf).flatten() 
    
    # Find the term with the greatest similarity using numpy 'argpartion'
    indices = np.argpartition(similarity, -5)[-5:]

    # Index our movie data by the indices or the titles we care about
    results = movies.iloc[indices][::-1]

    return results

In [7]:

# Create widgets for input and output data.  But we will expand this step a bit further at the end of this code
# to display the ten recommended movies.

import ipywidgets as widgets
from IPython.display import display

movie_input = widgets.Text(
    value="Toy Story",
    description="Movie Title:",
    diabled=False
)

# Make output widget
movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))
            
movie_input.observe(on_type, names='value')

# display both of the outputs
display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

# Find movies similar to movies we like

### We will now use another dataset called 'ratings' to find similar users by ratings.

In [8]:
# Read the ratings data
ratings = pd.read_csv("ratings.csv")

In [9]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [10]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [11]:
movie_id = 1

In [12]:
# Find users who like the same movies
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

In [13]:
similar_users

array([  7,  17,  31,  40,  43,  46,  57,  63,  71,  73,  96,  98, 145,
       151, 159, 166, 169, 171, 177, 201, 206, 220, 229, 234, 240, 247,
       252, 254, 269, 270, 273, 275, 280, 282, 288, 304, 328, 341, 347,
       353, 357, 364, 367, 378, 380, 382, 389, 396, 411, 438, 448, 451,
       453, 456, 460, 471, 484, 488, 533, 559, 562, 573, 584, 587, 610])

## Create A Movie Recommender System

### For our recommender system, we will build a content-based recommender which relies on the simiarity of movies when it makes recommendations to users.  In our case, when users like a movie, the recommender system finds and recommends movies that are similar to the one the users like (Jeong, 2021).

In [14]:
# Find users who are similar to you
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"]>4)]["movieId"]

In [15]:
similar_user_recs

874            1
875           50
877          150
879          260
880          356
           ...  
100821    160527
100829    164179
100832    168248
100833    168250
100834    168252
Name: movieId, Length: 3754, dtype: int64

In [16]:
# Find only the movies greater than 10% or more of users who are similar to us

similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

similar_user_recs = similar_user_recs[similar_user_recs > .1]

In [17]:
similar_user_recs

1        1.000000
318      0.430769
296      0.400000
356      0.384615
593      0.369231
           ...   
8368     0.107692
1097     0.107692
74458    0.107692
1219     0.107692
733      0.107692
Name: movieId, Length: 103, dtype: float64

In [18]:
# Find how much all of our users like movies we rate highly or recommend using ratings
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"]>4)]

In [19]:
all_users

Unnamed: 0,userId,movieId,rating,timestamp
3,1,47,5.0,964983815
4,1,50,5.0,964982931
15,1,260,5.0,964981680
25,1,457,5.0,964981909
28,1,527,5.0,964984002
...,...,...,...,...
100227,610,51255,5.0,1479542571
100310,610,58559,4.5,1493844688
100326,610,60069,4.5,1493844866
100429,610,74458,4.5,1479542157


In [20]:
# Find what percentage of all users who watch what we recommend
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [21]:
all_user_recs

318      0.362007
296      0.299283
356      0.277778
2571     0.268817
2959     0.232975
           ...   
8636     0.044803
899      0.039427
733      0.037634
78499    0.037634
500      0.030466
Name: movieId, Length: 103, dtype: float64

In [22]:
# Recommend movies by score

# Compare the percentages of the similar users and all users (avg. persons) who like what we recommend
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [23]:
rec_percentages

Unnamed: 0,similar,all
1,1.000000,0.116487
318,0.430769,0.362007
296,0.400000,0.299283
356,0.384615,0.277778
593,0.369231,0.229391
...,...,...
8368,0.107692,0.055556
1097,0.107692,0.069892
74458,0.107692,0.060932
1219,0.107692,0.062724


In [24]:
# Create score by recommendations from similar users divided by rec % from all users
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [25]:
# Sort values of score
rec_percentages = rec_percentages.sort_values("score", ascending=False)

In [26]:
rec_percentages

Unnamed: 0,similar,all,score
1,1.000000,0.116487,8.584615
3114,0.307692,0.059140,5.202797
78499,0.184615,0.037634,4.905495
500,0.138462,0.030466,4.544796
8961,0.184615,0.057348,3.219231
...,...,...,...
110,0.184615,0.181004,1.019954
4973,0.107692,0.105735,1.018514
79132,0.123077,0.123656,0.995318
2571,0.261538,0.268817,0.972923


In [27]:
# Display ten movies recommended by those who like similar movies
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
0,1.0,0.116487,8.584615,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
2355,0.307692,0.05914,5.202797,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
7355,0.184615,0.037634,4.905495,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
436,0.138462,0.030466,4.544796,500,Mrs. Doubtfire (1993),Comedy|Drama,Mrs Doubtfire 1993
5374,0.184615,0.057348,3.219231,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
32,0.169231,0.053763,3.147692,34,Babe (1995),Children|Drama,Babe 1995
2038,0.169231,0.053763,3.147692,2716,Ghostbusters (a.k.a. Ghost Busters) (1984),Action|Comedy|Sci-Fi,Ghostbusters aka Ghost Busters 1984
506,0.246154,0.082437,2.985953,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
592,0.107692,0.037634,2.861538,733,"Rock, The (1996)",Action|Adventure|Thriller,Rock The 1996
5260,0.123077,0.044803,2.747077,8636,Spider-Man 2 (2004),Action|Adventure|Sci-Fi|IMAX,SpiderMan 2 2004


## Building A Recommendation Function

In [28]:
# Create a recommendation function
def find_sim_movies(movie_id):
    # Find users who are similar to you
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"]>4)]["movieId"]
    
    # Adjust the recommendations over 10% of users who recoomended the movies
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)
    similar_user_recs = similar_user_recs[similar_user_recs > .1]
    
    # Find the movies in common from all of the users 
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"]>4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    
    # Compare the percentages of the similar users and all users (avg. persons) who like what we recommend
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    # Create score by similar divided by all
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    
    # Sort values of score
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    
    # Display top ten movies recommended by those who like similar movies in terms of score, title, & genres
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score","title","genres"]]

## Creating An Interactive Recommendation Widget For Movie Input

In [29]:
# Import the widgets and display libraries
import ipywidgets as widgets
from IPython.display import display

# Make input widget
movie_input = widgets.Text(
    value="Toy Story",   # initial value set to 'toy story'
    description="Movie Title:",
    diabled=False
)

# Make output widget
movie_list = widgets.Output()

# Create an on-type function
def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]  # Grab a title from input widget
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"] # Extract movie id
            display(find_sim_movies(movie_id)) # find and display similar movies
            
# observe movie input
movie_input.observe(on_type, names='value')


## Please input a movie title in the box below to find similar 10 movie recommendations 

In [30]:
# display a widget input where user types in to get ten movie recommendations.
display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

### References:

### Paruchuri, V. (2022, May 27).  *Movie Recommendation System With Python And Pandas: Data Project* [Video]. Dataquest  YouTube.  https://www.youtube.com/watch?v=eyEabQRBMQA 

### Jeong, Y. (2021, April 5).  Making a Content-Based Movie Recommender With Python.  *Geek Culture*.  https://medium.com/geekculture/creating-content-based-movie-recommender-with-python-7f7d1b739c63

### Stieber, B. (2018, December 31).  Recommending Songs Using Cosine Similarity in R.  *Deeper Data Digressions*.  https://bgstieber.github.io/post/recommending-songs-using-cosine-similarity-in-r/