### Tutorial Link
https://www.youtube.com/watch?v=eyEabQRBMQA&t=184s

In [1]:
import pandas as pd

movies = pd.read_csv("movies.csv")

In [2]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


## Cleaning Movie Titles with Regex

In [3]:
import re 
#re is python's regex library

def clean_title(title):
    return re.sub("[^A-Za-z0-9 ]", "", title)

In [4]:
movies["clean_title"] = movies["title"].apply(clean_title)
#This creates a new column in the data frame called clean_title and goes through each record and applies the function we created on the specified fields

In [5]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


## Creating a Term Frequency Matrix
Computers only understand numbers, so we're trying to convert each title to numbers through a process called vectorization. The first thing is to create a term frequency matrix (using 1s and 0s to indicate the presence of terms) then inverse document frequency to identify more unique terms.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range = (1,2))
#An ngram is a series of adjecent items. In this case, we're using it so instead of just vectorizing by one word, we make our search more efficient by looking at groups of adjacent words

tfidf = vectorizer.fit_transform(movies["clean_title"])

## Search Function
### Cosine Similarity
See: https://medium.com/@arjunprakash027/understanding-cosine-similarity-a-key-concept-in-data-science-72a0fcc57599.
Since we are working with numbers, we can use cosine similarity. This works by plotting the numbers on a graph, then calculating the angle between them so we can determine if they're in the same direction. It's commonly used for document similarity. This is different from using eucladian distance which measures the actual distance between two vectors. We don't use eucladian distance because we don't care much for the vector's magnitude, just the general direction

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    tite = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    # This finds the 5 most simlar titles to our search term. We're really just sorting the similarity array to get 5 largest indices to the end of the array, then returning those
    indices = np.argpartition(similarity, -5)[-5:]
    # Here, we found the movies in our df using the indices, then reversed the results.
    results = movies.iloc[indices][::-1]
    return results 

## Building the Search Widget

In [8]:
import ipywidgets as widgets
from IPython.display import display

movie_input = widgets.Text(
    value = "Toy Story",
    description="Movie Title:",
    disabled=False
)

movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))
            
movie_input.observe(on_type, names='value')
display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## Recommendation System
When we enter a title, we want to find other users that liked that title and recommend movies they liked. This makes use of user-based collaborative filtering.

In [9]:
ratings = pd.read_csv("ratings.csv")

In [10]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


In [11]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [12]:
# Here, we're looking for unique userIds of users that watched the same movie and liked it (i.e. rated it 5.0 or more)
movie_id = 3
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] == 5)]["userId"].unique()

In [13]:
similar_users

array([    23,     58,    198, ..., 162315, 162321, 162377], dtype=int64)

In [14]:
# Here, we're finding users that liked the same movie as us (by checking if they're in similar_users) and finding the movies they liked i.e. rated > 4
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

In [15]:
similar_user_recs

3864           3
3867          11
3872          36
3876          50
3878          58
            ... 
24968625    1059
24968627    1073
24968629    1356
24968630    1367
24968632    1405
Name: movieId, Length: 49704, dtype: int64

Now, we want to narrow our results down by looking for movies that >=10% of the users that are similar to us liked

In [16]:
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)
similar_user_recs = similar_user_recs[similar_user_recs > .10]

In [17]:
similar_user_recs

movieId
3      1.000000
780    0.322425
356    0.321463
733    0.307026
260    0.301251
         ...   
339    0.113571
586    0.110683
135    0.102984
376    0.102984
17     0.100096
Name: count, Length: 63, dtype: float64

## Finding out how much all users like movies
Some of these movies are specfic to our users. Now, we need to find what percentage of regular people like these movies too. So we find movies that define the similarity to the movie you're looking for. In essence, we don't just want to return all the movies they like, we want the movies that are close to the one you're searching for. I think this deals with the bias that might make the system to continue recommending more popular movies for popularity's sake.

In [18]:
# Finding anyone that has rated a movie in our recommendations and those that have rated them highly.
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

In [19]:
all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
72,2,110,5.0,1141416589
76,2,260,5.0,1141417172
79,2,318,5.0,1141417181
81,2,349,4.5,1141417045
...,...,...,...,...
24999916,162541,50,5.0,1240953428
24999923,162541,260,5.0,1240952836
24999934,162541,527,4.5,1240953464
24999957,162541,1196,5.0,1240952840


In [20]:
# Find what percentage of all users that recommended the movies in similar_user_recs
all_users_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [21]:
all_users_recs

movieId
318    0.384189
296    0.319585
356    0.264118
593    0.253613
260    0.249457
         ...   
788    0.008958
3      0.008646
5      0.007196
376    0.006052
135    0.004044
Name: count, Length: 63, dtype: float64

In [22]:
rec_percentages = pd.concat([similar_user_recs, all_users_recs], axis=1)
rec_percentages.columns = ["similar", "all"]
# This gives us a table with the movie ids, how much users who are similar to us liked them and how much the average person liked them

In [23]:
rec_percentages

Unnamed: 0_level_0,similar,all
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
3,1.000000,0.008646
780,0.322425,0.060693
356,0.321463,0.264118
733,0.307026,0.051021
260,0.301251,0.249457
...,...,...
339,0.113571,0.026481
586,0.110683,0.023083
135,0.102984,0.004044
376,0.102984,0.006052


We want movies that have a big difference between these two numbers because it shows that it's more than just a popular movie

In [24]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [25]:
rec_percentages = rec_percentages.sort_values("score", ascending=False)

In [26]:
rec_percentages

Unnamed: 0_level_0,similar,all,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,1.000000,0.008646,115.659501
135,0.102984,0.004044,25.464219
5,0.169394,0.007196,23.538717
788,0.165544,0.008958,18.479358
376,0.102984,0.006052,17.017856
...,...,...,...
318,0.227141,0.384189,0.591224
858,0.135707,0.235815,0.575482
1196,0.118383,0.211089,0.560821
50,0.118383,0.225103,0.525907


In [27]:
# Now, we're going to use this data to retrieve the titles
rec_percentages.head(10).merge(movies, left_index = True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
2,1.0,0.008646,115.659501,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
133,0.102984,0.004044,25.464219,135,Down Periscope (1996),Comedy,Down Periscope 1996
4,0.169394,0.007196,23.538717,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
772,0.165544,0.008958,18.479358,788,"Nutty Professor, The (1996)",Comedy|Fantasy|Romance|Sci-Fi,Nutty Professor The 1996
371,0.102984,0.006052,17.017856,376,"River Wild, The (1994)",Action|Thriller,River Wild The 1994
489,0.14437,0.009174,15.736987,494,Executive Decision (1996),Action|Adventure|Thriller,Executive Decision 1996
770,0.154957,0.009925,15.613134,786,Eraser (1996),Action|Drama|Thriller,Eraser 1996
93,0.145332,0.011189,12.989305,95,Broken Arrow (1996),Action|Adventure|Thriller,Broken Arrow 1996
642,0.12897,0.01069,12.064002,653,Dragonheart (1996),Action|Adventure|Fantasy,Dragonheart 1996
786,0.180943,0.015292,11.832296,802,Phenomenon (1996),Drama|Romance,Phenomenon 1996


In [28]:
def find_similar_movies(movie_id):
    # STEP 1: Find users similar to us
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
   
    # STEP 2: Adjust so we only have recs where 10% of users recommended the movie
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)
    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    
    # STEP 3: Finding how common the recommendations were amongst all users
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_users_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    
    # STEP 4: Creating our similarity score & sorting
    rec_percentages = pd.concat([similar_user_recs, all_users_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    
    return rec_percentages.head(10).merge(movies, left_index = True, right_on="movieId")[["score", "title", "genres"]]

## Building the Widget

In [29]:
movie_name_input = widgets.Text(
    value="Toy Story",
    description="Movie Title:",
    disabled = False
)

recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))
            
movie_name_input.observe(on_type, names="value")
display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()

Next steps: ask users to specify genres & use metadata to improve recommendations