# Introduction

Give project overview

## Import and clean data

In [1]:
import pandas as pd

movies = pd.read_csv("movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


We can see that our dataset contains over 62K entries, with ID, Title and Genres as features.

The titles contain the movie years. These contain () which can make search difficult. There may also be other special characters (like ') that make search more complex. We will remove them using regex. 

The Genre are seperated by | and some are marked by "no genres listed". We will get back to this later.

In [5]:
import re

def clean_title(title):
    #substiture anything that is not(^) a-z or A-Z or 0-9 with empty space ('') in title
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title

In [6]:
movies["clean_title"] = movies["title"].apply(clean_title)
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


## Search Algorithm: TFIDF Matrix
explain the algorithm

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
#will also look at ngrams i.e combinations of words where we set n=(up to) 2
vectorizer = TfidfVectorizer(ngram_range=(1,2))

tfidf = vectorizer.fit_transform(movies["clean_title"])

Now we need a function to compare the search we enter to all the titles. We will use the Cosine similarities to compare the vectors.

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    #Take title an vectorize it
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    #Compare to our titles, using cosine similarity
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    #returns a vector with cosine similarity of query and all the titles we have
    #Get the 5 most similar results
    indices = np.argpartition(similarity, -5)[-5:]
    #Get most similar movie at the top
    results = movies.iloc[indices].iloc[::-1]
    return results

In [21]:
search('batman')

Unnamed: 0,movieId,title,genres,clean_title
8614,26152,Batman (1966),Action|Adventure|Comedy,Batman 1966
17357,91054,Batman (1943),Action|Adventure|Crime|Sci-Fi|Thriller,Batman 1943
584,592,Batman (1989),Action|Crime|Thriller,Batman 1989
53229,186985,Batman Ninja (2018),Action|Animation,Batman Ninja 2018
46251,172067,Batman & Bill (2017),Documentary,Batman Bill 2017


## Creating our search widget

In [22]:
import ipywidgets as widgets
from IPython.display import display

In [23]:
#Generating widget
movie_input = widgets.Text(
    value='',
    description='Movie Title:',
    disabled=False
)

In [24]:
movie_input

Text(value='', description='Movie Title:')

The widget doesn't do much here. We need to connect it to an output widget i.e a function that takes the input of the input widget and outputs it.

In [25]:
#Generating input widget
movie_input = widgets.Text(
    value='',
    description='Movie Title:',
    disabled=False
)

movie_list = widgets.Output()

#function turns on when we type into input widget
def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            #call our search function
            display(search(title))

#Connect input widget to in_type function.
movie_input.observe(on_type, names='value')


display(movie_input, movie_list)

Text(value='', description='Movie Title:')

Output()

## Recommendation System

To complete this, we need to use the rating feature. We will use the ratings file for this

In [26]:
ratings = pd.read_csv('ratings.csv')
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


In [29]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

All are integers/floats.

### Finding How Much Similar Users Liked Movies

So our system will find all the users that like the movie we searched for and then we will look for the other movies that they liked to create our prediction.
i.e we are asusming that people who like the same movie as us, will have generally a similar taste in movies.

For this we will use a constructed mask. we will look through the Ratngs and filter all the ratings of the movie we select (for now hard code it as movie_id). Then we will select from that only thos who gave a rating of at least x (hardcode to 4). From this we get all our unique users.

In [32]:
movie_id = 1

In [33]:
similar_users = ratings[(ratings['movieId'] == movie_id) & (ratings['rating'] >= 4)]['userId'].unique()
similar_users

array([     3,      5,      8, ..., 162530, 162533, 162534])

Now we want to see the other movies they liked, with 4 stars;

In [62]:
similar_user_recs = ratings[(ratings['userId'].isin(similar_users)) & (ratings['rating'] >= 4)]['movieId']
similar_user_recs

254              1
255             29
256             32
257             50
258            111
             ...  
24999332    166643
24999342    171763
24999348    177593
24999351    177765
24999378    198609
Name: movieId, Length: 5101989, dtype: int64

Whats joining the datasets is the movie ID. There are alot of movies. So we will get the movies that more than 10% of users similar to us liked to narrow that down.

In [63]:
# How many times each movie appears in our dataset, and divide by number
# Of similar accounts.
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

similar_user_recs = similar_user_recs[similar_user_recs > .10]

In [64]:
similar_user_recs

1       1.000000
318     0.549604
260     0.531518
356     0.517224
296     0.495744
          ...   
235     0.101249
1242    0.100931
1907    0.100772
3527    0.100613
2761    0.100135
Name: movieId, Length: 273, dtype: float64

### Finding How Much All Users Liked Movies

We need to differntiate between people that like the movie because they are similar to us to those that like it on a more generic basis. For examplle, everybody (maybe) likes Harry Potter - not because they are Fantasy fans.

In [65]:
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

In [66]:
all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
19,1,2692,5.0,1147869100
23,1,3949,5.0,1147868678
29,1,4973,4.5,1147869080
37,1,6016,5.0,1147869090
...,...,...,...,...
25000065,162541,5952,5.0,1240952617
25000077,162541,7147,4.5,1240952343
25000078,162541,7153,5.0,1240952613
25000081,162541,7361,4.5,1240953484


In [67]:
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

We want a differntial between this set of everybody and those that are supposedly different to us. So we will compare the percentages.

In [69]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", 'all']
rec_percentages

Unnamed: 0,similar,all
1,1.000000,0.120850
2,0.105598,0.017112
6,0.162879,0.047294
10,0.122623,0.021405
11,0.101408,0.021116
...,...,...
91529,0.120422,0.053184
99114,0.112732,0.055417
109487,0.117426,0.071740
112852,0.102681,0.041609


In [70]:
# Score = ratio of similar user likes movie to how much everyone likes movie
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [71]:
rec_percentages = rec_percentages.sort_values("score", ascending=False)
rec_percentages

Unnamed: 0,similar,all,score
1,1.000000,0.120850,8.274754
2355,0.191095,0.024311,7.860413
648,0.187382,0.028527,6.568707
440,0.104537,0.016509,6.332170
3114,0.328914,0.052036,6.320939
...,...,...,...
858,0.355883,0.203523,1.748618
2959,0.351826,0.209977,1.675543
318,0.549604,0.331577,1.657542
79132,0.209870,0.127298,1.648656


To decipher the ID, we will merge with our (semi-orignal) movies dataframe.

In [72]:
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
0,1.0,0.12085,8.274754,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
2264,0.191095,0.024311,7.860413,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A 1998
637,0.187382,0.028527,6.568707,648,Mission: Impossible (1996),Action|Adventure|Mystery|Thriller,Mission Impossible 1996
435,0.104537,0.016509,6.33217,440,Dave (1993),Comedy|Romance,Dave 1993
3021,0.328914,0.052036,6.320939,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
3650,0.128378,0.020756,6.184955,3751,Chicken Run (2000),Animation|Children|Comedy,Chicken Run 2000
584,0.200642,0.03244,6.184933,592,Batman (1989),Action|Crime|Thriller,Batman 1989
1,0.105598,0.017112,6.170978,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2705,0.152139,0.024863,6.119119,2797,Big (1988),Comedy|Drama|Fantasy|Romance,Big 1988
2895,0.15129,0.024882,6.08028,2987,Who Framed Roger Rabbit? (1988),Adventure|Animation|Children|Comedy|Crime|Fant...,Who Framed Roger Rabbit 1988


## Recommendation Widget

Put everything into a single function.

In [81]:
def find_similar_movies(movie_id):
    #Finding recommendations from similar users
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    #Movies that more than 10% of users similar to us liked 
    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    
    #Liking of all users for movies
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    
    #Comparing similar and non-similar users liking
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    #Generating scores and sorting it
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

In [83]:
movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

In [84]:
display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## Next steps

-Improve recommendation logic

-Add genre data to decision / allow user to select genre