# Research on Algorithms and Notes

Reference: 
The algorithm implemented in this project follows the direction of CodeHeroku Intro ML video series by Mihir Thakkar, Kaggle Movie Recommender article by Rounak Banik, and several other youtube videos. Dataset comes from Kaggle.com 
(Note: I orginally used a really big dataset from MovieLens, but I realized that it was too big and my code runs super slowly, therefore, I switched to a relatively smaller dataset.)

## Recommendation System General Notes 
There are three types of movie recommendation systems: Rating-based, Content-based, and Collaborative-Based systems. 
### 1. Rating-based 
Rating-based filter focuses on the similarity of ratings and popularities.The system usually keeps track of the click count (and/or ratings of movie) so that the recommendation system always recommends the most trending movies, possibly from the recent most popular to the least popular movies. 
### 2. Content-based
Content-based filter focuses on the similarity of contents, take the movies that user currently likes as inputs and analyze several features such as genre, cast, directors of the movie to find out what other movies might have similar contents and make a recommendation to users. 
### 3. Collaborative-Based
This type of system take similar users as inputs based on people's similar preferences.	If several users are classified as being similar to each other — watch movies directed by the same director or movie of the same genre — then the system cross check if one user has not watched movies that several other "similar" users have watch, and recommends that movie to this particular user.

#### Final Work done:
1. First, I implemented a content-based system: 
    Features accounted for are: 
    * `director`
    * `genre`
    * `cast`
    * `keywords`
2. To improve the content-based recommendation system, I added the `popularity rating` as a new layer in my sorting process, allowing the model to give the final recommendation based on popularity
3. Implemented a new feature to recommend the top rated movies by `genre`
4. Made a UI mock-up hosted on figma: https://www.figma.com/proto/LuHyG0h47FUhRasBUEhpN4/HAI?node-id=1%3A210&scaling=scale-down

In [80]:
# Package Import 
# general packages
import pandas as pd  
import numpy as np  

In [81]:
# read dataset
df = pd.read_csv("movie_dataset.csv");
df.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [82]:
for col in df.columns: 
    print(col) 

index
budget
genres
homepage
id
keywords
original_language
original_title
overview
popularity
production_companies
production_countries
release_date
revenue
runtime
spoken_languages
status
tagline
title
vote_average
vote_count
cast
crew
director


## Feature 1: Content-based Filtering
### 1. Define Similarity score    
Similarity score is defined by the cosine similarity, that is, if a certain word occurred in movie 1 once and in movie 2 twice, we can map the occurrance of that word on a xy coordinate system and calculate their distance by cosine theta (through u.v/len(u).len(v)). The advantage 
of such a method is that the final score will be mapped in range of 0 to 1 range.
### 2. Calculate Cosine Similarity
The sklearn package will be able to calculate the cosine similarity score for us via `sklearn.metrics.pairwise` package. We would also need a package to parse the strings from the texts and the `sklearn.feature_extraction.text` package would help us do that.

In [83]:
# string parser and cosine score calculator
from sklearn.feature_extraction.text import CountVectorizer  #parse text
from sklearn.metrics.pairwise import cosine_similarity  # cosine_similarity package, for content-based similarity modeling

## Feature Selection
First, I need to select what type of features I want to use in the prediction. From the df.head() we could see the content of each column and `genres`, `director`, `keywords`, and `cast` might be good fits for the purpose of parsing similar texts. The overview might be useful as well, but we might end-up getting a bunch of propositions, which could become a distraction to the algorithm. After some research, I decide not to use this feature for now. 

In [84]:
# select features I want to use
features = ['keywords','cast','genres','director']


In [85]:
# combine all selected features
for feature in features:
    df[feature] = df[feature].fillna('') # fill all NA cells

# produce a column which has all features combined for each movie
def combine_features(movie):
    try:
        return movie[features[0]]+" "+movie[features[1]]+" "+movie[features[2]]+" "+movie[features[3]]
    except:
        print("Error:", movie)

df["combine_features"] = df.apply(combine_features,axis=1)

print("Features:\n", df["combine_features"].head())


Features:
 0    culture clash future space war space colony so...
1    ocean drug abuse exotic island east india trad...
2    spy based on novel secret agent sequel mi6 Dan...
3    dc comics crime fighter terrorist secret ident...
4    based on novel mars medallion space travel pri...
Name: combine_features, dtype: object


In [86]:
# Create count matrix from this new column
cv = CountVectorizer() # from sklearn.feature_extraction.text 
count_matrix = cv.fit_transform(df["combine_features"])


In [87]:
# Compute the Cosine Similarity based on the count_matrix
cosine = cosine_similarity(count_matrix) # from sklearn.metrics.pairwise


In [88]:
# helper
# given an index, get the title of selected movie
def get_title(index):
    return df[df.index == index]["title"].values[0]

# given a title, get the index of selected movie
def get_index(title):
    return df[df.title == title]["index"].values[0]


In [89]:
# test with the dark knight... 
# hypothetically, it will recommend things like batman, dark knight rises, etc. 
user_fav = "The Dark Knight"

In [90]:
# Get index from title
idx = get_index(user_fav)
similar = list(enumerate(cosine[idx]))

In [91]:
# Get a list of similar movies in descending order of similarity score
sorted_similar = sorted(similar,key=lambda x:x[1],reverse=True)[1:20]

In [93]:
# Print titles of first 10 movies
i=0
print("top 10 movies we think you would like!")
for element in sorted_similar:
    print(get_title(element[0]))
    i=i+1
    if i>9:
        break

top 10 movies we think you would like!
The Dark Knight Rises
Batman Begins
Amidst the Devil's Wings
The Prestige
Kick-Ass
Kick-Ass 2
Batman Returns
Batman
The Killer Inside Me
Batman & Robin


#### Feature 1.5
I'm going to add a new layer — voting— to the sorting.

In [94]:
# take voting into account
df["vote_average"].head()

0    7.2
1    6.9
2    6.3
3    7.6
4    6.1
Name: vote_average, dtype: float64

In [95]:
sort_by_average_voting = sorted(sorted_similar,key=lambda x:df["vote_average"][x[0]],reverse=True)
print(sort_by_average_voting)

[(1196, 0.37630890450319093), (3, 0.6923076923076922), (119, 0.6538461538461537), (1720, 0.3706246583305506), (3966, 0.2926847035024818), (1359, 0.3530090432487313), (3332, 0.3149448894660932), (428, 0.3602883460614461), (3819, 0.2926847035024818), (1740, 0.36417852036461484), (4099, 0.29268470350248177), (3293, 0.28022426915890253), (3359, 0.29417420270727607), (1792, 0.28625128703833574), (2793, 0.3508232077228117), (1503, 0.2853908964926965), (2398, 0.28625128703833574), (210, 0.3396831102433788), (4638, 0.45291081365783836)]


In [96]:
i=0
print("We think you would like to watch the following movies:\n")
for element in sort_by_average_voting:
    print(get_title(element[0]))
    i=i+1
    if i>9:
        break

We think you would like to watch the following movies:

The Prestige
The Dark Knight Rises
Batman Begins
Kick-Ass
Point Blank
Batman
Harry Brown
Batman Returns
Defendor
Kick-Ass 2


In [97]:
# function to print top recommended movies
def recommend_by_title(title):
    idx = get_index(title)
    similar = list(enumerate(cosine[idx]))
    sorted_similar = sorted(similar,key=lambda x:x[1],reverse=True)[1:20]
    sort_by_average_voting = sorted(sorted_similar,key=lambda x:df["vote_average"][x[0]],reverse=True)
    i=0
    print("We think you would like to watch the following movies:\n")
    for element in sort_by_average_voting:
        print(get_title(element[0]))
        i=i+1
        if i>9:
            break    

In [98]:
recommend_by_title('The Dark Knight')

We think you would like to watch the following movies:

The Prestige
The Dark Knight Rises
Batman Begins
Kick-Ass
Point Blank
Batman
Harry Brown
Batman Returns
Defendor
Kick-Ass 2


### Performance
I cross checked my algorithm's output with google: the results are somewhat similar. I saw the three batman series in both my list and the google recommendation; and I saw the dark knight rises. 

![darkknight1.png](attachment:darkknight1.png)

## Feature 2: Top-Rated Movies by genre

I added this feature which could recommend top rated movies by genre. The high-level idea is adopted from the IMDB's weighted rating formulae: 

weighted rating = (#vote / (#votes+min_vote)* ave_rating) + (min_vote / (#votes+min_vote)* ave_report)

where #vote is the number of vote; min_vote is the minimum vote on the board, ave_rating is the average rating of the movie, and ave_report stands for the mean vote across the board. 

To get enough movies in the pool, I set the top 75% percentile (i.e. a movie to be put on board should have at least higher votes than 75% of the movies across the entire dataset).

In [99]:
from ast import literal_eval

In [100]:
df['vote_average'].head()

0    7.2
1    6.9
2    6.3
3    7.6
4    6.1
Name: vote_average, dtype: float64

In [101]:
df['vote_count'].head()

0    11800
1     4500
2     4466
3     9106
4     2124
Name: vote_count, dtype: int64

In [102]:
# clean vote_count column
md = df[~df['vote_count'].astype(str).str.contains('id',na=False)]
md = md[~md['vote_count'].astype(str).str.contains('name',na=False)]

In [103]:
# clean vote_average column
md = md[~md['vote_average'].astype(str).str.contains('id',na=False)]
md = md[~md['vote_average'].astype(str).str.contains('name',na=False)]

In [104]:
# convert to int
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype(str).astype(int)
    

In [105]:
# convert score to float
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype(str).astype(float)

In [106]:
# calculate mean across the board 
C = vote_averages.mean()
C

6.092171559442011

In [107]:
# calculate 0.75 quantile
m = vote_counts.quantile(0.75)
m

737.0

In [108]:
# calculate weighted ratings
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [109]:
qualified = md[(md['vote_count'] >= m) & 
                   (md['vote_count'].notnull()) & 
                   (md['vote_average'].notnull())][['title', 
                                                    'vote_count', 
                                                    'vote_average', 
                                                    'popularity']]

In [110]:
qualified['weighted_rating'] = qualified.apply(weighted_rating, axis=1)

In [111]:
qualified = qualified.sort_values('weighted_rating', ascending=False).head(250)

In [112]:
qualified.head(15)

Unnamed: 0,title,vote_count,vote_average,popularity,weighted_rating
1881,The Shawshank Redemption,8205,8.5,136.747729,8.301547
3337,The Godfather,5893,8.4,143.659698,8.143459
662,Fight Club,9413,8.3,146.757391,8.139688
3232,Pulp Fiction,8428,8.3,121.463076,8.122458
65,The Dark Knight,12002,8.2,187.322927,8.078054
809,Forrest Gump,7927,8.2,138.133331,8.020698
96,Inception,13752,8.1,167.58371,7.997869
1818,Schindler's List,4329,8.3,104.469351,7.978806
3865,Whiplash,4254,8.3,192.528841,7.973979
95,Interstellar,10867,8.1,724.247784,7.972478


In [113]:
# filter by genre
s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = df.drop('genres', axis=1).join(s)

In [114]:
# test
gen_md[gen_md['genre'].astype(str).str.contains('Action')].head()

Unnamed: 0,index,budget,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,status,tagline,title,vote_average,vote_count,cast,crew,director,combine_features,genre
0,0,237000000,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron,culture clash future space war space colony so...,Action Adventure Fantasy Science Fiction
1,1,300000000,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski,ocean drug abuse exotic island east india trad...,Adventure Fantasy Action
2,2,245000000,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes,spy based on novel secret agent sequel mi6 Dan...,Action Adventure Crime
3,3,250000000,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan,dc comics crime fighter terrorist secret ident...,Action Crime Drama Thriller
4,4,260000000,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton,based on novel mars medallion space travel pri...,Action Adventure Science Fiction


In [115]:
def build_rec(genre, percentile=0.75):
    md = gen_md[gen_md['genre'].astype(str).str.contains(genre)]
    vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('float')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = md[(md['vote_count'] >= m) & 
                   (md['vote_count'].notnull()) & 
                   (md['vote_average'].notnull())][['title', 
                                                    'vote_count', 
                                                    'vote_average', 
                                                    'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('float')
    
    qualified['weighted_rating'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('weighted_rating', ascending=False).head(250)
    
    return qualified

In [116]:
# for example, I want to search for top crime movies
print('We think you would love these Crime movies: \n')
for index, row in build_rec('Crime').head(10).iterrows():
    print(row['title'])

We think you would love these Crime movies: 

The Shawshank Redemption
The Godfather
Pulp Fiction
The Dark Knight
The Godfather: Part II
The Green Mile
Se7en
The Silence of the Lambs
GoodFellas
The Usual Suspects
