##  type of recommender systems

### demographic filtering
gerneralized recommendations to every user, based on movie popularity, and/or genre.
recommends same movies for users with similar demographic features.
#### Too simple - since, every user is different

### content based filtering
#### If one like user like a item, he likes similar item
system uses item metadata, (genre, director, description, actors...) to recommend.

### Collaborative filtering
matches users with similar interest, provide recommnedations based on matchings.
no metadata required
#### same interest, same recommendations



In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

df1 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
df2 = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")

# features in DataFrame df1 & df2
df1 (DataFrame 1)
- movie_id - ID for each movie
- cast 
- crew

df2 (DataFrame 2)
- budget 
- genre
- homepage - Link to homepage of the movie
- id - ID for each movie (same as movie_id in df1)
- keywords - keywords, tages related to movie
- original_language - Original language
- Original_title - title of movie before translation/adaptation
- overview - description of the movie
- popularity - A numerical quantity for movie popularity 
- production_companies - The production house
- production_countries - Country of origin
- release_date
- revenue - ww revenue
- runtime - in minutes
- status - "Released" or "Rumored"
- tagline - Movie's tagline
- Title - Title of the movie
- vote_average - average rating recieved
- vote_count

Joining df1, df2 on id column using [pandas.DataFrame.merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)


In [42]:
df1.columns = ['id', 'tittle', 'cast', 'crew']
df2 = df2.merge(df1, on = 'id') 

peak our dataframe df2

In [43]:
df2.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Demographic filtering

- metric to rate movie
- rate for every movie
- sort rates & rec best rated movie to users

#### Average ratings

- wr := weighted rating
- v (vote_count) := no. of votes
- m := min vote required to be listed in chart
- R (vote_average) := avg rating
- C := mean vote across whole movie dataset

$$wr = \left(\frac{v}{v+m}{.}R\right)+\left(\frac{m}{v+m}{.}C\right)$$



In [44]:
# C = mean of df2["vote_average"]
C = df2["vote_average"].mean()

#### m (min vote req) 
movie m_i(v) > 90% v of other movies


In [45]:
# m = mean rating 90th percentile as cutoff
m = df2["vote_count"].quantile(.9)
m

1838.4000000000015

filtering out movie based on qualifying

In [46]:
q_ = df2.copy().loc[df2["vote_count"] >= m]
q_.shape

(481, 23)

#### metric for movie

$$Weighted Rating (WR)=(\frac{v}{v+m}\cdot R)+(\frac{m}{v+m}\cdot C)$$

In [47]:
# weighted rating 
def weighted_rating (x, m=m, C=C):
    """
    args:
        X : 
        m : mean rating of movies of 90% percentile
        C : mean of vote_average of every movie
    return:
        WR 
    """
    v = x["vote_count"]
    R = x["vote_average"]
    #calculate IMDB formula
    return (v/(v+m)*R + m/(v+m)*C)

In [48]:
# define new feature "score" & 
# calculate its value with weighted_rating
q_["score"] = q_.apply(weighted_rating, axis=1)

# sort movie based on score
q_ = q_.sort_values("score", ascending=False)
q_[["title", "vote_count", "vote_average", "score"]].head(18)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239
3337,The Godfather,5893,8.4,7.851236
95,Interstellar,10867,8.1,7.809479
809,Forrest Gump,7927,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727243
1990,The Empire Strikes Back,5879,8.2,7.697884


# Content based filtering

Similarity from content of the film ( overview, cast n' crew, keywrod, tagline, genre etc ) is used to find similarity with other movies. And most similar movies would be recommended.



### Plot based recommender

Computing parwise SS(similarity score) of every movie pairwise based on plot overview(discriptions) & recommend movie based on SS.

In [49]:
df2["overview"].head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

Computing TF-IDF vector for overview of every movie
$$TF=\frac{\textrm{instance of a term}}{\textrm{total instance}}$$

$$IDF=\log\Big( \frac{\textrm{no. of documents}}{\textrm{document with the term}} \Big)$$

$$TF-IDF=TF*IDF$$

Scikit-learn has build-in TF-IDF vectorizer in module sklearn.feature_extraction.text -> TfidfVectorizer

In [50]:
# TfidfVectorizer is imported sklearn.feature_extraction.text 

# replace NaN for an empty string in overview column
df2["overview"] = df2["overview"].fillna("")

# define a TF-IDF vectorizer object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words = "english")

# construct tf-idf matrix by fittin & transforming data
tfidf_matrix = tfidf.fit_transform(df2["overview"])

# shape of tfidf_matrix
tfidf_matrix.shape 

(4803, 20978)

### Cosine similarity
$$similarity \big( \cos(\theta) \big)=\frac{\textbf{A.B}}{|A||B|}=\dfrac{\sum_{i=1}^N\ A_i*B_i}{\sqrt{\sum_{i=1}^N\ A_i^2}\sqrt{\sum_{i=1}^N\ B_i^2}}$$

here, A, B are TFidfVectorizer for 2 different movies

In [51]:
# using sklearn.linear_kernel() instead of cosine_similarities() (faster)
# compute Cosine similarity matrix
cos_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

 reverse map of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.
 


In [52]:
# pd.Series() gives 
indices = pd.Series(data = df2.index, index = df2["title"] ).drop_duplicates()

+ Get the index of the movie from title
+ Get the list of similarity (linear_kerel) of the given movie with all other movies
+ Sort the list based on similarity
+ return 1 to 11 movies (10 movies) in the list (0 being the movie itself)  

In [53]:
def recommendations(title, sim = cos_sim):
    # get the index of the movie(title)
    id = indices[title]
    
    # get the pairwise similarity_score from sim for all movies 
    sim_score = list(enumerate(sim[id]))
    
    # sort the sim_score based on sim score
    sim_score = sorted(sim_score, key = lambda x : x[1], reverse = True )
    
    # get the top 10 movies score
    sim_score = sim_score[1:11]
    
    # get the movie indices
    movies_ID = [i[0] for i in sim_score]
    
    # return movie corr to movies_ID
    return df2["title"].iloc[movies_ID]

In [54]:
# trial
recommendations('The Dark Knight Rises')


65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object

In [56]:
recommendations('The Avengers')


7               Avengers: Age of Ultron
3144                            Plastic
1715                            Timecop
4124                 This Thing of Ours
3311              Thank You for Smoking
3033                      The Corruptor
588     Wall Street: Money Never Sleeps
2136         Team America: World Police
1468                       The Fountain
1286                        Snowpiercer
Name: title, dtype: object

###  Credit, genre & keywords based recommender


we would use teh following metadata
+ top 3 actor
+ director
+ related genre
+ movie plot keyword
