##  type of recommender systems

### demographic filtering
gerneralized recommendations to every user, based on movie popularity, and/or genre.
recommends same movies for users with similar demographic features.
#### Too simple - since, every user is different

### content based filtering
#### If one like user like a item, he likes similar item
system uses item metadata, (genre, director, description, actors...) to recommend.

### Collaborative filtering
matches users with similar interest, provide recommnedations based on matchings.
no metadata required
#### same interest, same recommendations



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

df1 = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
df2 = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")

# features in DataFrame df1 & df2
df1 (DataFrame 1)
- movie_id - ID for each movie
- cast 
- crew

df2 (DataFrame 2)
- budget 
- genre
- homepage - Link to homepage of the movie
- id - ID for each movie (same as movie_id in df1)
- keywords - keywords, tages related to movie
- original_language - Original language
- Original_title - title of movie before translation/adaptation
- overview - description of the movie
- popularity - A numerical quantity for movie popularity 
- production_companies - The production house
- production_countries - Country of origin
- release_date
- revenue - ww revenue
- runtime - in minutes
- status - "Released" or "Rumored"
- tagline - Movie's tagline
- Title - Title of the movie
- vote_average - average rating recieved
- vote_count

Joining df1, df2 on id column using [pandas.DataFrame.merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)


In [None]:
df1.columns = ['id', 'tittle', 'cast', 'crew']
df2 = df2.merge(df1, on = 'id') 

peak our dataframe df2

In [None]:
df2.head()

## Demographic filtering

- metric to rate movie
- rate for every movie
- sort rates & rec best rated movie to users

#### Average ratings

- wr := weighted rating
- v (vote_count) := no. of votes
- m := min vote required to be listed in chart
- R (vote_average) := avg rating
- C := mean vote across whole movie dataset

$$wr = \left(\frac{v}{v+m}{.}R\right)+\left(\frac{m}{v+m}{.}C\right)$$



In [None]:
# C = mean of df2["vote_average"]
C = df2["vote_average"].mean()
C

#### m (min vote req) 
movie m_i(v) > 90% v of other movies


In [None]:
# m = mean rating 90th percentile as cutoff
m = df2["vote_count"].quantile(.9)
m

filtering out movie based on qualifying

In [None]:
q_ = df2.copy().loc[df2["vote_count"] >= m]
q_.shape

#### metric for movie


In [None]:
# weighted rating 
def weighted_rating (x, m=m, C=C):
    v = x["vote_count"]
    R = x["vote_average"]
    #calculate IMDB formula
    return (v/(v+m)*R + m/(v+m)*C)

In [None]:
# define new feature "score" & 
# calculate its value with weighted_rating
q_["score"] = q_.apply(weighted_rating, axis=1)

# sort movie based on score
q_ = q_.sort_values("score", ascending=False)
q_[["title", "vote_count", "vote_average", "score"]].head(18)

In [None]:
pop = df2.sort_values("popularity", ascending=False)

plt.figure(figsize=(12,4))
x = 6
plt.barh(pop["title"].head(x), pop["popularity"].head(x), 
         align="edge", color="skyblue")

plt.gca().invert_yaxis()
plt.xlabel("popularity")
plt.ylabel("popular movies")

# Content based filtering
To recommend similar movies 

In [None]:
df2[["overview", "title"]].head()

#### Plot desciption based recommender


convert word vector 

compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors

rel. f - of a word in doc 
 $$TF = \frac{term instances}{total instances}$$

do f - rel. count of docs containing the term - **IDF**
 = $$IDF = log \left(\frac{no. of doc}{doc with term}\right)$$
 
$$overall\hspace{.4em}importance = TF * IDF$$

=> matrix
column - contains all words in all doc
row - movie
* done to reduce importance of words occuring frequently in overviews and (final) **similarity score**

scikit-learn -> sklearn.feature_extraction.text - TfidfVectorizer

In [None]:


# def TF_IDF vectorizer obj. from scikit-learn.extraction.text, rm all english stop words (the, a )  
tfidf = TfidfVectorizer(stop_words = "english")

# replace NaN with empty string
df2["overview"] = df2["overview"].fillna('')

# construct req TF-IDF matrix- fit & transform data

tfidf_mat = tfidf.fit_transform(df2["overview"])

# o/p shape of tfidf_mat
tfidf_mat.shape

(4803, 20978)
20, 978 diff words in overviews of 4803 movies

### similarity score
- euclidean 
- pearson
- cosine similarity score

### cosine similarity score
$$similarity=cos(\theta)=\frac{A.B}{||A||.||B||}$$
$$\frac{\sum_{i=1}^{n}A_i.B_i}{\sqrt{\sum_{i=1}^{n}A_1^2}.{\sqrt{\sum_{i=1}^{n}B_i^2}}}$$

TF-IDF vectorizer -> dot product give cosine similarity score

*sklearn linear_kernal(), than cosine_similarities()* {fast} 

In [None]:
# compute cosine similarity score

cos_sim = linear_kernel (tfidf_mat, tfidf_mat)
cos_sim.shape

def a function, f(movie title) = o/p 10 similar movies

- reverse mappin of movie title <-> df indices 

In [None]:
# rev map of indices & movie title 
indices = pd.Series(df2.index, index = df2["title"]).drop_duplicates()

indices

## define  recommender function
- get id of movie from title
- cos similarity score for a movie with all movie
    - convert it to list of tuples, (pos, score)
- sort on score
- get top x+1 elements of list, first movie would be itself based of similarity score
- return title corr id of top x +1 elements 

In [None]:
# f (title, x) = x+1 most similar movies
def rec(title, score = cos_sim, x=10, indices = indices):
    """
    param:
    - title, string, title of the movie
    - score | = cos_sim - cosine similarity score
    - x | = 10, no. of title to be recommended  
    """
    # Get movie ID matching title 
    idx = indices[title]
    
    # Get pairwise sim scores of all movies with movie[idx]
    score = list(enumerate(score[idx]))
    
    # sort movie on score
    score = sorted(score, key=(lambda x : x[1]), reverse=True )
    
    # get scores of 10 most similar movies
    score = score[1:x+1]
    
    # get ID
    indices = [mi[0] for mi in score]
    
    # return list
    return df2["title"].iloc[indices]

In [None]:
rec("Avatar")
