<a href="https://colab.research.google.com/github/redman157/HocML/blob/master/Recommand-system-tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"pson2809","key":"f045f4a4adb9707f5c86d18bc28608b2"}'}

In [3]:
# Let's make sure the kaggle.json file is present.
!ls -lha kaggle.json

-rw-r--r-- 1 root root 64 Mar 17 02:48 kaggle.json


In [0]:
# Next, install the Kaggle API client.
!pip install -q kaggle

In [0]:
# The Kaggle API client expects this file to be in ~/.kaggle,
# so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
# List available datasets.
!kaggle datasets list

ref                                                          title                                                size  lastUpdated          downloadCount  
-----------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  
ronitf/heart-disease-uci                                     Heart Disease UCI                                     3KB  2018-06-25 11:33:56          16378  
karangadiya/fifa19                                           FIFA 19 complete player dataset                       2MB  2018-12-21 03:52:59          14154  
russellyates88/suicide-rates-overview-1985-to-2016           Suicide Rates Overview 1985 to 2016                 396KB  2018-12-01 19:18:25          12281  
mohansacharya/graduate-admissions                            Graduate Admissions                                   9KB  2018-12-28 10:07:14          13658  
iarunava/cell-images-for-detecting-malaria                

In [7]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata
!kaggle datasets download -d rounakbanik/the-movies-dataset

Downloading tmdb-movie-metadata.zip to /content
 54% 5.00M/9.30M [00:00<00:00, 39.3MB/s]
100% 9.30M/9.30M [00:00<00:00, 45.3MB/s]
Downloading the-movies-dataset.zip to /content
 94% 215M/228M [00:02<00:00, 129MB/s]
100% 228M/228M [00:02<00:00, 84.6MB/s]


In [8]:
!unzip the-movies-dataset.zip

Archive:  the-movies-dataset.zip
  inflating: ratings.csv             
  inflating: ratings_small.csv       
  inflating: links.csv               
  inflating: links_small.csv         
  inflating: keywords.csv            
  inflating: movies_metadata.csv     
  inflating: credits.csv             


In [9]:
!unzip tmdb-movie-metadata.zip

Archive:  tmdb-movie-metadata.zip
  inflating: tmdb_5000_credits.csv   
  inflating: tmdb_5000_movies.csv    


In [10]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise (from surprise)
[?25l  Downloading https://files.pythonhosted.org/packages/4d/fc/cd4210b247d1dca421c25994740cbbf03c5e980e31881f10eaddf45fdab0/scikit-surprise-1.0.6.tar.gz (3.3MB)
[K    100% |████████████████████████████████| 3.3MB 7.0MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/ec/c0/55/3a28eab06b53c220015063ebbdb81213cd3dcbb72c088251ec
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.0.6 surprise-0.1


In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD, evaluate

import warnings; warnings.simplefilter('ignore')

df1 = pd.read_csv("tmdb_5000_credits.csv")
df2 = pd.read_csv('tmdb_5000_movies.csv')

In [0]:
def choose_keywords(X):
  X = X.apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x,list ) else [])
  return X
df2['genres'] = choose_keywords(df2['genres'])
df2['keywords'] = choose_keywords(df2['keywords'])

There are basically three types of recommender systems:-

> *  **Demographic Filtering**- They offer generalized recommendations to every user, based on movie popularity and/or genre. The System recommends the same movies to users with similar demographic features. Since each user is different , this approach is considered to be too simple. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.

> *  **Content Based Filtering**- They suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

> *  **Collaborative Filtering**- This system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts.

# **Demographic Filtering** -
   Before getting started with this  -
* we need a metric to score or rate movie 
* Calculate the score for every movie 
* Sort the scores and recommend the best rated movie to the users.

We can use the average ratings of the movie as the score but using this won't be fair enough since a movie with 8.9 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as average rating but 40 votes.
So, I'll be using IMDB's weighted rating (wr) which is given as :-

![](https://image.ibb.co/jYWZp9/wr.png)
where,
* v is the number of votes for the movie;
* m is the minimum votes required to be listed in the chart;
* R is the average rating of the movie; And
* C is the mean vote across the whole report

We already have v(**vote_count**) and R (**vote_average**) and C can be calculated as 


In [0]:
vote_average = df2['vote_average'].astype('int')
vote_count = df2['vote_count'].astype('int')

C = vote_average.mean()

m = vote_count.quantile(0.95)
print(C, m )

In [0]:
def Weighted(data, m = m, C=C):
  v = data['vote_count']

  
  R = data['vote_average']
  
  return (v/(v+m) * R) + (m/(m+v) * C)


In [0]:
df2['year'] = pd.to_datetime(df2['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

Therefore, to qualify to be considered for the chart, a movie has to have at least **434 votes** on TMDB. We also see that the average rating for a movie on TMDB is **5.244** on a scale of 10. **2274** Movies qualify to be on our chart.

In [0]:
qualified = df2[(df2['vote_count'] >= m) & (df2['vote_count'].notnull()) & (df2['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

In [0]:
qualified['wr'] = qualified.apply(Weighted, axis=1)
qualified = qualified.sort_values('wr', ascending=False).head(250)

In [0]:
qualified.head(15)

In [0]:
hobby = df2.apply(lambda x : pd.Series(x['genres']),axis = 1).stack().reset_index(level = 1, drop = True)
hobby.name = 'genre'
gen_df = df2.drop('genres', axis = 1 ).join(hobby)

In [0]:
def build_chart(genre, percentile = 0.85):
  # tim theo so thich, vi du la rometic
  df = gen_df[gen_df['genre'] == genre]
  # ep kieu 2 cot cua ham thanh int
  vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
  vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
  C = vote_averages.mean()
  m = vote_counts.quantile(percentile)
  
  qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & 
                 (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
  # hien thi cac o thoa man dieu kien nam trong khoang m
  
  # ep kieu dataframe thanh int 
  qualified['vote_count'] = qualified['vote_count'].astype('int')
  qualified['vote_average'] = qualified['vote_average'].astype('int')
  
  qualified['wr']  = qualified.apply(lambda x : Weighted(x, m ,C), axis = 1 )
  qualified = qualified.sort_values('wr', ascending = False).head(150)
  
  return qualified



In [0]:
build_chart('Action')

## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *Dilwale Dulhania Le Jayenge*, *My Name is Khan* and *Kabhi Khushi Kabhi Gham*. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me. 

In [0]:
links_small = pd.read_csv('links_small.csv')
md = pd.read_csv('movies_metadata.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [0]:
md = md.drop([19730, 29503, 35587])

In [0]:
md['id']= md['id'].astype('int')
smd = md[md['id'].isin(links_small)]
smd.shape


In [0]:
smd.head(5)
smd['genres'] = choose_keywords(smd['genres'])
smd['spoken_languages'] = choose_keywords(smd['spoken_languages'])


In [0]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [0]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

**#### Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.bold text

In [0]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [0]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])


In [0]:
a = (sorted(list(enumerate(cosine_sim[692])), key = lambda x : x[1], reverse= True)[1:11])
titles.iloc[[i[0] for i in a]]

In [0]:
def get_recommendations(title):
  idx = indices[title]
  similarity_scores = list(enumerate(cosine_sim[idx]))
  similarity_scores = sorted(similarity_scores, key = lambda x: x[1], reverse= True)
  similarity_scores = similarity_scores[1:21]
  movies_indices = [i[0] for i in similarity_scores]
  return titles.iloc[movies_indices]
  

In [0]:
get_recommendations('The Godfather').head(10)

### Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [0]:
import pandas as pd
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')


In [6]:
links_small = pd.read_csv('links_small.csv')
md = pd.read_csv('movies_metadata.csv')



def charge_type(data):
  data['id'] = data['id'].astype('int')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
keywords['id']= keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] =  md['id'].astype('int')
#print('keywords', keywords.shape,'credits',credits.shape)

In [9]:
md.shape

(45463, 24)

In [0]:
md = md.merge(credits, on= 'id')

In [0]:
smd = md[md['id'].isin(links_small)]
smd.shape

In [0]:
smd.head(10)

# New Section