For Recommender System, we usually have two approaches: **Collaborative Filtering** and **Content based Filtering**.

To help distinguish the difference between two approaches. Here is an exmaples: **Netflix show recommendation system**.

Suppose Todd loves watching action movie, by using content based filtering, the system will recommend more action movies that Todd hasnt watached. In another cases, by using collaborative approach, the system will like other users like Desmond also watch the same action movie as Todd plus a horror movie, then the system will recommend Todd to watch that horror movie. 

Due to the limitation of collaborative filtering, it **requires users community and enough amount of data for computing**. Thus, it is more handy to use content based filtering.

In this case, i will showcase how to use content based filtering to building a recommender system in TMDB Movie Dataset.

### Import Libraries

In [None]:
import os
os.getcwd()

In [None]:
import pandas as pd 
import numpy as np 
credits= pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
movies = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')

In [None]:
credits

In [None]:
# rename the common column name for merging two dataset later
movies.rename(columns={'id':'movie_id'},inplace=True)

In [None]:
movies

In [None]:
# Merge two datasets
movies_combine = pd.merge(movies,credits,on=['movie_id','title'])
movies_combine

In [None]:
movies_combine.info()

Columns: Homepage and tagline have relatively high missing values, so we tend to drop these two columns

In [None]:
movies_combine.drop(['homepage','tagline'],axis=1,inplace=True)

### Content Based Filtering

By looking at the description in the overview columns, we try to extract important keywords as features. By using cosine similarity, with more similar keywords with the target, the similary score will be higher

In [None]:
movies_combine.info()

In [None]:
movies_combine['overview'] = movies_combine['overview'].fillna('')

To extract all the important keywords, we will use TF-IDF that frequently used in NLP 

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_combine['overview'])

In [None]:
print(tfidf_matrix)

Cosine_simiarity approach is used. The higher the score, the more similar the movie with the target is

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(tfidf_matrix,tfidf_matrix)
cos_sim

In [None]:
cos_sim.shape

In [None]:
movies_combine.index

In [None]:
# Create a new dataseries, the aim is to get the index of corresponding movie title
movies_index = pd.Series(movies_combine.index, index=movies_combine['title']).drop_duplicates()
movies_index

### Here are the steps for building a content based recommender system

#### 1. Input our target movie for recommendation

#### 2. Finding the corresponding movie index among the dataset

#### 3. Finding the corresponding cosine similarity score base on the movie index

#### 4. Sort the score list in descending order 

#### 5. Retrieve the index with the top n number of similarity score 

#### 6. Parsing back to our dataset to look for the movies

In [None]:
## 1. Input our target movie for recommendation
title = 'Avatar'
## 2. Finding the corresponding movie index among the dataset
target_index = movies_index[movies_index.index==title].values[0]


In [None]:
## 3. Finding the corresponding cosine similarity score base on the movie index
cos_sim[target_index]

In [None]:
# Assign each similarity score with index by nusing enumerate()
a = list(enumerate(cos_sim[target_index]))


In [None]:
## 4. Sort the score list in descending order
sort_index = sorted(a,key=lambda x:x[1],reverse=True)
sort_index[1:10]

In [None]:
## 5. Retrieve the index with the top 10 number of similarity score
## 6. Parsing back to our dataset to look for the movies
for i in sort_index[1:11]:
    recommend_movie= movies_combine[movies_combine.index == i[0]]['title'].values[0]
    print(recommend_movie)

### Create a function for step 1 to 6

In [None]:
def content_based_recommend(title,n):
    output_list = []
    target_index = movies_index[movies_index.index==title].values[0]
    target_cos_list = list(enumerate(cos_sim[target_index]))
    sort_list = sorted(target_cos_list,key=lambda x:x[1],reverse=True)
    for i in sort_list[1:n+1]:
        recommend_movie = movies_combine[movies_combine.index==i[0]]['title'].values[0]
        output_list.append(recommend_movie)
    return output_list

In [None]:
content_based_recommend('Avatar',10)