### Power of Recommendation Engine

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Suppose You're planning to buy a laptop without any idea about the right configuration. So You would check with Your friends and colleagues for recommendation and they suggests laptops based on Your requirement , their knowledge and trending one. The same way Recommendation engine works. For instance, Amazon recommends You a laptop based on Your previous search , popularity and keeps on showing the best recommendation and tempt You to buy a laptop even if You drop the plan. All the major company has recommendation in their products such as Youtube shows recommendations based on Your interests and activity.

We'll explore how to implement it, before that there are two types of Recommendation Engine

1. **Content Based Filtering**
2. **Collabarative Based Filtering**

#### Content Based Filtering
This algorithm recommends products which are similar to the ones that a user has liked in the past.

#### Collabaratvie Based Filtering
The collaborative filtering algorithm uses “User Behavior” for recommending items.

*In this Kernel, we shall look at Content Based Filtering implementation*

**Our task is When User search a movie, We'll recommend the top 10 similar movies**

Implementation is so simple, We're going to combine all features and create a bulk of keywords for each movie from the given datasets and find similarity between each movie and popup the top similar movies

In [None]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

import json
from functools import reduce

In [None]:
credits = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_credits.csv")
movies = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")

In [None]:
# Dataset shape
print("Credits shape is {}".format(credits.shape))
print("Movies shape is {}".format(movies.shape))


In [None]:
print(credits.columns)
print(movies.columns)

In [None]:
credits.head()

In [None]:
movies.head()

movie_id in credits and id in movie datasets are same using that we'll combine both the dataset

In [None]:
final_dataset = pd.merge(movies,credits,left_on='id',right_on='movie_id',how='left')

#### Data Preprocessing

In [None]:
final_dataset.isnull().sum() 

In [None]:
# drop homepage and release_date
final_dataset.drop(['homepage','release_date','runtime'],axis=1,inplace=True)

In [None]:
final_dataset['overview'].fillna('',inplace=True)
final_dataset['tagline'].fillna('',inplace=True)

In [None]:
final_dataset.isnull().sum() 

In [None]:
# some of the columns are given in JSON format, We should convert this to Dictinary using json.loads method

def convertJson(y):
    y = json.loads(y)
    return " ".join([val['name'] for val in y])
final_dataset['genres'] = final_dataset['genres'].apply(convertJson)
final_dataset['keywords'] = final_dataset['keywords'].apply(convertJson)
final_dataset['production_companies'] = final_dataset['production_companies'].apply(convertJson)
final_dataset['production_countries'] = final_dataset['production_countries'].apply(convertJson)

In [None]:
final_dataset.drop(['id','spoken_languages','status','budget','popularity','revenue','vote_average','vote_count','crew'],inplace=True,axis=1)

In [None]:
final_dataset['genres']

In [None]:
# Top 5 cast does better prediction
def get_cast(y):
    y = json.loads(y)
    return " ".join([val['character']+" "+ val['name'] for val in y[:5]])
final_dataset['cast'] = final_dataset['cast'].apply(get_cast)

In [None]:
columns = ['original_language','original_title','overview',\
              'production_countries','tagline','title_x','title_y','cast']
final_dataset['title'] = final_dataset['title_x']
final_dataset['keywords'] = final_dataset[['keywords','genres','production_companies'] + columns].apply(" ".join,axis=1)
final_dataset.drop(columns,inplace=True,axis=1)

In [None]:
final_dataset.head()

All keywords are in english. Our model can understand only numbers so We'll convert the keywords into sparse matrix form using either CountVectorizer or TfidfVectorizer. CountVectorizer just counts the words appear, there is a high chances that missing the rare words which could have helped for predicting the model effectively. So We'll use TfidfVectorizer which counts the frequency of the words and normalize them and this is mostly recommended.

In [None]:
# stop words will remove the common english words like a,an,the,i,me,my etc which increase the words count and 
# create noise in our model 

c_vect = TfidfVectorizer()
X = c_vect.fit_transform(final_dataset['keywords'])

In [None]:
# There are other similiary distance metric available which are euclidean distance,manhattan distance, 
# Pearson coefficient etc. But for sparse matrix cosine similarity works better
cosine_sim = cosine_similarity(X)

In [None]:
def get_movie_recommendation(movie_name):
    idx = final_dataset[final_dataset['title'].str.contains(movie_name)].index
    if len(idx):
        sorted_list_indices = sorted(list(enumerate(cosine_sim[idx[0]])), key=lambda x: x[1], reverse=True)[1:11]
        sorted_list_indices = list(map(lambda x:x[0],sorted_list_indices))
        return sorted_list_indices
    else : 
        return []

In [None]:
title = "The Avengers"
recommended_movie_list = get_movie_recommendation(title)
final_dataset.loc[recommended_movie_list,['title','genres']]

In [None]:
final_dataset.loc[[3, 65, 3854]]

In [None]:
title = "The Dark Knight Rises"
recommended_movie_list = get_movie_recommendation(title)
final_dataset.loc[recommended_movie_list,['title','genres']]

Our system predicts exactly the similar movies for Avengers with list of all marvel movies and for dark knight with list of DC movies.

Major drawback of this approach is that it predicts the same lists of movie for all the user who search Avengers irrespective of their interest and their likes. So we need an algorithm to predict based on User behaviour for that We'll use collabrative filtering.

I'm writing my other kernel for collabarative filtering. Will update once it is completed.

**Please upvote it if you like it. Thanks**