 Myself [Manasi Pandharkar](
https://www.linkedin.com/in/manasi-kulkarni-pandharkar-094784a/) is creating an ML based Recommendation Engine in collaboration with [Mr. Rocky Jagtiani](https://www.linkedin.com/today/author/rocky-jagtiani-3b390649/)
 
> This is a simple Data Science project on Movies Recommendation System which recommends you the movie based on the Review of previous movie.

> Dataset: tmdb_5000_credits.csv,tmdb_5000_movies.csv from kaggle itself

> Tech Stack used: pandas, Scikit-learn,Python

> Recommended links : 

> https://datascience.suvenconsultants.com  ( For DS / AI / ML )

> https://monster.suvenconsultants.com  ( For Web development )

Recommender systems are among the most popular applications of data science today. They are used to predict the "rating" or "preference" that a user would give to an item. Almost every major tech company has applied them in some form. Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services. 

Recommender systems can be classified into Two types:

> **Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.

> **Collaborative filtering engines**: these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

Here we are going to implement **Content Based Filtering**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import Pandas
import pandas as pd

# Loading Data sets
full_url='/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv'

full_url1='/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv'

credits = pd.read_csv(full_url)
movies=pd.read_csv(full_url1)

In [None]:
# Printing 1st 5 elements of credits dataset
credits.head()

In [None]:
# Printing 1st 5 elements of movies dataset
movies.head(2)

In [None]:
# Printing the shapes of both the datasets
print("Credits:",credits.shape)
print("Movies:",movies.shape)

In [None]:
# Renaming the column of credits data set
credits_renamed=credits.rename(index=str,columns={'movie_id':'id'})
credits_renamed.head()

In [None]:
# Merging both data sets
merge=movies.merge(credits_renamed,on='id')
merge.head()

In [None]:
# Dropping unnecessary columns 
cleaned=merge.drop(columns=['homepage','title_x','title_y','status','production_countries'])
cleaned.head()

In [None]:
cleaned['overview'].head()

In [None]:
cleaned['overview'].isnull().sum() #checking for Null vaules for overview column

In [None]:
#Replace NaN with an empty string
cleaned['overview'] = cleaned['overview'].fillna('')


In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english',ngram_range=(1,3),min_df=3,analyzer='word')
#refernce: http://www.tfidf.com/

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(cleaned['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

#Ref:https://deepai.org/machine-learning-glossary-and-terms/cosine-similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [None]:
print(cosine_sim.shape)
print(cosine_sim[1])

We are going to define a function that takes in a movie title as an input and outputs a list of 10 most similar movies. Firstly, for this , we need a reverse mapping of movie titles and DataFrame indices. In other words , we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(cleaned.index, index=cleaned['original_title']).drop_duplicates()

indices[ :5]


Steps:
1. Get the index of the movie given its title
2. Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and second is the similarity score.
3. Sort the list of tuples based on similarity score. i,e the second element.
4. Get the top 10 elements of the list. Ignore the first element as it refers to self.
5. Return the titles corresponding to the indices of the top elements.

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    print(sim_scores[ :5])
    print("--------------------")

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    print(sim_scores[ :5])
    print("--------------------")
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return cleaned['original_title'].iloc[movie_indices]

In [None]:
# Getting the recommendation
get_recommendations('Avatar',cosine_sim)

In [None]:
get_recommendations('The Dark Knight Rises',cosine_sim)

#Enhancement

In [None]:
cleaned.columns

In [None]:
cleaned['crew'].values[0]

In [None]:
## From new features cast , crew and features
## we need to extract the three neq most important actors
## the directors and the keywords associated with that movie

## First convert the data into a way that is usable

##parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew','keywords','genres']
for feature in features:
    cleaned[feature]= cleaned[feature].apply(literal_eval)



In [None]:
cleaned['crew'].values[0]

In [None]:
##function to get director's name
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

Next we will write a function that will return the top 3 elements or the entire list, whichever is more. Here the list refers to the cast, keyword or genres

In [None]:
def get_list(x):
    if isinstance(x,list):
        names = [i['name'] for i in x]
        #check if more than 3 elements exist, if yes then return only first three
        if len(x) > 3:
            names = names[ :3]
        return names
    ## return empty list in case of missing or mal formed data
    return []

In [None]:
cleaned['director']= cleaned['crew'].apply(get_director)

features= ['cast','keywords','genres']
for feature in features:
    cleaned[feature]=cleaned[feature].apply(get_list)

In [None]:
#Print the fetures of the first three movies
cleaned[['original_title','cast','director','keywords','genres']].head(3)

In [None]:
## function to convert all strings to lowercase and strip names of spaces
def cleaned_data(x):
    if isinstance(x,list):
        return [str.lower(i.replace(" ","")) for i in x]
    else:
        if isinstance(x,str):
            return str.lower(x.replace(" ",""))
        else:
            return ''

In [None]:
## Apply clean_data function to your features
features = ['cast','keywords','director','genres']
for feature in features:
    cleaned[feature]=cleaned[feature].apply(cleaned_data)
    

In [None]:
def create_metadata(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['director']) + ' ' + ' '.join(x['genres'])

In [None]:
## create a new metadata feature
cleaned['metadata']= cleaned.apply(create_metadata, axis=1)

In [None]:
cleaned[['metadata']].head(2)


In [None]:
## import the CountVectorizer and create the count matrix

from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words ='english')
count_matrix = count.fit_transform(cleaned['metadata'])



In [None]:
count_matrix.shape

In [None]:
#Compute the cosine similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
## Reset index of your main Dataframe  and construct reverse mapping as before

indices = pd.Series(cleaned.index, index = cleaned['original_title'])
indices[ :2]

In [None]:
## You can now reusse your get_recommendation() function by passing in the new cosine_sim2 matrix as your second argument
get_recommendations('The Dark Knight Rises',cosine_sim2)

In [None]:
get_recommendations('The Godfather',cosine_sim2)

I would like to humbly and sincerely thank my mentor [Rocky Jagtiani](https://www.linkedin.com/today/author/rocky-jagtiani-3b390649/). He is more of a friend to me then mentor. The Machine Learning course taught by him and various projects we did and are still doing is the best way to learn and skill in Data Science field. See https://datascience.suvenconsultants.com once for more.