# Movie Recommendation System

 I have used https://www.kaggle.com/tmdb/tmdb-movie-metadata for building recommendation systems.

There are three types of recommender systems :-
 
1. Demographic Filtering - It uses the demographic data of a user to determine which items may be appropriate for recommendation.

2. Content Based Filtering - It uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.

3. Collaborative Filtering - This system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts. 

## Importing libraries and loading data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from ast import literal_eval

In [None]:
d1=pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv')
d2=pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv')

### Merging d1 & d2

In [None]:
d1.columns = ['id','title','cast','crew']
d2 = d2.merge(d1,on='id')

In [None]:
d2.head()

## Demographic Filtering

Before getting started with this -

1. we need a metric to score or rate movie
2. Calculate the score for every movie
3. Sort the scores and recommend the best rated movie to the users.

We can use the average ratings of the movie as the score but using this won't be fair enough since a movie with 8.9 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as average rating but 40 votes. So, I'll be using IMDB's weighted rating (wr) which is given as :-

![alt text](https://github.com/wandererabir/Movie-Recommendation-System/raw/9b1c1ef209e7731744f50563e9be656d146e6cbf/wr.png)

where,

1. v is the number of votes for the movie;
2. m is the minimum votes required to be listed in the chart;
3. R is the average rating of the movie; And
4. C is the mean vote across the whole report

We already have v(vote_count) and R (vote_average) and C can be calculated as

In [None]:
C= d2['vote_average'].mean()
C

### Calculating m, we use 90% percentile as cutoff. It must have more votes than at least 90% of the movies in the list.

In [None]:
m= d2['vote_count'].quantile(0.9)
m

### Qualified movies

In [None]:
q_movies = d2.copy().loc[d2['vote_count'] >= m]
q_movies.head(2)

In [None]:
q_movies.shape

### Calculate our metric for each qualified movie using function w_rating

In [None]:
def w_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
# Define a new feature 'score' and calculate its value with `w_rating()`
q_movies['score'] = q_movies.apply(w_rating, axis=1)

In [None]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title_x', 'vote_count', 'vote_average', 'score']].head(10)

### This recommendation system shows the ' Trending Now '  tab of a streaming app

In [None]:
pop= d2.sort_values('popularity', ascending=False)

plt.figure(figsize=(12,4))
plt.barh(pop['title_x'].head(6),pop['popularity'].head(6), align='center',color='blue')
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")

## Content Based Filtering

### 1. Plot description based Recommender

We are going use Plot description based Recommender for all movies  based on their similarity score.

In [None]:
d2['overview'].head()

### TfIdfVectorizer

We'll use Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview.

This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

![TfIdfVectorizer](https://github.com/wandererabir/Movie-Recommendation-System/raw/9b1c1ef209e7731744f50563e9be656d146e6cbf/tfidf.png)

In [None]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
d2['overview'] = d2['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(d2['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

### Using linear_kernal()

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate. Mathematically, it is defined as follows:

![cosine similarity](https://github.com/wandererabir/Movie-Recommendation-System/raw/9b1c1ef209e7731744f50563e9be656d146e6cbf/simi.png)

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

### Function that takes in a movie title as an input and outputs a list of the 10 most similar movies. 

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(d2.index, index=d2['title_x']).drop_duplicates()

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return d2['title_x'].iloc[movie_indices]

In [None]:
get_recommendations('The Godfather')

In [None]:
get_recommendations('Inception')

### 2. Credits, Genres and Keywords Based Recommender

We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres and the movie plot keywords. This system quality will incraese because of better usage of data.

In [None]:
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    d2[feature] = d2[feature].apply(literal_eval)

In [None]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
# Define new director, cast, genres and keywords features that are in a suitable form.
d2['director'] = d2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    d2[feature] = d2[feature].apply(get_list)

In [None]:
d2[['title_x', 'cast', 'director', 'keywords', 'genres']].head(2)

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    d2[feature] = d2[feature].apply(clean_data)

We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer

In [None]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
d2['soup'] = d2.apply(create_soup, axis=1)

### Using CountVectorizer

It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(d2['soup'])

In [None]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of our main DataFrame and construct reverse mapping as before
d2 = d2.reset_index()
indices = pd.Series(d2.index, index=d2['title_x'])

In [None]:
get_recommendations('Batman Begins', cosine_sim2)

In [None]:
get_recommendations('The Avengers', cosine_sim2)

This reccomendation sysytem has better success rate more than others because of utilizing more information from all the metadata. It gives us a better recommendations.

It is more likely that Marvels or DC comics fans will like the movies of the same production house. Therefore, to our features above we can add production_company . We can also increase the weight of the director , by adding the feature multiple times in the soup.

## Collaborative Filtering

### User based filtering

These systems recommend products to a user that similar users have liked. For measuring the similarity between two users we can either use pearson correlation or cosine similarity. 

### Item Based Collaborative Filtering 

Instead of measuring the similarity between users, the item-based CF recommends items based on their similarity with the items that the target user rated. Likewise, the similarity can be computed with Pearson Correlation or Cosine Similarity.

### I have not done this one in this following notebook. Feel free to explore on your own

# Conclusion
I have create recommendation systems using demographic and content- based. While demographic filtering is very simple and elementary and cannot be used practically whereas Hybrid Systems can take advantage of content-based and collaborative filtering as the two approaches are proved to be almost complimentary. This model was very baseline and only provides a fundamental framework and information to start with.