# Content based recommendation system | *Domain: Movie Recommendation*

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/06/Screenshot-from-2018-06-21-10-57-38.png)

## Advantages 

1. This recommendation doesn't require user data to train on. 
2. It requires only the item data
3. The core concept is Natural Language Processing. Hence there is a ready made preprocessing pipeline to be followed which works for any domain.
4. This acts more like a script which can be run after some amount of item data is available. Best usecase for early stage start-ups.
5. Requires less resources (training time, processing power) as the algorithm used is standard and has a very high explainability.

***

## Disadvantages

1. The item **must** have item name and item description
2. Since we run the code as a script, there are chances that the recommendation might be skewed. Solution, more the amount of data, better the recommendation
3. There must be some naming conventions for the item name and item description so that they are interpretable to the algorithm
4. The regex filtering changes domain to domain

## Imports

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import sigmoid_kernel
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
credits = pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv')
movies_df = pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv')

In [None]:
credits.head()

In [None]:
movies_df.head()

## Merging both dataframes and keeping only required columns

In [None]:
credits_column_renamed = credits.rename(columns = {"movie_id": "id"})
movies_df_merged = movies_df.merge(credits_column_renamed, on= 'id')
movies_df_merged.head()

In [None]:
movies_cleaned_df = movies_df_merged[['id','original_title', 'overview']]
movies_cleaned_df.head()

In this type of recommendation system, we try to find similarity between items. There are two ways to do it :

- Statistical approach -> Weighted hybrid technique, requires item data + generic data (total ratings, popularity)
- NLP approach -> Requires item data only, Standard preprocessing steps. Can be used as a script

Example overview

In [None]:
movies_cleaned_df.head(1)['overview'][0]

***The only thing we need to take care is that regex differes for different usecases***

In [None]:
tfv = TfidfVectorizer(min_df = 3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1,3))

# filling NaNs with empty strings
movies_cleaned_df.overview = movies_cleaned_df.overview.fillna('')

In [None]:
# Sparse matrix
tfv_matrix = tfv.fit_transform(movies_cleaned_df.overview)
tfv_matrix.shape

## Transforming range of tfv_matrix using sigmoid kernel

![](https://qph.fs.quoracdn.net/main-qimg-6ab7369356c16f17ac39fbb83d5d56c1)

In [None]:
# Transforms the matrix value range to [0,1]
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)

In [None]:
sig[0]

## Reversing mapping of indices and movie titles

In [None]:
indices = pd.Series(movies_cleaned_df.index, index = movies_cleaned_df.original_title).drop_duplicates()
indices

In [None]:
def give_rec(title, sig = sig):
    # get index corresponding to the original_title
    idx = indices[title]
    
    # Get the list of ids along with pairwise similarity scores of the provided idx with other ids
    # Sort the movies
    # Selecting top 10 movies for recommendation
    sig_scores = list(enumerate(sig[idx]))
    sig_scores = sorted(sig_scores,key = lambda x: x[1], reverse=True)
    sig_scores = sig_scores[1:11]
    
    # Movie indices 
    movies_indices = [i[0] for i in sig_scores]
    
    # Top 10 similar movies
    return movies_cleaned_df.original_title.iloc[movies_indices]

## Kids Recommendation (Animated)

In [None]:
give_rec("Toy Story 3")

## Action movie recommendation (James Bond)

In [None]:
give_rec("Spectre")

## Romance recommendation

In [None]:
give_rec("Newlyweds")