We can use the available sparse matrices to make a Movie Recommendation system, based on content-based filtering, with ease. Let's move on step by step.

## Importing Libraries and Datasets

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics.pairwise import linear_kernel #to generate the similarity matrices
import scipy.sparse #to process/handle the sparse matrices
import warnings #to ignore the warnings

warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's consider **Genre: *Action***, **Genre: *Animation*** and **Genre: *Comedy*** as examples. Also the one with all the movies for experimentation.

In [None]:
#importing the datasets
action_df = pd.read_csv("/kaggle/input/imdb-movies-dataset/datasets/action_df.csv")
animation_df = pd.read_csv("/kaggle/input/imdb-movies-dataset/datasets/animation_df.csv")
comedy_df = pd.read_csv("/kaggle/input/imdb-movies-dataset/datasets/comedy_df.csv")
allMovies_df = pd.read_csv("/kaggle/input/imdb-movies-dataset/datasets/all_df.csv")

In [None]:
action_df.info()

In [None]:
animation_df.info()

In [None]:
comedy_df.info()

As we can see the datasets are quite clean, considering for using content-based filtering. 
<br>
The Comedy dataset is quite huge, compared to Animation and Action Dataset, with a total of 25200 movies. Let's see the information about the allMovies dataset

In [None]:
allMovies_df.info()

74889 movies, with much less null values.

## Applying Content-Based Filtering

Since we already have sparse matrices which have been collected after applying Tf-idf on each dataset, with specifications (analyzer='word', ngram_range=(1,3), stopwords='english'), we can directly import these .npz files and proceed with generating the similarity matrix.

<br>
According to me, using the technique TF-IDF (term frequency – inverse document frequency) to find out ‘How important is a word in it’s corresponding document’ is much more efficient. The product of term frequency and inverse document frequency for each word acts up as a score of it’s importance.

**Term Frequency** tells us the probability of a word occurring in a document (i.e. number of times the word occur/total number of words in a document).
<br>
**Inverse Document Frequency** of a word in a given document corpus(dataset) is the logarithmic ratio of the total number of documents to the number of documents int which the word occurs. In our case, the description of each podcast is a document and the collection of all the descriptions is the document corpus.

Loading the .npz files *(sparse matrices)* for **Genres: *Action, Animation*** and ***Comedy***. 

In [None]:
action_sm = scipy.sparse.load_npz("/kaggle/input/imdb-movies-dataset/sparse_matrices/action_sm.npz")
animation_sm = scipy.sparse.load_npz("/kaggle/input/imdb-movies-dataset/sparse_matrices/animation_sm.npz")
comedy_sm = scipy.sparse.load_npz("/kaggle/input/imdb-movies-dataset/sparse_matrices/comedy_sm.npz")

Now we can calculate the similarity between the podcasts on the basis of their tf-idf scores or values( received as a sparse matrix) using an appropriate kernel method, such that the words with closer scores tend to have a similar type of value and this value changes(either increases or decreases) as the difference between the tf-idf scores increases or decreases. These values will be stored in a separate dataframe, this would be our Recommender DataFrame.

Here, we use the Linear Kernel, which is based on the cosine-similarity of the elements. For the linear_kernel method, closer the value is to 1, for given 2 data-points, more similar are the data-points.

The liner_kernel function returns a ***numpy.ndarray*** which has similarity scores of each movie, compared with every other movie in the dataframe.

Claculating the similarity matrices of the above sparse matrices.

In [None]:
action_simmat = linear_kernel(action_sm, action_sm)
animation_simmat = linear_kernel(animation_sm, animation_sm)
comedy_simmat = linear_kernel(comedy_sm, comedy_sm)

**Defining a generic function for getting the top 10 recommendations.**

In [None]:
def give_recommendations(df, sim_mat, mov_name):
    movie = df[df.original_title == mov_name].index[0]
    index_recomm = sim_mat[movie].argsort(axis=0)[-11:-1]
    
    print("Original Description: ",df.description[movie],"\n")

    for i in np.flipud(index_recomm):
        print("Score: ",sim_mat[movie][i],"\t Title: ",df.original_title[i])
        print("IMDb Title ID: ",df.imdb_title_id[i])
        print(df.description[i],"\n")

Now, let's see the recommendations.

In [None]:
give_recommendations(action_df, action_simmat, 'Singham') #let's try with an Indian Movie: Singham

Well, the recommendations look good...considering the plot of the movie.

Now, let's try out for Animation.

In [None]:
give_recommendations(animation_df, animation_simmat, 'Frozen') #Everyone knows about Frozen, right!

In [None]:
give_recommendations(comedy_df, comedy_simmat, 'Phir Hera Pheri') #The classic, the source of all memes.

Now, most people would doubt the credibility of the recommender, as it hasn't predicted ***Hera Pheri*** here. Well guys, remember the plot?

In [None]:
hera_pheri = comedy_df[(comedy_df.original_title == 'Hera Pheri') & (comedy_df.year == '2000')]
hera_pheri = hera_pheri.reset_index()
print("Plot of Hera Pheri(2000) : ", hera_pheri['description'][0])

Since the story is quite different and *content-based filtering* is completely dependant upon the description/ plot, therefore we don't get Hera Pheri as our recommendation.

**Great!**

So here's the way to implement these datasets using content-based filtering for getting genre-wise recommendations. I hoped you liked it, *Please upvote if you did :)*


Looking forward to your awesome implementations.

## *Thank you !* 