### Content Based Recommendation System

In [1]:
# import necessary libraries:

import numpy as np
import pandas as pd
pd.options.display.max_colwidth = 500

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

In [2]:
from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 95%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 99%; }
</style>
"""))

### Read the Dataset `movies_metadata.csv`

In [43]:
# Read the movies dataset:

movies_df = pd.read_csv('movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [44]:
# View dataset head:

movies_df.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [45]:
# View shape of dataset:

movies_df.shape

(45466, 24)

**Inference:**
    
    There are 45,466 rows and 24 columns in movies dataframe

In [11]:
# View datatypes:

movies_df.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

### Create a new column with name 'description' combining `'overview' and 'tagline'` columns in the given dataset

In [None]:
# tagline had null values while looking at the original data. Let's check the null value count

In [46]:
movies_df['tagline'].isna().sum()

25054

In [13]:
# Since tagline has many 'null' values, we will replace it with spaces before combining with overview column:

movies_df['tagline'] = movies_df['tagline'].fillna('')

# Combining overview and tagline columns to create a new 'description' column

movies_df['description'] = movies_df['overview'] + movies_df['tagline']

In [14]:
# View the combined columns:

movies_df[['overview', 'tagline', 'description']].head(3)

Unnamed: 0,overview,tagline,description
0,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."
1,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",Roll the dice and unleash the excitement!,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.Roll the dice and unleash the excitement!"
2,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",Still Yelling. Still Fighting. Still Ready for Love.,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.Still Yelling. Still Fighting. Still Ready for Love."


In [15]:
# Verify shape once again:

movies_df.shape

(45466, 25)

### Lets drop the null values in `description` column

In [16]:
# Check the count of null values:

movies_df['description'].isna().sum()

954

**Inference:**
    
    Totally there are 954 null values in description column

In [17]:
# Drop na in description column:

movies_df.dropna(subset=['description'], inplace=True)

In [18]:
movies_df.shape

(44512, 25)

**Inference:**
    
    Now there are 44512 rows after removing 954 null values in description colum. Column count remains the same.

### Keep the first occurance and drop duplicates of each title in column `title`

In [19]:
# Find if duplicates are prsesent in title column

any(movies_df.duplicated(subset=['title']))

True

**Inference:**
    
    Duplicates are present in title column

In [20]:
movies_df.shape

(44512, 25)

In [21]:
# Drop duplicates in title column:

movies_df = movies_df.drop_duplicates('title')

In [22]:
movies_df.shape

(41372, 25)

**Inference:**
    
    Now we are left with 41372 rows and 25 columns

### As we might have dropped a few rows with duplicate `title` in above step, just reset the index [make sure you are not adding any new column to the dataframe while doing reset index]

In [23]:
# Reset index:

movies_df.reset_index(drop=True, inplace=True)

In [24]:
movies_df.shape

(41372, 25)

**Inference:**
    
    Column count is 25. This proves that no new column was added while resetting the index

In [25]:
# View the dataset head:

movies_df.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,description
0,False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.Roll the dice and unleash the excitement!"


### Generate tf-idf matrix using the column `description`. Consider till 3-grams, with minimum document frequency as 0.

In [26]:
# Construct tf-idf matrix:

tf_id_vect = TfidfVectorizer(analyzer='word', ngram_range=(1,3), stop_words='english', min_df=0)

tf_id_vect.fit(movies_df['description'])

desc_matrix = tf_id_vect.transform(movies_df['description'])

In [27]:
desc_matrix

<41372x2237767 sparse matrix of type '<class 'numpy.float64'>'
	with 3674457 stored elements in Compressed Sparse Row format>

### create cosine similarity matrix

In [28]:
# Create similarity matrix:

cos_sim_matrix = linear_kernel(desc_matrix, desc_matrix)

In [29]:
cos_sim_matrix.shape

(41372, 41372)

### Write a function with name `recommend` which takes `title` as argument and returns a list of 10 recommended title names in the output based on the above cosine similarities

In [40]:
# Define recommend function:

def recommend(title_name):
    
    # Get title_id using title name:
    title_id = movies_df[movies_df['title'] == title_name].index.values[0]
    
    # Get similar movies using similarity matrix
    top_n_idx = np.flip(np.argsort(cos_sim_matrix[title_id,]),axis=0)[1:11]
    top_n_sim_values = cos_sim_matrix[title_id, top_n_idx]
    
    # find top n with values > 0
    top_n_idx = top_n_idx[top_n_sim_values > 0]
        
    # find features from the vectorized matrix
    sim_movie_idx = movies_df['title'].iloc[top_n_idx].values
            
    # collate results
    result = pd.DataFrame({"Movie title" : movies_df['title'].iloc[title_id],
                           "Similar movies": sim_movie_idx,
                          },
                          columns = ["Movie title", "Similar movies"])
    
    return result

### Give the recommendations from above functions for movies `The Godfather` and `The Dark Knight Rises`

In [41]:
# Give recommendations for users who have watched 'The Godfather'

recommend('The Godfather')

Unnamed: 0,Movie title,Similar movies
0,The Godfather,The Godfather Trilogy: 1972-1990
1,The Godfather,The Godfather: Part II
2,The Godfather,Honor Thy Father
3,The Godfather,Blood Ties
4,The Godfather,The Cave of the Yellow Dog
5,The Godfather,A Mother Should Be Loved
6,The Godfather,The Outside Man
7,The Godfather,Household Saints
8,The Godfather,Made
9,The Godfather,Shanghai Triad


In [42]:
# Give recommendations for users who have watched 'The Dark Knight Rises'

recommend('The Dark Knight Rises')

Unnamed: 0,Movie title,Similar movies
0,The Dark Knight Rises,The Dark Knight
1,The Dark Knight Rises,Batman Forever
2,The Dark Knight Rises,Batman Returns
3,The Dark Knight Rises,Batman: Mask of the Phantasm
4,The Dark Knight Rises,Batman
5,The Dark Knight Rises,Batman: Mystery of the Batwoman
6,The Dark Knight Rises,Batman: Under the Red Hood
7,The Dark Knight Rises,Batman Beyond: Return of the Joker
8,The Dark Knight Rises,Batman vs Dracula
9,The Dark Knight Rises,Batman Unmasked: The Psychology of the Dark Knight


## Overall Summary:

- Read movies_metadata file
- Totally there are 45,466 columns and 24 columns
- Combined overview and tagline columns to create a new column called 'description'. Since tagline had null values, replaced it with spaces before combining with overview column
- Dropped null values in the newly created 'description' column
- Dropped duplicates in 'title' column and did a index reset
- Finally there were 41,372 rows and 25 columns
- Generated tf-idf matrix using 'description' column (ngram = 3, min_df = 0)
- Created similarity matrix using the above matrix
- Provided 10 similar movie recommendations for users who have watched 'The Godfather' and 'The Dark Knight Rises' movies.

### <center> End of R5 Content Based Recommendation System External lab </center>