<h3> Preprocessing the data.

The movies dataset has been used to build a content based filtering model. This file essentially shows the cleaning process deployed to build the content based recommender system. Firstly, the metadata of the movies dataset is read as a pandas dataframe, and all of it's columns are displayed.

In [13]:
import pandas as pd
import numpy as np
from ast import literal_eval

df = pd.read_csv(r'C:\Users\sweth\OneDrive\Desktop\2nd_Semester\Machine Learning 1\Project\Dataset\the-movies-dataset\movies_metadata.csv', low_memory=False)
#Print all the features (or columns) of the DataFrame
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [14]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [15]:
type(df)

pandas.core.frame.DataFrame

Only the features that are relevant to calculate the similarity scores are retained. These features are determined after running various possible combinations and looking into their importance in the similarity score. Also, given the size of the dataset, it was difficult to factor in more than 10 features for computational efficiency.

In [16]:
#Only keep those features that we require 
df = df[['title','id','overview','popularity','genres', 'release_date', 'runtime', 'vote_average', 'vote_count']]
df.head(5)

Unnamed: 0,title,id,overview,popularity,genres,release_date,runtime,vote_average,vote_count
0,Toy Story,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995-10-30,81.0,7.7,5415.0
1,Jumanji,8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995-12-15,104.0,6.9,2413.0
2,Grumpier Old Men,15602,A family wedding reignites the ancient feud be...,11.7129,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",1995-12-22,101.0,6.5,92.0
3,Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom...",3.859495,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",1995-12-22,127.0,6.1,34.0
4,Father of the Bride Part II,11862,Just when George Banks has recovered from his ...,8.387519,"[{'id': 35, 'name': 'Comedy'}]",1995-02-10,106.0,5.7,173.0


In [17]:
df.shape
# We have 45466 movies in our dataframe

(45466, 9)

Here, we are extracting just the year of the release date, instead of having the whole date.

In [18]:
# Convert release_date into pandas datetime format
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

#Extract year from the datetime
#import datetime
#df['year'] = datetime.date.today().year
df['year'] = df['release_date'].apply(lambda x: str(x).split('-')[0])

Initially, the type of the release date is not 'int'. The goal of the helper function below is to converting the year of release date to int type, and returning a zero otherwise.

In [19]:
# Function to convert NaN to 0 and all other years to integers.
def converting_to_integer(x):
    try:
        return int(x)
    except:
        return 0

In [20]:
#Apply the above function to the year feature
df['year'] = df['year'].apply(converting_to_integer)

Now we drop the original release date column and instead retain the year_of_release as an 'int' type in the column 'year'.

In [21]:
#Drop the release_date column
df = df.drop('release_date', axis=1)

#Display the dataframe
df.head()

Unnamed: 0,title,id,overview,popularity,genres,runtime,vote_average,vote_count,year
0,Toy Story,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",81.0,7.7,5415.0,1995
1,Jumanji,8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",104.0,6.9,2413.0,1995
2,Grumpier Old Men,15602,A family wedding reignites the ancient feud be...,11.7129,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",101.0,6.5,92.0,1995
3,Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom...",3.859495,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",127.0,6.1,34.0,1995
4,Father of the Bride Part II,11862,Just when George Banks has recovered from his ...,8.387519,"[{'id': 35, 'name': 'Comedy'}]",106.0,5.7,173.0,1995


From the below, we can see that a movie like 'Toy Story' fell into several different genres such as: Animation, Comedy and Family. Here, we have used literal_eval to make multiple copies of the same movie for each genre it is falling into.

In [22]:
#Print genres of the first movie
df.iloc[0]['genres']

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [23]:
df['genres'] = df['genres'].fillna('[]')
df['genres'] = df['genres'].apply(literal_eval)
df['genres'] = df['genres'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

In [24]:
# Removing all the movies that have null values for runtime or average vote or vote count or popularity
df = df[df.runtime.notnull()]
df = df[df.vote_average.notnull()]
df = df[df.vote_count.notnull()]
df = df[df.popularity.notnull()]

In [25]:
df['vote_count'].mean()

110.50651505431055

In [26]:
df['vote_count'].quantile(0.80)

51.0

If a movie had less than 50 percent of the number of ratings (in this case, 10), we did not take into consideration. This is due to several reasons: 
1. Computational power
2. Given the largeness of the dataset, it seemed reasonable to drop those movies and still recommend similar movies.

In [27]:
df = df[df['vote_count'] > df['vote_count'].quantile(0.50)]

In [28]:
df.shape

(21748, 9)

In [29]:
df.head()

Unnamed: 0,title,id,overview,popularity,genres,runtime,vote_average,vote_count,year
0,Toy Story,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[animation, comedy, family]",81.0,7.7,5415.0,1995
1,Jumanji,8844,When siblings Judy and Peter discover an encha...,17.015539,"[adventure, fantasy, family]",104.0,6.9,2413.0,1995
2,Grumpier Old Men,15602,A family wedding reignites the ancient feud be...,11.7129,"[romance, comedy]",101.0,6.5,92.0,1995
3,Waiting to Exhale,31357,"Cheated on, mistreated and stepped on, the wom...",3.859495,"[comedy, drama, romance]",127.0,6.1,34.0,1995
4,Father of the Bride Part II,11862,Just when George Banks has recovered from his ...,8.387519,[comedy],106.0,5.7,173.0,1995


Writing this cleaned dataset into a csv file, which we will be later using to build a content based recommender system.

In [30]:
df.to_csv(r'C:\Users\sweth\OneDrive\Desktop\2nd_Semester\Machine Learning 1\Project\Dataset\the-movies-dataset\clean_data.csv', index=False)