# MOVIE RECOMMENDER SYSTEM

We are using the dataset from TMDB (https://www.themoviedb.org/about). **The Movie Database** (TMDB) is a community built movie and TV database. The platform consists of around 5000 movies and associated details. The dataset is available on Kaggle - https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata.

### TASK:
> Users will be requested to type/select a movie based on their interest. We would be developing a recommender system to help suggest back a list of movies similar to the one selected by the user. We will be sharing atmost 6 recommended movies based on user's choice.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

color_pal = sns.color_palette()
sns.set()

In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [3]:
print('Shape of movies dataframe:', movies.shape)
print('Shape of credits dataframe:', credits.shape)

Shape of movies dataframe: (4803, 20)
Shape of credits dataframe: (4803, 4)


In [4]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [5]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [6]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [7]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


In [8]:
# analyzing both dataframes, there is a common feature named 'title'
# we will be merging both dataframes using this common feature

merged_df = movies.merge(credits, on='title')

In [9]:
print('Shape of merged dataframe:', merged_df.shape)

Shape of merged dataframe: (4809, 23)


In [10]:
merged_df.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [11]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [12]:
merged_df['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

- {"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}

On extracting the very first record under 'Cast' feature we could infer that it basically consists of all the cast details including *Cast ID*, *Gender*, *Name of the character played by actor/actress*, *Real life-name of actor/actress*, etc.

In [13]:
merged_df['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

- {"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}

On extracting the very first record under 'Crew' feature we could infer that it basically consists of all the crew details including *Name of the person working as a crew member*, *His/Her department (editing/sound/director)*, *Gender*, *job of that person (say, if that person is working as Production Designer/ Sound Designer/ Director of Photography/ Art Director/ Visual Effects Producer etc.)*, etc.

Since we will be building a *content-based recommender system*, we need to select specific features that would be helpful in deriving *tags* which would form the base of our recommendation model. Based on these tags, the recommendations will be shown for the users. Listing the features below that we would be using to formulate tags for our model:
- genre
- id
- keywords
- title
- overview
- cast
- crew

In [14]:
# 'budget', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count'
# these are numeric features & they won't do any help in deriving tags for the movies

# suppose let us consider an example accounting for budget
# say there is an user named 'Mr. XYZ' who liked one of the movies with a high budget as $300 million
# now it is not guaranteed that this user might like other movies ranging around the same budget
# same is the issue with other numerical features as stated above
# so we would be dropping these features

In [15]:
merged_df['original_language'].value_counts()

# let's analyze the table below
# roughly 95% of our dataset is having movies listed in 'English' as their official language
# since it's a very unbalanced data & it won't help us deriving any relevant tags
# we could consider dropping this feature from our dataframe

en    4510
fr      70
es      32
zh      27
de      27
hi      19
ja      16
it      14
ko      12
cn      12
ru      11
pt       9
da       7
sv       5
nl       4
fa       4
th       3
he       3
ta       2
cs       2
ro       2
id       2
ar       2
vi       1
sl       1
ps       1
no       1
ky       1
hu       1
pl       1
af       1
nb       1
tr       1
is       1
xx       1
te       1
el       1
Name: original_language, dtype: int64

In [16]:
# 'homepage', 'production_companies', 'production_countries'
# these features do not have any relevance in deriving tags for the movies
# considering our daily life, we never go around recommending movies to our friends
# based on the respective production housenames or the country wherein these productions are situated

In [17]:
# original_title feature might contain name of a movie in native/regional language
# so we prefer keeping 'title' feature in our dataframe instead of 'originakl_title' feature
# since the 'title' feature consists of the english names of all the movies in our dataset

# 'overview' feature includes a short description regarding the storyline
# in our daily lives we do recommend movies to our family/friends highlighting a brief storyline of a movie
# we could derive some useful tags from the 'overview' feature which would account for helpful recommendation
# since 'overview' feature is serving our purpose of deriving tags much better compared to 'tagline' feature
# so we could prefer having 'overview' feature instead of 'tagline' feature in our dataframe

In [18]:
# 'cast', 'crew' - we consider to include these features in our dataframe
# comparing our daily lives, we do share our recommendation to our family/friends
# based on actors/actresses, directors, etc.

# let us assume an example: say there is an user named 'Mrs. ABC'
# She does like movies directed by some person named 'Tom Holland' based on here reviews
# Now there is a strong correlation wherein she would like other movies as well that have been directed by this person

In [19]:
feat_df = merged_df[['movie_id', 'title', 'genres', 'keywords', 'overview', 'cast', 'crew']]

In [20]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [21]:
# checking for missing values
feat_df.isnull().sum()

movie_id    0
title       0
genres      0
keywords    0
overview    3
cast        0
crew        0
dtype: int64

In [22]:
# as the count for missing values is very less we can consider dropping these null values
feat_df = feat_df.dropna()

In [23]:
# checking for duplicated values
feat_df.duplicated().sum()

0

In [24]:
# our motive is to derive a final dataframe with 3 features: 'movie_id', 'title', 'tags'
# so for 'tags' feature we need to merge all the 'genre', 'overview', 'keywords', 'cast', 'crew' features together
# so that all the relevants words could be encapsulated under a single feature
# this would help us correlate b/w the tags of two and more moview and finally share the recommendations

## PRE-PROCESSING STEP:

#### I. TRANSFORMING THE "GENRES" FEATURE

In [25]:
feat_df['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [26]:
# analyzing above results we could infer that for a single record:
# the 'genres' feature consists of data bundled as multiple dictionaries under a list

# [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]
# for above data, we need to retrieve results as a list of tags (as shared below):
    # - ['Action', 'Adventure', 'Fantasy', 'Science Fiction']

# however there is a small problem: if we see below the type returned is not a list but a string
# as a result it is very diffcult to split this string object and retrieve the required data
type(feat_df['genres'][0])

str

In [27]:
# literal_eval() function under ast module helps transform a string object into a list type
# so at first, we are importing the ast module

import ast

In [28]:
# let us create a function(func_1) where we would be splitting the bundled dictionaries
# so basically the object is a (string of list) and we need to convert it into a list using ast.literal_eval() function
# finally we would be appending the names from these bundled dictionaries as tags to an empty list to formulate a list of tags

def func_1(obj):
    list = []
    for i in ast.literal_eval(obj):
        list.append(i['name'])
    return list

In [29]:
feat_df['genres'] = feat_df['genres'].apply(func_1)

In [30]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[Action, Adventure, Crime]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


#### II. TRANSFORMING THE "KEYWORDS" FEATURE

In [31]:
feat_df['keywords'][0]

# 'keyword' feature also had same format data bundled as a dictionary under a list
# so we can apply the same func_1() function to extract the tags out of this (str) object

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [32]:
feat_df['keywords'] = feat_df['keywords'].apply(func_1)

In [33]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [34]:
# now let us analyze the 'cast' & 'crew' features

#### III. TRANSFORMING THE "CAST" FEATURE

In [35]:
feat_df['cast'][0]

# we could see a long list of actors/actresses under 'cast' feature for a single movie
# now let us consider an assumption here (say, we want only the top 4 actors/actresses for a particular movie)
# first dictionary - {"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}
# second dictionary - {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}
# third dictionary - {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2
# fourth dictionary - {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}

# now here we want only the names of the top 4 actors/actresses
# so we only need the 'name' attribute from each of these dictionary objects

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

Let us create a function(func_2) where we would be splitting the bundled dictionaries for 'cast' feature. So basically the object is a (string of list) and we need to convert it into a list using ast.literal_eval() function. Finally we would be appending the names of these actors/actresses as tags to an empty list to formulate a list of actor/actress names.

## FOR SIMPLIFICATION:
- "character": "Jake Sully" ---- "name": "Sam Worthington"
- This indicates Sam Worthington had played the character named 'Jake Sully' in movie titled 'Avatar'
- So here we need the real life name (i.e. Sam Worthington) in our list
- This will help recommend relevant movies associated with this particular actor

In [36]:
def func_2(obj):
    list = []
    counter=0
    for i in ast.literal_eval(obj):
        if counter != 4:
            list.append(i['name'])
            counter = counter + 1
        else:
            break
    return list

In [37]:
feat_df['cast'] = feat_df['cast'].apply(func_2)
print(feat_df['cast'])

0       [Sam Worthington, Zoe Saldana, Sigourney Weave...
1       [Johnny Depp, Orlando Bloom, Keira Knightley, ...
2       [Daniel Craig, Christoph Waltz, Léa Seydoux, R...
3       [Christian Bale, Michael Caine, Gary Oldman, A...
4       [Taylor Kitsch, Lynn Collins, Samantha Morton,...
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805    [Edward Burns, Kerry Bishé, Marsha Dietlein, C...
4806    [Eric Mabius, Kristin Booth, Crystal Lowe, Geo...
4807    [Daniel Henney, Eliza Coupe, Bill Paxton, Alan...
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldm...
Name: cast, Length: 4806, dtype: object


In [38]:
# we have successfully transformed top 4 actors/actresses for each movie in the form of a list
# so we are getting the lead actors/actress for each movie which we can use in our 'tag' feature
# to correlate with other movies relevant to that actor/actress and help recommend same back to the users 

In [39]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,"[Christian Bale, Michael Caine, Gary Oldman, A...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


#### IV. TRANSFORMING THE "CREW" FEATURE

In [40]:
feat_df['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [41]:
# from the 'crew' feature, we only need Director's name which we can use in our recommendation model
# so we need to create some function to search for "job": "Director" value in our dictionary
# since a movie has only 1 director, once we have found the director's name in our dictionary
# we need to terminate our loop using the break statement

# for simplication:
# {"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}
# considering above dictionary, this person named 'Stephen E. Rivkin' is the Editor for the movie titled 'Avatar'

In [42]:
def fetch_director(obj):
    list = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            list.append(i['name'])
            break
    return list

In [43]:
feat_df['crew'] = feat_df['crew'].apply(fetch_director)
feat_df['crew']

0           [James Cameron]
1          [Gore Verbinski]
2              [Sam Mendes]
3       [Christopher Nolan]
4          [Andrew Stanton]
               ...         
4804     [Robert Rodriguez]
4805         [Edward Burns]
4806          [Scott Smith]
4807          [Daniel Hsia]
4808     [Brian Herzlinger]
Name: crew, Length: 4806, dtype: object

In [44]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,"[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


#### V. TRANSFORMING THE "OVERVIEW" FEATURE

In [45]:
feat_df['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [46]:
feat_df['overview'] = feat_df['overview'].apply(lambda x:x.split())
feat_df['overview']

0       [In, the, 22nd, century,, a, paraplegic, Marin...
1       [Captain, Barbossa,, long, believed, to, be, d...
2       [A, cryptic, message, from, Bond’s, past, send...
3       [Following, the, death, of, District, Attorney...
4       [John, Carter, is, a, war-weary,, former, mili...
                              ...                        
4804    [El, Mariachi, just, wants, to, play, his, gui...
4805    [A, newlywed, couple's, honeymoon, is, upended...
4806    ["Signed,, Sealed,, Delivered", introduces, a,...
4807    [When, ambitious, New, York, attorney, Sam, is...
4808    [Ever, since, the, second, grade, when, he, fi...
Name: overview, Length: 4806, dtype: object

In [47]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[In, the, 22nd, century,, a, paraplegic, Marin...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Captain, Barbossa,, long, believed, to, be, d...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[A, cryptic, message, from, Bond’s, past, send...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Following, the, death, of, District, Attorney...","[Christian Bale, Michael Caine, Gary Oldman, A...",[Christopher Nolan]
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[John, Carter, is, a, war-weary,, former, mili...","[Taylor Kitsch, Lynn Collins, Samantha Morton,...",[Andrew Stanton]


In [48]:
# now all the 5 features: (genres	keywords	overview	cast	crew) are listed as a list
# however there is a problem specific to spaces between these words which would cause an error in our recommendation model

# for simplication:
    # 'Avatar' movie has an actor named 'Sam Worthington'
    # Now though it is a single name, our model would still consider it as two separate entities owing to the spaces
    # so it will classify it into two different tags as 'Sam' & 'Worthington'
    
    # 'Spectre' movie has a director named 'Sam Mendes'
    # so our model would consider it as two separate tags: 'Sam' & 'Mendes'
    
# now the issue will arise when the model has to show recommendation specific to these actor/director's name
# say we need our model to recommend movies specific 'Sam' tag
# this may lead so some sort of confusion leading to incorrect recommendations
# our model might share the recommendations pulling in results for 'Sam' tag corresponding to 'Sam Mendes'
# to avoid this confusion, we need to omit the spaces so that our model could consider a name as a single entity

# 'Sam Worthington' 	 'SamWorthington'

In [49]:
# so we need to omit spaces b/w strings in the list for 4 features: (genres		keywords	cast	crew)

feat_df['genres'] = feat_df['genres'].apply(lambda x:[i.replace(" ", "") for i in x])
feat_df['keywords'] = feat_df['keywords'].apply(lambda x:[i.replace(" ", "") for i in x])
feat_df['cast'] = feat_df['cast'].apply(lambda x:[i.replace(" ", "") for i in x])
feat_df['crew'] = feat_df['crew'].apply(lambda x:[i.replace(" ", "") for i in x])

In [50]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan]
4,49529,John Carter,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[John, Carter, is, a, war-weary,, former, mili...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton]


In [51]:
# concatenating all the 5 features into one feature titled 'tags'
feat_df['tags'] = feat_df['overview'] + feat_df['genres'] + feat_df['keywords'] + feat_df['cast'] + feat_df['crew']

In [52]:
feat_df.head()

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew,tags
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman, Anne...",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[John, Carter, is, a, war-weary,, former, mili...","[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [53]:
preprocessed_df = feat_df[['movie_id', 'title', 'tags']]

In [54]:
preprocessed_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [55]:
# now we will consolidate the tags in the list to formulate a string

preprocessed_df['tags'] = preprocessed_df['tags'].apply(lambda x:" ".join(x))
preprocessed_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."
...,...,...,...
4804,9367,El Mariachi,El Mariachi just wants to play his guitar and ...
4805,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...
4806,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic..."
4807,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...


In [56]:
# also we will have all these words converted to lowercase for simplification

preprocessed_df['tags'] = preprocessed_df['tags'].apply(lambda x:x.lower())
preprocessed_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."
...,...,...,...
4804,9367,El Mariachi,el mariachi just wants to play his guitar and ...
4805,72766,Newlyweds,a newlywed couple's honeymoon is upended by th...
4806,231617,"Signed, Sealed, Delivered","""signed, sealed, delivered"" introduces a dedic..."
4807,126186,Shanghai Calling,when ambitious new york attorney sam is sent t...


In [57]:
preprocessed_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4806 entries, 0 to 4808
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4806 non-null   int64 
 1   title     4806 non-null   object
 2   tags      4806 non-null   object
dtypes: int64(1), object(2)
memory usage: 279.2+ KB


In [58]:
preprocessed_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang jamescameron'

In [59]:
preprocessed_df['tags'][1]

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley stellanskarsgård goreverbinski"

## COUNT VECTORIZATION

The data under 'tags' feature is in textual format and to check for similarity between two tag records we need to evaluate the words that are common to both these records. Now we can't do this manually since there are over 4000 movies. For a real time applications, there might be even more records/data. So here we need to use the 'COUNT VECTORIZATION' feature wherein we would be transforming each of these tags into vectors. Since we are transforming *text into vectors* this process is termed as 'TEXT VECTORIZATION'. There are a number of techniques as shared below to apply text vectorization:
- BAG OF WORDS (*we would be using it here in our problem*)
- TF-IDF
- word2vec

### BAG OF WORDS
'Bag of words' is the simplest of all techniques wherein at first we concatenate all the tag records together as a bag and then from here we will be extracting those set of words which occur most frequently throughout this complete bag. Our dataset houses nearly 4800 movies, so we can consider extracting around 5000 words in common that appear most frequently. 

For simplification let's assume:
- **tag1** = 'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver stephenlang jamescameron'
- **tag2** = 'captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley stellanskarsgård goreverbinski'

Similar to above there are around 4806 tag records in the form of textual data. So we will group them together as:
[*tag1 + tag2 + tag3 + ........ + tag4806*] and let's say we will group under a single entity named 'bag'.
- *bag* = *tag1 + tag2 + tag3 + ........ + tag4806*
- now from this bag we would be extracting 5000 words that are most common/appear most frequently
- we will refer them as *w1, w2, w3, .......... , w5000*
- now we will check how many times each of these words are occuring under a single movie
- this will derive the results in a form of a string

for example, movies = [*w1, w2, w3, .......... , w5000*]
- m1 = [*5, 3, 0, .......... , 2*]
- m2 = [*6, 1, 2, .......... , 0*]
- m3 = [*0, 0, 1, .......... , 0*]
- m4 = [*4, 2, 0, .......... , 0*]
- ..........
- ..........
- m4806 = [*1, 0, 0, .......... , 0*]

> This indicates that first common word (w1) is occuring 5 times under the tag data for movie (m1),  second common word (w2) is occuring 3 times under the tag data for movie (m1), third common word (m3) is occuring 0 times under the tag data for movie (m1) and henceforth.

Each of these movies are now represented as a vector mapped under a (x, y) dimensional plane. (x) refers to the number of movies and (y) refers to the number of common words. So in our problem we are dealing with (4806, 5000) dimensional plane, wherein we have 4806 vectors mapped onto this plane.

#### SO EVERY MOVIE IS NOW A VECTOR MAPPED UNDER A 5000-dimensional SPACE

**NOTE**: 'Stopwords' refer to those words which are used in english language with the sole purpose of formation of sentences however they do not have any contribution to the meaning of these sentences. For example: a, is, are, if, to, from, and, etc. So we would be excluding these words from our bag of commom words as they won't serve any meaning to our purpose.

In [60]:
from sklearn.feature_extraction.text import CountVectorizer

In [61]:
cv = CountVectorizer(stop_words='english',
                     max_features=5000)

In [62]:
# the result obtained after applying fit_transform() operation over the cv object
# is basically in the form of a sciPy sparse matrix and we need to convert these values in the form of a list
# so we would be applying the toarray() operation to convert these values into a numpy array

vectors = cv.fit_transform(preprocessed_df['tags']).toarray()

In [63]:
print('Shape:', vectors.shape)
vectors

Shape: (4806, 5000)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [64]:
print('Vector matrix for the first movie:', vectors[0])
print('Vector matrix for the second movie:', vectors[1])
print('Vector matrix for the third movie:', vectors[2])
print('Vector matrix for the fourth movie:', vectors[3])
print("")
print('Vector matrix for the second last movie:', vectors[4084])
print('Vector matrix for the last movie:', vectors[4085])

Vector matrix for the first movie: [0 0 0 ... 0 0 0]
Vector matrix for the second movie: [0 0 0 ... 0 0 0]
Vector matrix for the third movie: [0 0 0 ... 0 0 0]
Vector matrix for the fourth movie: [0 0 0 ... 0 0 0]

Vector matrix for the second last movie: [0 0 0 ... 0 0 0]
Vector matrix for the last movie: [0 0 0 ... 0 0 0]


In [65]:
# viewing all the 5000 most frequent words that we had extracted from the corpus
cv.get_feature_names()

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '18th',
 '19',
 '1930s',
 '1940s',
 '1950s',
 '1960s',
 '1970s',
 '1971',
 '1976',
 '1980',
 '1980s',
 '1985',
 '1990s',
 '19th',
 '19thcentury',
 '20',
 '200',
 '2003',
 '2009',
 '20th',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '60s',
 '70',
 '70s',
 'aaron',
 'aaroneckhart',
 'aarontaylor',
 'abandoned',
 'abducted',
 'abigailbreslin',
 'abilities',
 'ability',
 'able',
 'aboard',
 'abuse',
 'abusive',
 'academy',
 'accept',
 'accepted',
 'accepts',
 'access',
 'accident',
 'accidental',
 'accidentally',
 'accompanied',
 'accomplish',
 'account',
 'accountant',
 'accused',
 'ace',
 'achieve',
 'act',
 'acting',
 'action',
 'actionhero',
 'actions',
 'activist',
 'activities',
 'activity',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'adam',
 'adambrody',
 'adams',
 'adamsandler',
 'adamshankman',
 'adaptation',
 'adapted',
 'addict',
 'addicted',
 'ad

In [66]:
# there is a problem with the common words that have been extracted above
# ['commit', 'commitment', 'committed']
# more examples of such words in our list:
    # - ['abilities', 'ability']
    # - ['accident', 'accidental', 'accidentally']
    # - ['activist', 'activities', 'activity']
# basically these 3 words are similar however they are being treated as different by our count vectorizer object
# so we will be applying the 'stemming' functionality to help avoid this issue

# How (stemming) works:
# Suppose we have three different words ['loved', 'loving', 'love']
# So what stemming does is -> it will transform these words to the root term
# final output will be: ['love', 'love', 'love']
# we will be applying this functionality just before implementing the count vectorization technique

#### APPLYING STEMMING FUNCTIONALITY & RE-RUNNING COUNT VECTORIZATION TECHNIQUE

In [67]:
import nltk

In [68]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [69]:
# defining a helper function to implement the stemming functionality to our tag records

def stem(text): # --- text = preprocessed_df['tags'][0]
    y = []
    for i in text.split(): # --- splitting the input text based on spaces
        y.append(ps.stem(i)) # --- applying the stemming functionality on these list of words
    return " ".join(y) # --- finally joining all the words as a combined string and returning it back

In [70]:
preprocessed_df['tags'] = preprocessed_df['tags'].apply(stem)

In [71]:
preprocessed_df['tags']

0       in the 22nd century, a parapleg marin is dispa...
1       captain barbossa, long believ to be dead, ha c...
2       a cryptic messag from bond’ past send him on a...
3       follow the death of district attorney harvey d...
4       john carter is a war-weary, former militari ca...
                              ...                        
4804    el mariachi just want to play hi guitar and ca...
4805    a newlyw couple' honeymoon is upend by the arr...
4806    "signed, sealed, delivered" introduc a dedic q...
4807    when ambiti new york attorney sam is sent to s...
4808    ever sinc the second grade when he first saw h...
Name: tags, Length: 4806, dtype: object

In [72]:
vectors = cv.fit_transform(preprocessed_df['tags']).toarray()

In [73]:
print('Vector matrix for the first movie:', vectors[0])
print('Vector matrix for the second movie:', vectors[1])
print('Vector matrix for the third movie:', vectors[2])
print('Vector matrix for the fourth movie:', vectors[3])
print("")
print('Vector matrix for the second last movie:', vectors[4084])
print('Vector matrix for the last movie:', vectors[4085])

Vector matrix for the first movie: [0 0 0 ... 0 0 0]
Vector matrix for the second movie: [0 0 0 ... 0 0 0]
Vector matrix for the third movie: [0 0 0 ... 0 0 0]
Vector matrix for the fourth movie: [0 0 0 ... 0 0 0]

Vector matrix for the second last movie: [0 0 0 ... 0 0 0]
Vector matrix for the last movie: [0 0 0 ... 0 0 0]


In [74]:
cv.get_feature_names()

['000',
 '007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '18th',
 '19',
 '1910',
 '1920',
 '1930',
 '1940',
 '1950',
 '1950s',
 '1960',
 '1960s',
 '1970',
 '1970s',
 '1980',
 '1985',
 '1990',
 '19th',
 '19thcenturi',
 '20',
 '200',
 '2009',
 '20th',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '500',
 '60',
 '70',
 'aaron',
 'aaroneckhart',
 'aarontaylor',
 'abandon',
 'abduct',
 'abigailbreslin',
 'abil',
 'abl',
 'aboard',
 'abov',
 'abus',
 'academ',
 'academi',
 'accept',
 'access',
 'accid',
 'accident',
 'acclaim',
 'accompani',
 'accomplish',
 'account',
 'accus',
 'ace',
 'achiev',
 'acquaint',
 'act',
 'action',
 'actionhero',
 'activ',
 'activist',
 'activities',
 'actor',
 'actress',
 'actual',
 'ad',
 'adam',
 'adambrodi',
 'adamsandl',
 'adamshankman',
 'adapt',
 'add',
 'addict',
 'adjust',
 'admir',
 'admit',
 'adolesc',
 'adopt',
 'ador',
 'adrienbrodi',
 'adult',
 'adultanim',
 'adulteri',
 'adulthood',
 'advanc',
 'adventur',
 'adv

In [75]:
print('Shape:', vectors.shape)

Shape: (4806, 5000)


In [76]:
# the problem of repetition of common words in our tags feature has been resolved

In [77]:
# now that we have all our 4806 movies mapped as vectors on a 5000-dimensional plane
# we will measure the distance between each of these vectors to check for similarity
# we won't be measuring the 'Euclidean Distance' as it is not a reliable measure while dealing with a high dimensional plane
# so we would be using the 'Cosine Similarity' technique to measure distance and check for similarity
# [Distance] is always inversely proportional to [Similarity]

# for simplication (Cosine Similarity):
    # we will compare the similarity between two movies using 'theta' value
    # 'theta' is the angle between two movies
    # higher 'theta' value = cosine distance is more --> movies are less similar to each other
    # lower 'theta' value = cosine distance is less --> movies are much similar to each other

In [78]:
from sklearn.metrics.pairwise import cosine_similarity

In [79]:
similarity = cosine_similarity(vectors)

In [80]:
similarity.shape

(4806, 4806)

In [81]:
# so basically every movie will be have its distance comapred against the rest of the movies
    # movie_1 compared against [movie_1, movie_2, movie_3, ..... , movie_4086]
    # movie_2 compared against [movie_1, movie_2, movie_3, ..... , movie_4086]
    # movie_3 compared against [movie_1, movie_2, movie_3, ..... , movie_4086]
    # ............
    # ............
    # movie_4086 compared against [movie_1, movie_2, movie_3, ..... , movie_4086]

# hence we are receiving a (4086x4086) matrix as the resultant matrix having the distance values expressed between (0,1)

In [82]:
similarity

array([[1.        , 0.08226127, 0.0860309 , ..., 0.04543109, 0.        ,
        0.        ],
       [0.08226127, 1.        , 0.05976143, ..., 0.02366905, 0.        ,
        0.02577696],
       [0.0860309 , 0.05976143, 1.        , ..., 0.02475369, 0.        ,
        0.        ],
       ...,
       [0.04543109, 0.02366905, 0.02475369, ..., 1.        , 0.04174829,
        0.04270814],
       [0.        , 0.        , 0.        , ..., 0.04174829, 1.        ,
        0.09093258],
       [0.        , 0.02577696, 0.        , ..., 0.04270814, 0.09093258,
        1.        ]])

In [83]:
similarity[0]

array([1.        , 0.08226127, 0.0860309 , ..., 0.04543109, 0.        ,
       0.        ])

## SAMPLE UNDERSTANDING OF HOW THE RECOMMENDATION MODEL WOULD WORK

In [84]:
# now we need to define a helper function that would request a movie from the user
# and would fetch for top 5 similar movies and share them back to the user as recommendation

# Steps to be followed under the helper function:
    # based on user's request a movie is provided as an input to the function
    # now we need to fetch the index value for this movie and store it under a variable
    # based on the index value we then need to check for the similarity scores for this movie against other movies
    # once the similarity array for that particular movie has been identified -- it then needs to sort the distances in decreasing order
    
    ## -- meanwhile while sorting the distances we also need to hold the index position for our requested movie so that the sorted values don't mismatch
    ## for simplication:
        ## for 'Avatar' --> cosine distance = 1 against itself
        ## for 'Avatar' --> cosine distance = 0.08226127 against movie_2
        ## for 'Avatar' --> cosine distance = 0.0860309 against movie_3
        
        ## so while sorting results will follow up as: [1, 0.0860309, 0.08226127]
        ## however we have our movies in the sequence: [movie_1, movie_2, movie_3]
        ## but the correct sequence should be: [movie_1, movie_3, movie_2] (based on distance)

# so we need to ensure that the index position is not lost while sorting the distance values
# so for this we would be using the enumerate() functionality and transform it into a list (as illustrated below)

In [85]:
list(enumerate(similarity[0]))

[(0, 1.0000000000000002),
 (1, 0.08226127456606226),
 (2, 0.08603090020146065),
 (3, 0.07300534327409847),
 (4, 0.1873171623163388),
 (5, 0.10743376064838502),
 (6, 0.04024218182927669),
 (7, 0.14509525002200235),
 (8, 0.05923488777590923),
 (9, 0.0967301666813349),
 (10, 0.101338918387628),
 (11, 0.09365858115816941),
 (12, 0.08885233166386385),
 (13, 0.044151078568834795),
 (14, 0.12824729401064427),
 (15, 0.06282808624375433),
 (16, 0.07894736842105264),
 (17, 0.13977653617040256),
 (18, 0.09558988911273408),
 (19, 0.0837707816583391),
 (20, 0.057807331301608),
 (21, 0.106676149412533),
 (22, 0.0662266178532522),
 (23, 0.08603090020146065),
 (24, 0.05407380704358751),
 (25, 0.05101627678885769),
 (26, 0.15389675281277312),
 (27, 0.18848425873126295),
 (28, 0.10968169942141635),
 (29, 0.065033247714309),
 (30, 0.06622661785325219),
 (31, 0.15609763526361567),
 (32, 0.08447772061910234),
 (33, 0.09544271444636668),
 (34, 0.0),
 (35, 0.09733285267845754),
 (36, 0.1692777916923361),
 (3

- (0, 1.0000000000000002) ---> distance of movie_1 compared against movie_1 = 1.0000000000000002
- (1, 0.08226127456606226) ---> distance of movie_1 compared against movie_2 = 0.08226127456606226
- (2, 0.08603090020146065) ---> distance of movie_1 compared against movie_3 = 0.08603090020146065

In [86]:
# --- now we will sort this list of tuples
sorted(list(enumerate(similarity[0])),
      reverse = True,
      key = lambda x:x[1])

# ---- 'key' parameter is used to indicate that we want to sort the values based on second element in each of these tuples

[(0, 1.0000000000000002),
 (1216, 0.28676966733820225),
 (2409, 0.26310068027921696),
 (507, 0.255608593705383),
 (3730, 0.25391668753850405),
 (539, 0.2467838236981868),
 (582, 0.24511108480187255),
 (1204, 0.23918243661746996),
 (1194, 0.2367785320221084),
 (61, 0.23179316248638276),
 (778, 0.2294157338705618),
 (1920, 0.2252817784447915),
 (4048, 0.22329687826943603),
 (2786, 0.22269966704152225),
 (172, 0.21239769762143662),
 (972, 0.2073221072156823),
 (2971, 0.20602141085758227),
 (322, 0.20519567041703082),
 (2333, 0.20443988269091456),
 (3608, 0.20437977982832192),
 (4192, 0.2029530274475215),
 (1444, 0.20277677641345318),
 (1089, 0.2020475485519274),
 (260, 0.20073876713674155),
 (74, 0.20054543301971392),
 (151, 0.19867985355975665),
 (3675, 0.1979082783981174),
 (973, 0.19767387315371682),
 (577, 0.1976738731537168),
 (47, 0.19529164171612676),
 (3327, 0.19117977822546817),
 (1201, 0.19088542889273336),
 (942, 0.1892994097121204),
 (27, 0.18848425873126295),
 (305, 0.1884842

In [87]:
# to suggest top 6 recommendations against movie titled 'Avatar'
sorted(list(enumerate(similarity[0])), reverse = True, key = lambda x:x[1])[1:7]

[(1216, 0.28676966733820225),
 (2409, 0.26310068027921696),
 (507, 0.255608593705383),
 (3730, 0.25391668753850405),
 (539, 0.2467838236981868),
 (582, 0.24511108480187255)]

In [88]:
# now let us define the helper function to help implement the process illustrated above

def recommend(movie):
    movie_index = preprocessed_df[preprocessed_df['title'] == movie].index[0]
    distance = similarity[movie_index]
    movies_list = sorted(list(enumerate(distance)), reverse = True, key = lambda x:x[1])[1:6]
    for i in movies_list:
        print(preprocessed_df.iloc[i[0]].title)

In [89]:
# testing if our helper function: recommend() works successfully or not

print('Recommendations similar to Batman Begins:')
recommend('Batman Begins')
print("")
print('Recommendations similar to John Carter:')
recommend('John Carter')

Recommendations similar to Batman Begins:
The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf

Recommendations similar to John Carter:
Riddick
Krrish
The Other Side of Heaven
The Legend of Hercules
Get Carter


In [90]:
import pickle

In [91]:
pickle.dump(preprocessed_df,open('movie_list.pkl','wb'))

In [92]:
pickle.dump(similarity,open('similarity_check.pkl','wb'))