## Movies Recommender System
### `Project By : Data Detectives`

## 1. Import Neccesorry Libraries

In [1]:
import ast
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 50)

## 2. Load Dataset and Understanding Data 

### _Credits_

In [2]:
credits = pd.read_csv('datasets/credits.csv')
print(credits.shape)
credits.head()

(45476, 3)


Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


* **cast:** Information about casting. Name of actor, gender and it's character name in movie
* **crew:** Information about crew members. Like who directed the movie, editor of the movie and so on.
* **id:** It's movie ID given by TMDb

### _Keywords_

In [3]:
keywords = pd.read_csv('datasets/keywords.csv')
print(keywords.shape)
keywords.head()

(46419, 2)


Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


* **id:** It's movie ID given by TMDb
* **Keywords:** Tags/keywords for the movie. It list of tags/keywords

### _Links_

In [4]:
links = pd.read_csv('datasets/links.csv')
print(links.shape)
links.head()

(45843, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


* **movieId:** It's serial number for movie
* **imdbId:** Movie id given on IMDb platform
* **tmdbId:** Movie id given on TMDb platform

### _Links Small_

In [5]:
links_small = pd.read_csv('datasets/links_small.csv')
print(links_small.shape)
links_small.head()

(9125, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


* **movieId:** It's serial number for movie
* **imdbId:** Movie id given on IMDb platform
* **tmdbId:** Movie id given on TMDb platform

### _Ratings Small_

In [6]:
ratings_small = pd.read_csv('datasets/ratings_small.csv')
print(ratings_small.shape)
ratings_small.head()

(100004, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


* **userId:** It is id for User
* **movieId:** It is TMDb movie id.
* **rating:** Rating given for the particular movie by specific user
* **timestamp:** Time stamp when rating has been given by user 

### _Movies Meta Data_

In [7]:
movies_meta = pd.read_csv('datasets/movies_metadata.csv')
print(movies_meta.shape)
movies_meta.head()

(45466, 24)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Features

* **adult:** Indicates if the movie is X-Rated or Adult.
* **belongs_to_collection:** A stringified dictionary that gives information on the movie series the particular film belongs to.
* **budget:** The budget of the movie in dollars.
* **genres:** A stringified list of dictionaries that list out all the genres associated with the movie.
* **homepage:** The Official Homepage of the movie.
* **id:** The ID of the movie.
* **imdb_id:** The IMDB ID of the movie.
* **original_language:** The language in which the movie was originally shot in.
* **original_title:** The original title of the movie.
* **overview:** A brief blurb of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **poster_path:** The URL of the poster image.
* **production_companies:** A stringified list of production companies involved with the making of the movie.
* **production_countries:** A stringified list of countries where the movie was shot/produced in.
* **release_date:** Theatrical Release Date of the movie.
* **revenue:** The total revenue of the movie in dollars.
* **runtime:** The runtime of the movie in minutes.
* **spoken_languages:** A stringified list of spoken languages in the film.
* **status:** The status of the movie (Released, To Be Released, Announced, etc.)
* **tagline:** The tagline of the movie.
* **title:** The Official Title of the movie.
* **video:** Indicates if there is a video present of the movie with TMDB.
* **vote_average:** The average rating of the movie.
* **vote_count:** The number of votes by users, as counted by TMDB.

## 3. Data Pre-processing

* **We'll try to simplify the following columns into one data frame**

* **We can observe that credits and keywords id are same lets merge into one**

In [9]:
credits  = credits.merge(keywords, on='id')
credits.head()

Unnamed: 0,cast,crew,id,keywords
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [10]:
links.rename(columns = {'tmdbId':'id'}, inplace=True)

In [11]:
links  = credits.merge(links, on='id')
links.head()

Unnamed: 0,cast,crew,id,keywords,movieId,imdbId
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",1,114709
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...",2,113497
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",3,113228
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...",4,114885
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...",5,113041


In [12]:
movies_meta = movies_meta.drop([29503,35587,19730])

In [13]:
movies_meta['id'] = movies_meta['id'].astype('int64')

In [14]:
movies_meta = movies_meta.merge(links, on='id')

In [15]:
movies_meta.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords,movieId,imdbId
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",1,114709
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",2,113497
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",3,113228
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...",4,114885
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...",5,113041


In [16]:
movies_meta.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46910 entries, 0 to 46909
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  46910 non-null  object 
 1   belongs_to_collection  4602 non-null   object 
 2   budget                 46910 non-null  object 
 3   genres                 46910 non-null  object 
 4   homepage               8049 non-null   object 
 5   id                     46910 non-null  int64  
 6   imdb_id                46893 non-null  object 
 7   original_language      46899 non-null  object 
 8   original_title         46910 non-null  object 
 9   overview               45915 non-null  object 
 10  popularity             46906 non-null  object 
 11  poster_path            46511 non-null  object 
 12  production_companies   46906 non-null  object 
 13  production_countries   46906 non-null  object 
 14  release_date           46822 non-null  object 
 15  re

In [17]:
movies_meta.isnull().sum()

adult                        0
belongs_to_collection    42308
budget                       0
genres                       0
homepage                 38861
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   995
popularity                   4
poster_path                399
production_companies         4
production_countries         4
release_date                88
revenue                      4
runtime                    268
spoken_languages             4
status                      86
tagline                  26001
title                        4
video                        4
vote_average                 4
vote_count                   4
cast                         0
crew                         0
keywords                     0
movieId                      0
imdbId                       0
dtype: int64

In [18]:
movies_meta.duplicated().sum()

1388

In [19]:
df = movies_meta.drop_duplicates()
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords,movieId,imdbId
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",1,114709
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",2,113497
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",3,113228
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...",4,114885
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...",5,113041


In [20]:
# lets drop the following columns:
# adult
# belongs_to_collection
# homepage
# imdb_id
# original_language
# original_title
# poster_path
# production_countries
# release_date
# status
# tagline
# video
# movieId
# imdbId
# userId
# timestamp

In [21]:
df = df.drop(['adult',
              'belongs_to_collection',
              'homepage',
              'imdb_id',
              'original_language',
              'original_title',
              'poster_path',
              'production_countries',
              'release_date',
              'status',
              'tagline',
              'video',
              'movieId',
              'imdbId',
              ], axis=1)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45522 entries, 0 to 46909
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                45522 non-null  object 
 1   genres                45522 non-null  object 
 2   id                    45522 non-null  int64  
 3   overview              44568 non-null  object 
 4   popularity            45519 non-null  object 
 5   production_companies  45519 non-null  object 
 6   revenue               45519 non-null  float64
 7   runtime               45262 non-null  float64
 8   spoken_languages      45519 non-null  object 
 9   title                 45519 non-null  object 
 10  vote_average          45519 non-null  float64
 11  vote_count            45519 non-null  float64
 12  cast                  45522 non-null  object 
 13  crew                  45522 non-null  object 
 14  keywords              45522 non-null  object 
dtypes: float64(4), int6

In [23]:
df = df[['id', 'title', 'genres', 'overview', 'spoken_languages', 'cast', 'crew', 'vote_average', 'vote_count', 'popularity', 'budget', 'keywords', 'production_companies', 'revenue', 'runtime']]
df.head()

Unnamed: 0,id,title,genres,overview,spoken_languages,cast,crew,vote_average,vote_count,popularity,budget,keywords,production_companies,revenue,runtime
0,862,Toy Story,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...","[{'iso_639_1': 'en', 'name': 'English'}]","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",7.7,5415.0,21.946943,30000000,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'name': 'Pixar Animation Studios', 'id': 3}]",373554033.0,81.0
1,8844,Jumanji,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",6.9,2413.0,17.015539,65000000,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'name': 'TriStar Pictures', 'id': 559}, {'na...",262797249.0,104.0
2,15602,Grumpier Old Men,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,"[{'iso_639_1': 'en', 'name': 'English'}]","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",6.5,92.0,11.7129,0,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",0.0,101.0
3,31357,Waiting to Exhale,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","Cheated on, mistreated and stepped on, the wom...","[{'iso_639_1': 'en', 'name': 'English'}]","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",6.1,34.0,3.859495,16000000,"[{'id': 818, 'name': 'based on novel'}, {'id':...",[{'name': 'Twentieth Century Fox Film Corporat...,81452156.0,127.0
4,11862,Father of the Bride Part II,"[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,"[{'iso_639_1': 'en', 'name': 'English'}]","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",5.7,173.0,8.387519,0,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'name': 'Sandollar Productions', 'id': 5842}...",76578911.0,106.0


### _Cleaning_

In [24]:
df.shape

(45522, 15)

In [25]:
df.isnull().sum()

id                        0
title                     3
genres                    0
overview                954
spoken_languages          3
cast                      0
crew                      0
vote_average              3
vote_count                3
popularity                3
budget                    0
keywords                  0
production_companies      3
revenue                   3
runtime                 260
dtype: int64

In [26]:
df.duplicated().sum()

60

In [27]:
df.dropna(inplace=True)

In [28]:
df = df.drop_duplicates()

In [29]:
# Genres

def genre_convert(data):
    genre = []
    for i in ast.literal_eval(data):
        genre.append(i['name'])
    return genre 

In [30]:
df['genres'] = df['genres'].apply(genre_convert)

In [31]:
# Keywords

def name_convert(data):
    keywords = []
    for i in ast.literal_eval(data):
        keywords.append(i['name'])
    return keywords

In [32]:
df['keywords'] = df['keywords'].apply(name_convert)

In [33]:
# Cast

def cast_names(data):
    actors = []
    counter = 0
    for i in ast.literal_eval(data):
        if counter != 3:
            actors.append(i['name'])
            counter += 1
    return actors

In [34]:
df['cast'] = df['cast'].apply(cast_names)

In [35]:
# Crew

def crew_names(data):
    director = []
    for i in ast.literal_eval(data):
        if i['job'] == "Director":
            director.append(i['name'])
            break
    return director

In [36]:
df['crew'] = df['crew'].apply(crew_names)

In [37]:
# Spoken Languages

def spoken_lang(data):
    keywords = []
    for i in ast.literal_eval(data):
        keywords.append(i['name'])
    return keywords

In [38]:
df['spoken_languages'] = df['spoken_languages'].apply(spoken_lang)

In [39]:
# Production Company

def production_companies(data):
    production_companies = []
    for i in ast.literal_eval(data):
        production_companies.append(i['name'])
    return production_companies

In [40]:
def prod_company(data):
    production_companies = []
    for i in ast.literal_eval(data):
        production_companies.append(i['name'])
    return production_companies

In [41]:
df['production_companies'] = df['production_companies'].apply(production_companies)

In [42]:
df['genres'] = df['genres'].apply(lambda x:[i.replace(' ','') for i in x])
df['cast'] = df['cast'].apply(lambda x:[i.replace(' ','') for i in x])
df['crew'] = df['crew'].apply(lambda x:[i.replace(' ','') for i in x])
df['keywords'] = df['keywords'].apply(lambda x:[i.replace(' ','') for i in x])

In [43]:
df['genres'] = df['genres'].apply(lambda x:' '.join(x))
df['cast'] = df['cast'].apply(lambda x:' '.join(x))
df['crew'] = df['crew'].apply(lambda x:' '.join(x))
df['keywords'] = df['keywords'].apply(lambda x:' '.join(x))
df['spoken_languages'] = df['spoken_languages'].apply(lambda x:' '.join(x))
df['production_companies'] = df['production_companies'].apply(lambda x:' '.join(x))

In [44]:
df.head()

Unnamed: 0,id,title,genres,overview,spoken_languages,cast,crew,vote_average,vote_count,popularity,budget,keywords,production_companies,revenue,runtime
0,862,Toy Story,Animation Comedy Family,"Led by Woody, Andy's toys live happily in his ...",English,TomHanks TimAllen DonRickles,JohnLasseter,7.7,5415.0,21.946943,30000000,jealousy toy boy friendship friends rivalry bo...,Pixar Animation Studios,373554033.0,81.0
1,8844,Jumanji,Adventure Fantasy Family,When siblings Judy and Peter discover an encha...,English Français,RobinWilliams JonathanHyde KirstenDunst,JoeJohnston,6.9,2413.0,17.015539,65000000,boardgame disappearance basedonchildren'sbook ...,TriStar Pictures Teitler Film Interscope Commu...,262797249.0,104.0
2,15602,Grumpier Old Men,Romance Comedy,A family wedding reignites the ancient feud be...,English,WalterMatthau JackLemmon Ann-Margret,HowardDeutch,6.5,92.0,11.7129,0,fishing bestfriend duringcreditsstinger oldmen,Warner Bros. Lancaster Gate,0.0,101.0
3,31357,Waiting to Exhale,Comedy Drama Romance,"Cheated on, mistreated and stepped on, the wom...",English,WhitneyHouston AngelaBassett LorettaDevine,ForestWhitaker,6.1,34.0,3.859495,16000000,basedonnovel interracialrelationship singlemot...,Twentieth Century Fox Film Corporation,81452156.0,127.0
4,11862,Father of the Bride Part II,Comedy,Just when George Banks has recovered from his ...,English,SteveMartin DianeKeaton MartinShort,CharlesShyer,5.7,173.0,8.387519,0,baby midlifecrisis confidence aging daughter m...,Sandollar Productions Touchstone Pictures,76578911.0,106.0


In [45]:
df.shape

(44505, 15)

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44505 entries, 0 to 46909
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    44505 non-null  int64  
 1   title                 44505 non-null  object 
 2   genres                44505 non-null  object 
 3   overview              44505 non-null  object 
 4   spoken_languages      44505 non-null  object 
 5   cast                  44505 non-null  object 
 6   crew                  44505 non-null  object 
 7   vote_average          44505 non-null  float64
 8   vote_count            44505 non-null  float64
 9   popularity            44505 non-null  object 
 10  budget                44505 non-null  object 
 11  keywords              44505 non-null  object 
 12  production_companies  44505 non-null  object 
 13  revenue               44505 non-null  float64
 14  runtime               44505 non-null  float64
dtypes: float64(4), int6

In [47]:
df.nunique()

id                      44475
title                   41370
genres                   4045
overview                44302
spoken_languages         1835
cast                    41326
crew                    17247
vote_average               92
vote_count               1820
popularity              43319
budget                   1205
keywords                25856
production_companies    22406
revenue                  6845
runtime                   353
dtype: int64

## 4. Create a Final DataFrame

In [48]:
df.to_csv('datasets/movies_dataset.csv', index=False)

# Thank You