# Movies Database

* We are going to predict revenue and popularity of a movie given its attributes like title, genre, production company, budget etc.

* We also recommend movies to the user according to his search

## Data Preprocessing

### Importing basic libraries.

In [139]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### Importing `movies_metadata.csv` as `movies_df`.

In [140]:
movies_df = pd.read_csv('./Data/movies_metadata.csv', low_memory=False)
movies_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Basic info of each row in `movies_df`.

In [141]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

### Making sure that each date is in correct format and changing into datetime object if not.

In [142]:
import datetime
for ind in movies_df.index:
    try:
        year, month, day = str(movies_df['release_date'][ind]).split('-')
        movies_df['release_date'][ind] = datetime.date(int(year), int(month), int(day))
    except ValueError:
        movies_df['release_date'][ind] = datetime.date(1800, 1, 1)

### Importing `credits.csv` file as `credits_df`.

In [143]:
credits_df = pd.read_csv('./Data/credits.csv')
credits_df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


### Basic info of each row in `credits_df`.

In [144]:
credits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
cast    45476 non-null object
crew    45476 non-null object
id      45476 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


### Considering only the single lead role out of the whole cast because the total cast may not be responsible for revenue.

In [145]:
LC = []
for ind in credits_df.index:
    words = str(credits_df['cast'][ind]).split(',')
    name = ''
    i = 0
    for word in words:
        if i < 1:
            if 'name' in word:
                name = word[10:-1]
                i+=1
    LC.append(name)    
credits_df['lead_role'] = pd.Series(LC)

### Considering only director out of the whole crew because the total crew may not be responsible for revenue.

In [146]:
director = []
for ind in credits_df.index:
    words = str(credits_df['crew'][ind]).split(',')
    name = ''
    i = 0
    for j in range(len(words)):
        if i < 1:
            if 'Director' in words[j]:
                name = words[j+1][10:-1]
                i+=1
    director.append(name)
credits_df['director'] = pd.Series(director)

### Removing the `cast` and `crew` columns as we have new `lead_role` and `director` columns.

In [147]:
credits_df.drop(['cast', 'crew'], axis=1, inplace=True)

### Now the `credits_df` would appear as:

In [148]:
credits_df.head()

Unnamed: 0,id,lead_role,director
0,862,Tom Hanks,John Lasseter
1,8844,Robin Williams,Joe Johnston
2,15602,Walter Matthau,Howard Deutch
3,31357,Whitney Houston,Forest Whitaker
4,11862,Steve Martin,Elliot Davis


### Checking the datatype of `id` in both dataframes.

In [149]:
movies_df['id'].dtype, credits_df['id'].dtype

(dtype('O'), dtype('int64'))

### As the type of `id` in movies_df is not integer, we will convert it into integer.

#### Some of the `id`s are like dates. So, assigning them as '-1' to remove as they are only 3 elements like that.

In [150]:
for ind in movies_df.index:
    if '-' in str(movies_df['id'][ind]):
        movies_df['id'][ind] = '-1'

#### Converting `id`s of movies_df into `int64`.

In [151]:
movies_df = movies_df.astype({'id': 'int64'})

#### Checking again.

In [152]:
movies_df['id'].dtype

dtype('int64')

#### Removing the rows from `movies_df` whose `id` is equal to -1.

In [153]:
movies_df = movies_df[movies_df['id'] != -1]

### Checking duplicate values in `movies_df`.

In [154]:
len_of_movies_df = len(movies_df)
movies_df_ids = list(movies_df['id'])
movies_df_duplicate_ids = []
for i in range(len_of_movies_df):
    for j in range(i+1, len_of_movies_df):
        if movies_df_ids[i] == movies_df_ids[j] and movies_df_ids[i] not in movies_df_duplicate_ids:
            movies_df_duplicate_ids.append(movies_df_ids[i])

In [155]:
movies_df_duplicate_ids

[105045,
 132641,
 22649,
 84198,
 10991,
 110428,
 15028,
 12600,
 109962,
 4912,
 5511,
 23305,
 69234,
 14788,
 77221,
 13209,
 159849,
 141971,
 168538,
 97995,
 18440,
 11115,
 42495,
 99080,
 25541,
 119916,
 152795,
 265189,
 298721]

### Removing duplicate rows.

In [156]:
movies_df.drop_duplicates(inplace=True)

### Let's check `movies_df` again.

In [157]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45446 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45446 non-null object
belongs_to_collection    4490 non-null object
budget                   45446 non-null object
genres                   45446 non-null object
homepage                 7777 non-null object
id                       45446 non-null int64
imdb_id                  45429 non-null object
original_language        45435 non-null object
original_title           45446 non-null object
overview                 44492 non-null object
popularity               45443 non-null object
poster_path              45060 non-null object
production_companies     45443 non-null object
production_countries     45443 non-null object
release_date             45446 non-null object
revenue                  45443 non-null float64
runtime                  45186 non-null float64
spoken_languages         45443 non-null object
status                   45362 non-null object

### Checking duplicate values in `credits_df`.

In [158]:
len_of_credits_df = len(credits_df)
credits_df_ids = list(credits_df['id'])
credits_df_duplicate_ids = []
for i in range(len_of_credits_df):
    for j in range(i+1, len_of_credits_df):
        if credits_df_ids[i] == credits_df_ids[j] and credits_df_ids[i] not in credits_df_duplicate_ids:
            credits_df_duplicate_ids.append(credits_df_ids[i])

In [159]:
credits_df_duplicate_ids

[105045,
 132641,
 22649,
 84198,
 10991,
 110428,
 15028,
 12600,
 109962,
 4912,
 5511,
 23305,
 69234,
 14788,
 77221,
 13209,
 159849,
 141971,
 168538,
 97995,
 18440,
 11115,
 42495,
 99080,
 25541,
 119916,
 152795,
 265189,
 116723,
 3057,
 125458,
 199591,
 24023,
 24026,
 11752,
 142563,
 157301,
 9755,
 123634,
 8767,
 43629,
 187156,
 298721]

### Removing duplicate rows.

In [160]:
credits_df.drop_duplicates(inplace=True)

### Lets check `credits_df` again.

In [161]:
credits_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45434 entries, 0 to 45475
Data columns (total 3 columns):
id           45434 non-null int64
lead_role    45434 non-null object
director     45434 non-null object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


### Creating a `main_df` which is an outer merge of `movies_df` and `credits_df`.

In [173]:
main_df = credits_df.merge(movies_df, how='outer', left_on='id', right_on='id')

### Basic info of each row in `main_df`.

In [174]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45449 entries, 0 to 45448
Data columns (total 26 columns):
id                       45449 non-null int64
lead_role                45448 non-null object
director                 45448 non-null object
adult                    45449 non-null object
belongs_to_collection    4490 non-null object
budget                   45449 non-null object
genres                   45449 non-null object
homepage                 7777 non-null object
imdb_id                  45432 non-null object
original_language        45438 non-null object
original_title           45449 non-null object
overview                 44495 non-null object
popularity               45446 non-null object
poster_path              45063 non-null object
production_companies     45446 non-null object
production_countries     45446 non-null object
release_date             45449 non-null object
revenue                  45446 non-null float64
runtime                  45189 non-null float64

### Look of `main_df`.

In [175]:
main_df

Unnamed: 0,id,lead_role,director,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,862,Tom Hanks,John Lasseter,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,8844,Robin Williams,Joe Johnston,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,15602,Walter Matthau,Howard Deutch,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,31357,Whitney Houston,Forest Whitaker,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,11862,Steve Martin,Elliot Davis,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,tt0113041,en,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45444,111109,Angel Aquino,Lav Diaz,False,,0,"[{'id': 18, 'name': 'Drama'}]",,tt2028550,tl,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45445,67758,Erika Eleniak,Mark L. Lester,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,tt0303758,en,...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45446,227506,Iwan Mosschuchin,Yakov Protazanov,False,,0,[],,tt0008536,en,...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0
45447,461257,,Daisy Asquith,False,,0,[],,tt6980792,en,...,2017-06-09,0.0,75.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Queerama,False,0.0,0.0


### Checking the null values in `main_df`.

In [176]:
main_df.isna().sum()

id                           0
lead_role                    1
director                     1
adult                        0
belongs_to_collection    40959
budget                       0
genres                       0
homepage                 37672
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   3
poster_path                386
production_companies         3
production_countries         3
release_date                 0
revenue                      3
runtime                    260
spoken_languages             3
status                      84
tagline                  25041
title                        3
video                        3
vote_average                 3
vote_count                   3
dtype: int64

### Removing the unnecessary charecters in `belongs_to_collection` and making it clean.

In [177]:
collection = []
for ind in main_df.index:
    words = str(main_df['belongs_to_collection'][ind]).split(',')
    name = 'no collection'
    i = 0
    for word in words:
        if i < 1:
            if 'name' in word:
                name = word[10:-1]
                i+=1
    collection.append(name)
main_df['belongs_to_collection'] = pd.Series(collection)

### Checking `main_df` again.

In [178]:
main_df

Unnamed: 0,id,lead_role,director,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,862,Tom Hanks,John Lasseter,False,Toy Story Collection,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,8844,Robin Williams,Joe Johnston,False,no collection,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,15602,Walter Matthau,Howard Deutch,False,Grumpy Old Men Collection,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,31357,Whitney Houston,Forest Whitaker,False,no collection,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,11862,Steve Martin,Elliot Davis,False,Father of the Bride Collection,0,"[{'id': 35, 'name': 'Comedy'}]",,tt0113041,en,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45444,111109,Angel Aquino,Lav Diaz,False,no collection,0,"[{'id': 18, 'name': 'Drama'}]",,tt2028550,tl,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45445,67758,Erika Eleniak,Mark L. Lester,False,no collection,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,tt0303758,en,...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45446,227506,Iwan Mosschuchin,Yakov Protazanov,False,no collection,0,[],,tt0008536,en,...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0
45447,461257,,Daisy Asquith,False,no collection,0,[],,tt6980792,en,...,2017-06-09,0.0,75.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Queerama,False,0.0,0.0


### Dropping the columns `homepage`, `imdb_id`, `original_title`, `overview`, `poster_path`, `tagline`, `video` as they do not contribute to the `revenue`.

In [179]:
main_df.drop(['homepage', 'imdb_id', 'original_title', 'overview', 'poster_path', 'tagline', 'video'], axis=1, inplace=True)
main_df.head()

Unnamed: 0,id,lead_role,director,adult,belongs_to_collection,budget,genres,original_language,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count
0,862,Tom Hanks,John Lasseter,False,Toy Story Collection,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415.0
1,8844,Robin Williams,Joe Johnston,False,no collection,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Jumanji,6.9,2413.0
2,15602,Walter Matthau,Howard Deutch,False,Grumpy Old Men Collection,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Grumpier Old Men,6.5,92.0
3,31357,Whitney Houston,Forest Whitaker,False,no collection,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,3.859495,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Waiting to Exhale,6.1,34.0
4,11862,Steve Martin,Elliot Davis,False,Father of the Bride Collection,0,"[{'id': 35, 'name': 'Comedy'}]",en,8.387519,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Father of the Bride Part II,5.7,173.0


### Removing the unnecessary charecters in `production_companies` and selecting only one major company if there are multiple.

In [180]:
production_companies = []
for ind in main_df.index:
    try:
        if str(main_df['production_companies'][ind]) != '' or str(main_df['production_companies'][ind]) != '[]':
            words = str(main_df['production_companies'][ind]).split(',')[0]
            word = str(words).split(':')[1]
            name = str(word[2: -1])
            production_companies.append(name)
        else:
            production_companies.append('not specified')
    except IndexError:
        production_companies.append('not specified')
        
main_df['production_companies'] = pd.Series(production_companies)

### Removing the unnecessary charecters in `production_countries` and selecting only one major country if there are multiple.

In [181]:
production_countries = []
for ind in main_df.index:
    try:
        if str(main_df['production_countries'][ind]) != '' or str(main_df['production_countries'][ind]) != '[]':
            words = str(main_df['production_countries'][ind]).split(',')[1]
            word = str(words).split(':')[1]
            word_1 = str(word).split('}')[0]
            name = str(word_1[2: -1])
            production_countries.append(name)
        else:
            production_countries.append('not specified')
    except IndexError:
        production_countries.append('not specified')
        
main_df['production_countries'] = pd.Series(production_countries)

### Still have to change `genre` and `spoken languages` as they play an important role and there are multiple values to be considered.