# Movie Data ETL Pipeline - Extract

This notebook will begin the ETL process by extracting the data from the 2 data sources (2 CSV files from Kaggle and 1 JSON file with data scraped from Wikipedia). We will also start the transform step by cleaning the data.

### Dependencies and data

The Wikipedia data was scraped from the movie pages on [Wikipedia](https://en.wikipedia.org/).

Download `movies_metadata.csv` and `ratings.csv` from the TMDB's movie dataset at the Kaggle link below. Move both files into the `data/raw/` directory.

Source: https://www.kaggle.com/rounakbanik/the-movies-dataset

In [1]:
# Dependencies
import os
import json
import re
import numpy as np
import pandas as pd

In [2]:
# Path to data directory
data_path = os.path.join('..', 'data')

# Paths to data files
kmovies_path = os.path.join(data_path, 'raw', 'movies_metadata.csv')
wmovies_path = os.path.join(data_path, 'raw', 'wikipedia_movies.json')
print(kmovies_path)
print(wmovies_path)

../data/raw/movies_metadata.csv
../data/raw/wikipedia_movies.json


In [3]:
# Kaggle movie metadata
kmovies_df = pd.read_csv(kmovies_path, low_memory=False)
kmovies_df.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [4]:
# Wikipedia movie data
wmovies_df = pd.read_json(wmovies_path)
wmovies_df.head(2)

Unnamed: 0,url,year,imdb_link,title,Directed by,Produced by,Screenplay by,Story by,Based on,Starring,...,Predecessor,Founders,Area served,Products,Services,Russian,Hebrew,Revenue,Operating income,Polish
0,https://en.wikipedia.org/wiki/The_Adventures_o...,1990.0,https://www.imdb.com/title/tt0098987/,The Adventures of Ford Fairlane,Renny Harlin,"[Steve Perry, Joel Silver]","[David Arnott, James Cappe, Daniel Waters]","[David Arnott, James Cappe]","[Characters, by Rex Weiner]","[Andrew Dice Clay, Wayne Newton, Priscilla Pre...",...,,,,,,,,,,
1,"https://en.wikipedia.org/wiki/After_Dark,_My_S...",1990.0,https://www.imdb.com/title/tt0098994/,"After Dark, My Sweet",James Foley,"[Ric Kidney, Robert Redlin]","[James Foley, Robert Redlin]",,"[the novel, After Dark, My Sweet, by, Jim Thom...","[Jason Patric, Rachel Ward, Bruce Dern, George...",...,,,,,,,,,,


In [5]:
# Inspect cols
print(wmovies_df.columns.sort_values().tolist())

['Actor control', 'Adaptation by', 'Alias', 'Alma mater', 'Also known as', 'Animation by', 'Arabic', 'Area', 'Area served', 'Artist(s)', 'Attraction type', 'Audio format', 'Author', 'Based on', 'Biographical data', 'Bopomofo', 'Born', 'Box office', 'Budget', 'Camera setup', 'Cantonese', 'Characters', 'Children', 'Chinese', 'Cinematography', 'Closing date', 'Color process', 'Comics', 'Composer(s)', 'Coordinates', 'Country', 'Country of origin', 'Cover artist', 'Created by', 'Date premiered', 'Designer(s)', 'Developed by', 'Developer(s)', 'Dewey Decimal', 'Died', 'Directed by', 'Director', 'Distributed by', 'Distributor', 'Divisions', 'Duration', 'Edited by', 'Editor(s)', 'Ending theme', 'Engine', 'Engine(s)', 'Executive producer(s)', 'Family', 'Fate', 'Film(s)', 'Followed by', 'Format(s)', 'Formerly', 'Founded', 'Founder', 'Founders', 'French', 'Full name', 'Gender', 'Genre', 'Genre(s)', 'Genres', 'Gwoyeu Romatzyh', 'Hangul', 'Hanyu Pinyin', 'Headquarters', 'Hebrew', 'Height', 'Hepburn'

Since the Wikipedia data is a lot messier than the Kaggle data, Reading the it in as a Pandas dataframe resulted in 193 columns and a lot of null values. Instead, we'll be reading this data in as JSON and cleaning it up before converting it to a dataframe.

In [6]:
# Wikipedia movie data
with open(wmovies_path, 'r') as f:
    wmovies = json.load(f)
    
print('Number of records:', len(wmovies))
print('Sample record:')
wmovies[0]

Number of records: 7311
Sample record:


{'url': 'https://en.wikipedia.org/wiki/The_Adventures_of_Ford_Fairlane',
 'year': 1990,
 'imdb_link': 'https://www.imdb.com/title/tt0098987/',
 'title': 'The Adventures of Ford Fairlane',
 'Directed by': 'Renny Harlin',
 'Produced by': ['Steve Perry', 'Joel Silver'],
 'Screenplay by': ['David Arnott', 'James Cappe', 'Daniel Waters'],
 'Story by': ['David Arnott', 'James Cappe'],
 'Based on': ['Characters', 'by Rex Weiner'],
 'Starring': ['Andrew Dice Clay',
  'Wayne Newton',
  'Priscilla Presley',
  'Lauren Holly',
  'Morris Day',
  'Robert Englund',
  "Ed O'Neill"],
 'Narrated by': 'Andrew "Dice" Clay',
 'Music by': ['Cliff Eidelman', 'Yello'],
 'Cinematography': 'Oliver Wood',
 'Edited by': 'Michael Tronick',
 'Productioncompany ': 'Silver Pictures',
 'Distributed by': '20th Century Fox',
 'Release date': ['July 11, 1990', '(', '1990-07-11', ')'],
 'Running time': '102 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$20 million',
 'Box office': '$21.4 milli

### Filter for movies

There seem to be TV shows mixed into the data. To filter for movies, records must have a value for:
1. `imdb_link`
2. `Directed by`/`Director`
3. `Duration`/`Length`/`Running time`

Records should also NOT have a value for:
1. `No. of seasons`
2. `No. of episodes`

In [7]:
# Filter out tv shows
wmovies_filtered = [movie for movie in wmovies if 
                    ('imdb_link' in movie) and 
                    ('Directed by' in movie or 'Director' in movie) and 
                    ('Duration' in movie or 'Length' in movie or 'Running time' in movie) and 
                    ('No. of seasons' not in movie) and 
                    ('No. of episodes' not in movie)]
len(wmovies_filtered)

6936

In [8]:
# Convert to df
wmovies_df = pd.DataFrame(wmovies_filtered)
print(wmovies_df.shape)

# Inspect cols
wmovies_cols = wmovies_df.columns.sort_values()
wmovies_cols

(6936, 75)


Index(['Adaptation by', 'Also known as', 'Animation by', 'Arabic',
       'Audio format', 'Based on', 'Box office', 'Budget', 'Cantonese',
       'Chinese', 'Cinematography', 'Color process', 'Composer(s)', 'Country',
       'Country of origin', 'Created by', 'Directed by', 'Director',
       'Distributed by', 'Distributor', 'Edited by', 'Editor(s)',
       'Executive producer(s)', 'Followed by', 'French', 'Genre', 'Hangul',
       'Hebrew', 'Hepburn', 'Japanese', 'Label', 'Language', 'Length',
       'Literally', 'Mandarin', 'McCune–Reischauer', 'Music by', 'Narrated by',
       'Original language(s)', 'Original network', 'Original release',
       'Original title', 'Picture format', 'Polish', 'Preceded by',
       'Produced by', 'Producer', 'Producer(s)', 'Production company(s)',
       'Production location(s)', 'Productioncompanies ', 'Productioncompany ',
       'Recorded', 'Release date', 'Released', 'Revised Romanization',
       'Romanized', 'Running time', 'Russian', 'Screen st

### Clean columns

Just by filtering out the TV shows, the data has been reduced to 75 columns. Let's take a look at a sample value for each column to get a better understanding of the data. Since there are a lot of missing values in most of the columns, the sample values will not all correspond to the same record.

In [9]:
# Sample val for each column
for col in wmovies_cols:
    # Print the first non-null val in the col
    print(col, ':', wmovies_df.loc[wmovies_df[col].notnull(), col].values[0])

Adaptation by : ['John L. Balderston', 'Paul Perez', 'Daniel Moore']
Also known as : Detonator II: Night Watch
Animation by : ['Andreas Deja', 'Gary Dunn', 'Deboissy Sylvain']
Arabic : قضية رقم ٢٣
Audio format : Stereo
Based on : ['Characters', 'by Rex Weiner']
Box office : $21.4 million
Budget : $20 million
Cantonese : ['Jip', '6', 'Man', '6', 'Saam', '1']
Chinese : 摇滚藏獒
Cinematography : Oliver Wood
Color process : Technicolor
Composer(s) : Richard Bellis
Country : United States
Country of origin : United States
Created by : ['John William Corrington', '(novel)']
Directed by : Renny Harlin
Director : Mark "Aldo" Miceli
Distributed by : 20th Century Fox
Distributor : NBC
Edited by : Michael Tronick
Editor(s) : ['Christopher Cooke', 'James Galloway']
Executive producer(s) : Rich Melcombe
Followed by : See below
French : Le Cinquième Élément
Genre : Thriller
Hangul : 원더풀 데이즈
Hebrew : פוֹקְסטְרוֹט
Hepburn : Omoide no Mānī
Japanese : 思い出のマーニー
Label : ['Warner Music Vision', 'Warner-Reprise

We will be addressing a few things here:
1. There are a lot of columns holding alternate titles for the movies so we're going to group all of these together in the JSON data.
2. There are also a lot of redundant columns giving the same information, such as `Directed by` and `Director`. We'll be grouping these together as well.
3. The column names are inconsistent, so we will rename them for consistency.

In [10]:
# Keys holding alternate titles
title_keys = ['Also known as', 'Arabic', 'Cantonese', 'Chinese', 'French', 
              'Hangul', 'Hebrew', 'Hepburn', 'Japanese', 'Literally', 'Mandarin', 
              'McCune–Reischauer', 'Original title', 'Polish', 'Revised Romanization', 
              'Romanized', 'Russian', 'Simplified', 'Traditional', 'Yiddish']

# Key rename pairs (old name: new name)
keys_to_rename = {
    'Adaptation by': 'writers',
    'Animation by': 'animators',
    'Audio format': 'audio_format',
    'Based on': 'based_on',
    'Box office': 'box_office',
    'Budget': 'budget',
    'Cinematography': 'cinematographers',
    'Color process': 'color_process',
    'Composer(s)': 'composers',
    'Country': 'country',
    'Country of origin': 'country',
    'Created by': 'creators',
    'Directed by': 'director',
    'Director': 'director',
    'Distributed by': 'distributor',
    'Distributor': 'distributor',
    'Edited by': 'editors',
    'Editor(s)': 'editors',
    'Executive producer(s)': 'executive_producers',
    'Followed by': 'sequel',
    'Genre': 'genre',
    'Label': 'label',
    'Language': 'languages',
    'Length': 'duration',
    'Music by': 'composers',
    'Narrated by': 'narrator',
    'Original language(s)': 'languages', 
    'Original network': 'network', 
    'Original release': 'release_date',
    'Picture format': 'picture_format',
    'Preceded by': 'prequel',
    'Produced by': 'producers',
    'Producer': 'producers',
    'Producer(s)': 'producers',
    'Production company(s)': 'production_companies',
    'Production location(s)': 'production_locations',
    'Productioncompanies ': 'production_companies',
    'Productioncompany ': 'production_companies',
    'Recorded': 'recorded',
    'Release date': 'release_date',
    'Released': 'release_date',
    'Running time': 'duration',
    'Screen story by': 'writers',
    'Screenplay by': 'writers',
    'Starring': 'stars',
    'Story by': 'writers',
    'Suggested by': 'suggestors',
    'Theme music composer': 'composers',
    'Venue': 'venue',
    'Voices of': 'voicers',
    'Written by': 'writers',
    'imdb_link': 'imdb_link', 
    'title': 'title',
    'url': 'url', 
    'year': 'year'
}

len(title_keys), len(keys_to_rename)

(20, 55)

In [11]:
def clean_movie(movie_dict, title_keys=title_keys, keys_to_rename=keys_to_rename):
    
    """
    Clean movie dictionary with the following steps:
    1. combine all alternate titles into a single key
    2. rename keys for consistency and to consolidate similar columns into 1
    
    Parameters
    ----------
    movie_dict : dict
        Record to clean
    title_keys : list[str]
        Names of keys with the movie's alternate titles
    keys_to_rename : dict
        Mapping of old key name to new key name
    
    Returns
    -------
    Dict
        Clean movie dictionary
    """
    
    # Copy of movie dict and empty dict for alt titles
    movie_dict, alt_titles = dict(movie_dict), dict()
    
    # Add keys with alt titles to `alt_titles` and delete the original key
    for key in title_keys:
        if key in movie_dict:
            alt_titles[key.lower().replace(' ', '_')] = movie_dict.pop(key)
            
    # Add new key for alt titles
    if len(alt_titles):
        movie_dict['alternate_titles'] = alt_titles
        
    # Rename keys
    for old, new in keys_to_rename.items():
        if old in movie_dict:
            movie_dict[new] = movie_dict.pop(old)
        
    return movie_dict
            
    
# Test func
clean_movie(wmovies_filtered[849])

{'alternate_titles': {'mandarin': 'Xǐyàn', 'traditional': '喜宴'},
 'box_office': '$23.6 million',
 'budget': '$1 million',
 'cinematographers': 'Jong Lin',
 'country': ['Taiwan', 'United States'],
 'director': 'Ang Lee',
 'distributor': 'The Samuel Goldwyn Company',
 'editors': 'Tim Squyres',
 'languages': ['Mandarin Chinese', 'English'],
 'composers': 'Mader',
 'producers': ['Ang Lee', 'Ted Hope', 'James Schamus'],
 'production_companies': 'Good Machine',
 'release_date': ['4 August 1993', '(', '1993-08-04', ')', '(United States)'],
 'duration': '106 minutes',
 'stars': ['Ah-Leh Gua',
  'Sihung Lung',
  'May Chin',
  'Winston Chao',
  'Mitchell Lichtenstein'],
 'writers': ['Ang Lee', 'Neil Peng', 'James Schamus'],
 'imdb_link': 'https://www.imdb.com/title/tt0107156/',
 'title': 'The Wedding Banquet',
 'url': 'https://en.wikipedia.org/wiki/The_Wedding_Banquet',
 'year': 1993}

In [12]:
# Clean movie dictionaries
wmovies_clean = [clean_movie(movie) for movie in wmovies_filtered]

# Convert Wikipedia data to df
wmovies_df = pd.DataFrame(wmovies_clean)
print(wmovies_df.shape)

# Inspect cols
wmovies_df.columns.sort_values()

(6936, 38)


Index(['alternate_titles', 'animators', 'audio_format', 'based_on',
       'box_office', 'budget', 'cinematographers', 'color_process',
       'composers', 'country', 'creators', 'director', 'distributor',
       'duration', 'editors', 'executive_producers', 'genre', 'imdb_link',
       'label', 'languages', 'narrator', 'network', 'picture_format',
       'prequel', 'producers', 'production_companies', 'production_locations',
       'recorded', 'release_date', 'sequel', 'stars', 'suggestors', 'title',
       'url', 'venue', 'voicers', 'writers', 'year'],
      dtype='object')

### Drop duplicate rows

With the columns taken care of, let's inspect the data for duplicate records. The IMDB ID should be unique for each movie. The data does have the IMDB link, so we can extract the IMDB ID from that link.

In [13]:
# Inspect IMDB links
wmovies_df.loc[:1, 'imdb_link']

0    https://www.imdb.com/title/tt0098987/
1    https://www.imdb.com/title/tt0098994/
Name: imdb_link, dtype: object

In [14]:
# Extract IMDB ID from the link
wmovies_df['imdb_id'] = wmovies_df['imdb_link'].str.extract(r'(tt\d{7})')
wmovies_df.loc[:1, 'imdb_id']

0    tt0098987
1    tt0098994
Name: imdb_id, dtype: object

In [15]:
# Check for duplicate movies
wmovies_dup = wmovies_df.duplicated(subset=['imdb_id'], keep=False)
print(wmovies_dup.sum())
wmovies_df[wmovies_dup].sort_values('imdb_id').head(4)

83


Unnamed: 0,based_on,box_office,budget,cinematographers,country,director,distributor,editors,languages,composers,...,creators,prequel,suggestors,alternate_titles,label,recorded,venue,animators,color_process,imdb_id
23,"[Characters, by, H. P. Lovecraft]",,,Rick Fichter,United States,Brian Yuzna,50th Street Films,Peter Teschner,English,Richard Band,...,,,,,,,,,,tt0099180
273,"[Characters, by, H. P. Lovecraft]",,,Rick Fichter,United States,Brian Yuzna,50th Street Films,Peter Teschner,English,Richard Band,...,,,,,,,,,,tt0099180
199,,$15.1 million,"[60 million, Norwegian Kroner]",Erling Thurmann-Andersen,"[Norway, Sweden, United States]",Nils Gaup,Buena Vista Pictures,Nils Pagh Andersen,English,Patrick Doyle,...,,,,,,,,,,tt0099816
611,,$15.1 million,"[60 million, Norwegian Kroner]",Erling Thurmann-Andersen,"[Norway, Sweden, United States]",Nils Gaup,Buena Vista Pictures,Nils Pagh Andersen,English,Patrick Doyle,...,,,,,,,,,,tt0099816


In [16]:
# Drop duplicate movies
wmovies_df.drop_duplicates(subset=['imdb_id'], inplace=True)
wmovies_df.shape

(6894, 39)

### Drop columns with mostly missing values

There are a lot of columns that are missing most of their values, so there is no point in keeping them. We will be looking for and dropping columns with at least half of their values missing.

In [17]:
# Check cols where >= half of the vals are missing
cols_missing = wmovies_df.isnull().mean()
cols_missing50 = cols_missing[cols_missing >= 0.5]
cols_missing50

based_on                0.684798
narrator                0.959240
genre                   0.984624
network                 0.982594
audio_format            0.991297
executive_producers     0.986365
picture_format          0.991007
production_locations    0.993328
sequel                  0.998695
voicers                 0.999710
creators                0.998549
prequel                 0.998549
suggestors              0.999855
alternate_titles        0.996954
label                   0.999710
recorded                0.999710
venue                   0.999855
animators               0.999710
color_process           0.999855
dtype: float64

In [18]:
# Drop these cols and inspect remaining cols
wmovies_df.drop(cols_missing50.index, axis=1, inplace=True)
print(wmovies_df.shape)
wmovies_df.isnull().mean()

(6894, 20)


box_office              0.208297
budget                  0.315202
cinematographers        0.093560
country                 0.031912
director                0.000000
distributor             0.043951
editors                 0.070351
languages               0.010299
composers               0.067740
producers               0.025239
production_companies    0.231651
release_date            0.003626
duration                0.000000
writers                 0.026690
stars                   0.023934
imdb_link               0.000000
title                   0.000000
url                     0.000000
year                    0.000000
imdb_id                 0.000000
dtype: float64

### Inspect data types

With 20 columns remaining, let's check if their data types are appropriate.

In [19]:
# Inspect dtypes
wmovies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6894 entries, 0 to 6935
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   box_office            5458 non-null   object
 1   budget                4721 non-null   object
 2   cinematographers      6249 non-null   object
 3   country               6674 non-null   object
 4   director              6894 non-null   object
 5   distributor           6591 non-null   object
 6   editors               6409 non-null   object
 7   languages             6823 non-null   object
 8   composers             6427 non-null   object
 9   producers             6720 non-null   object
 10  production_companies  5297 non-null   object
 11  release_date          6869 non-null   object
 12  duration              6894 non-null   object
 13  writers               6710 non-null   object
 14  stars                 6729 non-null   object
 15  imdb_link             6894 non-null   

Columns that need to be recasted:
- `Release_date` - to datetime
- `Budget` - to numeric
- `Box_office` - to numeric
- `Duration` - to numeric

### Convert `release_date` to datetime type

The data contains lists, which are unhashable, so we'll need to join the list items into a single string before making the conversion.

In [20]:
def list_to_str(obj):
    
    """ Convert list to string if not already a string """
    
    return ' '.join(obj) if isinstance(obj, list) else obj


# Convert lists in `release_dates` to strs
release_date = wmovies_df['release_date'].apply(list_to_str)
# release_date.unique().tolist()

Some of the dates in the data are actually ranges (ex. January 15 - 16, 1990). We will be removing the right limit of these date ranges and keeping the left limit. In other words, `January 15 - 16, 1990` would be transformed into `January 15, 1990`. After addressing this issue, there are 5 different formats the dates were recorded in:
1. (DD) (Month) (YYYY) - 01 January 1970
2. (Month) (DD), (YYYY) - January 01, 1970
3. (Month), (YYYY) - January, 1970
4. (YYYY)-(MM)-(DD) - 1970-01-01
5. (YYYY) - 1970

These formats are all parsable by Pandas, so we will be extracting these patterns from the date strings in order to convert them into datetime type.

In [21]:
# Select lower limit of date ranges
release_date = release_date.str.strip().str.replace(r' [-–—] \d\d?', '', regex=True)

In [22]:
# Date formats
date_format1 = r'(?:\d\d? )?[a-z]{3,9}(?: \d\d?)?,? \d{4}' # (DD) (Month) (YYYY) | (Month) (DD), (YYYY) | (Month), (YYYY)
date_format2 = r'\d{4}(?:\D\d\d?\D\d\d?)?' # (YYYY)-(MM)-(DD) | (YYYY)
date_formats = f'({date_format1}|{date_format2})' # date formats 1 and 2

# Check if there are any other dates not captured by these 2 patterns
release_date.dropna()[~release_date.dropna().str.contains(date_formats, flags=re.IGNORECASE)]

  return func(self, *args, **kwargs)


Series([], Name: release_date, dtype: object)

In [23]:
# Extract date from str
release_date = release_date.str.extract(date_formats, flags=re.IGNORECASE)[0]

# Convert date str to datetime type
wmovies_df['release_date'] = pd.to_datetime(release_date, infer_datetime_format=True)
wmovies_df['release_date'].isnull().sum()

25

### Convert `budget` to numeric

As with `release_date`, there are lists and ranges in the `budget` values, so we will convert any list to strings and reduce ranges to the lower limit the same way. In addition, `budget` also has citation brackets (ex. [1]) which we will removing as well.

In [24]:
# Convert lists in `budget` to strs
budget = wmovies_df['budget'].apply(list_to_str)

# Clean str and select lower limit of amount ranges
budget = budget.str.strip().str.replace(r'\[\s*(?:\w+\s*)*\]', '', regex=True) \
                           .str.replace(r'[-–—]\s?\$?\d+', '', regex=True)
# budget.unique().tolist()

It looks like 2 formats will capture most or all of the `budget` values:
1. \\$xxx.x mil - \\$100 million, \\$1.2 million
2. \\$xxx,xxx,xxx - \\$100,000,000, \\$20,000

In [25]:
# Budget formats
budget_format1 = r'\$?\s?\d{1,3}(?:\.\d+)?\s*mil' # $xxx.x mil
budget_format2 = r'\$?\s?\d{1,3}(?:,\d{3})+' # $xxx,xxx,xxx
budget_formats = f'({budget_format1}|{budget_format2})'

# Check formats
budget_contains = budget.dropna().str.contains(budget_formats, flags=re.IGNORECASE)
budget_to_na = budget.dropna()[~budget_contains].unique().tolist()
budget_to_na

['Unknown', 'HBO', '$218.32', 'N/A', '19 crore', '3.5 crore']

In [26]:
# Replace above vals with NaN
for val in budget_to_na:
    budget.replace(val, np.NaN, inplace=True)
budget.dropna()[~budget_contains]

Series([], Name: budget, dtype: object)

In [27]:
# Extract amount from str
budget = budget.str.extract(budget_formats, flags=re.IGNORECASE)[0]
# budget.unique().tolist()

With the patterns extracted, we can now parse these values in order to convert them to numeric type.

In [28]:
def parse_budget(s):
    
    """ Convert budget string to numeric """
    
    # Check if the input is NaN or already a float
    if isinstance(s, float):
        return s
    
    # Remove $, spaces, and commas
    s = re.sub(r'[\$\s,]', '', s).lower()
    
    # Convert to float
    if 'mil' in s:
        f = float(s.replace('mil', '')) * 1e6 # million
    else:
        f = float(s)
        
    return f


# Convert budget to float type
wmovies_df['budget'] = budget.apply(parse_budget)
wmovies_df['budget'].isnull().sum()

2182

### Convert box office to numeric

This will be pretty similar to `budget`, except there are no ranges to worry about. The formats will also be very similar except for the fact that instead of just "million", there are also values in thousands ("k") and "billion"s.

In [29]:
# Convert lists in `box_office` to strs
boxoffice = wmovies_df['box_office'].apply(list_to_str)
# boxoffice.unique().tolist()

In [30]:
# Box office formats
boxoffice_format1 = r'\$?\s?\d{1,3}(?:\.\d+)?\s*[kmb]' # $xxx.x k/m/b
boxoffice_format2 = r'\$?\s?\d{1,3}(?:[\s\.,]?\d{3})+\$?' # $xxx,xxx,xxx
boxoffice_formats = f'({boxoffice_format1}|{boxoffice_format2})'

# Check formats
boxoffice_contains = boxoffice.dropna().str.contains(boxoffice_formats, flags=re.IGNORECASE)
boxoffice_to_na = boxoffice.dropna()[~boxoffice_contains].unique().tolist()
boxoffice_to_na

['N/A',
 '$309',
 'TBA',
 '$20-30',
 '£2.56',
 'Unknown',
 '$588',
 'less than $372',
 '8 crore']

In [31]:
# Replace above values with NaN
for val in boxoffice_to_na:
    boxoffice.replace(val, np.NaN, inplace=True)
boxoffice.dropna()[~boxoffice_contains]

Series([], Name: box_office, dtype: object)

In [32]:
# Extract amount from string
boxoffice = boxoffice.str.extract(boxoffice_formats, flags=re.IGNORECASE)[0]
# boxoffice.unique().tolist()

Special values to consider when parsing `box_office`:
1. Numbers ending with "k" (thousands) : \\$10 k = \\$10,000
2. Numbers ending with "m" (millions) : \\$10 m = \\$10,000,000
3. Numbers ending with "b" (billions) : \\$10 b = \\$10,000,000,000
4. Numbers using "." as a thousand-separator (instead of ",") : \\$10.000.000  = \\$10,000,000

In [33]:
def parse_boxoffice(s):
    
    """ Convert box office string to numeric """
    
    # Check if the input is NaN or already a float
    if isinstance(s, float):
        return s
    
    # Remove $, spaces, and commas
    s = re.sub(r'[\$\s,]', '', s).lower()
    
    # Convert to float
    if 'k' in s:
        f = float(s.replace('k', '')) * 1e3 # thousand
    elif 'm' in s:
        f = float(s.replace('m', '')) * 1e6 # million
    elif 'b' in s:
        f = float(s.replace('b', '')) * 1e9 # billion
    elif '.' in s:
        f = float(s.replace('.', '')) # for amounts using "." as a thousand-separator
    else:
        f = float(s)
        
    return f


# Convert budget to float type
wmovies_df['box_office'] = boxoffice.apply(parse_boxoffice)
wmovies_df['box_office'].isnull().sum()

1445

### Convert duration to numeric

Just like `budget`, `duration` contains lists, citation brackets, and ranges. We will removing these the same way we did with `budget`.

In [34]:
# Convert lists in `duration` to strings
duration = wmovies_df['duration'].apply(list_to_str)

# Clean string and select lower limit of amount ranges
duration = duration.str.strip().str.replace(r'\[\d\]', '', regex=True) \
                               .str.replace(r'[-–—]\s?\d+', '', regex=True)
# duration.unique().tolist()

There are a few different formats for the `duration`:
1. x hours xx m - 2 hours 22 minutes
2. xxx m - 100 minutes
3. xx hours - 10 hours
4. xx : xx - 10 : 10

In [35]:
# Duration formats
duration_format1 = r'(?:\d\s*ho?u?r?s?\s*)?\d{1,3}\s*m' # x hours xx m | xxx m
duration_format2 = r'\d\s*ho?u?r?s?|\d{1,2}\s*\:\s*\d{1,2}' # xx hours | xx : xx
duration_formats = f'({duration_format1}|{duration_format2})'

# Check formats
duration_contains = duration.dropna().str.contains(duration_formats, flags=re.IGNORECASE)
duration_to_na = duration.dropna()[~duration_contains].unique().tolist()
duration_to_na

['varies', 'minutes']

In [36]:
# Replace above values with NaN
for val in duration_to_na:
    duration.replace(val, np.NaN, inplace=True)
duration.dropna()[~duration_contains]

Series([], Name: duration, dtype: object)

In [37]:
# Extract duration from string
duration = duration.str.extract(duration_formats, flags=re.IGNORECASE)[0]
duration.unique().tolist()

['102 m',
 '114 m',
 '113 m',
 '106 m',
 '95 m',
 '100 m',
 '99 m',
 '50 m',
 '93 m',
 '110 m',
 '126 m',
 '121 m',
 '118 m',
 '90 m',
 '94 m',
 '190 m',
 '85 m',
 '96 m',
 '97 m',
 '32 m',
 '98 m',
 '84 m',
 '101 m',
 '86 m',
 '138 m',
 '91 m',
 '181 m',
 '108 m',
 '120 m',
 '111 m',
 '103 m',
 '105 m',
 '124 m',
 '30 m',
 '82 m',
 '74 m',
 '81 m',
 '87 m',
 '107 m',
 '128 m',
 '83 m',
 '162 m',
 '145 m',
 '92 m',
 '88 m',
 '109 m',
 '140 m',
 '136 m',
 '130 m',
 '135 m',
 '115 m',
 '192 m',
 '89 m',
 '129 m',
 '75 m',
 '78 m',
 '127 m',
 '119 m',
 '132 m',
 '77 m',
 '117 m',
 '104 m',
 '7 m',
 '80 m',
 '134 m',
 '60 m',
 '200 m',
 '25 m',
 '189 m',
 '137 m',
 '112 m',
 '122 m',
 '141 m',
 '188 m',
 '116 m',
 '143 m',
 '79 m',
 '148 m',
 '72 m',
 '187 m',
 '76 m',
 '202 m',
 '51 m',
 '67 m',
 '23 m',
 '57 m',
 '123 m',
 '156 m',
 '150 m',
 '49 m',
 '69  m',
 '131 m',
 '139 m',
 '142 m',
 '180 m',
 '144 m',
 '154 m',
 '254 m',
 '133 m',
 '71 m',
 '125 m',
 '195 m',
 '79    m',
 '73 m',

In [38]:
def parse_duration(s):
    
    """ Convert duration string to integer """
    
    # Check if the input is NaN or already a float
    if isinstance(s, float):
        return s
    
    # Remove seconds
    s = re.sub(r'\:\s*\d{1,2}', '', s)
    
    # Remove "m" and spaces
    s = re.sub(r'm|\s', '', s, flags=re.IGNORECASE)
    
    # Convert to int
    match = re.search(r'(\d)(ho?u?r?s?)(\d\d?)?', s, flags=re.IGNORECASE)
    if match: # if time is in hours
        i = int(match.group(1)) * 60 # hours to minutes
        if match.group(3):
            i += int(match.group(3)) # add minutes
    else: # if time is in minutes
        i = int(s)
        
    return i


# Convert duration to integer type
wmovies_df['duration'] = duration.apply(parse_duration)
wmovies_df['duration'].isnull().sum()

2

### Save Wikipedia data

In [39]:
# Inspect data
wmovies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6894 entries, 0 to 6935
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   box_office            5449 non-null   float64       
 1   budget                4712 non-null   float64       
 2   cinematographers      6249 non-null   object        
 3   country               6674 non-null   object        
 4   director              6894 non-null   object        
 5   distributor           6591 non-null   object        
 6   editors               6409 non-null   object        
 7   languages             6823 non-null   object        
 8   composers             6427 non-null   object        
 9   producers             6720 non-null   object        
 10  production_companies  5297 non-null   object        
 11  release_date          6869 non-null   datetime64[ns]
 12  duration              6892 non-null   float64       
 13  writers           

In [40]:
# Save data
wmovies_pkl_path = os.path.join(data_path, 'wiki_movies.pkl')
wmovies_df.to_pickle(wmovies_pkl_path)
pd.read_pickle(wmovies_pkl_path).head(2)

Unnamed: 0,box_office,budget,cinematographers,country,director,distributor,editors,languages,composers,producers,production_companies,release_date,duration,writers,stars,imdb_link,title,url,year,imdb_id
0,21400000.0,20000000.0,Oliver Wood,United States,Renny Harlin,20th Century Fox,Michael Tronick,English,"[Cliff Eidelman, Yello]","[Steve Perry, Joel Silver]",Silver Pictures,1990-07-11,102.0,"[David Arnott, James Cappe]","[Andrew Dice Clay, Wayne Newton, Priscilla Pre...",https://www.imdb.com/title/tt0098987/,The Adventures of Ford Fairlane,https://en.wikipedia.org/wiki/The_Adventures_o...,1990,tt0098987
1,2700000.0,6000000.0,Mark Plummer,United States,James Foley,Avenue Pictures,Howard E. Smith,English,Maurice Jarre,"[Ric Kidney, Robert Redlin]",Avenue Pictures,1990-05-17,114.0,"[James Foley, Robert Redlin]","[Jason Patric, Rachel Ward, Bruce Dern, George...",https://www.imdb.com/title/tt0098994/,"After Dark, My Sweet","https://en.wikipedia.org/wiki/After_Dark,_My_S...",1990,tt0098994


### Inspect Kaggle data

In [41]:
# Inspect data
kmovies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

### Drop duplicate rows

The movie data from Kaggle also has `imdb_id` which we can use to find and drop duplicate records.

In [42]:
# Check for duplicated rows
kmovies_dup = kmovies_df.duplicated(subset='imdb_id', keep=False)
print(kmovies_dup.sum())
kmovies_df.loc[kmovies_dup, 'imdb_id'].head(2)

79


676    tt0111613
838    tt0046468
Name: imdb_id, dtype: object

There are 3 records that have an `imdb_id` of 0. There may be other records with an invalid `imdb_id` so let's find those first so we can drop them.

In [43]:
# Invalid IMDB ids
kmovies_df['imdb_id'].count() - kmovies_df['imdb_id'].dropna().str.contains(r'tt\d{7}').sum()

3

In [44]:
# Replace `imdb_id` 0 with NaN
kclean_df = kmovies_df.copy()
kclean_df['imdb_id'].replace('0', np.NaN, inplace=True)
kclean_df['imdb_id'].count() - kclean_df['imdb_id'].dropna().str.contains(r'tt\d{7}').sum()

0

In [45]:
# Drop duplicate movies
kclean_df = kclean_df.drop_duplicates(subset=['imdb_id'])
kclean_df.duplicated(subset='imdb_id').sum()

0

### Drop columns with mostly missing values

In [46]:
# Check columns where at least half of the values are missing
cols_missing = kclean_df.isnull().mean()
cols_missing50 = cols_missing[cols_missing >= 0.5]
cols_missing50

belongs_to_collection    0.901226
homepage                 0.828853
tagline                  0.550873
dtype: float64

In [47]:
# Drop these columns and inspect remaining columns
kclean_df.drop(cols_missing50.index, axis=1, inplace=True)
kclean_df.isnull().mean()

adult                   0.000000
budget                  0.000000
genres                  0.000000
id                      0.000000
imdb_id                 0.000022
original_language       0.000242
original_title          0.000000
overview                0.020961
popularity              0.000066
poster_path             0.008389
production_companies    0.000066
production_countries    0.000066
release_date            0.001850
revenue                 0.000066
runtime                 0.005681
spoken_languages        0.000066
status                  0.001850
title                   0.000066
video                   0.000066
vote_average            0.000066
vote_count              0.000066
dtype: float64

There are 2 columns containing titles: `original_title` and `title`. This is a little redundant, so we will also be dropping `original_title`.

In [48]:
kclean_df.drop('original_title', axis=1, inplace=True)
kclean_df.shape

(45417, 20)

### Inspect data types

In [49]:
# Column dtypes and sample vals
for col in kclean_df.columns:
    print(col, '(', kclean_df[col].dtype, ') :', kclean_df.loc[0, col])

adult ( object ) : False
budget ( object ) : 30000000
genres ( object ) : [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
id ( object ) : 862
imdb_id ( object ) : tt0114709
original_language ( object ) : en
overview ( object ) : Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
popularity ( object ) : 21.946943
poster_path ( object ) : /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
production_companies ( object ) : [{'name': 'Pixar Animation Studios', 'id': 3}]
production_countries ( object ) : [{'iso_3166_1': 'US', 'name': 'United States of America'}]
release_date ( object ) : 1995-10-30
revenue ( float64 ) : 373554033.0
runtime ( float64 ) : 81.0
spoken_languages ( object ) : [{'iso_639_1': 'en', 'na

Columns that need to be recasted:
- `adult` - to boolean
- `video` - to boolean
- `release_date` - to datetime
- `budget` - to numeric
- `id` - to numeric
- `popularity` - to numeric

### Convert `adult` and `video` to boolean type

In [50]:
# Inspect `adult` and `video` vals
print(kclean_df['adult'].value_counts())
print(kclean_df['video'].value_counts())

False    45408
True         9
Name: adult, dtype: int64
False    45321
True        93
Name: video, dtype: int64


Both `adult` and `video` are predominantly 1 value, so they will both be dropped since they're not providing much information. But records of `adult` videos will be dropped before dropping these 2 columns.

In [51]:
# Remove adult movies and drop both columns
kclean_df = kclean_df.query('adult != "True"').drop(['adult', 'video'], axis=1)
kclean_df.shape

(45408, 18)

### Convert `release_date` to datetime type

In [52]:
# Recast release date to datetime
kclean_df['release_date'] = pd.to_datetime(kclean_df['release_date'])
kclean_df.loc[:1, 'release_date']

0   1995-10-30
1   1995-12-15
Name: release_date, dtype: datetime64[ns]

### Convert `budget`, `id`, and `popularity` to numeric

In [53]:
# Recast budget and ID to int
kclean_df['budget'] = kclean_df['budget'].astype(int)
kclean_df['id'] = kclean_df['id'].astype(int)

# Recast popularity to float
kclean_df['popularity'] = kclean_df['popularity'].astype(float)
kclean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45408 entries, 0 to 45465
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   budget                45408 non-null  int64         
 1   genres                45408 non-null  object        
 2   id                    45408 non-null  int64         
 3   imdb_id               45407 non-null  object        
 4   original_language     45397 non-null  object        
 5   overview              44456 non-null  object        
 6   popularity            45405 non-null  float64       
 7   poster_path           45027 non-null  object        
 8   production_companies  45405 non-null  object        
 9   production_countries  45405 non-null  object        
 10  release_date          45325 non-null  datetime64[ns]
 11  revenue               45405 non-null  float64       
 12  runtime               45150 non-null  float64       
 13  spoken_languages

### Save Kaggle data

In [54]:
# Inspect data
kclean_df.head(2)

Unnamed: 0,budget,genres,id,imdb_id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415.0
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Jumanji,6.9,2413.0


In [55]:
# Save data
kmovies_pkl_path = os.path.join(data_path, 'kaggle_movies.pkl')
kclean_df.to_pickle(kmovies_pkl_path)
pd.read_pickle(kmovies_pkl_path).head(2)

Unnamed: 0,budget,genres,id,imdb_id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415.0
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Jumanji,6.9,2413.0
