# Movie Data ETL Pipeline - Transform

This notebook will continue the transform step of the pipeline by merging the Wikipedia data and the Kaggle data. A few additional cleaning and transformations will have to be done on the combined data before it is ready to be loaded into a SQL database.

### Dependencies

In [1]:
%matplotlib inline

import datetime
import json
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Wikipedia data

In [2]:
# Path to data directory
data_path = '../data/'

In [3]:
# Wikipedia movie data (cleaned from notebook 1)
wmovies_df = pd.read_pickle(data_path + 'wiki_movies.pkl')
print(wmovies_df.info())
wmovies_df.head(2)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6894 entries, 0 to 6935
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   box_office            5449 non-null   float64       
 1   budget                4712 non-null   float64       
 2   cinematographers      6249 non-null   object        
 3   country               6674 non-null   object        
 4   director              6894 non-null   object        
 5   distributor           6591 non-null   object        
 6   editors               6409 non-null   object        
 7   languages             6823 non-null   object        
 8   composers             6427 non-null   object        
 9   producers             6720 non-null   object        
 10  production_companies  5297 non-null   object        
 11  release_date          6869 non-null   datetime64[ns]
 12  duration              6892 non-null   float64       
 13  writers           

Unnamed: 0,box_office,budget,cinematographers,country,director,distributor,editors,languages,composers,producers,production_companies,release_date,duration,writers,stars,imdb_link,title,url,year,id
0,21400000.0,20000000.0,Oliver Wood,United States,Renny Harlin,20th Century Fox,Michael Tronick,English,"[Cliff Eidelman, Yello]","[Steve Perry, Joel Silver]",Silver Pictures,1990-07-11,102.0,"[David Arnott, James Cappe]","[Andrew Dice Clay, Wayne Newton, Priscilla Pre...",https://www.imdb.com/title/tt0098987/,The Adventures of Ford Fairlane,https://en.wikipedia.org/wiki/The_Adventures_o...,1990,tt0098987
1,2700000.0,6000000.0,Mark Plummer,United States,James Foley,Avenue Pictures,Howard E. Smith,English,Maurice Jarre,"[Ric Kidney, Robert Redlin]",Avenue Pictures,1990-05-17,114.0,"[James Foley, Robert Redlin]","[Jason Patric, Rachel Ward, Bruce Dern, George...",https://www.imdb.com/title/tt0098994/,"After Dark, My Sweet","https://en.wikipedia.org/wiki/After_Dark,_My_S...",1990,tt0098994


### Kaggle data

In [4]:
# Kaggle movie metadata (cleaned from notebook 1)
kmovies_df = pd.read_pickle(data_path + 'kaggle_movies.pkl')
print(kmovies_df.info())
kmovies_df.head(2)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45408 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   budget                45408 non-null  int64         
 1   genres                45408 non-null  object        
 2   id                    45408 non-null  int64         
 3   imdb_id               45407 non-null  object        
 4   original_language     45397 non-null  object        
 5   original_title        45408 non-null  object        
 6   overview              44456 non-null  object        
 7   popularity            45405 non-null  float64       
 8   poster_path           45027 non-null  object        
 9   production_companies  45405 non-null  object        
 10  production_countries  45405 non-null  object        
 11  release_date          45325 non-null  datetime64[ns]
 12  revenue               45405 non-null  float64       
 13  runtime         

Unnamed: 0,budget,genres,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415.0
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Jumanji,6.9,2413.0


In [None]:
# Kaggle ratings data
kratings_df = pd.read_csv(data_path + 'raw/ratings.csv')
print(kratings_df.info(null_counts=True))
kratings_df.head(2)

### Join movie data from both sources

The Kaggle data has 2 columns containing movie IDs, which is a little redundant, so we will drop `id` and keep `imdb_id` because the Wikipedia data also has the IMDB ID. After merging the data, we can then drop one of the `imdb_id` columns.

In [None]:
# Merge data
movies_df = pd.merge(wmovies_df, kmovies_df.drop('id', axis=1), 
                     left_on='id', right_on='imdb_id', 
                     how='inner', suffixes=['_wiki', '_kaggle']).drop('id', axis=1)
print(movies_df.info())
movies_df.head(2)

### Find redundant columns

Since the 2 dataframes being joined contain similar data, they will have several of the same columns. For these columns, it would be redundant to keep both, so we will examine each pair to see what is the best course of action to take. In most cases, we will likely just keep the one with more consistent data. A table will be created to keep track of the action we take on each pair.

In [None]:
# List columns
movies_df.columns.sort_values()

In [None]:
# List of redundant column name pairs (wikipedia column, kaggle column)
redundant_cols = [
    ('box_office', 'revenue'),
    ('budget_wiki', 'budget_kaggle'),
    ('duration', 'runtime'),
    ('languages', 'original_language'),
    ('production_companies_wiki', 'production_companies_kaggle'),
    ('release_date_wiki', 'release_date_kaggle'),
    ('title_wiki', 'title_kaggle')
]

# Create a table to keep track of the action taken for each pair
redundant_cols = pd.DataFrame(redundant_cols, columns=['wikipedia', 'kaggle'])
redundant_cols['action'] = ''
redundant_cols

### Inspect title columns

In [None]:
# Extract title columns
title = movies_df[['title_wiki', 'title_kaggle']]

# Rows where titles are different between the 2 columns
print(title[title['title_wiki'] != title['title_kaggle']].shape[0])
title[title['title_wiki'] != title['title_kaggle']].head(4)

It looks like the Kaggle data has more complete `title`s. If it contains no missing `title`s, or at least less than the Wikipedia data, then we will drop the Wikipedia `title`s and keep the Kaggle `title`s.

In [None]:
# Check for null titles in the Kaggle column
title[(title['title_kaggle'] == '') | (title['title_kaggle'].isnull())].shape[0]

In [None]:
# Add action for title - drop the Wikipedia column
redundant_cols.loc[6, 'action'] = 'drop wiki col'
redundant_cols

### Inspect language columns

In [None]:
# Extract language columns
lang = movies_df[['languages', 'original_language']]

# Rows where languages are different between the 2 columns
print(lang[lang['languages'] != lang['original_language']].shape[0])
lang[lang['languages'] != lang['original_language']].head(4)

The Wikipedia `languages` column contains lists, so it would be easier to keep Kaggle's `original_language` column, assuming there are no missing or strange values.

In [None]:
# Check for null languages the Kaggle column
print(lang[(lang['original_language'] == '') | (lang['original_language'].isnull())].shape[0])

# Check Kaggle column values
lang['original_language'].unique()

In [None]:
# Add action for language - drop the Wikipedia column
redundant_cols.loc[3, 'action'] = 'drop wiki col'
redundant_cols

### Inspect production company columns

In [None]:
# Extract production company columns
production = movies_df[['production_companies_wiki', 'production_companies_kaggle']]

# Rows where values are different between the 2 columns
print(production[production['production_companies_wiki'] != production['production_companies_kaggle']].shape[0])
production[production['production_companies_wiki'] != production['production_companies_kaggle']].head(4)

In [None]:
# Check Kaggle column values
production['production_companies_kaggle'].unique()

In [None]:
# Check for null values in the Kaggle column
production[(production['production_companies_kaggle'] == '') | (production['production_companies_kaggle'] \
    .isnull())].shape[0]

We will be keeping the Kaggle column because it's more descriptive than the Wikipedia values.

In [None]:
# Add action to production company - drop the Wikipedia column
redundant_cols.loc[4, 'action'] = 'drop wiki col'
redundant_cols

### Inspect budget columns

In [None]:
# Extract budget columns
budget = movies_df[['budget_wiki', 'budget_kaggle']]x

# Scatterplot
budget.plot(kind='scatter', x='budget_wiki', y='budget_kaggle', figsize=(10, 4), grid=True)
plt.show()

In [None]:
# Number of 0 values
print('Number of 0s in Wikipedia `budget`:', (budget['budget_wiki'] == 0).sum())
print('Number of 0s in Kaggle `budget`:', (budget['budget_kaggle'] == 0).sum())

# Check null values
revenue.isnull().sum()

The Wikipedia data has 1589 missing `budget` values, but the Kaggle data has 2350 `budget` values of 0, which are probably missing values. We will use `budget_kaggle` to fill any missing `budget_wiki` and then drop `budget_kaggle`.

In [None]:
# Add action to budget - fill Wikipedia's missing values with Kaggle's budget values, then drop the Kaggle column
redundant_cols.loc[1, 'action'] = 'fill wiki missing, drop kaggle col'
redundant_cols

### Inspect revenue columns

In [None]:
# Extract budget columns
revenue = movies_df[['box_office', 'revenue']]

# Scatterplot
revenue.plot(kind='scatter', x='box_office', y='revenue', figsize=(10, 4), grid=True)
plt.show()

In [None]:
# Number of 0 values
print('Number of 0s in Wikipedia `box_office`:', (revenue['box_office'] == 0).sum())
print('Number of 0s in Kaggle `revenue`:', (revenue['revenue'] == 0).sum())

# Check null values
revenue.isnull().sum()

This is the same case as `budget` so we will take the same action for `revenue`.

In [None]:
# Add action to revenue - same as budget
redundant_cols.loc[0, 'action'] = 'fill wiki missing, drop kaggle col'
redundant_cols

### Inspect runtime columns

In [None]:
# Extract budget columns
run = movies_df[['duration', 'runtime']]

# Scatterplot
run.plot(kind='scatter', x='duration', y='runtime', figsize=(10, 4), grid=True)
plt.show()

In [None]:
# Number of 0 values
print('Number of 0s in Wikipedia `duration`:', (run['duration'] == 0).sum())
print('Number of 0s in Kaggle `runtime`:', (run['runtime'] == 0).sum())

# Check null values
run.isnull().sum()

Even though the Wikipedia data (`duration`) contains less missing/0 values, it has some extreme outliers, so we will actually use it to fill in the Kaggle data's (`runtime`) zero values and then drop it (drop `duration`).

In [None]:
# Add action to runtime - fill the zero values in the Kaggle column, then drop the Wikipedia column
redundant_cols.loc[2, 'action'] = 'fill kaggle missing, drop wiki col'
redundant_cols

### Inspect release date column

In [None]:
# Extract budget columns
release = movies_df[['release_date_wiki', 'release_date_kaggle']]

# Scatterplot
release.plot(x='release_date_wiki', y='release_date_kaggle', style='.', 
             figsize=(10, 4), grid=True, legend=False, ylabel='release_date_kaggle')
plt.xticks(rotation=0, ha='center')
plt.show()

There is a strange outlier with a `release_date` before 1960 in the Kaggle data but after 2000 in the Wikipedia data. Let's inspect this movie.

In [None]:
# Inspect outlier
release_outlier = movies_df.loc[(movies_df['release_date_wiki'] > '2000') & (movies_df['release_date_kaggle'] < '1960'),
                                ['title_wiki', 'title_kaggle', 'release_date_wiki', 'release_date_kaggle']]
release_outlier

In [None]:
# Check null values
release.isnull().sum()

It looks like 2 different movies shared an `imdb_id` so they were merged together. We will be dropping this single movie and the Wikipedia `release_date` since it has some missing values.

In [None]:
# Add action to release date - drop outlier and drop the Wikipedia column
redundant_cols.loc[5, 'action'] = 'drop outlier, drop wiki col'
redundant_cols

### Drop redundant columns

With a plan defined for each pair of redundant columns in the table above, we will now be executing these actions.

In [None]:
# Drop `title_wiki`, `languages`, and `production_companies_wiki`
movies_df.drop(['title_wiki', 'languages', 'production_companies_wiki'], axis=1, inplace=True)

# Drop the `release_date` outlier then drop `release_date_wiki`
movies_df.drop(release_outlier.index, inplace=True)
movies_df.drop('release_date_wiki', axis=1, inplace=True)
movies_df.shape

In [None]:
def filla_dropb(cola, colb, data):
    
    """ Fill missing values in column a (keep) using values in column b (drop), then drop column b """
    
    # Copy dataframe
    df = data.copy()
    
    def filla(row):
        cond1 = pd.isnull(row[cola]) or row[cola] == 0
        cond2 = pd.notnull(row[colb]) and row[colb] != 0
        return row[colb] if cond1 and cond2 else row[cola]
            
    df[cola] = df.apply(filla, axis=1)
    return df.drop(colb, axis=1)
    

# Fill `budget`, `revenue`, and `runtime` missing values, then drop `Budget`, `Box office`, `Duration`
movies_df = filla_dropb('box_office', 'revenue', data=movies_df)
movies_df = filla_dropb('budget_wiki', 'budget_kaggle', data=movies_df)
movies_df = filla_dropb('runtime', 'duration', data=movies_df)
movies_df.info()

### Rename and reorder columns

In [32]:
movies_drop_cols2_df.columns

Index(['Based on', 'Country', 'Cinematographer(s)', 'Director', 'Distributor',
       'Editor(s)', 'Composer(s)', 'Producer(s)', 'Writer(s)', 'Stars',
       'IMDB link', 'URL', 'Year', 'budget', 'genres', 'id', 'imdb_id',
       'original_language', 'original_title', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'status',
       'tagline', 'title', 'vote_average', 'vote_count'],
      dtype='object')

In [33]:
# Column order
col_order = ['id', 'imdb_id', 'IMDB link', 'URL', 'poster_path', 'title', 'original_title', 'tagline', 'Based on', 'overview',
             'release_date', 'Year', 'runtime', 'budget', 'revenue', 'popularity', 'vote_average', 'vote_count',
             'genres', 'status', 'Country', 'original_language', 'spoken_languages',
             'Writer(s)', 'Producer(s)', 'Director', 'Stars', 'Cinematographer(s)', 'Composer(s)', 'Editor(s)',
             'production_companies', 'production_countries', 'Distributor']

# New column names
col_names = ['kaggle_id', 'imdb_id', 'imdb_link', 'wikipedia_url', 'poster_path', 'title', 'original_title', 'tagline', 
             'based_on', 'overview', 'release_date', 'year', 'runtime', 'budget', 'revenue', 'popularity', 
             'vote_average', 'vote_count', 'genres', 'status', 'country', 'original_language', 'spoken_languages', 
             'writers', 'producers', 'director', 'stars', 'cinematographers', 'composers', 'editors', 
             'production_companies', 'production_countries', 'distributor']

# Rename and order columns
movies_rename_cols = movies_drop_cols2_df.rename({old: new for old, new in zip(col_order, col_names) if old != new}, axis=1)
movies_rename_cols = movies_rename_cols[col_names]
movies_rename_cols.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5982 entries, 0 to 5982
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   kaggle_id             5982 non-null   int64         
 1   imdb_id               5982 non-null   object        
 2   imdb_link             5982 non-null   object        
 3   wikipedia_url         5982 non-null   object        
 4   poster_path           5981 non-null   object        
 5   title                 5982 non-null   object        
 6   original_title        5982 non-null   object        
 7   tagline               4880 non-null   object        
 8   based_on              1966 non-null   object        
 9   overview              5977 non-null   object        
 10  release_date          5982 non-null   datetime64[ns]
 11  year                  5982 non-null   int64         
 12  runtime               5982 non-null   float64       
 13  budget            

In [34]:
# Copy df
movies_df = movies_rename_cols.copy()
movies_df.head(2)

Unnamed: 0,kaggle_id,imdb_id,imdb_link,wikipedia_url,poster_path,title,original_title,tagline,based_on,overview,...,writers,producers,director,stars,cinematographers,composers,editors,production_companies,production_countries,distributor
0,9548,tt0098987,https://www.imdb.com/title/tt0098987/,https://en.wikipedia.org/wiki/The_Adventures_o...,/yLeX2QLkHeRlYQRcbU8BKgMaYYD.jpg,The Adventures of Ford Fairlane,The Adventures of Ford Fairlane,Kojak. Columbo. Dirty Harry. Wimps.,"[Characters, by Rex Weiner]","Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",...,"[David Arnott, James Cappe]","[Steve Perry, Joel Silver]",Renny Harlin,"[Andrew Dice Clay, Wayne Newton, Priscilla Pre...",Oliver Wood,"[Cliff Eidelman, Yello]",Michael Tronick,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",20th Century Fox
1,25501,tt0098994,https://www.imdb.com/title/tt0098994/,"https://en.wikipedia.org/wiki/After_Dark,_My_S...",/3hjcHNtWn9T6jVGXgNXyCsMWBdj.jpg,"After Dark, My Sweet","After Dark, My Sweet",All they risked was everything.,"[the novel, After Dark, My Sweet, by, Jim Thom...",The intriguing relationship between three desp...,...,"[James Foley, Robert Redlin]","[Ric Kidney, Robert Redlin]",James Foley,"[Jason Patric, Rachel Ward, Bruce Dern, George...",Mark Plummer,Maurice Jarre,Howard E. Smith,"[{'name': 'Avenue Pictures Productions', 'id':...","[{'iso_3166_1': 'US', 'name': 'United States o...",Avenue Pictures


### Save data

In [35]:
movies_df.to_pickle(data_path + 'movies.pkl')
pd.read_pickle(data_path + 'movies.pkl').head(2)

Unnamed: 0,kaggle_id,imdb_id,imdb_link,wikipedia_url,poster_path,title,original_title,tagline,based_on,overview,...,writers,producers,director,stars,cinematographers,composers,editors,production_companies,production_countries,distributor
0,9548,tt0098987,https://www.imdb.com/title/tt0098987/,https://en.wikipedia.org/wiki/The_Adventures_o...,/yLeX2QLkHeRlYQRcbU8BKgMaYYD.jpg,The Adventures of Ford Fairlane,The Adventures of Ford Fairlane,Kojak. Columbo. Dirty Harry. Wimps.,"[Characters, by Rex Weiner]","Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",...,"[David Arnott, James Cappe]","[Steve Perry, Joel Silver]",Renny Harlin,"[Andrew Dice Clay, Wayne Newton, Priscilla Pre...",Oliver Wood,"[Cliff Eidelman, Yello]",Michael Tronick,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",20th Century Fox
1,25501,tt0098994,https://www.imdb.com/title/tt0098994/,"https://en.wikipedia.org/wiki/After_Dark,_My_S...",/3hjcHNtWn9T6jVGXgNXyCsMWBdj.jpg,"After Dark, My Sweet","After Dark, My Sweet",All they risked was everything.,"[the novel, After Dark, My Sweet, by, Jim Thom...",The intriguing relationship between three desp...,...,"[James Foley, Robert Redlin]","[Ric Kidney, Robert Redlin]",James Foley,"[Jason Patric, Rachel Ward, Bruce Dern, George...",Mark Plummer,Maurice Jarre,Howard E. Smith,"[{'name': 'Avenue Pictures Productions', 'id':...","[{'iso_3166_1': 'US', 'name': 'United States o...",Avenue Pictures
