# Movie Dataset Creation (DS Project) Tutorial

Following [Keith Galli's YouTube tutorial](https://www.youtube.com/watch?v=Ewgy-G9cmbg)

IMPORTANT NOTE! Websites have rules regarding scraping spiders. Most are fine for personal purposes, but for commercial use, you'll need to regard these with care. It's always worth looking at the rules regardless.<br> To see the rules, google the *website* name followed by 'robots.txt'


## Task 1: Create Spider

In [1]:
# import libraries
import requests
from bs4 import BeautifulSoup as bs
import re
import json
from datetime import datetime
import pickle
import pandas as pd

In [2]:
# get website as BeautifulSoup object
r = requests.get('https://en.wikipedia.org/wiki/Toy_Story_3')
website = bs(r.content)

In [3]:
# find infobox table
table = website.body.find('table', attrs={'class': 'infobox vevent'})

In [4]:
# create table dict
ts3_dict = {'Title': 0}

In [5]:
# replace title
ts3_dict['Title'] = table.find('tr').string

In [6]:
# select remaining infobox elements
rows = table.tbody.select('tr')[2:]

In [7]:
# move infobox items into table dict
for row in rows:
    k = row.find('th').get_text(' ', strip=True)
    if row.select('ul'):
        l = row.select('li')
        v = [li.get_text(' ', strip=True).replace('\xa0', ' ') for li in l]
    else:
        v = row.find('td').get_text(' ', strip=True).replace('\xa0', ' ')
    ts3_dict[k] = v

In [8]:
# completed infobox scrape
ts3_dict

{'Title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Produced by': 'Darla K. Anderson',
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )',
  'June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': '103 minutes [1]',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million [1]',
 'Box office': '$1.067 billion [1]'}

In [9]:
# another solution to above

info_rows = table.select('tr')

def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip=True).replace('\xa0', ' ') for li in row_data.find_all('li')]
    else:
        return row_data.get_text(' ', strip=True).replace('\xa0', ' ')

movie_info = {}

for index, row in enumerate(info_rows):
    if index == 0:
        movie_info['Title'] = row.find('th').get_text(' ', strip=True)
    elif index == 1:
        continue
    else:
        content_key = row.find('th').get_text(' ', strip=True)
        content_value = get_content_value(row.find('td'))
        movie_info[content_key] = content_value

movie_info

{'Title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Produced by': 'Darla K. Anderson',
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )',
  'June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': '103 minutes [1]',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million [1]',
 'Box office': '$1.067 billion [1]'}

In [10]:
# get value function for key-value pairs
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip=True).replace('\xa0', ' ') for li in row_data.find_all('li')]
    else:
        return row_data.get_text(' ', strip=True).replace('\xa0', ' ')

# moving scrape into a function
def wiki_infobox_scrape(wiki_href):
    # get page as BeautifulSoup
    r = requests.get('https://en.wikipedia.org/' + wiki_href)
    page = bs(r.content)
    # find infobox table
    table = page.body.find('table', attrs={'class': 'infobox vevent'})
    # get table rows
    info_rows = table.select('tr')
    # initialize dictionary
    movie_info = {}
    # populate dictionary
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['Title'] = row.find('th').get_text(' ', strip=True)
        elif index == 1:
            continue
        else:
            content_key = row.find('th').get_text(' ', strip=True)
            content_value = get_content_value(row.find('td'))
            movie_info[content_key] = content_value

    return movie_info

In [11]:
# quick function test
dict_test = wiki_infobox_scrape('wiki/Toy_Story_3')
dict_test

{'Title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Produced by': 'Darla K. Anderson',
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )',
  'June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': '103 minutes [1]',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million [1]',
 'Box office': '$1.067 billion [1]'}

## Task 2: Run spider on all Disney films

In [12]:
# get BeautifulSoup object
movie_scrape = bs(requests.get('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films').content)

In [13]:
# get tables of movies
tables = movie_scrape.find_all('table', attrs={'class': re.compile('wikitable')})

In [14]:
# relative paths
hrefs = []
# movie titles
titles = []
# movies without hrefs
ref_fails = []

# gets hrefs and titles for each movie (or adds movie to ref_fails)
for table in tables:
    for row in table.select('tr')[1:]:
        td = row.findChild()
        try:
            movie_href = td.a['href']
            movie_title = td.a['title']

            hrefs.append(movie_href)
            titles.append(movie_title)
        except Exception as e:
            ref_fails.append(td.get_text())

In [15]:
# see successful hrefs
print(len(hrefs), len(titles), len(ref_fails))
hrefs[:20]

557 557 19


['/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 '/wiki/Pinocchio_(1940_film)',
 '/wiki/Fantasia_(1940_film)',
 '/wiki/The_Reluctant_Dragon_(1941_film)',
 '/wiki/Dumbo',
 '/wiki/Bambi',
 '/wiki/Saludos_Amigos',
 '/wiki/Victory_Through_Air_Power_(film)',
 '/wiki/The_Three_Caballeros',
 '/wiki/Make_Mine_Music',
 '/wiki/Song_of_the_South',
 '/wiki/Fun_and_Fancy_Free',
 '/wiki/Melody_Time',
 '/wiki/So_Dear_to_My_Heart',
 '/wiki/The_Adventures_of_Ichabod_and_Mr._Toad',
 '/wiki/Cinderella_(1950_film)',
 '/wiki/Treasure_Island_(1950_film)',
 '/wiki/Alice_in_Wonderland_(1951_film)',
 '/wiki/The_Story_of_Robin_Hood_(film)',
 '/wiki/Peter_Pan_(1953_film)']

In [16]:
# show reference failures
ref_fails

['Trail of the Panda\n',
 'Growing Up Wild\n',
 'Expedition China\n',
 'Wish\n',
 'Elio\n',
 '29 Dates ‡\n',
 'Aloha Rodeo ‡\n',
 'Knights\n',
 'Merlin\n',
 'Penelope\n',
 'Sadé\n',
 'Society of Explorers and Adventurers\n',
 'Song for a Whale ‡\n',
 'Spooked ‡\n',
 'Untitled Josie Trinidad film\n',
 'Untitled Kristen Lester film\n',
 'Untitled Marc Smith film\n',
 "World's Best ‡\n",
 "Wouldn't It Be Nice ‡\n"]

All of the reference failures do not contain a link.

In [17]:
# COMMENTED OUT - ACTUAL SCRAPE (REQUESTS) PROCESSED IN CLEANED CELLS BELOW

# # list of movie infobox dictionaries
# movie_infobox_list = []
# # list of hrefs for failed scrapes
# scrape_failure = {}

# # moves all movie infobox scrapes into movie_infobox_list
# for href in hrefs:
#     try:
#         movie_infobox_list.append(wiki_infobox_scrape(href))
#     except Exception as e:
#         scrape_failure[href] = e

In [18]:
# failed scrapes
# scrape_failure

In [19]:
# saves converted json into local variable
# movie_json = json.dumps(movie_infobox_list)

In [20]:
# for saving to repo
def save_data(file_title, data):
    with open(file_title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

In [21]:
# for loading from repo
def load_data(file_title):
    with open(file_title, encoding='utf-8') as f:
        return json.load(f)

In [22]:
# saves data to local repo
# save_data('disney_movies.json', movie_infobox_list)

In [23]:
# loads data from local repo
# load_data('disney_movies.json')

## Task 3: Cleaning data

- Clean up references (subscript boxes like [1])
- Convert running time to integer
- Convert dates to datetime
- Split up long strings
- Convert budget & box office to numbers

Clean up references and split strings in scraping function

In [24]:
# get value function for key-value pairs
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip=True).replace('\xa0', ' ') for li in row_data.find_all('li')]
    elif row_data.find('br'):
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(' ', strip=True).replace('\xa0', ' ')

# deletes subscript and span tags
def clean_tags(soup):
    for tag in soup.find_all(['sup', 'span']):
        tag.decompose()

# moving scrape into a function
def wiki_infobox_scrape(wiki_href):
    # get page as BeautifulSoup
    r = requests.get('https://en.wikipedia.org/' + wiki_href)
    page = bs(r.content)
    # cleans tags
    clean_tags(page)
    # find infobox table
    table = page.body.find('table', attrs={'class': 'infobox vevent'})
    # get table rows
    info_rows = table.select('tr')
    # initialize dictionary
    movie_info = {}
    # populate dictionary
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['Title'] = row.find('th').get_text(' ', strip=True)
        else:
            header = row.find('th')
            if header:
                content_key = row.find('th').get_text(' ', strip=True)
                content_value = get_content_value(row.find('td'))
                movie_info[content_key] = content_value

    return movie_info

In [25]:
wiki_infobox_scrape('wiki/Toy_Story_3')

{'Title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Produced by': 'Darla K. Anderson',
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release dates': ['June 12, 2010 ( Taormina Film Fest )',
  'June 18, 2010 (United States)'],
 'Running time': '103 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million',
 'Box office': '$1.067 billion'}

In [26]:
# list of movie infobox dictionaries
movie_infobox_list = []
# list of hrefs for failed scrapes
scrape_failure = {}

# moves all movie infobox scrapes into movie_infobox_list
for href in hrefs:
    try:
        movie_infobox_list.append(wiki_infobox_scrape(href))
    except Exception as e:
        scrape_failure[href] = e

In [27]:
# failed scrapes
scrape_failure

{'/wiki/Zorro_(1957_TV_series)#Theatrical': AttributeError("'NoneType' object has no attribute 'find'"),
 '/wiki/The_Beatles:_Get_Back#The_Beatles:_Get_Back_–_The_Rooftop_Concert': AttributeError("'NoneType' object has no attribute 'find'"),
 '/wiki/Elemental_(2023_film)': AttributeError("'NoneType' object has no attribute 'select'"),
 '/wiki/Chris_Paul': AttributeError("'NoneType' object has no attribute 'select'"),
 '/wiki/All_Night_Long_(All_Night)': AttributeError("'NoneType' object has no attribute 'find'"),
 '/wiki/Big_Thunder_Mountain_Railroad': AttributeError("'NoneType' object has no attribute 'select'"),
 '/wiki/Keeper_of_the_Lost_Cities#Film_adaptation': AttributeError("'NoneType' object has no attribute 'select'"),
 '/wiki/Jim_Henson#Legacy': AttributeError("'NoneType' object has no attribute 'select'"),
 '/wiki/One_Thousand_and_One_Nights': AttributeError("'NoneType' object has no attribute 'select'"),
 '/wiki/Shrunk_(film)': AttributeError("'NoneType' object has no attrib

In [28]:
# saves converted json into local variable
movie_json = json.dumps(movie_infobox_list)

Converting running time to integer

In [29]:
# get a movie entry
movie_infobox_list[-10]

{'Title': 'Aladdin',
 'Directed by': 'Guy Ritchie',
 'Screenplay by': ['John August', 'Guy Ritchie'],
 'Based on': ["Disney 's Aladdin by Ron Clements John Musker Ted Elliott Terry Rossio",
  'Ron Clements',
  'John Musker',
  'Ted Elliott',
  'Terry Rossio',
  'Aladdin and the Magic Lamp which is associated with One Thousand and One Nights'],
 'Produced by': ['Dan Lin', 'Jonathan Eirich'],
 'Starring': ['Will Smith',
  'Mena Massoud',
  'Naomi Scott',
  'Marwan Kenzari',
  'Navid Negahban',
  'Nasim Pedrad',
  'Billy Magnussen'],
 'Cinematography': 'Alan Stewart',
 'Edited by': 'James Herbert',
 'Music by': 'Alan Menken',
 'Production companies': ['Walt Disney Pictures', 'Rideback'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release dates': ['May 8, 2019 ( Grand Rex )',
  'May 24, 2019 (United States)'],
 'Running time': '128 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$183 million',
 'Box office': ''}

In [30]:
# get running times to look for edge cases
[movie.get('Running time', 'N/A') for movie in movie_infobox_list][:20]

['83 minutes',
 '88 minutes',
 '126 minutes',
 '74 minutes',
 '64 minutes',
 '70 minutes',
 '42 minutes',
 '65 min',
 '71 minutes',
 '75 minutes',
 '94 minutes',
 '73 minutes',
 '75 minutes',
 '82 minutes',
 '68 minutes',
 '74 minutes',
 '96 minutes',
 '75 minutes',
 '84 minutes',
 '77 minutes']

In [31]:
# runtime conversion function
def minutes_to_integer(running_time):
    if running_time == 'N/A':
        return None
    if isinstance(running_time, list):
        entry = running_time[0]
    else:
        entry = running_time
    value = int(entry.split(' ')[0])
    return value

In [32]:
# appending conversions to entries
for movie in movie_infobox_list:
    movie['Running time (int)'] = minutes_to_integer(movie.get('Running time', 'N/A'))

In [33]:
# testing conversion
movie_infobox_list[-10]

{'Title': 'Aladdin',
 'Directed by': 'Guy Ritchie',
 'Screenplay by': ['John August', 'Guy Ritchie'],
 'Based on': ["Disney 's Aladdin by Ron Clements John Musker Ted Elliott Terry Rossio",
  'Ron Clements',
  'John Musker',
  'Ted Elliott',
  'Terry Rossio',
  'Aladdin and the Magic Lamp which is associated with One Thousand and One Nights'],
 'Produced by': ['Dan Lin', 'Jonathan Eirich'],
 'Starring': ['Will Smith',
  'Mena Massoud',
  'Naomi Scott',
  'Marwan Kenzari',
  'Navid Negahban',
  'Nasim Pedrad',
  'Billy Magnussen'],
 'Cinematography': 'Alan Stewart',
 'Edited by': 'James Herbert',
 'Music by': 'Alan Menken',
 'Production companies': ['Walt Disney Pictures', 'Rideback'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release dates': ['May 8, 2019 ( Grand Rex )',
  'May 24, 2019 (United States)'],
 'Running time': '128 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$183 million',
 'Box office': '',
 'Running time (int)': 128}

Convert Budget & Box office

In [34]:
# get movie budgets for viewing edge cases
movie_budgets = [movie.get('Budget', 'N/A') for movie in movie_infobox_list]
movie_budgets[:20]

['$1.49 million',
 '$2.6 million',
 '$2.28 million',
 '$600,000',
 '$950,000',
 '$858,000',
 'N/A',
 '$788,000',
 'N/A',
 '$1.35 million',
 '$2.125 million',
 'N/A',
 '$1.5 million',
 '$1.5 million',
 'N/A',
 '$2.2 million',
 '$1,800,000',
 '$3 million',
 'N/A',
 '$4 million']

In [35]:
# regex's
amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"
value_re = rf"\${number}"
word_re = rf"\${number}(-|\sto\s)?({number})?\s({amounts})"

# converts 'illion' values to numbers
def word_to_value(word):
    value_dict = {'thousand': 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

# gets numerical value from entries like '$10 million'
def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(',', ''))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value * word_value

# gets numerical value from entries like '$730,000'
def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(',', ''))
    return value

# converts money string value to float
def money_conversion(money):

    if isinstance(money, list):
        money = money[0]

    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())
    
    elif value_syntax:
        return parse_value_syntax(value_syntax.group())
    
    else:
        return None

In [36]:
# check money_conversion function
[money_conversion(budget) for budget in movie_budgets][:20]

[1490000.0,
 2600000.0,
 2280000.0,
 600000.0,
 950000.0,
 858000.0,
 None,
 788000.0,
 None,
 1350000.0,
 2125000.0,
 None,
 1500000.0,
 1500000.0,
 None,
 2200000.0,
 1800000.0,
 3000000.0,
 None,
 4000000.0]

In [37]:
# create numerical versions of budget and box office
for movie in movie_infobox_list:
    movie['Budget (float)'] = money_conversion(movie.get('Budget', 'N/A'))
    movie['Box office (float)'] = money_conversion(movie.get('Box office', 'N/A'))

In [38]:
# check money conversion in movie dicts
movie_infobox_list[-10]

{'Title': 'Aladdin',
 'Directed by': 'Guy Ritchie',
 'Screenplay by': ['John August', 'Guy Ritchie'],
 'Based on': ["Disney 's Aladdin by Ron Clements John Musker Ted Elliott Terry Rossio",
  'Ron Clements',
  'John Musker',
  'Ted Elliott',
  'Terry Rossio',
  'Aladdin and the Magic Lamp which is associated with One Thousand and One Nights'],
 'Produced by': ['Dan Lin', 'Jonathan Eirich'],
 'Starring': ['Will Smith',
  'Mena Massoud',
  'Naomi Scott',
  'Marwan Kenzari',
  'Navid Negahban',
  'Nasim Pedrad',
  'Billy Magnussen'],
 'Cinematography': 'Alan Stewart',
 'Edited by': 'James Herbert',
 'Music by': 'Alan Menken',
 'Production companies': ['Walt Disney Pictures', 'Rideback'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release dates': ['May 8, 2019 ( Grand Rex )',
  'May 24, 2019 (United States)'],
 'Running time': '128 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$183 million',
 'Box office': '',
 'Running time (int)': 128,
 

In [39]:
# look at movie dates
movie_dates = [movie.get('Release date', 'N/A') for movie in movie_infobox_list]
movie_dates[:20]

['N/A',
 'N/A',
 ['November 13, 1940'],
 ['June 27, 1941'],
 'N/A',
 'N/A',
 'N/A',
 ['July 17, 1943'],
 'N/A',
 'N/A',
 'N/A',
 ['September 27, 1947'],
 'May 27, 1948',
 'N/A',
 ['October 5, 1949'],
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 ['February 5, 1953 (United States)']]

In [40]:
# get string if date is list
def clean_date(date):
    return date.split('(')[0].strip()

# convert string date to datetime
def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
    
    if date == 'N/A':
        return None

    date_str = clean_date(date)

    fmts = ["%B %d, %Y", "%d %B %Y"]
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            return None
    return None

In [41]:
# create datetime value for each movie
for movie in movie_infobox_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [42]:
# check datetime conversion
movie_infobox_list[-20]

{'Title': 'National Treasure: Book of Secrets',
 'Directed by': 'Jon Turteltaub',
 'Screenplay by': ['Cormac Wibberley', 'Marianne Wibberley'],
 'Story by': ['Gregory Poirier',
  'Cormac Wibberley',
  'Marianne Wibberley',
  'Ted Elliott',
  'Terry Rossio'],
 'Based on': ['Characters', 'by', 'Jim Kouf', 'Oren Aviv', 'Charles Segars'],
 'Produced by': ['Jerry Bruckheimer', 'Jon Turteltaub'],
 'Starring': ['Nicolas Cage',
  'Diane Kruger',
  'Justin Bartha',
  'Jon Voight',
  'Helen Mirren',
  'Ed Harris',
  'Harvey Keitel',
  'Bruce Greenwood'],
 'Cinematography': ['John Schwartzman', 'Amir Mokri'],
 'Edited by': ['William Goldenberg', 'David Rennie'],
 'Music by': 'Trevor Rabin',
 'Production companies': ['Walt Disney Pictures',
  'Jerry Bruckheimer Films',
  'Junction Entertainment',
  'Saturn Films'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['December 21, 2007'],
 'Running time': '124 minutes',
 'Country': 'United States',
 'Language': 'English',
 '

### Pickle Movie Data

In [43]:
# save data in pickle
def save_data_pickle(name, data):
    with open(name, 'wb') as f:
        pickle.dump(data, f)

In [44]:
# load data from pickle
def load_data_pickle(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [45]:
# save_data_pickle('disney_movie_data_cleaned_more.pickle', movie_infobox_list)

In [46]:
# a = load_data_pickle('disney_movie_data_cleaned_more.pickle')

In [47]:
# equivalent!
# a == movie_infobox_list

## Task 4: Attach IMDB/Rotten Tomatoes Scores

Completion of this task requires an API key. Will skip to keep scope to a minimum.

# Task 5: Save data as JSON & CSV

In [48]:
# creating copy (prevent alterations to original list)
movie_info_copy = [movie.copy() for movie in movie_infobox_list]

In [49]:
# JSON will not accept datetime format
    # must convert to string before save to JSON
for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y")

In [50]:
# save to JSON using previous function
# save_data('disney_data_final.json', movie_info_copy)

In [51]:
# move data into dataframe for saving to csv
df = pd.DataFrame(movie_infobox_list)
df.head()

Unnamed: 0,Title,Directed by,Written by,Based on,Produced by,Starring,Music by,Production company,Distributed by,Release dates,...,Traditional,Simplified,Original title,Layouts by,Music,Lyrics,Book,Basis,Productions,Awards
0,Snow White and the Seven Dwarfs,"[David Hand, William Cottrell, Wilfred Jackson...","[Ted Sears, Richard Creedon, Otto Englander, D...","[Snow White, by The, Brothers Grimm]",Walt Disney,"[Adriana Caselotti, Lucille La Verne, Harry St...","[Frank Churchill, Paul Smith, Leigh Harline]",Walt Disney Productions,RKO Radio Pictures,"[December 21, 1937 ( Carthay Circle Theatre ),...",...,,,,,,,,,,
1,Pinocchio,"[Ben Sharpsteen, Hamilton Luske, Bill Roberts,...",,"[The Adventures of Pinocchio, by, Carlo Collodi]",Walt Disney,"[Cliff Edwards, Dickie Jones, Christian Rub, W...","[Leigh Harline, Paul J. Smith]",Walt Disney Productions,RKO Radio Pictures,"[February 7, 1940 ( Center Theatre ), February...",...,,,,,,,,,,
2,Fantasia,"[Samuel Armstrong, James Algar, Bill Roberts, ...",,,"[Walt Disney, Ben Sharpsteen]","[Leopold Stokowski, Deems Taylor]",See program,Walt Disney Productions,RKO Radio Pictures,,...,,,,,,,,,,
3,The Reluctant Dragon,"[Alfred Werker, (live action), Hamilton Luske,...","[Live-action:, Ted Sears, Al Perkins, Larry Cl...",,Walt Disney,"[Robert Benchley, Frances Gifford, Buddy Peppe...","[Frank Churchill, Larry Morey]",Walt Disney Productions,RKO Radio Pictures,,...,,,,,,,,,,
4,Dumbo,"[Ben Sharpsteen, Norman Ferguson, Wilfred Jack...",,"[Dumbo, the Flying Elephant, by, Helen Aberson...",Walt Disney,"[Edward Brophy, Verna Felton, Cliff Edwards, H...","[Frank Churchill, Oliver Wallace]",Walt Disney Productions,RKO Radio Pictures,"[October 23, 1941 (New York City), October 31,...",...,,,,,,,,,,


In [52]:
# save to csv
# df.to_csv('disney_movie_data_final.csv')