## Disney Dataset Creation (w/ Python BeautifulSoup)

Scrape and clean a list of Disney wikipedia pages to create a dataset to further analyze.

### Task #1: Get Info Box for a test movie (store in Python dictionary)

#### Import necessary libraries

In [1]:
from bs4 import BeautifulSoup as bs
import requests

#### Load the webpage

In [7]:
r = requests.get('https://en.wikipedia.org/wiki/Toy_Story_3')

# Convert to BeautifulSoup object
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
# print(contents)

In [6]:
# Limiting our scope to the infobox section of the wikipedia page

info_box = soup.find(class_ = 'infobox vevent')
print(info_box.prettify())

<table class="infobox vevent" style="font-size:90%;">
 <tbody>
  <tr>
   <th class="infobox-above summary" colspan="2" style="font-size:110%;font-style:italic;">
    Toy Story 3
   </th>
  </tr>
  <tr>
   <td class="infobox-image" colspan="2">
    <a class="image" href="/wiki/File:Toy_Story_3_poster.jpg" title="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3.">
     <img alt="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3." class="thumbborder" data-file-height="326" data-file-width="220" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/6/69/Toy_Story_3_poster.jpg" width="220"/>
    </a>
    <div class="infobox-caption" style="font-size:95%;padding:0.35em 0.35em 0.25em;line-height:1.25em;">
     Theatrical release poster

In [10]:
# Finding table rows (tr) in the infobox

info_rows = info_box.find_all('tr')

for row in info_rows:
    print(row.prettify())

<tr>
 <th class="infobox-above summary" colspan="2" style="font-size:110%;font-style:italic;">
  Toy Story 3
 </th>
</tr>

<tr>
 <td class="infobox-image" colspan="2">
  <a class="image" href="/wiki/File:Toy_Story_3_poster.jpg" title="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3.">
   <img alt="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3." class="thumbborder" data-file-height="326" data-file-width="220" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/6/69/Toy_Story_3_poster.jpg" width="220"/>
  </a>
  <div class="infobox-caption" style="font-size:95%;padding:0.35em 0.35em 0.25em;line-height:1.25em;">
   Theatrical release poster
  </div>
 </td>
</tr>

<tr>
 <th class="infobox-label" scope="row" style="white-space

To build a dictionary, we will get the key from the table head (th) and the value from the table data (td).

In [18]:
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip = True).replace('\xa0', ' ') for li in row_data.find_all('li')]
    else:
        return row_data.get_text(' ', strip = True).replace('\xa0', ' ') # cleanup needed of xa stuff

movie_info = {}

for index, row in enumerate(info_rows):
    if index == 0:
        movie_info['title'] = row.find('th').get_text(' ', strip = True)
    elif index == 1:
        continue
    else:
        content_key = row.find('th').get_text(' ', strip = True)
        content_value = get_content_value(row.find('td'))
        movie_info[content_key] = content_value
    
movie_info

{'title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Produced by': 'Darla K. Anderson',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Music by': 'Randy Newman',
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )',
  'June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': '103 minutes [1]',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million [1]',
 'Box office': '$1.067 billion [1]'}

### Task #2: Get info box for all movies

In [28]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films')

# Convert to BeautifulSoup object
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
# print(contents)

In [33]:
movies = soup.select('.wikitable.sortable i')
movies[0:10]

'Academy Award Review of Walt Disney Cartoons'

In [46]:
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip=True).replace('\xa0', ' ') for li in row_data.find_all('li')]
    elif row_data.find('br'):
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(' ', strip=True).replace('\xa0', ' ')

def clean_tags(soup):
    for tag in soup.find_all(['sup', 'span']):
        tag.decompose()
        
def get_info_box(url):

    r = requests.get(url)
    soup = bs(r.content)
    info_box = soup.find(class_= 'infobox vevent')
    info_rows = info_box.find_all('tr')
    
    clean_tags(soup)

    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find('th').get_text(' ', strip=True)
        else:
            header = row.find('th')
            if header:
                content_key = row.find('th').get_text(' ', strip=True)
                content_value = get_content_value(row.find('td'))
                movie_info[content_key] = content_value
            
    return movie_info

In [55]:
### WARNING! Long run-time!

r = requests.get('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films')
soup = bs(r.content)
movies = soup.select('.wikitable.sortable i a')

base_path = 'https://en.wikipedia.org/'

movie_info_list = []
for index, movie in enumerate(movies):
    if index % 10 == 0:
        print(index)
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))
        
    except Exception as e:
        print(movie.get_text())
        print(e)

0
10
20
30
40
Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
50
60
70
80
90
100
110
120
True-Life Adventures
'NoneType' object has no attribute 'find_all'
130
140
The London Connection
'NoneType' object has no attribute 'find'
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
Hollywood Stargirl
'NoneType' object has no attribute 'find_all'


After investigating the errors listed above (5 movies), I can see that those are edge cases, having different wikipedia page (i.e. missing the info box, or not having an own wiki page due to being part of a series). I decided to ignore these cases, as they are a very marginal minority of the movies in scope.

In [56]:
len(movie_info_list)

447

#### Save/Reload Movie Data

In [57]:
import json

def save_data(title, data):
    with open(title, 'w', encoding = 'utf-8') as f:
        json.dump(data, f, ensure_ascii = False, indent = 2)

In [58]:
import json

def load_data(title):
    with open(title, encoding = 'utf-8') as f:
        return json.load(f)

In [59]:
save_data('disney_data.json', movie_info_list)

### Task #3: Clean the data

#### Subtasks
- Clean up references (e.g. [1])
- Convert running time into integer
- Convert dates to datetime object
- Split up the long strings (e.g. Name 1 Name 2 Name 3 ...)
- Convert Budget and Box Office to numbers

In [None]:
# Clean up references (remove [1] [2] and similar things)
# Done in infobox function definition section

In [None]:
# Split up the long strings
# Done in infobox function definition section

#### Convert Running time into integer

In [61]:
# Check running times

[movie.get('Running time', 'N/A') for movie in movie_info_list]

['41 minutes (74 minutes 1966 release)',
 '83 minutes',
 '88 minutes',
 '126 minutes',
 '74 minutes',
 '64 minutes',
 '70 minutes',
 '42 minutes',
 '70 min',
 '71 minutes',
 '75 minutes',
 '94 minutes',
 '73 minutes',
 '75 minutes',
 '82 minutes',
 '68 minutes',
 '74 minutes',
 '96 minutes',
 '75 minutes',
 '84 minutes',
 '77 minutes',
 '92 minutes',
 '69 minutes',
 '81 minutes',
 ['60 minutes (VHS version)', '71 minutes (original)'],
 '127 minutes',
 '92 minutes',
 '76 minutes',
 '75 minutes',
 '73 minutes',
 '85 minutes',
 '81 minutes',
 '70 minutes',
 '90 min.',
 '80 minutes',
 '75 minutes',
 '83 minutes',
 '83 minutes',
 '72 minutes',
 '97 minutes',
 '75 minutes',
 '104 minutes',
 '93 minutes',
 '105 minutes',
 '95 minutes',
 '97 minutes',
 '134 minutes',
 '69 minutes',
 '92 minutes',
 '126 minutes',
 '79 minutes',
 '97 minutes',
 '128 minutes',
 '74 minutes',
 '91 minutes',
 '105 minutes',
 '98 minutes',
 '130 minutes',
 '89 min.',
 '93 minutes',
 '67 minutes',
 '98 minutes',
 '10

In [76]:
# Define formula to convert

def minutes_to_integer(running_time):
    if running_time == 'N/A':
        return None # if running time is not listed (mostly new movies)
    if isinstance(running_time, list):
        return int(running_time[0].split(" ")[0]) # in case multiple running times are listed (different editions)
    else: 
        return int(running_time.split(" ")[0])

In [78]:
for movie in movie_info_list:
    movie['Running time (int)'] = minutes_to_integer(movie.get('Running time', 'N/A'))

print([movie.get('Running time (int)', 'N/A') for movie in movie_info_list])

[41, 83, 88, 126, 74, 64, 70, 42, 70, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 92, 76, 75, 73, 85, 81, 70, 90, 80, 75, 83, 83, 72, 97, 75, 104, 93, 105, 95, 97, 134, 69, 92, 126, 79, 97, 128, 74, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 79, 91, 91, 97, 118, 139, 92, 131, 87, 116, 93, 110, 110, 131, 101, 108, 84, 78, 75, 164, 106, 110, 99, 113, 108, 112, 93, 91, 93, 100, 100, 79, 96, 113, 89, 118, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, 91, 112, 115, 95, 91, 95, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 112, 84, 98, 97, 114, 96, 100, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 89, 74, 90, 89, 110, 74, 93, 84, 83, 74, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 88, 102, 102, 97, 111, 100, 96, 98, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 101, 104, 103, 86, 105, 93, 92, 98, 95, 93, 87, 93, 87, 128, 86, 95, 114, 93, 83, 8

#### Convert Budget & Box Office to numbers

In [83]:
# Check budgets
print([movie.get('Budget') for movie in movie_info_list])

[None, '$1.49 million', '$2.6 million', '$2.28 million', '$600,000', '$950,000', '$858,000', None, '$788,000', None, '$1.35 million', '$2.125 million', None, '$1.5 million', '$1.5 million', None, '$2.9 million', '$1,800,000', '$3 million', None, '$4 million', '$2 million', '$300,000', '$1.8 million', None, '$5 million', None, '$4 million', None, None, None, None, None, None, '$700,000', None, None, None, None, None, '$6 million', 'under $1 million or $1,250,000', None, '$2 million', None, None, '$2.5 million', None, None, '$4 million', '$3.6 million', None, None, None, None, '$3 million', None, '$3 million', None, None, None, None, None, None, None, None, None, '$3 million', None, None, None, None, '$4.4–6 million', None, None, None, None, None, None, None, None, None, None, None, '$4 million', None, '$5 million', None, None, None, None, '$5 million', None, None, None, None, None, None, '$4 million', None, None, None, '$6.3 million', None, None, None, None, None, None, None, None, '$5 

In [84]:
# Developing regex for parsing numerical values

import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

'''
money_conversion("$12.2 million") --> 12200000 ## Word syntax
money_conversion("$790,000") --> 790000        ## Value syntax
'''
def money_conversion(money):
    if money == "N/A":
        return None

    if isinstance(money, list):
        money = money[0]
        
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())

    elif value_syntax:
        return parse_value_syntax(value_syntax.group())

    else:
        return None

In [85]:
# Conversion
for movie in movie_info_list:
    movie['Budget (float)'] = money_conversion(movie.get('Budget', 'N/A'))
    movie['Box office (float)'] = money_conversion(movie.get('Box office', 'N/A'))

In [86]:
movie_info_list[-40]

{'title': 'Ralph Breaks the Internet',
 'Directed by': ['Rich Moore', 'Phil Johnston'],
 'Produced by': 'Clark Spencer',
 'Screenplay by': ['Phil Johnston', 'Pamela Ribon'],
 'Story by': ['Rich Moore',
  'Phil Johnston',
  'Jim Reardon',
  'Pamela Ribon',
  'Josie Trinidad'],
 'Starring': ['John C. Reilly',
  'Sarah Silverman',
  'Gal Gadot',
  'Taraji P. Henson',
  'Jack McBrayer',
  'Jane Lynch',
  'Alan Tudyk',
  'Alfred Molina',
  "Ed O'Neill"],
 'Music by': 'Henry Jackman',
 'Cinematography': ['Nathan Detroit Warner (layout)',
  'Brian Leach (lighting)'],
 'Edited by': 'Jeremy Milton',
 'Production companies': ['Walt Disney Pictures',
  'Walt Disney Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['November 5, 2018 ( El Capitan Theatre )',
  'November 21, 2018 (United States)'],
 'Running time': '112 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$175 million',
 'Box office': '$529.3 million',
 'Running ti

#### Convert dates to datetime objects

In [87]:
# Checking date formats

print([movie.get('Release date') for movie in movie_info_list])

[['May 19, 1937'], ['December 21, 1937 ( Carthay Circle Theatre , Los Angeles , CA , premiere)'], ['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'], ['November 13, 1940'], ['June 27, 1941'], ['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)'], ['August 9, 1942 (World Premiere-London)', 'August 13, 1942 (Premiere-New York City)', 'August 21, 1942 (U.S.)'], ['August 24, 1942 (World Premiere-Rio de Janeiro)', 'February 6, 1943 (U.S. Premiere-Boston)', 'February 19, 1943 (U.S.)'], ['July 17, 1943'], ['December 21, 1944 (Mexico City)', 'February 3, 1945 (US)'], ['April 20, 1946 (New York City premiere)', 'August 15, 1946 (U.S.)'], ['November 12, 1946 (Premiere: Atlanta, Georgia)', 'November 20, 1946', 'March 30, 1947 (Stanford Theatre, Palo Alto, California)'], ['September 27, 1947'], 'May 27, 1948', ['November 29, 1948 (Chicago, Illinois)', 'January 19, 1949 (Indianapolis, Indiana)'], ['October 5, 1949'], ['February 15, 1950 (Boston)', 'March 4, 1950 

In [88]:
# Import necessary library

from datetime import datetime

In [95]:
dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

# Defining date conversion functions
def clean_date(date):
    return date.split('(')[0].strip() # get rid of content in parenthesis

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
    
    if date == 'N/A':
        return None
    
    date_str = clean_date(date)
    
    fmts = ["%B %d, %Y", '%d %B %Y'] # after checking that some dates are in different formats, decided to have a list
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    return None
    
for date in dates:
    print(date_conversion(date))
    print()

1937-05-19 00:00:00

1937-12-21 00:00:00

1940-02-07 00:00:00

1940-11-13 00:00:00

1941-06-27 00:00:00

1941-10-23 00:00:00

1942-08-09 00:00:00

1942-08-24 00:00:00

1943-07-17 00:00:00

1944-12-21 00:00:00

1946-04-20 00:00:00

1946-11-12 00:00:00

1947-09-27 00:00:00

1948-05-27 00:00:00

1948-11-29 00:00:00

1949-10-05 00:00:00

1950-02-15 00:00:00

1950-06-22 00:00:00

1951-07-26 00:00:00

1952-03-13 00:00:00

1953-02-05 00:00:00

1953-08-08 00:00:00

1953-11-10 00:00:00

1953-10-26 00:00:00

1954-08-17 00:00:00

1954-12-23 00:00:00

1955-05-25 00:00:00

1955-06-22 00:00:00

1955-09-14 00:00:00

1955-12-22 00:00:00

1956-06-08 00:00:00

1956-07-18 00:00:00

1956-09-04 00:00:00

1956-12-20 00:00:00

1957-06-19 00:00:00

1957-08-28 00:00:00

1957-12-25 00:00:00

1958-07-08 00:00:00

1958-08-12 00:00:00

1958-12-25 00:00:00

1959-01-29 00:00:00

1959-03-19 00:00:00

1959-06-24 00:00:00

1959-11-10 00:00:00

1960-01-21 00:00:00

1960-02-24 00:00:00

1960-05-19 00:00:00

None

1960-11

In some cases we still get Nones, either because the release date is not indicated, or because the formatting is different (in most cases we only have a year). We could fine tune the formula to take into account these different edge cases, however they are quite marginal, so at this point, I will just proceed with other tasks.

In [96]:
for movie in movie_info_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

{'title': 'Frozen II',
 'Directed by': ['Chris Buck', 'Jennifer Lee'],
 'Produced by': 'Peter Del Vecho',
 'Screenplay by': ['Jennifer Lee'],
 'Story by': ['Chris Buck',
  'Jennifer Lee',
  'Marc E. Smith',
  'Kristen Anderson-Lopez',
  'Robert Lopez'],
 'Starring': ['Kristen Bell', 'Idina Menzel', 'Josh Gad', 'Jonathan Groff'],
 'Music by': ['Christophe Beck (score)',
  'Robert Lopez (songs)',
  'Kristen Anderson-Lopez (songs)'],
 'Cinematography': ['Tracy Scott Beattie (layout)',
  'Mohit Kallianpur (lighting)'],
 'Edited by': 'Jeff Draheim',
 'Production companies': ['Walt Disney Pictures',
  'Walt Disney Animation Studios'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['November 7, 2019 ( Dolby Theatre )',
  'November 22, 2019 (United States)'],
 'Running time': '103 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$150 million',
 'Box office': '$1.450 billion',
 'Running time (int)': 103,
 'Budget (float)': 150000000.0,

#### Saving the dataset
We will need to use pickle, in order to be able to save the dataset with the newly added datetime values.

In [99]:
import pickle

In [100]:
def save_data_pickle(name, data):
    with open(name, 'wb') as f:
        pickle.dump(data, f)

In [104]:
def load_data_pickle(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [105]:
save_data_pickle('movie_data_cleaned.pickle', movie_info_list)

In [106]:
a = load_data_pickle('movie_data_cleaned.pickle')

In [109]:
# Checking if saving and loading went through without issues
a == movie_info_list

True

### Task #3: Merge IMDB, Rotten Tomatoes and Metacritic ratings

We will use the Open Movie Database (OMDB) to get the movie ratings.

url: https://www.omdbapi.com/
key: http://www.omdbapi.com/?apikey=[yourkey]&

In [117]:
# Import necessary package and set up requests
import requests
import urllib

def get_omdb_info(title):
    base_url = 'https://www.omdbapi.com/?'
    parameters = {'apikey': '412e64b4', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

# test
get_omdb_info('into the woods')

{'Title': 'Into the Woods',
 'Year': '2014',
 'Rated': 'PG',
 'Released': '25 Dec 2014',
 'Runtime': '125 min',
 'Genre': 'Adventure, Comedy, Drama, Fantasy, Musical',
 'Director': 'Rob Marshall',
 'Writer': 'James Lapine (screenplay by), James Lapine (based on the musical by)',
 'Actors': 'Anna Kendrick, Daniel Huttlestone, James Corden, Emily Blunt',
 'Plot': 'A witch tasks a childless baker and his wife with procuring magical items from classic fairy tales to reverse the curse put on their family tree.',
 'Language': 'English',
 'Country': 'USA',
 'Awards': 'Nominated for 3 Oscars. Another 10 wins & 71 nominations.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTY4MzQ4OTY3NF5BMl5BanBnXkFtZTgwNjM5MDI3MjE@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '5.9/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '71%'},
  {'Source': 'Metacritic', 'Value': '69/100'}],
 'Metascore': '69',
 'imdbRating': '5.9',
 'imdbVotes': '134,229',
 'imdbID': 'tt2180411',


Seems like the IMDB and Metascore ratings are easily obtainable, however the Rotten Tomatoes rating is nested into a list. We will need to write an additional function to extract it.

In [119]:
# Define function for extracting Rotten Tomatoes score

def get_rotten_tomatoes_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None


In [120]:
# WARNING! Long run-time!

for movie in movie_info_list:
    title = movie['title']
    omdb_info = get_omdb_info(title)
    movie['imdb'] = omdb_info.get('imdbRating', None)
    movie['metascore'] = omdb_info.get('Metascore', None)
    movie['rotten_tomatoes'] = get_rotten_tomatoes_score(omdb_info)

In [122]:
# Checking if we successfully added the ratings

movie_info_list[-50]

{'title': 'Born in China',
 'Traditional': '我們誕生在中國',
 'Simplified': '我们诞生在中国',
 'Directed by': 'Lu Chuan',
 'Produced by': ['Roy Conli', 'Brian Leith', 'Phil Chapman'],
 'Screenplay by': ['David Fowler', 'Brian Leith', 'Phil Chapman', 'Lu Chuan'],
 'Narrated by': ['John Krasinski', 'Zhou Xun', 'Claire Keim'],
 'Music by': 'Barnaby Taylor',
 'Edited by': 'Matthew Meech',
 'Production companies': ['Disneynature',
  'Shanghai Media Group',
  'Chuan Films',
  'Brian Leith Productions'],
 'Distributed by': ['Walt Disney Studios Motion Pictures',
  'Shanghai Media Group',
  '(China)'],
 'Release date': ['August 12, 2016 (China)',
  'April 21, 2017 (United States)',
  'August 23, 2017 (France)'],
 'Running time': '76 minutes',
 'Countries': ['United States', 'China', 'France'],
 'Languages': ['English', 'Mandarin', 'French'],
 'Budget': '$5–10 million',
 'Box office': '$25.1 million',
 'Running time (int)': 76,
 'Budget (float)': 5000000.0,
 'Box office (float)': 25100000.0,
 'Release date (

#### Saving the final dataset

In [123]:
save_data_pickle('movie_database_final', movie_info_list)

### Task #5: Save data as JSON and CSV

In [124]:
# Creating copy

movie_info_copy = [movie.copy() for movie in movie_info_list]

In [126]:
# Replace datetime with string

for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime('%B %d, %Y')
    else:
        movie['Release date (datetime)'] = None

In [129]:
# Save data to JSON

save_data('movie_data_json.json', movie_info_copy)

#### Save data as CSV

In [132]:
# Import package
import pandas as pd

# Convert to dataframe
df = pd.DataFrame(movie_info_list)

df.head()

Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Box office,Running time (int),Budget (float),Box office (float),...,Languages,Screenplay by,Countries,Production companies,Japanese,Hepburn,Adaptation by,Animation by,Traditional,Simplified
0,Academy Award Review of,Walt Disney Productions,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,$45.472,41.0,,45.472,...,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,"[December 21, 1937 ( Carthay Circle Theatre , ...",83 minutes,United States,English,$418 million,83.0,1490000.0,418000000.0,...,,,,,,,,,,
2,Pinocchio,Walt Disney Productions,"[February 7, 1940 ( Center Theatre ), February...",88 minutes,United States,English,$164 million,88.0,2600000.0,164000000.0,...,,,,,,,,,,
3,Fantasia,Walt Disney Productions,"[November 13, 1940]",126 minutes,United States,English,$76.4–$83.3 million,126.0,2280000.0,83300000.0,...,,,,,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,"[June 27, 1941]",74 minutes,United States,English,"$960,000 (worldwide rentals)",74.0,600000.0,960000.0,...,,,,,,,,,,


In [133]:
# Save to CSV

df.to_csv('movie_data_csv.csv')