<h1><center>DATA ACQUISITION</center></h1>

Predictive modeling depends crucially on the quality of Data. In the first part of this project, I obtained the latest and most reliable information about more than 6000 films from the Wikipedia page using the Wikipedia API, and also from their IMDB page using the OMDB API. Regular expressions were used to extract budget and box-office information (normmalized to the unit of million dollars). 

The Wikipedia API was also used to scrape information about The Academy Awards at two levels:
1. The different awards and nomination in each of the major categories from 1940 to 2019
2. The individual awards and nominations in each category awarded from 1940 to 2019

# Contents

1. [Obtaining Movie Information using Wikipedia and OMDB API](#movieinfo)
2. [Data Acquisition](#data_acquisition)<br>
   i) [Web scraping](#scrape)<br>
   ii) [Challenges and Fixes](#challenges) <br>
   iii) [Creating Movies DataFrame](#dataframe)
3. [Obtaining Academy Award Information](#academy)<br>
   i) [For Films](#films)<br>
   ii) [For Individuals](#individuals)<br>


This project aims to analyze movie data from IMDB database based on 3 categories (commercial, critical and financial) and predict the nominees and winners for the The 92<sup>nd</sup> Academy Awards due to be held in 2020. 

The Data Acquisition phase will comprise of 3 steps:
1. To obtain a list of English language films betwen the year 1955 to 2019
2. To obtain the complete list of Academy Award winners and nominees, both in the film and individual categories

By acquiring the latest and most updated data about films from **Wikipedia** and **IMDB** website of Hollywood films from 1940 to 2019, we will attempt to identify the key features that make films successful, popular and enduring. 

We will look at three key parameters:
1. Accolades given by The Academy of Motion Pictures
2. Box office revenues
3. IMDB ratings based on fan reviews

Based on these outcomes, we will explore the history of hollywood films and The Academy Awards to see what makes movies resonate with fans and endure over time. Why do some films become commercially successful but fail to win critical acclaim? Is there a correlation between budget and box-office revenues - do big budget moviesd make more money? Is there any significant correlation between how fans and the Academy view a film's success? 

By pooling data from various sources, defining metrics and measures that are simple and intuitive, we will try to uncover the secrets of great films. Using this knoweldge and information, we will try to predict the nominees and winners for the upcoming Oscars due to be held in 2020. 

In [86]:
# import necessary libraries
import pandas as pd
import numpy as np
import requests
import wikipedia
import re
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen
from scrapy import selector
import datetime as dt
import pickle
from skimage import io
from IPython.display import clear_output
import pdb
plt.style.use('ggplot')
%matplotlib inline

<a class="anchor" id="movie_info"></a>
# 1.0 Obtaining Movie List and Creating Initial Database

For the project of analyzing movie success and predicting the Academy Award winner for the year 2020, our first step was to obtain a list of all major English films made between the years 1950 to 2019. In order to do that, we looked at the List of American abd British films made in each year in that period. 

In [9]:
%%time
primary_list = []
for year in range(1960,2020):
    if year%10==0:
        print(f"In the {year}'s")
    # Set URL
    my_url = ['List_of_American_films_of_'+ str(year), 'List_of_British_films_of_'+ str(year)]
    
    #Define empty list for the year
    
    for url in my_url:
        page = wikipedia.page(url)
        soup = BeautifulSoup(page.html(),'lxml')
        tables = soup.find_all('table', class_ = 'wikitable') # , class_="wikitable sortable jquer-tablesorter")
        movie_set_us = []
        for table in tables:
            films = table.find_all('i')
            for film in films:
                title = film.text
                # print(title)
                link = film.find_all('a', href=True, title = True)
                if len(link)==0:
                    continue
                else:
                    link = link[0]['href']
                primary_list.append((year,film.text.lower()))

        
pickle.dump(primary_list,open('my_data_4/PRIMARY_LIST_1960_2019', "wb" ))
len(primary_list)


In the 1960's
In the 1970's
In the 1980's
In the 1990's
In the 2000's
In the 2010's
CPU times: user 23.5 s, sys: 614 ms, total: 24.1 s
Wall time: 3min 18s


17776

In [51]:
%%time
##########GET ALL ##########
movie_dict_ALL = dict()
for i,(year, title) in enumerate(primary_list):
    if i%21 == 0:
        clear_output()
    movie_dict_ALL[(title,year)] = get_all_movie_info(title, year)
    pickle.dump(movie_dict_ALL,open( 'my_data_4/movie_dict_ALL', "wb" ))
        

#SAVE movie_dict
print(len(movie_dict_ALL))
pickle.dump(movie_dict_2019,open( 'my_data_4/movie_dict_ALL', "wb" ))

<a class="anchor" id="scrape"></a>
# The Wikipedia API

The API consists of a collection of functions for standardizing budget (to US million dollars) and prediceded genres so that final Dataframes are all consistent.

The following functions are defined below:
1. wikiapi_film(title, year): returns **intro** and **infobox** HTML from wikipedia page
2. get_genre(intro): Scrapes the intro paragraph of the wikipedia page and identifies genres
3. movie_info_dict(infobox): returns dictionary with all the information

In [103]:
## WIKIPEDIA API Function: Returns page for specified film and year
def wikiapi_film(title, year):   
    '''Takes in a title, award and year, and returns the the wikipedia.page containing HTML
    ''' 
    if len(title)==0 or len(str(year)) == 0:
        return 
    
    try:
        if wikipedia.page(title)!=None:
            print(title)
            return wikipedia.page(title)
    except wikipedia.exceptions.DisambiguationError:
        pass
    except wikipedia.exceptions.PageError:
        pass

    title_without_date = title + ' (film)'
    try:
        if wikipedia.page(title_without_date) != None:
            print(title_without_date)
            return wikipedia.page(title_without_date)            
   
    except wikipedia.exceptions.DisambiguationError:
        print(f"\tCould not fetch information for {title}: Disambiguation error")
        return
    except wikipedia.exceptions.PageError:
        print(f"\tThe Wikipedia page for the title {title} could not be found: Page Error ")
        return

In [113]:
# Get the standardized genre specified in genre_dict file
genre_dict = {
        'romance': 'rom',
        'romantic': 'rom',
        'comedy': 'com',
        'musical':'mus',
        'animated':'ani',
        'superhero':'sup',
        'horror':'hor',
        'crime': 'cri',
        'war ':'war',
        'psychological':'psy',
        'psychology':'psy',
        'action':'act',
        'dystopian':'dys',
        'political':'pol',
        'spy':'spy',
        'science': 'sci',
        'sci-fi': 'sci',
        'adventure':'adv',
        'fantasy': 'fan',
        'biography': 'bio',
        'biographical': 'bio',
        'historical':'his',
        'mystery':'mys',
        'epic':'epi',
        'thrill':'thr',
        'drama':'dra',
        'monster': 'mon',
        'disaster':'dis',
        'other':'oth'
    }
pickle.dump(genre_dict,open( 'my_data_4/genre_dict', "wb" ))

del genre_dict['psychology']
del genre_dict['romantic']
del genre_dict['biography']
del genre_dict['science']

genre_dict2 = {}
for key,value in genre_dict.items():
    genre_dict2[value]= key
genre_dict2['other'] = 'other'
pickle.dump(genre_dict2,open( 'my_data_4/genre_dict2', "wb" ))
genre_dict2

{'rom': 'romance',
 'com': 'comedy',
 'mus': 'musical',
 'ani': 'animated',
 'sup': 'superhero',
 'hor': 'horror',
 'cri': 'crime',
 'war': 'war ',
 'psy': 'psychological',
 'act': 'action',
 'dys': 'dystopian',
 'pol': 'political',
 'spy': 'spy',
 'sci': 'sci-fi',
 'adv': 'adventure',
 'fan': 'fantasy',
 'bio': 'biographical',
 'his': 'historical',
 'mys': 'mystery',
 'epi': 'epic',
 'thr': 'thrill',
 'dra': 'drama',
 'mon': 'monster',
 'dis': 'disaster',
 'oth': 'other',
 'other': 'other'}

In [89]:
def get_genre(intro):
    """returns the movie genre based on wikipedia intro, returns list of genres
    """
    genre_dict = pickle.load(open("my_data_4/genre_dict","rb"))
    this_movie_genres = []
    for line in intro: # got through all the lines
        line = line.text.split('.')
        line = line[0]
        line = line.lower()
        for genre in genre_dict:
            if line.find(genre) > 0:
                this_movie_genres.append(genre_dict[genre])
        
        if len(this_movie_genres)>0:
            return list(set(this_movie_genres))
    return ['other']

In [90]:
# Create dictionary from Wikipedia page of the film using the INFOBOX table

def movie_infobox_dict(infobox_items, intro, title, year):
    """Accepts an infobox item and intro sections of wikipedia page and returns a dictionary
    """
    # initialiZe empty dictionary
    _movie_dict = dict()
    
    # Go through each infobox item and extract information
    for item in infobox_items:
        try:
            if len(item) < 2:
                continue
            
            #director
            if item.th.text.lower().find('direct')>-1:
                _movie_dict['director'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['director'].append(line.text)      
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['director'].append(line.text)
                if len(_movie_dict['director'])==0:
                    _movie_dict['director'] = item.td.text.split('br')
            #producer
            if item.th.text.lower().find('produce')>-1:
                _movie_dict['producer'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['producer'].append(line.text)    
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['producer'].append(line.text) 
                if len(_movie_dict['producer'])==0:
                    _movie_dict['producer'] = item.td.text.split('br')

            #cast
            if item.th.text.lower().find('star')>-1:
                _movie_dict['cast'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['cast'].append(line.text)
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['cast'].append(line.text)
                if len(_movie_dict['cast'])==0:
                    _movie_dict['cast'] = item.td.text.split('br')

            #screenplay
            if item.th.text.lower().find('screenplay')>-1 or item.th.text.lower().find('written')>-1:
                _movie_dict['screenplay'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['screenplay'].append(line.text)
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['screenplay'].append(line.text)
                if len(_movie_dict['screenplay'])==0:
                    _movie_dict['screenplay'] = item.td.text.split('br')

            #cinematography
            if item.th.text.lower().find('cinematography')>-1:
                _movie_dict['cinematography'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['cinematography'].append(line.text)
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['cinematography'].append(line.text)
                if len(_movie_dict['cinematography'])==0:
                    _movie_dict['cinematography'] = item.td.text.split('br')

            # music
            if item.th.text.lower().find('music')>-1 or item.th.text.lower().find('score')>-1:
                _movie_dict['music'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['music'].append(line.text)
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['music'].append(line.text)
                if len(_movie_dict['music'])==0:
                    _movie_dict['music'] = item.td.text
                
            # editing
            if item.th.text.lower().find('edit')>-1:
                _movie_dict['edit'] = []
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['edit'].append(line.text)
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['edit'].append(line.text)
                if len(_movie_dict['edit'])==0:
                    _movie_dict['edit'] = item.td.text.split('br')

            # Based on a book (Y/N)
            if item.th.text.lower().find('based')>-1:
                _movie_dict['book'] = 'yes'

            # Get budget
            if item.th.text.lower().find('budget')>-1:
                budget = item.td.text.strip()
                budget = re.sub(r'\[.+\]+', "", budget) #remove square bracket references
                _movie_dict['budget'] = currency_to_million(budget)

            # Get box office
            if item.th.text.lower().find('box')>-1:
                box_office = item.td.text.strip()
                box_office = re.sub(r'\[.+\]+', "", box_office)#remove square bracket references
                _movie_dict['box_office'] = currency_to_million(box_office)

            # Get running time
            if item.th.text.lower().find('running time')>-1:
                running_time = item.td.text.strip()
                running_time = re.findall(r'(\d+)\smin', running_time)
                if len(running_time)==0:
                    _movie_dict['running_time'] = 0
                    continue
                _movie_dict['running_time'] = int(running_time[0])

            # Language 
            if item.th.text.lower().find('language')>-1:
                language = item.td.text.strip()
                if (language.lower().find('english') == -1 and 
                    language.lower().find('silent') == -1): # film does not contain english
                    return dict()
                _movie_dict['language']= language.split('\n')

            # Release date 
            if item.th.text.lower().find('release')>-1:
                release = item.td.text.strip()
                release = re.findall(r'\d\d\d\d', release)
                if len(release)==0:
                    print(f'Year not found for {title} in {year}')
                    return dict()
                release = dt.datetime.strptime(release[0], '%Y').year
#                 if release !=year: # or len(str(release))==0:
#                     print(f'Wrong year for {title}! Corrected to wikipedia year')
#                     return dict()
#                     # _movie_dict['year']= year
#                 else:
                _movie_dict['year']=release
            
            
            # Get genre
            _movie_dict['genre'] = get_genre(intro)

        except AttributeError:
            print(f'\tCould not fetch info for {title} from infobox items')

    return _movie_dict


# Converts different denominations to $X.y million
def currency_to_million(money):
    ''' Accepts $12 million and returns 12000000
        Accepts $13,678,654 and reurns  13678654
        Accept $15-25 million and returns 25000000
        Accepts $2 billion and returns 2000000000
        using regular expressions
    '''
    if money == None:
        return np.nan
    if len(money)==0:
        return np.nan

    # Check to see dollar, otherwise return nan
    money = money.lower()
    if money.find('$')>-1:
        factor=1
    elif money.find('£')>-1:
        factor = 1.3
    else:
        return np.nan
    
    money = re.sub(r'\[.*\]', '', money) #remove square bracket citation

    if money.find('illion')>0: # when currency expressed in million/billion

        # Billion: $12.4 billion
        reg = r"[\$-–]([0-9.]+)\sbillion"
        num = re.findall(reg, money) # find number like $12.4 billion
        if len(num)>0:
            # num = re.sub(r'\D', "", num) # drop any non-numeric characters like comma, dash etc
            return  round(float(num[0])*1e3*factor,2)

        # Million: $6.8 million
        reg = r"[\$-–]([0-9.]+)\smillion"
        num = re.findall(reg, money) # find number like $6.8 million
        if len(num)>0:
            # num = re.sub(r'\D', "", num) # drop any non-numeric characters like comma, dash etc
            try:
                return round(float(num[0])*factor,2)
            except:
                return np.nan

    else: # When currency not expressed in millions  
        reg = r"[$£]\s?([\d,]+)[\D\s]?"
        num = re.findall(reg, money)
        if len(num) >0:
            num = re.sub(r',', '', num[0])
            try:
                return round(float(num)/1e6*factor,2)
            except:
                return np.nan


In [101]:
# title = 'annie hall'
# year = 1977
# page = wikiapi_film(title, year)
# soup = BeautifulSoup(page.html(),'lxml') 
# infobox_items = soup.find('table', class_ = 'infobox vevent').tbody.find_all('tr')
# infobox_items

In [91]:
# This is the main function that calls wikiapi_film, OMDB API and ancillary functions to get the dictionaries
def get_all_movie_info(title, year, skip_imdb = 0):
    """This is the MAIN function that uses previous functions to accept title and year and return 
    all the movie information, returns dictionary
    """
    
    # Wikipedia info
    page = wikiapi_film(title, year)
    if page == None:
        return dict()

    soup = BeautifulSoup(page.html(),'lxml') 
    
    # Get wikipedia introduction paragraph for genres
    try:
        intro = soup.find_all('p')
    except:
        print(f'Could not find p-tags info for {title}')
        return dict()
    
    # Get wikipedia infobox tables for all other information
    try:
        infobox_items = soup.find('table', class_ = 'infobox vevent').tbody.find_all('tr')
    except: 
        print(f'Could not find infobox for {title}')
        return dict()
        
    this_dict = movie_infobox_dict(infobox_items, intro, title, year) # get movie info dictionary
    if len(this_dict) == 0:
        print(f'Wikipedia could not get {title} in {year}')
        return dict()
    # return this_dict

    # obtain json file from OMDB API and get info
    url_base = 'http://www.omdbapi.com/?i=tt3896198&apikey=5db77b44&'
    url = url_base + 't=' + str(title)
    r = requests.get(url)
    json_data = r.json()
    if 'Error' in json_data:
        print(f'\t{title} not found in OMDB API')
        return dict()
    if skip_imdb == 1:
        return this_dict
    else:
        try:
            this_dict['imdbID'] = json_data['imdbID']
            this_dict['imdb_rating'] = float(json_data['imdbRating'])
            this_dict['n_votes'] = int(re.sub(r',','',json_data['imdbVotes']))
            this_dict['title'] =title
            return this_dict
        except:
            print(f'Unknown error in {title}: could not fetch info')
            return dict()
    
    

### Get Movie Information from 1950 to 2018

In [104]:
%%time
# Get movie_dict from df_imdb list and SAVE. df_imdb was created in an older version of the 
df_imdb = pd.read_csv('my_data_4/df_imdb.csv')
movie_dict = pickle.load(open("my_data_4/movie_dict","rb"))
year_start = 1950
year_stop = 2018
years = range(year_start,year_stop+1)
df_imdb.reset_index(inplace=True, drop=True)
for row in df_imdb.iterrows():
    
    if row[0]%20==0: # prevent clutter, clear output after every 20 movies
        clear_output()
        
    idx = row[0]
    title = row[1].title.lower()
    year = row[1].year
    
    
    if year not in years:
        continue
    
    # Get information and save in movie_dict
    if(title, year) not in movie_dict:
        movie_dict[(title,year)] = get_all_movie_info(title, year)
    if row[0]%20==0:
        clear_output()

#Save dictionary       
pickle.dump(movie_dict,open( 'my_data_4/movie_dict_new2', "wb" ))

CPU times: user 2.13 s, sys: 413 ms, total: 2.54 s
Wall time: 10.8 s


### Get 2019 Movie Information

In [106]:
%%time
##########GET 2019 moives ##########
movie_dict_2019 = pickle.load(open("my_data_4/movie_dict_2019","rb"))

movies_all = pickle.load(open('my_data_4/PRIMARY_LIST_1960_2019', "rb" ))
skip_imdb = 0
for (year, title) in movies_all:
    if year!=2019:
        continue
    if (title,year) not in movie_dict_2019 or len(movie_dict_2019[(title,year)])==0:
        movie_dict_2019[(title,year)] = get_all_movie_info(title, year)


star wars: the rise of skywalker
Unknown error in star wars: the rise of skywalker: could not fetch info
alita: battle angel
tyler perry's a madea family funeral
	tyler perry's a madea family funeral not found in OMDB API
buffaloed
Unknown error in buffaloed: could not fetch info
darlin' (film)
Could not find infobox for darlin'
monos
Wikipedia could not get monos in 2019
star wars: the rise of skywalker
Unknown error in star wars: the rise of skywalker: could not fetch info
mr. jones
	mr. jones not found in OMDB API
for sama
Wikipedia could not get for sama in 2019
nomad: in the footsteps of bruce chatwin
Unknown error in nomad: in the footsteps of bruce chatwin: could not fetch info
first love
Wikipedia could not get first love in 2019
balance, not symmetry
Unknown error in balance, not symmetry: could not fetch info
carmilla
Could not find infobox for carmilla
carmilla
Could not find infobox for carmilla
shooting clerks
Year not found for shooting clerks in 2019
Wikipedia could not 

### Manual Imputations

For 2 of the movies in 2019, "Uncut Gems" and "Once upon a time in hollywood", the IMDB information appeared to be corrupted. Therefore, these information was added manually

In [93]:
# Filling in information for uncut gems
movie_dict_2019 = pickle.load(open("my_data_4/movie_dict_2019","rb"))
title = 'uncut gems'
year = 2019
movie_dict_2019[(title, year)]['imdbID'] = 'tt5727208'
movie_dict_2019[(title, year)]['imdb_rating'] = 8.1
movie_dict_2019[(title, year)]['n_votes'] = 12969
movie_dict_2019[(title, year)]['title']=title
movie_dict_2019[(title, year)]

title = 'once upon a time in hollywood'
year = 2019
movie_dict_2019[(title, year)]['imdbID'] = 'tt7131622'
movie_dict_2019[(title, year)]['imdb_rating'] = 7.8
movie_dict_2019[(title, year)]['n_votes'] = 303486
movie_dict_2019[(title, year)]['title']=title
# movie_dict_2019[(title, year)]


In [26]:
pickle.dump(movie_dict_2019,open( 'my_data_4/movie_dict_2019', "wb" ))

<a class="anchor" id="dataframe"></a>
## Create DataFrame from movie_dict

In [61]:
def create_dataframe_from_dict2(movie_dict, all_columns, years):
    """Accept the movie_info_dict and creates the DataFrame 
    containing budget, box_office, running_time, cast etc (mentioned in columns)
    Returns a dataframe
    """
    df = pd.DataFrame(columns = all_columns)
    rejects = []
    for i,(title,year) in enumerate(movie_dict):

        
        #Tracking display
        if i%20 == 0:
            clear_output()

        print(title,year)
        movie = movie_dict[title,year]
        if len(movie)==0:
            print(f'\tCould not prepare DataFrame for {title}. Movie info missing!')
            rejects.append((title,year))

            continue
        if 'year' not in movie:
            movie['year']=year

        if movie['year'] != year or movie['year'] not in years: # the year information is incorrect
            print(title, year)
            print('Broken!')
            rejects.append((title,year))
            continue

        if movie['imdbID'] in df.imdbID:
            print(f'{title} duplicated: Skipping')

        for col in all_columns:
            # print(i,title)
            if col in movie:
                if type(movie[col])!=list:
                    df.loc[i,col] = movie[col]
                elif type(movie[col])==list:
                    df.loc[i,col] = len(movie[col])
            else:
                rejects.append((title,year))

                
    return df, list(set(rejects))

In [62]:
%%time
# Create DataFrame up to year 2018 and SAVE as df_movies
movie_dict = pickle.load(open("my_data_4/movie_dict","rb"))
columns = ['imdbID','title','year','n_votes','imdb_rating','budget', 'box_office','cast', 'genre','running_time']
df_movies, rejected = create_dataframe_from_dict2(movie_dict, columns, range(1950,2019))
print(df_movies.shape)
df_movies = df_movies[df_movies.imdbID.duplicated() == False]
print('after dropna()',df_movies.shape)
print('Done!')

# Convert year column from integer to date_time and SAVE
df_movies.year = [y.year for y in pd.to_datetime(df_movies.year.astype('int'), format='%Y')]
df_movies.reset_index(drop=True)
df_movies.head()

mr. turner 2014
	Could not prepare DataFrame for mr. turner. Movie info missing!
begin again 2014
ida 2014
ex machina 2015
cinderella 2015
the lobster 2016
passengers 2016
13 hours: the secret soldiers of benghazi 2016
	Could not prepare DataFrame for 13 hours: the secret soldiers of benghazi. Movie info missing!
star wars: the last jedi 2017
victoria & abdul 2017
roma 2018
at eternity's gate 2018
the wife 2018
first reformed 2018
the ballad of buster scruggs 2018
isle of dogs 2018
never look away 2018
(6148, 10)
after dropna() (5867, 10)
Done!
CPU times: user 1min 36s, sys: 3.84 s, total: 1min 40s
Wall time: 1min 38s


Unnamed: 0,imdbID,title,year,n_votes,imdb_rating,budget,box_office,cast,genre,running_time
0,tt0097626,johnny handsome,1989,8324,6.2,20,7.24,7.0,3,96.0
1,tt0092997,extreme prejudice,1987,5649,6.7,22,1130780.0,3.0,3,104.0
2,tt0076759,star wars: episode iv - a new hope,1977,1150440,8.6,1308,9323.0,,4,
3,tt0120903,x-men,2000,545484,7.4,75,296.3,10.0,4,104.0
4,tt0118200,x,2011,2395,6.2,1670,6020.0,,3,


## Adjusting for Inflation
Inflation chart obtained from https://www.usinflationcalculator.com/

In [63]:
# DF_MOVIES
# df_movies = pd.read_csv('my_data_4/df_movies.csv')
cpi = pd.read_csv('my_data_4/CPI_index.csv')
cpi.rename({'Year':'year'}, axis = 1,inplace=True)
cpi.set_index('year', inplace=True)
cpi.loc[2019,'Avg']=255.0 # 2019 information was missing and entered manually

cpi = cpi[['Avg']]
cpi['factor'] = 1/(cpi.Avg/100)
cpi.tail()

# Create new DataFrame adjusting for inflation
df_budget = df_movies[['year', 'budget']]                  
df_movies['inflation_factor'] = [cpi.loc[x,'factor'] for x in df_movies.year]
df_movies['budget_adjusted'] = df_movies['budget']*df_movies['inflation_factor']
df_movies['box_office_adjusted'] = df_movies['box_office']*df_movies['inflation_factor']

print(df_movies.shape)
df_movies.to_csv('my_data_4/df_movies_new.csv')

(5867, 13)


In [64]:
df_movies.head()

Unnamed: 0,imdbID,title,year,n_votes,imdb_rating,budget,box_office,cast,genre,running_time,inflation_factor,budget_adjusted,box_office_adjusted
0,tt0097626,johnny handsome,1989,8324,6.2,20,7.24,7.0,3,96.0,0.806452,16.129,5.83871
1,tt0092997,extreme prejudice,1987,5649,6.7,22,1130780.0,3.0,3,104.0,0.880282,19.3662,995409.0
2,tt0076759,star wars: episode iv - a new hope,1977,1150440,8.6,1308,9323.0,,4,,1.650165,2158.42,15384.5
3,tt0120903,x-men,2000,545484,7.4,75,296.3,10.0,4,104.0,0.58072,43.554,172.067
4,tt0118200,x,2011,2395,6.2,1670,6020.0,,3,,0.444565,742.424,2676.28


## Create DataFrame for 2019

In [17]:
movie_dict_2019['uncut gems',2019]['imdbID']

{'director': ['Josh Safdie Benny Safdie'],
 'genre': ['cri', 'thr'],
 'producer': ['Scott Rudin', 'Eli Bush', 'Sebastian Bear-McClard'],
 'screenplay': ['Ronald Bronstein', 'Josh Safdie', 'Benny Safdie'],
 'cast': ['Adam Sandler',
  'Lakeith Stanfield',
  'Julia Fox',
  'Kevin Garnett',
  'Idina Menzel',
  'Eric Bogosian',
  'Judd Hirsch',
  'Keith Williams Richards',
  'Mike Francesa',
  'Jonathan Aranbayev',
  'Noa Fisher',
  'Abel Tesfaye'],
 'music': ['Daniel Lopatin'],
 'cinematography': ['Darius Khondji'],
 'edit': ['Ronald Bronstein', 'Benny Safdie'],
 'year': 2019,
 'running_time': 135,
 'language': ['English'],
 'box_office': 2.2}

In [95]:
%%time
# Create DataFrame for year 2019 and SAVE as df_movies_2019
movie_dict_2019 = pickle.load(open("my_data_4/movie_dict_2019","rb"))

columns = ['imdbID','title','year','n_votes','imdb_rating','budget', 'box_office','cast', 'genre','running_time']
df_movies_2019, rejected = create_dataframe_from_dict2(movie_dict_2019, columns,[2019])
print(df_movies_2019.shape)
df_movies_2019 = df_movies_2019[df_movies_2019.year==2019]
# df_movies_2019 = df_movies_2019.dropna()
print('after dropna()',df_movies_2019.shape)
print('Done!')

# Convert year column from integer to date_time and SAVE
# yr_column = pd.to_datetime(df_movies.year.astype('int'), format='%Y')
df_movies_2019.year = [y.year for y in pd.to_datetime(df_movies_2019.year.astype('int'), format='%Y')]
df_movies_2019.reset_index(inplace=True, drop=True)
df_movies_2019.head()

#df_movies_2019.to_csv('my_data_4/df_movies_2019_2.csv')

# Adjust for inflation
cpi = pd.read_csv('my_data_4/CPI_index.csv')
cpi.rename({'Year':'year'}, axis = 1,inplace=True)
cpi.set_index('year', inplace=True)
cpi.loc[2019,'Avg']=255.0 # 2019 information was missing and entered manually

cpi = cpi[['Avg']]
cpi['factor'] = 1/(cpi.Avg/100)
cpi.tail()

# Create new DataFrame adjusting for inflation
df_budget = df_movies_2019[['year', 'budget']]                  
df_movies_2019['inflation_factor'] = [cpi.loc[x,'factor'] for x in df_movies_2019.year]
df_movies_2019['budget_adjusted'] = df_movies_2019['budget']*df_movies_2019['inflation_factor']
df_movies_2019['box_office_adjusted'] = df_movies_2019['box_office']*df_movies_2019['inflation_factor']

print(df_movies_2019.shape)
df_movies_2019.to_csv('my_data_4/df_movies_2019.csv')

the courier 2019
the courier 2019
Broken!
old possum's book of practical cats 2019
	Could not prepare DataFrame for old possum's book of practical cats. Movie info missing!
henry iv, part 1 2019
	Could not prepare DataFrame for henry iv, part 1. Movie info missing!
henry iv, part 2 2019
	Could not prepare DataFrame for henry iv, part 2. Movie info missing!
henry v 2019
henry v 2019
Broken!
the surgeon of crowthorne 2019
	Could not prepare DataFrame for the surgeon of crowthorne. Movie info missing!
the queen's corgi 2019
robert the bruce 2019
pain and glory 2019
	Could not prepare DataFrame for pain and glory. Movie info missing!
transit 2019
	Could not prepare DataFrame for transit. Movie info missing!
(254, 10)
after dropna() (254, 10)
Done!
(254, 13)
CPU times: user 1.05 s, sys: 104 ms, total: 1.15 s
Wall time: 1.14 s


# 2.0 Data Acquisition for Academy Awards with Webscraping and API
<a id="acquisition"></a>

We will use BeautifulSoup for web-scraping together with the wikipedia API to obtain information about the most popular films released between 1960 and 2019 and the various Academy Awards nominations and wins during that period. Out goal is to look at hollywood films under three categories of performance: awards, critical review and revenue to identify what predicts the success of a movie. 

In other words, we see each film in terms of its measurable and identifiable features and see hwo they contributed to the success of a movie. Are big budget movies more likely to win awards? Is it the size or profile of the cast size that are likely to draw a more critically acclaim? Are longer movies more popular among the award committe and the fans alike?

In order to obtain clean and reliable movie information and their awards and nomination in the different film categories, we use two APIs: the wikiedpa API and the omdb API. 

We also use the BeautifulSoup module aloing with regular expressions (re) to extract the various information from wikipedia and IMDB to obtain the most reliable information about films. 

## 2.2 Web scraping from Wikipedia

The following function obtained information about a film, namely its director, cast, running time, budget and box office information as shown below for Lawrence of Arabia (1962), which was used for predicting the Oscars for 2020. 
The information is returned as a Python dictionary.
<br>
<img src="files/wiki4.png" height="100" width="100" align="left" style="width:40%">
<img src="files/wiki2.png" height="100" width="100" align="center" style="width:40%">


## 2.1 The Wikipedia API
The following function accepts a title, year or category and obtains the HTML document for the relevant wikipedia page. 

The wikpedia pages for the Academy Awars expect get request not in the specification of year (eg. 1974 Academy Awards), but in the form of "N<sup>th</sup> Academy Awards, such as <a href="https://en.wikipedia.org/wiki/34th_Academy_Awards">"34th Academy Awards"</a>". The same wikipedia API can can individual film information if movie and year are specified, and The Academy Awards information for a given year as shwon below.
<br>

#### Getting Oscar Editions
The Wikipedia page for The Academy Awards is referenced **not** by the year (eg. 1962 Academy Awards) but by its edition (34th Academy Awards) as shown below. 
<br>
<img src="files/wiki.jpeg" height="800" width="800" align="left">

In [66]:
# The Wikipedia pages for Academy Awards are listed in terms of their edition: 1st, 2nd, 3rd, etc.


def wikiapi_nth(year, award = ' Academy Awards'):
    year =get_edition(year)
    year_award_wikiformat = year + award
    try:
        return wikipedia.page(year_award_wikiformat)
    except wikipedia.exceptions.PageError:
        pass
    
    try:
        return wikipedia.page(year_award_wikiformat.replace(' ','_'))
    except wikipedia.exceptions.PageError:
        print(f"The page for the year {year_award_wikiformat} could not be found ")
        return

########################################################
def get_edition(year):
    '''Takes in a year and returns the edition of the given year's oscars
    eg: Academy Awards, 1930 and returns 2nd Academy Awards'''
    
    # Define editiona
    editions = ['th','st', 'nd', 'rd','th', 'th', 'th', 'th', 'th', 'th', 'th']
    # editions_dict = {}
    nth = year - 1928
    if nth>10 and nth<20:
        year_th = str(nth)+'th'
    else:
        year_th = str(nth) + editions[nth%10]
    return year_th


#########################################################
def get_genre(intro):
    """Takes in the intro paragraph from Wikipedia and scrapes out information about genre
    """
    genre_dict = pickle.load(open("my_data_2/genre_dict","rb"))
    this_movie_genres = []
    for line in intro: # got through all the lines
        line = line.text.split('.')
        line = line[0]
        line = line.lower()
        for genre in genre_dict:
            if line.find(genre) > 0:
                this_movie_genres.append(genre_dict[genre])
        
        if len(this_movie_genres)>0:
            return list(set(this_movie_genres))
    return 


In [67]:
# Categories definition
major_categories = ['picture','director','s_actor','s_actress','actor', 
                   'actress','screenplay']
minor_categories = ['music','cinematography','editing','effects','sound',
                    'costume','song', 'art_direction']
all_categories = major_categories + minor_categories

pickle.dump(major_categories,open( 'my_data_4/major_categories', "wb" ))
pickle.dump(minor_categories,open( 'my_data_4/minor_categories', "wb" ))
pickle.dump(all_categories,open( 'my_data_4/all_categories', "wb" ))

print(major_categories)
print(minor_categories)
print(all_categories)


['picture', 'director', 's_actor', 's_actress', 'actor', 'actress', 'screenplay']
['music', 'cinematography', 'editing', 'effects', 'sound', 'costume', 'song', 'art_direction']
['picture', 'director', 's_actor', 's_actress', 'actor', 'actress', 'screenplay', 'music', 'cinematography', 'editing', 'effects', 'sound', 'costume', 'song', 'art_direction']


In [68]:
def get_main_category(category):
    """This function accepts variants of basic cateries such as 
    'Best Motion Picture' and 'Best Picture' and 'Outstanding Picture'
    and returns "picture"
    """
    category = category.lower()

    if category.find('story')>=0:
        return 'other'
    if category.find('best picture')>=0 or category.find('best motion picture')>=0:
        return 'picture'
    if category.find('outstanding production')>=0 or category.find('outstanding picture')>=0:
        return 'picture'
    if category.find('actor')>=0:
        if category.find('supporting')>=0:
            return 's_actor'
        else:
            return 'actor'   
    if category.find('actress')>=0:
        if category.find('supporting')>=0:
            return 's_actress'
        else:
            return 'actress'    
    if category.find('best director')>=0:
        return 'director'
    if category.find('screenplay')>=0:
        return 'screenplay'
    if category.find('music')>=0 or (category.find('scor')>=0):
        return 'music'
    if category.find('costume')>=0:
        return 'costume'
    if category.find('editing')>=0:
        return 'editing'
    if category.find('effects')>=0:
        return 'effects'
    if category.find('cinematography')>=0:
        return 'cinematography'
    if category.find('sound')>=0:
        return 'sound'
    if category.find('song')>=0:
        return 'song'
    if category.find('art')>=0 and category.find('direct')>=0:
        return 'art_direction'
    # if category.find('art direction')>=0:
        # return 'art direction'
    else:
        # print(f'Warning:{category} did not get matched!')
        return 'other'

## Film Awards and Nominations
The following function obtains movie information at the level of the mlovie and NOT the individual winners.

In [69]:
# Awards and Nominations
def awards_and_nominations(year, award = ' Academy Awards', all_categories = 'all'):
    '''This function accepts year, title or award and goes into the wikipedia page of the Award 
    or the Movie and extrac t all the necessary information.
    '''
    
    if all_categories=='all':
        all_categories = pickle.load(open("my_data_4/all_categories","rb")) 

    
    # initialize empty DataFrame and the corresponding Screening Number
    oscars_wn = pd.DataFrame()
    missing_categories = []
    winner_list = dict()
    nominee_list = dict()
    
    # Get the html file from the Wikipedia page using the wikipedia API.
    # Parse it with BeautifulSoup
    page = wikiapi_nth(year)
    soup = BeautifulSoup(page.html(),'lxml')
    print(get_edition(year))
        
    # Get the table-body (tbody) from the wikipedia page where Oscars information are stored
    tbody = soup.body.find('table', class_="wikitable").find('tbody')
       
    # Make sure number of cells and header match
    if len(tbody.find_all('td')) !=len(tbody.find_all(['div', 'th'])):
    
        # The wikipedia tables needs to be fixed! Some category header may be missing.
        print('Warning: Number of cell <td> element  and header <th> does not add up for year:', year)
        print('Returned Empty dataFrame')
        return oscars_wn, missing_categories
        
        
    # Get winners and nominees
    try:
        for td,th in zip(tbody.find_all('td'),tbody.find_all(['div', 'th'])):
            cat = th.text.strip()
            # Get standardized categories
            category = get_main_category(cat)
            if category not in all_categories:
                continue
            winner_list[category] = [] # inditialize an empty dictionary
            nominee_list[category] = []

            if category=='other':
                missing_categories.append(cat)
                # print(f'Warning in {category} in {year}')

            # Go into the list and look at every line
            for tli in td.find_all('li'):
                # go down each line (li) and check if it is bolded
                if tli.find('b')!= None: # winner
                    winner = tli.find('i').text.strip()       # get italicized movie
                    winner = re.sub(r'–',"", winner).strip()
                    winner = winner.lower()
                    winner_list[category].append(winner) # add movie to winner list
                    oscars_wn.loc[winner,'year']= int(year)-1
                    oscars_wn.loc[winner,category]='W'
                    
                elif tli.find('b')== None: #nominee
                    nominee = tli.find('i').text.strip()      # get italicized movie
                    nominee = re.sub(r'–',"", nominee).strip()
                    nominee = nominee.lower()
                    nominee_list[category].append(nominee)
                    oscars_wn.loc[nominee,'year']= int(year)-1
                    if nominee in winner_list[category]: # if same movie has already won, leave it unchanged as 'W'
                        oscars_wn.loc[nominee,category]='WN'
                    else:
                        oscars_wn.loc[nominee,category]='N'

        
    except AttributeError:
        print(f'Warning: in Category {category} for Year: {year}')
            

    # oscars_wn['film'] = oscars_wn.index
    return oscars_wn.fillna('O'), missing_categories

In [70]:
%%time
# GET MOVIE AWARDS FROM ALL CATEGORIES

df_oscars_wide = pd.DataFrame()
for year in range(1940,2020):
    if year%10==0:
        clear_output()
        print(f'In year {year}')
    df, missing = awards_and_nominations(year)
    df_oscars_wide = df_oscars_wide.append(df)

# Convert year to datwetime, removes Nans and SAVE
df_oscars_wide.year = df_oscars_wide.year.astype('int').astype('str')
x = pd.to_datetime(df_oscars_wide.year, format='%Y', exact=True)
df_oscars_wide.year = [x.year for x in pd.to_datetime(df_oscars_wide.year, format='%Y', exact=True)]
df_oscars_wide.reset_index(inplace=True)
df_oscars_wide.rename(columns = {'index':'film'},inplace=True)
print(df_oscars_wide.head())

#Perpare and save long version and SAVE
df_oscars_long = pd.melt(df_oscars_wide, id_vars = ['film', 'year'], var_name='category', value_name = 'result')
print(df_oscars_long.head())


In year 2010
82nd
83rd
84th
85th
86th
87th
88th
89th
90th
91st
                           film actor actress art_direction cinematography  \
0            gone with the wind     N       W             W              W   
1                  dark victory     O       N             O              O   
2            goodbye, mr. chips     W       N             O              O   
3                   love affair     O       N             N              O   
4  mr. smith goes to washington     N       O             N              O   

  costume director editing effects music picture s_actor s_actress screenplay  \
0     NaN        W       W       N     N       W       O        WN          W   
1     NaN        O       O       O     N       N       O         O          O   
2     NaN        N       N       O     O       N       O         O          N   
3     NaN        O       O       O     O       N       O         N          O   
4     NaN        N       N       O     N       N       N       

In [75]:
df_oscars_wide.fillna('O',inplace=True)
print(df_oscars_wide.shape)
print(df_oscars_long.shape)
df_oscars_wide.rename(columns = {'film':'title'},inplace=True)
df_oscars_wide.info()

(2547, 17)
(38205, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2547 entries, 0 to 2546
Data columns (total 17 columns):
title             2547 non-null object
actor             2547 non-null object
actress           2547 non-null object
art_direction     2547 non-null object
cinematography    2547 non-null object
costume           2547 non-null object
director          2547 non-null object
editing           2547 non-null object
effects           2547 non-null object
music             2547 non-null object
picture           2547 non-null object
s_actor           2547 non-null object
s_actress         2547 non-null object
screenplay        2547 non-null object
song              2547 non-null object
sound             2547 non-null object
year              2547 non-null int64
dtypes: int64(1), object(16)
memory usage: 338.4+ KB


In [80]:
df_oscars_wide.to_csv('my_data_4/df_oscars_wide.csv')

In [85]:
df_oscars_long = pd.melt(df_oscars_wide, id_vars = ['title', 'year'], var_name='category', value_name = 'result')
print(df_oscars_long.head())

                          title  year category result
0            gone with the wind  1939    actor      N
1                  dark victory  1939    actor      O
2            goodbye, mr. chips  1939    actor      W
3                   love affair  1939    actor      O
4  mr. smith goes to washington  1939    actor      N


In [84]:
df_oscars_long.to_csv('my_data_4/df_oscars_long.csv')

## Individual Awards and Nomination
The previous function obtained information about the varioiius winners and nominees at the movie level. The next function obtains information in the individual categories

In [None]:
# Awards and Nominations for individuals: actor, actress, director
def individual_awards_and_nominations(year, get_categories ='all', award =' Academy Awards'):
    '''This function accepts year, title or award and goes into the wikipedia page of the Award 
    or the Movie and extracts all the necessary information for INDIVIDUAL winners, example 
    Steven Spielberg in directing and Tom Hanks in Actor category etc. 
    '''
    
    if get_categories == 'all':
        get_categories = pickle.load(open("my_data_3/main_categories","rb"))

    
    # initialize empty DataFrame and the corresponding Screening Number
    individual_wn = pd.DataFrame()
    missing_categories = []
    idx = 0
    
    # Get the html file from the Wikipedia page using the wikipedia API.
    # Parse it with BeautifulSoup
    page = wikiapi_nth(year)
    soup = BeautifulSoup(page.html(),'lxml')
        
    # Get the table-body (tbody) from the wikipedia page where Oscars information are stored
    tbody = soup.body.find('table', class_="wikitable").find('tbody')
       
    # Make sure number of cells and header match
    if len(tbody.find_all('td')) !=len(tbody.find_all(['div', 'th'])):
    
        # The wikipedia tables needs to be fixed! Some category header may be missing.
        print('Warning: Number of cell <td> element  and header <th> does not add up for year:', year)
        print('Returned Empty dataFrame')
        return individual_wn, missing_categories
        
        
    # Get winners and nominees
    try:
        for td,th in zip(tbody.find_all('td'),tbody.find_all(['div', 'th'])):
            cat = th.text.strip()
            category = get_main_category(cat)
            # print('\nCategory:', category)
            if category not in get_categories:
                # print(year, category)
                continue

            # Go into the list and remove the film names in <i> italic tags
            for line in td.find_all('li'):
                if line.find('i') != None:
                    line.i.decompose() # remove film
                    
                if line.find('b')!= None: # if in bold then winner
                    for wins in line.find_all('b'):
                        if wins.find('i') != None: # remove film
                            wins.i.decompose()
                            
                        if wins.find('a')!=None: # winner will be the first hyperlink
                            winner = wins.find_all('a')[0].text.strip()
                        else:
                            winner = wins.text.strip() #if no hyperlink, winner is first text
                        winner = re.sub(r'–',"", winner).strip()
                        if winner != None:
                            individual_wn.loc[idx,'year']= int(year)
                            individual_wn.loc[idx, 'name']= winner
                            individual_wn.loc[idx, 'category']= category
                            individual_wn.loc[idx, 'result']= 'W'
                            idx = idx + 1
                
                elif line.find('b') == None: # if no bold then not a winner
                    if line.find('a') != None:
                        nominee = line.find_all('a')[0].text.strip()
                    else:
                        nominee = line.text.strip()
                    nominee = re.sub(r'–',"", nominee).strip()
                    if nominee != None:
                        individual_wn.loc[idx,'year']= int(year)
                        individual_wn.loc[idx, 'name']= nominee
                        individual_wn.loc[idx, 'category']= category
                        individual_wn.loc[idx, 'result']= 'N'
                        idx = idx + 1

        
    except AttributeError:
        print(f'Warning: in Category {category} for Year: {year}')

    return individual_wn, missing_categories

In [None]:
%%time
# GET INDIVIDUAL AWARDS FROM ALL CATEGORIES
get_categories = 'all'
df_individual_long = pd.DataFrame()
for year in range(1940,2020):
    if year%10==0:
        print(f'In year {year}')
    df, missing = individual_awards_and_nominations(year)
    df_individual_long = df_individual_long.append(df)


# Convert year to datetime, and SAVE
df_individual_long.year = df_individual_long.year.astype('int').astype('str')
x = pd.to_datetime(df_individual_long.year, format='%Y', exact=True)
df_individual_long.year = [x.year for x in pd.to_datetime(df_individual_long.year, format='%Y', exact=True)]
df_individual_long.reset_index(inplace=True)
df_individual_long.to_csv('my_data_3/df_individual_long.csv')

## Get Missing Movies

Some movies that have won Academy Award nominations are missing from the movie_dictionary and data frame because they were npot listed in the original IMDB dataset. Thes emovies will now be included

In [33]:
df_oscars_wide = pd.read_csv('my_data_4/df_oscars_wide.csv', index_col=[0])
movie_dict = pickle.load(open("my_data_4/movie_dict","rb"))
df_movies = pd.read_csv('my_data_4/df_movies.csv', index_col=[0])


In [62]:
try:
    for row in df_oscars_wide.iterrows():
        idx = row[0]
        title = row[1].film
        year = row[1].year
        if year < 1965:
            continue
        if (title, year) not in movie_dict or len(movie_dict[title,year])==0:
            print(f'Fetching information for {title},{year}')
            movie_dict[(title,year)] = get_all_movie_info(title, year)
except WikipediaException:
    pass
    

Fetching information for the pleasure seekers,1965
the pleasure seekers
Could not find infobox for the pleasure seekers
Fetching information for blowup,1966
blowup
Unknown error in blowup: could not fetch info
Fetching information for return of the seven,1966
return of the seven
	return of the seven not found in OMDB API
Fetching information for gambit,1966
gambit
Could not find infobox for gambit
Fetching information for mandragola,1966
mandragola
mandragola (film)
Wikipedia could not get mandragola in 1966
Fetching information for rachel, rachel,1968
rachel, rachel
Fetching information for the fox,1968
the fox
Could not find infobox for the fox
Fetching information for chitty chitty bang bang,1968
chitty chitty bang bang
Fetching information for goodbye, mr. chips,1969
goodbye, mr. chips
Could not find infobox for goodbye, mr. chips
Fetching information for m*a*s*h,1970
m*a*s*h
Could not find infobox for m*a*s*h
Fetching information for i girasoli,1970
i girasoli
Wikipedia could not 

In [107]:
%%time
# Create DataFrame up to year 2018 and SAVE as df_movies
# movie_dict = pickle.load(open("my_data_4/movie_dict","rb"))
columns = ['imdbID','title','year','n_votes','imdb_rating','budget', 'box_office','cast', 'genre','running_time']
df_movies, rejected = create_dataframe_from_dict2(movie_dict, columns, range(1950,2019))
print(df_movies.shape)
df_movies = df_movies.dropna()
df_movies = df_movies[df_movies.imdbID.duplicated() == False]
print('after dropna()',df_movies.shape)
print('Done!')

# Convert year column from integer to date_time and SAVE
# yr_column = pd.to_datetime(df_movies.year.astype('int'), format='%Y')
df_movies.year = [y.year for y in pd.to_datetime(df_movies.year.astype('int'), format='%Y')]
clear_output()
df_movies.reset_index()

CPU times: user 1min 30s, sys: 4.16 s, total: 1min 34s
Wall time: 1min 35s


Unnamed: 0,index,imdbID,title,year,n_votes,imdb_rating,budget,box_office,cast,genre,running_time
0,0,tt0097626,johnny handsome,1989,8324,6.2,20,7.24,7,3,96
1,1,tt0092997,extreme prejudice,1987,5649,6.7,22,1.13078e+06,3,3,104
2,3,tt0120903,x-men,2000,545484,7.4,75,296.3,10,4,104
3,5,tt0077975,national lampoon's animal house,1978,106817,7.5,3,141.6,6,1,109
4,6,tt0091757,pirates,1986,7361,6.1,1274,4560,6,3,726
...,...,...,...,...,...,...,...,...,...,...,...
4891,6506,tt0452624,the good german,2006,23319,6,32,6,3,4,105
4892,6507,tt0497116,an inconvenient truth,2006,78298,7.4,1.5,49.8,1,2,97
4893,6511,tt0857191,the visitor,2008,40468,7.6,4,18.1,4,1,103
4894,6518,tt1667353,mirror mirror,2012,80479,5.6,85,183,7,5,106
