# 1.0 Data Acquisition with Webscraping and API

Webscraping and API to get all the information about a movie from one given year. Out goial is to look at hollywood films under three categories of performance: awards, critical review and revenue to identify what predicts the success of a movie. 

In other words, we see each film in terms of its measurable and identifiable features and see hwo they contributed to the success of a movie. Are big budget movies more likely to win awards? Is it the size or profile of the cast size that are likely to draw a more critically acclaim? Are longer movies more popular among the award committe and the fans alike?

In order to obtain clean and reliable movie information and their awards and nomination in the different film categories, we use two APIs: the wikiedpa API and the omdb API. 

We also use the BeautifulSoup module aloing with regular expressions (re) to extract the various information from wikipedia and IMDB to obtain the most reliable information about films. 


In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import requests
import wikipedia
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
from scrapy import selector
import datetime as dt
import pickle

# Load imdb movie set
# df_imdb = pd.read_excel('Movies/statcrunch_IMDB.xlsx')
# df_imdb = df_imdb[df_imdb.votes>1000]

In [2]:
pd

<module 'pandas' from '/Users/ranitsengupta/opt/anaconda3/lib/python3.7/site-packages/pandas/__init__.py'>

# The Wikipedia API
The following function accepts a title, year or category and obtains the HTML document for the relevant wikipedia page. 

The wikpedia page works most consistently when the get request is submitted as "Nth Academy Awards", such as "24th Academy Awards". The same wikipedia API can can individual film information if movie and year are specified, and The Academy Awards information for a given year.


In [2]:
pd

<module 'pandas' from '/Users/ranitsengupta/opt/anaconda3/lib/python3.7/site-packages/pandas/__init__.py'>

In [3]:
## WIKIPEDIA API Function

# Define function fr obtaining movie data using Wikipedia API
def wikiapi(title = "", award = "", year = ""):   
    '''
    Takes in a title or a year and returns the wikipedia page
    ''' 
    # When both title and year specified, fetches movie info
    if len(title)>0 and len(year) > 0:
        title_wikiformat = title + ' (film)'
        try:
            return wikipedia.page(title_wikiformat)  
        except wikipedia.exceptions.PageError:
            print(f"The page for the title {title} could not be found ")
            return ""
        except wikipedia.exceptions.DisambiguationError:
            print(f"Multiple pages for the title {title} ")
            return ""
        
    if len(title)>0 and len(year) > 0:
        title_wikiformat = title + f' ({str(year)} film)'
        try:
            return wikipedia.page(title_wikiformat)  
        except wikipedia.exceptions.PageError:
            print(f"The page for the title {title} could not be found ")
            return ""
        except wikipedia.exceptions.DisambiguationError:
            print(f"Multiple pages for the title {title} ")
            return ""
    
        
            
    # Fetches award info for that year        
    if len(award)> 0 and len(year) > 0:
        year_award_wikiformat = year + award
        try:
            return wikipedia.page(year_award_wikiformat)
        except wikipedia.exceptions.PageError:
            print(f"The page for the year {year_award_wikiformat} could not be found ")
            return ""
    else:
        return ""

# Standardizing Oscar Categories
The names of categories underwent many changes over the years. New categories were added and older ones were dropped. "Best Motion Picture" was changed to "Best Picture". The next pice of the code identifies these different categories and puts them in one of the 13 predecided categories with the variants within parentheses:
1. picture (Best Picture or Best Motion Picture)<br>
2. director (Best Director)
3. actress (Best Actress)
4. actor (Best Actor)
5. supporting actress (Best Supporting Actress)
6. supporting actor (Best Supporting Actor)
7. screenplay(Best Adapted/Original Screenplay)
8. music (Best Music/Musical Score/Music Composition)
9. cinematography (Best Cinematography Color/Black and White)
10. editing (Best Editing Special/Sound)
11. special effects (Best Special Effects)
12. sound (Best Sound Design, Sound Effects)
13. costume (Best Costume Design)
    

In [16]:
# Prediceded list of award categories
main_categories = ['picture','director','s_actor','s_actress','actor', 
                        'actress','screenplay','music','cinematography',
                        'editing','special effects','sound','costume']

                    
# print(main_categories)
def get_main_category(category):
    category = category.lower()

    if category.find('story')>=0:
        return 'other'
    if category.find('best picture')>=0 or category.find('best motion picture')>=0:
        return 'picture'
    if category.find('actor')>=0:
        if category.find('supporting')>=0:
            return 's_actor'
        else:
            return 'actor'   
    if category.find('actress')>=0:
        if category.find('supporting')>=0:
            return 's_actress'
        else:
            return 'actress'    
    if category.find('best director')>=0:
        return 'director'
    if category.find('screenplay')>=0:
        return 'screenplay'
    if category.find('music')>=0 or (category.find('scor')>=0):
        return 'music'
    if category.find('costume')>=0:
        return 'costume'
    if category.find('editing')>=0:
        return 'editing'
    if category.find('effects')>=0:
        return 'effects'
    if category.find('cinematography')>=0:
        return 'cinematography'
    if category.find('sound')>=0:
        return 'sound'
    # if category.find('art direction')>=0:
        # return 'art direction'
    else:
        # print(f'Warning:{category} did not get matched!')
        return 'other'

    # The Wikipedia pages for Academy Awards are listed in terms of their edition: 1st, 2nd, 3rd, etc.
def get_edition(year):
    
    '''Takes in a year and returns the edition of the given year's oscars
    eg: 1930 --> 2nd'''
    
    editions = '0th, 1st, 2nd, 3rd, 4th, 5th, 6th, 7th, 8th, 9th'.split(',')
    editions = ['th','st', 'nd', 'rd','th', 'th', 'th', 'th', 'th', 'th', 'th']
    editions_dict = {}
    nth = year - 1928
    if nth>10 and nth<20:
        year_th = str(nth)+'th'
    else:
        year_th = str(nth) + editions[nth%10]
    return year_th


In [5]:
# Awards and Nominations

def awards_and_nominations(year, award = ' Academy Awards'):
    '''
    This function accepts year, title or award and goes into the wikipedia page of the Award 
    or the Movie and extrac t all the necessary information.
    '''
    
    # initialize empty DataFrame and the corresponding Screening Number
    oscars_wn = pd.DataFrame()
    missing_categories = []
    
    # Get the html file from the Wikipedia page using the wikipedia API.
    # Parse it with BeautifulSoup
    page = wikiapi(year=get_edition(year), award = award)
    soup = BeautifulSoup(page.html(),'html5lib')
        
    # Get the table-body (tbody) from the wikipedia page where Oscars information are stored
    tbody = soup.body.find('table', class_="wikitable").find('tbody')
       
    # Make sure number of cells and header match
    if len(tbody.find_all('td')) !=len(tbody.find_all(['div', 'th'])):
    
        # The wikipedia tables needs to be fixed! Some category header may be missing.
        print('Warning: Number of cell <td> element  and header <th> does not add up for year:', year)
        print('Returned Empty dataFrame')
        return oscars_wn, missing_categories
        
        
    # Get winners and nominees
    try:
        for td,th in zip(tbody.find_all('td'),tbody.find_all(['div', 'th'])):
            cat = th.text.strip()
            category = get_main_category(cat)
            # print('\nCategory:', category)
            if category=='other':
                missing_categories.append(cat)
                continue

            # Go into the list and get the winners in bold
            for tli in td.find_all('li'):

                # Get the winners in bold
                for tb in tli.find_all('b'):
                    winner = tb.find('i').text.strip()
                    winner = re.sub(r'–',"", winner).strip()
                    if winner != None:
                        oscars_wn.loc[winner,'year']= int(year)-1
                        oscars_wn.loc[winner,category]='W'

                # Get the nominees from the next generation within the list
                for tli2 in tli.find_all('li'):
                    nominee = tli2.find('i').text.strip()
                    nominee = re.sub(r'–',"", nominee).strip()
                    if nominee != None:
                        oscars_wn.loc[nominee,'year']= int(year)-1
                        oscars_wn.loc[nominee,category]='N'
        
    except AttributeError:
        print(f'Warning: in Category {category} for Year: {year}')
            

    oscars_wn['film'] = oscars_wn.index
    return oscars_wn.fillna('O'), missing_categories


In [44]:
# Awards and Nominations for individuals: actor, actress, director

def individual_awards_and_nominations(year, get_categories, award =' Academy Awards'):
    '''
    This function accepts year, title or award and goes into the wikipedia page of the Award 
    or the Movie and extrac t all the necessary information.
    '''
    
    # initialize empty DataFrame and the corresponding Screening Number
    individual_wn = pd.DataFrame()
    missing_categories = []
    idx = 0
    
    # Get the html file from the Wikipedia page using the wikipedia API.
    # Parse it with BeautifulSoup
    page = wikiapi(year=get_edition(year), award = award)
    soup = BeautifulSoup(page.html(),'html5lib')
        
    # Get the table-body (tbody) from the wikipedia page where Oscars information are stored
    tbody = soup.body.find('table', class_="wikitable").find('tbody')
       
    # Make sure number of cells and header match
    if len(tbody.find_all('td')) !=len(tbody.find_all(['div', 'th'])):
    
        # The wikipedia tables needs to be fixed! Some category header may be missing.
        print('Warning: Number of cell <td> element  and header <th> does not add up for year:', year)
        print('Returned Empty dataFrame')
        return oscars_wn, missing_categories
        
        
    # Get winners and nominees
    try:
        for td,th in zip(tbody.find_all('td'),tbody.find_all(['div', 'th'])):
            cat = th.text.strip()
            category = get_main_category(cat)
            # print('\nCategory:', category)
            if category not in get_categories:
                # print(year, category)
                continue

            # Go into the list and get the winners in bold
            for tli in td.find_all('li'):

                # Get the winners in bold
                for tb in tli.find_all('b'):
                    winner = tb.find('a').text.strip()
                    winner = re.sub(r'–',"", winner).strip()
                    if winner != None:
                        individual_wn.loc[idx,'year']= int(year)
                        individual_wn.loc[idx, 'name']= winner
                        individual_wn.loc[idx, 'category']= category
                        individual_wn.loc[idx, 'result']= 'W'
                        idx = idx + 1

                # Get the nominees from the next generation within the list
                for tli2 in tli.find_all('li'):
                    nominee = tli2.find('a').text.strip()
                    nominee = re.sub(r'–',"", nominee).strip()
                    if nominee != None:
                        individual_wn.loc[idx,'year']= int(year)
                        individual_wn.loc[idx, 'name']= nominee
                        individual_wn.loc[idx, 'category']= category
                        individual_wn.loc[idx, 'result']= 'N'
                        idx = idx + 1
        
    except AttributeError:
        print(f'Warning: in Category {category} for Year: {year}')
            

    
    return individual_wn, missing_categories


In [45]:
# Get oscars Data from year_first to year_last

year_start = 1940
year_end = 2019
df_individual = pd.DataFrame()
categories = ['director','s_actor','s_actress','actor','actress']
for year in range(year_start,year_end+1,1):
    
    if year%10 == 0:
        print(f'In the decade {year%100}s')
    df1, missing = individual_awards_and_nominations(year, categories)
    df_individual = df_individual.append(df1)
    missing_categories[year] = missing


In the decade 40s
In the decade 50s
In the decade 60s
In the decade 70s
In the decade 80s
In the decade 90s
In the decade 0s
In the decade 10s


In [46]:
file_name = "individual_"+ str(year_start)+"_"+str(year_end) + ".xlsx"
print(file_name)
df_oscars.to_excel(file_name, sheet_name='Sheet_name_1')    

individual_1940_2019.xlsx


In [7]:
df_oscars = df_oscars[['year', 'film', 'picture','director',
                      'actor', 'actress','s_actor', 's_actress', 
                      'cinematography','screenplay','editing', 
                      'costume','effects', 'music', 'sound']]
df_oscars.film = df_oscars.film.str.lower()

# print(set(df_oscars.film))
df_oscars['year']= pd.to_datetime(df_oscars['year']) 
df_oscars['year'] = pd.DatetimeIndex(df_oscars['year']).year
df_oscars.head(15)

df_oscars.set_index ('film')


Unnamed: 0_level_0,year,picture,director,actor,actress,s_actor,s_actress,cinematography,screenplay,editing,costume,effects,music,sound
film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ben-hur,1970,W,W,W,O,W,O,W,N,W,W,W,W,W
anatomy of a murder,1970,N,O,N,O,N,O,N,N,N,O,O,O,O
the diary of anne frank,1970,N,N,O,O,N,W,W,O,O,N,O,N,O
the nun's story,1970,N,N,O,N,O,O,N,N,N,O,O,N,N
room at the top,1970,N,N,N,W,O,N,O,W,O,O,O,O,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
beauty and the beast,1970,O,O,O,O,O,O,O,O,O,N,O,O,O
victoria & abdul,1970,O,O,O,O,O,O,O,O,O,N,O,O,O
guardians of the galaxy vol. 2,1970,O,O,O,O,O,O,O,O,O,O,N,O,O
kong: skull island,1970,O,O,O,O,O,O,O,O,O,O,N,O,O


In [10]:
file_name = "oscars_"+ str(year_start)+"_"+str(year_end) + ".xlsx"
print(file_name)
df_oscars.to_excel(file_name, sheet_name='Sheet_name_1')                  

oscars_1960_2018.xlsx


In [11]:
# Calculate points
def oscar_score(df_oscars, year, category):
    
    #select year and category and create series
    series = df_oscars[df_oscars['year'] == year][category]  
    #drop unnominated films
    series = series[series!='o']                             
    series.dropna()
    print(series)
    #computer number of nominees
    n = len(series)   
    #initialize
    df_points = pd.Series(index = df.index)
    for idx in series.index:
        if series[idx] == 'w':
            df_points[idx] = float(n + 1/n)
        elif series[idx] == 'n':
            df_points[idx] = float(1/n)
    return df_points.fillna(0)
    

# Movie information using OMDB API

For a specified movie, get all its information using the OMDB-API and Wikpedia, including Box-office and Budget information

In [13]:
# OMDB API returns the following keys: dict_keys(['Title', 'Year', 'Rated', 'Released', 
# 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 
# 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 
# 'DVD', 'BoxOffice', 'Production', 'Website', 'Response'])

# Define function fr obtaining movie data using omdb API
def omdbapi(title):
    if not isinstance(title, str):
        return {}
    
    url_base = 'http://www.omdbapi.com/?i=tt3896198&apikey=5db77b44&'
    url = url_base + 't=' + str(title)
    r = requests.get(url)
    json_data = r.json()
    return json_data


# IMDB Web scraping

In the following function, we scrape movie information directly from the IMDB website for each year within the specified range.

In [3]:
# Web scraping directly from IMDB
# Variables retieved: ['title', 'year', 'runtime', 'imdb_rating', 'metascore', 'n_votes','gross'] 
def get_imdb_movie_data(year):
    my_urls = ['','']
    my_urls[0] = 'https://www.imdb.com/search/title/?year=' + str(year) + '&title_type=feature&'
    my_urls[1] = 'https://www.imdb.com/search/title/?title_type=feature&year='+str(year)+'-01-01,'+str(year)+'-12-31&start=51&ref_=adv_nxt'
    'https://www.imdb.com/search/title/?title_type=feature&year='2011'-01-01,'2011-12-31&start=101&ref_=adv_nxt
    DF = pd.DataFrame()
    
    # each page
    df = pd.DataFrame()
    movie_info = dict()
    
    for this_url in my_urls:

        response = requests.get(this_url)
        soup = BeautifulSoup(response.text,'html.parser')
        movie_cells = soup.find_all('div', class_ = 'lister-item mode-advanced')

        for movie in movie_cells:
            
            # initialize for this
            
            info = movie.find_all('div', class_="lister-item-content")
            movie = info[0]
            idx = re.findall(r'(tt\d+)', movie.a['href'])[0]
            movie_info[idx] = dict()
            


            try:
                title_year = movie.find('h3',class_="lister-item-header").text.replace('\n','')
                title = re.findall(r'\d*\.(.+)\(', title_year)[0]
                year = re.findall(r'\((\d{4})\)', title_year)[0]

            except AttributeError:
                title = ""
                year = ""
                print(f'Title could not be fetched for {title} in {year}')
            df.loc[idx,'title'] = title
            df.loc[idx,'year'] = year
            # print(title, year)

            try:
                runtime = movie.find('span',class_="runtime").text.strip()
            except AttributeError:
                title = ""
                print( 'Title could not be fetched')
            df.loc[idx,'runtime'] = int(runtime[:-4])

            try:
                imdb_rating = movie.find('div',class_="ratings-bar").strong.text
            except AttributeError:
                imdb_rating = np.nan
                print( 'IMDB rating could not be fetched for', title)     
            df.loc[idx,'imdb_rating'] = float(imdb_rating)


            try:
                metascore = movie.find('div',class_="inline-block ratings-metascore").span.text.strip()
            except AttributeError:
                metascore = ""
                # print( 'Metascore could not be fetched for', title)
            df.loc[idx,'metascore'] = metascore

            try:
                votes_gross = movie.find('p',class_="sort-num_votes-visible").text.replace('\n',"")
                n_votes,gross = ("", "")
                if votes_gross.find('Votes') > -1:
                    n_votes = re.findall(r'Votes:([\d,]+)', votes_gross)[0].strip()
                if votes_gross.find('Gross') > -1:
                    gross = re.findall(r'Gross:\$([\d.,]+)M', votes_gross)[0].strip()
            except AttributeError:
                print( 'Votes and gross could not be fetched for', title)

            if len(n_votes) > 0:
                df.loc[idx,'n_votes'] = int(n_votes.replace(',',''))
            else:
                df.loc[idx,'n_votes'] = np.nan
            if len(gross) > 0:
                df.loc[idx,'gross'] = float(gross)
            else:
                df.loc[idx,'gross'] = np.nan
            
            budget = get_budget_info(idx)
            df.loc[idx,'budget'] = budget


            # dictionary of director, cast and genre
            try:
                director_cast = movie.find('p',class_="").text.split('|')

                direction = director_cast[0].replace('\n',"").strip()
                director = re.findall(r"Director:([\w\s]+)",direction)
                

                casting = director_cast[1].replace('\n',"").strip()
                cast = re.findall(r"Stars:(.+)",casting)[0]

            except AttributeError:
                director = ""
                cast = ""
            movie_info[idx]['director'] = director
            movie_info[idx]['cast'] = cast
            df.loc[idx,'cast_size'] = int(len(cast.split(',')))

            try:
                genres = movie.find('p',class_="text-muted").text.replace('\n',"").strip().split('|')
                if(len(genres)<3):
                    continue
                rated = genres[0]
                movie_info['genre'] = genres[2]
            except AttributeError:
                genres = []
                rated = ""
                print(f'Genres could not be fetched for {title} in {year}')
            movie_info[idx]['genre'] = genres
            


    return df, movie_info


# def get_imdb_budget_boxoffice(movie_id):

def get_budget_info(movie_id):
    my_url = 'https://www.imdb.com/title/'+movie_id
    response = requests.get(my_url)
    soup = BeautifulSoup(response.text,'html.parser')
    headers = soup.find_all('div', class_="txt-block")
    budget=0
    for header in headers:
        try :
            if header.text.find('Budget') > -1:
                budget = re.findall(r'Budget:(\$[\d,]+)',header.text)
                if len(budget)==0:
                    return 0
                budget = budget[0]
                budget = budget.strip('$').replace(',','')
                budget =  float(budget)/1e6
                return budget
        except AttributeError:
            return budget
    


In [13]:
df_info = pd.DataFrame()
movie_info = dict()

In [17]:
%%time
year_start = 2019
year_stop = 2020

for year in range(year_start,year_stop):
    print(year)
    df, mvinfo = get_imdb_movie_data(year)
    df_info = df_info.append(df)
    movie_info = {**movie_info, **mvinfo} 
print(df_info.shape)
df_info.tail(15)

2019
Title could not be fetched
IMDB rating could not be fetched for 
Votes and gross could not be fetched for 
IMDB rating could not be fetched for Star Wars: The Rise Of Skywalker
Votes and gross could not be fetched for Star Wars: The Rise Of Skywalker
IMDB rating could not be fetched for Sonic the Hedgehog
Votes and gross could not be fetched for Sonic the Hedgehog
IMDB rating could not be fetched for Jumanji: The Next Level
Votes and gross could not be fetched for Jumanji: The Next Level
IMDB rating could not be fetched for Spies in Disguise
Votes and gross could not be fetched for Spies in Disguise
IMDB rating could not be fetched for Little Women
Votes and gross could not be fetched for Little Women
IMDB rating could not be fetched for Bombshell
Votes and gross could not be fetched for Bombshell
Title could not be fetched
IMDB rating could not be fetched for 
Votes and gross could not be fetched for 
IMDB rating could not be fetched for 1917
Votes and gross could not be fetched 

Unnamed: 0,title,year,runtime,imdb_rating,metascore,n_votes,gross,budget,cast_size
tt7329656,47 Meters Down: Uncaged,2019,90.0,5.0,43.0,8801.0,21.11,12.0,4.0
tt8613070,Portrait of a Lady on Fire,2019,121.0,8.3,93.0,4203.0,,,4.0
tt10481868,Black Christmas,2019,121.0,,,4203.0,,,4.0
tt7390588,Little Monsters,2019,93.0,6.3,59.0,4902.0,,,4.0
tt6398184,Downton Abbey,2019,122.0,7.7,64.0,16947.0,31.03,20.0,4.0
tt7394816,Primal,2019,97.0,4.9,32.0,1408.0,,,4.0
tt8110640,In the Shadow of the Moon,2019,115.0,6.2,48.0,24235.0,,,4.0
tt6823368,Glass,2019,129.0,6.7,43.0,170006.0,111.04,20.0,4.0
tt6565702,Dark Phoenix,2019,113.0,5.8,43.0,111882.0,65.85,200.0,4.0
tt8364368,Crawl,2019,87.0,6.2,60.0,37549.0,39.01,13.5,4.0


6001

In [25]:
pickle.dump(movie_info, open( "movie_info_1960_2019.p", "wb" ))
file_name = "IMDB_movie_info_"+ str(year_start)+"_"+str(year_stop) + "_oscars.xlsx"
print(file_name)
df_info.to_excel(file_name, sheet_name='Sheet_name_1')

IMDB_movie_info_2019_2020_oscars.xlsx


## Obtaining Budget and Rating Information

We used the OMDB API to get the information about the film, namely the title, year of release, running time, cast, etc. Unfortunateely, the OMDB API did not have reliable information about the movies budget and box office revenues, which are important variables for our analyses. 

We therefore used the Wikipedia API to obtain the budget and box office information. Examining several Wikipedia pages durign the data acquisition phase clearly demonstrated that the WIkpedia information about movie budget and movie box office were reliable. 

In [14]:
# Function to obtain budget and box office information from WIkipedia API
def get_budget_boxoffice(movie, year):
    
    budget = ""
    box_office = ""
    page = wikiapi(title = movie, year = year)
    if len(str(page))==0:
        print(f'Warning: Budget and Boxoffice info for {movie} could not be obtained from wikipedia ')
        return"",""
    try:
        soup = BeautifulSoup(page.html(),'html5lib')
    except AttributeError:
        print(f'Warning: Budget and BX info for {movie} could not be obtained from wikipedia ')
    if soup.body.find('table', class_="infobox vevent")==None:
        return "", ""

    try:
        trows = soup.body.find('table', class_="infobox vevent").find('tbody').find_all('tr')
    except AttributeError:
        print(f"Something wrong {movie} infobox. Budget/Boxoffice could not be fetched!")
        return budget, box_office
    for row in trows:
        if row.find('th') == None:
            continue
        if row.find('th').text.lower() == 'budget':
            budget = row.find('td').text
        if row.find('th').text.lower() == 'box office':
            box_office = row.find('td').text
    return currency_to_million(budget), currency_to_million(box_office)


def currency_to_million(money):
    ''' Accepts $12 million and returns 12000000
        Accepts $13,678,654 and reurns  13678654
        using regular expressions
    '''
    if money == None:
        return ""
    if len(money)==0:
        return ""
    
    # Check to see dollar, otherwise return nan
    money = money.lower()
    if money.find('$')<0:
        return ""
    money = re.sub(r'\[.*\]', '', money)
    
    if money.find('illion')>0: # when currency expressed in million/billion
        
        # Billion: $12.4 billion
        reg = r"[\$-–]([0-9.]+)\sbillion"
        num = re.findall(reg, money) # find number like $12.4 billion
        if len(num)>0:
            # num = re.sub(r'\D', "", num) # drop any non-numeric characters like comma, dash etc
            num =  round(float(num[0])*1e3)
            return str(num) + ' million'
        
        # Million: $6.8 million
        reg = r"[\$-–]([0-9.]+)\smillion"
        num = re.findall(reg, money) # find number like $6.8 million
        if len(num)>0:
            # num = re.sub(r'\D', "", num) # drop any non-numeric characters like comma, dash etc
            return str(round(float(num[0]))) + ' million'
    
    else: # When currency not expressed in millions  
        reg = r"\D*"
        num = re.sub(reg, "", money)
        num = round(float(num)/1e6)
        return str(num) + ' million'

    
# Function to obtain rating information from OMDB API
def get_ratings_as_float(ratings):
    
    imdb, rotten_tomatoes, metacritic = np.nan, np.nan, np.nan

    for rating in ratings:
        if (rating['Source']=='Internet Movie Database') & (len(rating['Value'])>0):
            rate = rating['Value'].strip('/')
            imdb = float(rate[0])
        if (rating['Source']=='Rotten Tomatoes') & (len(rating['Value'])>0):
            rate = rating['Value']
            rotten_tomatoes = float(rate[:-1])/10
        if (rating['Source']=='Metacritic') & (len(rating['Value'])>0):
            rate = rating['Value'].strip('/')
            metacritic = float(rate[:-4])/10
    
    return imdb, rotten_tomatoes, metacritic



In [15]:
def get_movie_info(movie):
    
    df_info = pd.DataFrame()
    
    json = omdbapi(movie)
    if 'Error' in json:
        print('OMDB API could not fetch info for', movie)
        return df_info
    if json['Language'].lower() != 'english':
        return df_info

    # Get imdb Index
    if 'imdbID' in json.keys():
        idx = json['imdbID']
    else:
        print(f'Title {movie} not found')
        return df_info
    
    if 'Title' in json.keys():
        df_info.loc[idx,'film']=json['Title']
    if 'Year' in json.keys():
        df_info.loc[idx,'year']=json['Year']
    if 'Runtime' in json.keys():
        df_info.loc[idx,'runtime']=json['Runtime']
    if 'Released' in json.keys():
        df_info.loc[idx,'released']=json['Released']
    if 'Rated' in json.keys():
        df_info.loc[idx,'released']=json['Rated']
    if 'imdbRating' in json.keys():
        df_info.loc[idx,'imdb_ratings']=json['imdbRating']
    if 'imdbVotes' in json.keys():  
        df_info.loc[idx,'imdb_num_votes']=json['imdbVotes']

    # Get Ratings as float
    imdb, rotten_tomatoes, metacritic=get_ratings_as_float(json['Ratings'])
    df_info.loc[idx,'imdb']=imdb
    df_info.loc[idx,'rotten_tomatoes']=rotten_tomatoes
    df_info.loc[idx,'metacritic']= metacritic
        
    # Get budget and box-office from wikipedia
    budget, box_office = get_budget_boxoffice(movie, json['Year']) 
    
    
 
    #budget = budget[0]
    df_info.loc[idx,'budget']= budget


    #box_office = box_office[0]
    df_info.loc[idx,'box_office']= box_office
    
    return df_info


# Evaluation of films 3 Criterion

Once the information of movies have been obtained, movies are evaluated under 3 broad categories. <br>
1. Critical Accolades based on awards and nominations <br>
2. Critical Feedback based on IMDB, Metacrritic and Rotten tomatoes <br>
3. Commercial success based on Budget and Box office returns<br>
<br>

## Accolades
Each movie was scored in each category for both nominations and wins. The following heuristic was used. <br>
A movie scores higher if it competed with and beat more films nomiated in that category. 
If in a given year, more number of movies are nominated, each nominated film earns a lower score.


## Critical Feedback
Two kinds of ratings were considered, both separately as well as their average, on a scale of 10. 

## Box Office
The commercial success was measured as a form of %age which made it independent of changes inflation rate over the years. A second measure, total earnings, was also calculated accounting for inflation rate.

In [28]:
df_imdb = pd.read_excel('my_data/IMDB_movie_info_2019_2020_oscars.xlsx')
df_academy = pd.read_excel('my_data/oscars_1960_2018.xlsx')

In [37]:
df_imdb.title = df_imdb.title.str.lower()
df_imdb.head()

Unnamed: 0.1,Unnamed: 0,title,year,runtime,imdb_rating,metascore,n_votes,gross,budget,cast_size
0,tt0054357,swiss family robinson,1960,126,7.2,61.0,12663,40.36,5.0,4
1,tt0054215,psycho,1960,109,8.5,97.0,556010,32.0,0.806947,4
2,tt0054047,the magnificent seven,1960,128,7.7,74.0,82485,4.91,2.0,4
3,tt0054135,ocean's 11,1960,127,6.6,57.0,18682,12.32,2.8,4
4,tt0054195,pollyanna,1960,134,7.4,,8319,,,4


In [38]:
df_academy.film = df_academy.film.str.lower()
df_academy.head()

Unnamed: 0.1,Unnamed: 0,year,film,picture,director,actor,actress,s_actor,s_actress,cinematography,screenplay,editing,costume,effects,music,sound
0,Ben-Hur,1970,ben-hur,W,W,W,O,W,O,W,N,W,W,W,W,W
1,Anatomy of a Murder,1970,anatomy of a murder,N,O,N,O,N,O,N,N,N,O,O,O,O
2,The Diary of Anne Frank,1970,the diary of anne frank,N,N,O,O,N,W,W,O,O,N,O,N,O
3,The Nun's Story,1970,the nun's story,N,N,O,N,O,O,N,N,N,O,O,N,N
4,Room at the Top,1970,room at the top,N,N,N,W,O,N,O,W,O,O,O,O,O


In [41]:
imdb_films = set(df_imdb.title)
academy_films = set(df_academy.film)
print(len(imdb_films))
print(len(academy_films))
unlisted = academy_films - imdb_films
print('Following oscar winning films are not listed in IMDB:\n\n')
print(unlisted)


5879
1501
Following oscar winning films are not listed in IMDB:


{'the hot rock', 'defiance', 'agatha', 'jackie', "murphy's romance", 'snow falling on cedars', 'the last station', 'a cry in the dark', 'drive', 'séance on a wet afternoon', 'mrs brown', 'butch and sundance: the early days', 'the shoes of the fisherman', 'one true thing', 'frost/nixon', 'the seven-per-cent solution', 'spotlight', 'agnes of god', 'birds do it, bees do it', 'tender mercies', 'caravans', 'little dorrit', 'shirley valentine', 'transamerica', 'lovers and other strangers', 'longtime companion', 'the gazebo', 'the diary of anne frank', 'the milagro beanfield war', "you're a big boy now", 'venus', 'pollock', 'madame bovary', 'woodstock', 'the pigeon that took rome', 'reuben, reuben', 'the kite runner', 'bon voyage!', 'khovanshchina', 'shadowlands', 'love field', "the preacher's wife", 'oscar and lucinda', 'libel', 'the magic flute', 'victor/victoria', 'and now my love', 'the visit', 'silence', 'mr. turner', 'mur