<h1><center>DATA ACQUISITION</center></h1>

Predictive modeling depends crucially on the quality of Data. In the first part of this project, I obtained the latest and most reliable information about more than 6000 films from the Wikipedia page using the Wikipedia API, and also from their IMDB page using the OMDB API. Regular expressions were used to extract budget and box-office information (normmalized to the unit of million dollars). 

The Wikipedia API was also used to scrape information about The Academy Awards at two levels:
1. The different awards and nomination in each of the major categories from 1940 to 2019
2. The individual awards and nominations in each category awarded from 1940 to 2019

# Contents

1. [Obtaining Movie Information using Wikipedia and OMDB API](#movieinfo)
2. [Data Acquisition](#data_acquisition)<br>
   i) [Web scraping](#scrape)<br>
   ii) [Challenges and Fixes](#challenges) <br>
   iii) [Creating Movies DataFrame](#dataframe)
3. [Obtaining Academy Award Information](#academy)<br>
   i) [For Films](#films)<br>
   ii) [For Individuals](#individuals)<br>


This project aims to analyze movie data from IMDB database based on 3 categories (commercial, critical and financial) and predict the nominees and winners for the The 92<sup>nd</sup> Academy Awards due to be held in 2020. 

The Data Acquisition phase will comprise of 3 steps:
1. To obtain a list of English language films betwen the year 1955 to 2019
2. To obtain the complete list of Academy Award winners and nominees, both in the film and individual categories

By acquiring the latest and most updated data about films from **Wikipedia** and **IMDB** website of Hollywood films from 1940 to 2019, we will attempt to identify the key features that make films successful, popular and enduring. 

We will look at three key parameters:
1. Accolades given by The Academy of Motion Pictures
2. Box office revenues
3. IMDB ratings based on fan reviews

Based on these outcomes, we will explore the history of hollywood films and The Academy Awards to see what makes movies resonate with fans and endure over time. Why do some films become commercially successful but fail to win critical acclaim? Is there a correlation between budget and box-office revenues - do big budget moviesd make more money? Is there any significant correlation between how fans and the Academy view a film's success? 

By pooling data from various sources, defining metrics and measures that are simple and intuitive, we will try to uncover the secrets of great films. Using this knoweldge and information, we will try to predict the nominees and winners for the upcoming Oscars due to be held in 2020. 

In [16]:


# import necessary libraries
import pandas as pd
import numpy as np
import requests
import wikipedia
import re
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen
from scrapy import selector
import datetime as dt
import pickle
from skimage import io
from IPython.display import clear_output
import pdb
plt.style.use('ggplot')
%matplotlib inline

# Webscraping Movie Information

We will use Python's requests module and the Wikipedia API to scrape information for all films between 1960 to 2019. First, we will create a list of films by scraping the title and year of films of all American and British films from 1960 to 2019 using the following piece of code. In order to do that, we looked at the List of American abd British films made in each year in that period using the WIkipedia API. This will contain the information about the wikipedia links of each movie that we will use to obtain information about the particular film. 

In [4]:
%%time
## This function obtains list of all British and American Movies in specified years
# primary_list = []
for year in range(1950,2020):
    print(f"In {year}")
    if year%10==0:
        clear_output()
    # Set URL
    my_url = ['List_of_American_films_of_'+ str(year), 'List_of_British_films_of_'+ str(year)]
    
    #Define empty list for the year
    
    for url in my_url:
        page = wikipedia.page(url)
        soup = BeautifulSoup(page.html(),'lxml')
        tables = soup.find_all('table', class_ = 'wikitable') # , class_="wikitable sortable jquer-tablesorter")
        for table in tables:
            films = table.find_all('i')
            for film in films:
                title = film.text
                # print(title)
                link = film.find_all('a', href=True, title = True)
                if len(link)==0:
                    continue
                else:
                    link = link[0]['href']
                primary_list.append((year,film.text.lower(),link))

        
pickle.dump(primary_list,open('my_data/PRIMARY_LIST_1950_2019', "wb" ))
len(primary_list)


In [26]:
movie_dict = pickle.load(open("my_data/movie_dict_FINAL","rb"))

In [27]:
movie_dict['avengers endgame', 2019]

{'title': 'avengers endgame',
 'year': 2019,
 'screenplay': ['steve gan',
  'joe simon',
  'jack kirby',
  'stephen mcfeely',
  'don heck',
  'christopher markus',
  'gamora & drax created by)',
  'jim starlin (thanos',
  'steve englehart',
  'bill mantlo',
  'keith giffen',
  'larry lieber',
  'stan lee'],
 'director': ['anthony russo', 'joe russo'],
 'cast': ['chris evans',
  'chris hemsworth',
  'robert downey jr.',
  'mark ruffalo'],
 'imdbID': 'tt4154796',
 'plot': "after the devastating events of avengers: infinity war (2018), the universe is in ruins. with the help of remaining allies, the avengers assemble once more in order to reverse thanos' actions and restore balance to the universe.",
 'language': 'English, Japanese, Xhosa, German',
 'running_time': '181 min',
 'imdb_rating': 8.5,
 'n_votes': 640310.0,
 'metscore': 7.8,
 'rotten_tomatoes': 9.4,
 'other_wins': 32.0,
 'other_noms': 75.0,
 'genre': ['action', ' adventure', ' drama', ' sci-fi'],
 'budget': 356.0,
 'opening_wee

## Get Movie Information

Obtain information about the introductory paragraph and the infobox items for each of the movies from the list

In [2]:
# Create dictionary from Wikipedia page of the film using the INFOBOX table

def movie_infobox_dict(infobox_items, title, year):
    """Accepts an infobox item and intro sections of wikipedia page and returns a dictionary
    """
    # initialiZe empty dictionary
    _movie_dict = dict()
    _movie_dict['title']= title
    _movie_dict['year']= year
    
    # Go through each infobox item and extract information
    for item in infobox_items:
        try:
            if len(item) < 2:
                continue
            
            #director
            _movie_dict['director'] = []
            if item.th.text.lower().find('direct')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['director'].append(line.text.lower())      
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['director'].append(line.text.lower())
                if len(_movie_dict['director'])==0:
                    _movie_dict['director'] = item.td.text.split('br')
            
            #producer
            _movie_dict['producer'] = []
            if item.th.text.lower().find('produce')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['producer'].append(line.text.lower())    
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['producer'].append(line.text.lower()) 
                if len(_movie_dict['producer'])==0:
                    _movie_dict['producer'] = item.td.text.split('br')

            #cast
            _movie_dict['cast'] = []
            if item.th.text.lower().find('star')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['cast'].append(line.text.lower())
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['cast'].append(line.text.lower())
                if len(_movie_dict['cast'])==0:
                    _movie_dict['cast'] = item.td.text.split('br')

            #screenplay
            _movie_dict['screenplay'] = []
            if item.th.text.lower().find('screenplay')>-1 or item.th.text.lower().find('written')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['screenplay'].append(line.text.lower())
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['screenplay'].append(line.text.lower())
                if len(_movie_dict['screenplay'])==0:
                    _movie_dict['screenplay'] = item.td.text.split('br')

            #cinematography
            _movie_dict['cinematography'] = []
            if item.th.text.lower().find('cinematography')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['cinematography'].append(line.text.lower())
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['cinematography'].append(line.text.lower())
                if len(_movie_dict['cinematography'])==0:
                    _movie_dict['cinematography'] = item.td.text.split('br')

            # music
            _movie_dict['music'] = []
            if item.th.text.lower().find('music')>-1 or item.th.text.lower().find('score')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['music'].append(line.text.lower())
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['music'].append(line.text.lower())
                if len(_movie_dict['music'])==0:
                    _movie_dict['music'] = item.td.text.lower()
                
            # editing
            _movie_dict['edit'] = []
            if item.th.text.lower().find('edit')>-1:
                if item.td.find('div', class_="plainlist")==None: # single entries
                    for line in item.td.find_all('a'):
                        _movie_dict['edit'].append(line.text.lower())
                else:
                    for line in item.td.find('div', class_="plainlist").find_all('li'):
                        _movie_dict['edit'].append(line.text.lower())
                if len(_movie_dict['edit'])==0:
                    _movie_dict['edit'] = item.td.text.split('br')

            # Based on a book (Y/N)
            _movie_dict['book'] = '0'
            if item.th.text.lower().find('based')>-1:
                _movie_dict['book'] = '1'

            # Get budget
            if item.th.text.lower().find('budget')>-1:
                budget = item.td.text.strip()
                budget = re.sub(r'\[.+\]+', "", budget) #remove square bracket references
                _movie_dict['budget'] = currency_to_million(budget)

            # Get box office
            if item.th.text.lower().find('box')>-1:
                box_office = item.td.text.strip()
                box_office = re.sub(r'\[.+\]+', "", box_office)#remove square bracket references
                _movie_dict['box_office'] = currency_to_million(box_office)

            # Get running time
            if item.th.text.lower().find('running time')>-1:
                running_time = item.td.text.strip()
                running_time = re.findall(r'(\d+)\smin', running_time)
                if len(running_time)==0:
                    _movie_dict['running_time'] = 0
                    continue
                _movie_dict['running_time'] = int(running_time[0])

            # Language 
            if item.th.text.lower().find('language')>-1:
                language = item.td.text.strip()
                if (language.lower().find('english') == -1 and 
                    language.lower().find('silent') == -1): # film does not contain english
                    return dict()
                _movie_dict['language']= language.split('\n')

            # Release date 
            if item.th.text.lower().find('release')>-1:
                release = item.td.text.strip()
                release = re.findall(r'\d\d\d\d', release)
                if len(release)==0:
                    print(f'Year not found for {title} in {year}')
                    return dict()
                release = dt.datetime.strptime(release[0], '%Y').year
#                 if release !=year: # or len(str(release))==0:
#                     print(f'Wrong year for {title}! Corrected to wikipedia year')
#                     return dict()
#                     # _movie_dict['year']= year
#                 else:
                _movie_dict['release_year']=release


        except AttributeError:
            print(f'\tCould not fetch info for {title} from infobox items')

    return _movie_dict

#########################################################################################    
# Converts different denominations to $X.y million (eg. 3.5 for $3.5 million or 3,500,000)
def currency_to_million(money):
    ''' Accepts $12 million and returns 12000000
        Accepts $13,678,654 and reurns  13678654
        Accept $15-25 million and returns 25000000
        Accepts $2 billion and returns 2000000000
        using regular expressions
    '''
    if money == None:
        return np.nan
    if len(money)==0:
        return np.nan

    # Check to see dollar, otherwise return nan
    money = money.lower()
    if money.find('$')>-1:
        factor=1
    elif money.find('£')>-1:
        factor = 1.3
    else:
        return np.nan
    
    money = re.sub(r'\[.*\]', '', money) #remove square bracket citation

    if money.find('illion')>0: # when currency expressed in million/billion

        # Billion: $12.4 billion
        reg = r"[\$-–]([0-9.]+)\sbillion"
        num = re.findall(reg, money) # find number like $12.4 billion
        if len(num)>0:
            # num = re.sub(r'\D', "", num) # drop any non-numeric characters like comma, dash etc
            return  round(float(num[0])*1e3*factor,2)

        # Million: $6.8 million
        reg = r"[\$-–]([0-9.]+)\smillion"
        num = re.findall(reg, money) # find number like $6.8 million
        if len(num)>0:
            # num = re.sub(r'\D', "", num) # drop any non-numeric characters like comma, dash etc
            try:
                return round(float(num[0])*factor,2)
            except:
                return np.nan

    else: # When currency not expressed in millions  
        reg = r"[$£]\s?([\d,]+)[\D\s]?"
        num = re.findall(reg, money)
        if len(num) >0:
            num = re.sub(r',', '', num[0])
            try:
                return round(float(num)/1e6*factor,2)
            except:
                return np.nan

#########################################################################################    
# Get genre information from wikipedia introductory paragraph            
def get_genre(intro):
    """returns the movie genre based on wikipedia intro, returns list of genres
    """
    genre_dict = pickle.load(open("my_data_4/genre_dict","rb"))
    this_movie_genres = []
    for line in intro: # got through all the lines
        line = line.text.split('.')
        line = line[0]
        line = line.lower()
        for genre in genre_dict:
            if line.find(genre) > 0:
                this_movie_genres.append(genre_dict[genre])
        
        if len(this_movie_genres)>0:
            return list(set(this_movie_genres))
    return ['other']

#########################################################################################    

def num_awards_noms(my_str, cat):
    """Returns total number of awards and nominations
    Accepts "11 wins and 4 nominations and returns 11,4"
    """
    my_str = my_str.lower()
    if my_str.lower() == 'n/a':
        return 0
    if cat == 'w':
        my_str = my_str.lower()
        x = re.findall(r'(\d+)\sw', my_str)
        if len(x) > 0:
            return float(x[0])
        else:
            return 0
    if cat == 'n':
        my_str = my_str.lower()
        x = re.findall(r'(\d+)\sn', my_str)
        if len(x) > 0:
            return float(x[0])
        else:
            return 0
#########################################################################################    

def get_list(my_str):
    """Cleans data and gets list of writers, actors, directors, etc
    """
    my_str = my_str.lower()
    this_list = my_str.split(',')
    
    for i,element in enumerate(this_list):
        reg = r'(\(.+\))'
        this_list[i] = re.sub(reg,'',element).strip()
    return this_list

#########################################################################################    

def get_runtime(my_str):
    """Returns runtime as float
    """
    my_str = my_str.lower()
    x = re.findall(r'(\d+)\s\w',my_str)
    return float(x[0])

In [3]:
# This is the main function that calls wikiapi_film, OMDB API and ancillary functions to get the dictionaries
def get_all_movie_info(title, year, url):
    """This is the MAIN function that uses previous functions to accept title and year and return 
    all the movie information, returns dictionary
    """
    this_dict = dict()
    this_dict['title'] = title
    this_dict['year'] = year
    this_dict['screenplay'] = []
    this_dict['director'] = []
    this_dict['cast'] = []
    
    
#     # Wikipedia info
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content,'lxml')

#     # Get wikipedia infobox tables for all other information
#     try:
#         infobox_items = soup.find('table', class_ = 'infobox vevent').tbody.find_all('tr')
#     except: 
#         print(f'Could not find infobox for {title}')
#         return dict()
        
#     this_dict = movie_infobox_dict(infobox_items, title, year) # get movie info dictionary
#     if len(this_dict) == 0:
#         print(f'Wikipedia could not get {title} in {year}')
#         return dict()



    # obtain json file from OMDB API and get info
    # format: http://www.omdbapi.com/?t=the+lighthouse&y=2016&plot=full
    url_base_imdb = 'http://www.omdbapi.com/?i=tt3896198&apikey=5db77b44&'
    url = url_base_imdb + 't=' + str(title) + '&y=' + str(year)
    r = requests.get(url)
    try:
        json_data = r.json()
    except:
        print(f'Json error in {title}: could not fetch info')
        return dict()
    
    if 'Error' in json_data:
        print(f'\t{title} not found in OMDB API')
        return dict()

    else:
        try:
            this_dict['imdbID'] = json_data['imdbID']
            this_dict['plot'] = json_data['Plot'].lower()
            this_dict['language'] = json_data['Language']
            this_dict['running_time'] = json_data['Runtime'] # get_runtime(json_data['Runtime'])
        except:
            print(f'ValueError in {title} basic info: could not fetch info')
            return dict()
        
        # Get imdb ratings
        try: 
            this_dict['imdb_rating'] = float(json_data['imdbRating'])
            this_dict['n_votes'] = float(re.sub(r',','',json_data['imdbVotes']))
        except:
            print(f'ValueError in {title} imdb ratings: could not fetch info')
            return dict()
        
        if len(json_data['Ratings'])>1:
            # Get meta ratings
            try:
                this_dict['metscore'] = float(json_data['Metascore'])/10
            except:
                print(f'ValueError in {title} metascore ratings: could not fetch info')
            
            # Get rotten ratings
            try:
                this_dict['rotten_tomatoes'] = float(json_data['Ratings'][1]['Value'][:-1])/10
            except:
                print(f'ValueError in {title} rotten toms ratings: could not fetch info')
        
        # get OTHER wins and noms
        if 'Awards' in json_data:
            try:
                this_dict['other_wins'] = num_awards_noms(json_data['Awards'], cat = 'w')
                this_dict['other_noms'] = num_awards_noms(json_data['Awards'], cat = 'n')
            except:
                print(f'ValueError in {title} wins and noms: could not fetch info')
        
        # Get film casting, direction, writing, box offic3e info
        if 'Genre' in json_data:
            this_dict['genre'] = json_data['Genre'].lower().split(',')
        if 'Writer' in json_data:
            this_dict['screenplay'] = list(set(this_dict['screenplay'] + get_list(json_data['Writer'])))
        if 'Director' in json_data:
            this_dict['director'] = list(set(this_dict['director'] + get_list(json_data['Director'])))
        if 'Actors' in json_data:
            this_dict['cast'] = list(set(this_dict['cast'] + get_list(json_data['Actors'])))
        if 'BoxOffice' in json_data and json_data['BoxOffice'] != 'N/A':
            this_dict['box_office'] = currency_to_million(json_data['BoxOffice'])

        
        return this_dict

# GET BUDGET AND BOX OFFICE INFORMATION
# # for title,year in movie_dict:
# start_over = 2981

# for i, (title, year) in enumerate(movie_dict):
#     if i < start_over:
#         continue
#     if i%10 == 0:
#         clear_output()
#     print(i, title, year)
#     try:
#         imdbID = movie_dict[title,year]['imdbID']
#     except KeyError:
#         movie_dict[title,year]['budget'] = -1
#         movie_dict[title,year]['opening_weekend'] = -1
#         movie_dict[title,year]['gross_box_office'] = -1
#         continue
#     url = f'https://www.imdb.com/title/{imdbID}/'
#     r = requests.get(url)
#     soup = BeautifulSoup(r.content,'lxml')
#     divs = soup.find_all('div', class_="txt-block")
#     for div in divs:
#         inline = div.find_all('h4', class_="inline")
#         for line in inline:
#             if line.text.lower().find('budget') > -1:
#                 movie_dict[title,year]['budget'] = currency_to_million(div.text.strip())
#             if line.text.lower().find('opening weekend') > -1:
#                 movie_dict[title,year]['opening_weekend'] = currency_to_million(div.text.strip())
#             if line.text.lower().find('cumulative') > -1:
#                 movie_dict[title,year]['gross_box_office'] = currency_to_million(div.text.strip())



In [180]:
movie_dict['lincoln', 2012]

{'title': 'lincoln',
 'year': 2012,
 'screenplay': ['tony kushner', 'doris kearns goodwin'],
 'director': ['steven spielberg'],
 'cast': ['sally field',
  'david strathairn',
  'daniel day-lewis',
  'joseph gordon-levitt'],
 'imdbID': 'tt0443272',
 'plot': "as the american civil war continues to rage, america's president struggles with continuing carnage on the battlefield as he fights with many inside his own cabinet on the decision to emancipate the slaves.",
 'language': 'English',
 'running_time': '150 min',
 'imdb_rating': 7.3,
 'n_votes': 233363.0,
 'metscore': 8.6,
 'rotten_tomatoes': 8.9,
 'other_wins': 108.0,
 'other_noms': 247.0,
 'genre': ['biography', ' drama', ' history', ' war'],
 'box_office': 129.48,
 'budget': 65.0,
 'opening_weekend': 0.94,
 'gross_box_office': 275.29}

In [105]:
# Get genres and save them on pickle
all_genres = []
movie_dict = pickle.load(open('my_data/movie_dict_FINAL', "rb" ))

for (title, year) in movie_dict:
    if 'genre' not in movie_dict[title,year]:
        continue
    if len(movie_dict[title,year]['genre'])>0:
        all_genres = all_genres + [x.strip() for x in movie_dict[title,year]['genre']]
all_genres = list(set(all_genres))
print(all_genres)
print(len(all_genres))
all_genres.remove('n/a')
all_genres

['adult', 'action', 'animation', 'news', 'reality-tv', 'sci-fi', 'family', 'adventure', 'crime', 'n/a', 'music', 'film-noir', 'short', 'biography', 'history', 'western', 'fantasy', 'musical', 'romance', 'talk-show', 'sport', 'comedy', 'horror', 'drama', 'mystery', 'documentary', 'war', 'game-show', 'thriller']
29


['adult',
 'action',
 'animation',
 'news',
 'reality-tv',
 'sci-fi',
 'family',
 'adventure',
 'crime',
 'music',
 'film-noir',
 'short',
 'biography',
 'history',
 'western',
 'fantasy',
 'musical',
 'romance',
 'talk-show',
 'sport',
 'comedy',
 'horror',
 'drama',
 'mystery',
 'documentary',
 'war',
 'game-show',
 'thriller']

In [106]:

pickle.dump(all_genres,open('my_data/all_genres_omdb', "wb" ))

# Create DataFrame
Now that all movie information have been obtained and saved in the movie_dict dictionary, create a dataframe

In [12]:
def create_dataframe_from_dict(movie_dict, all_columns, years):
    """Accept the movie_info_dict and creates the DataFrame 
    containing budget, box_office, running_time, cast etc (mentioned in columns)
    Returns a dataframe
    """
    df = pd.DataFrame(columns = all_columns)
    rejects = []
    for i,(title,year) in enumerate(movie_dict):

        #Tracking display
        if i%11 == 0:
            clear_output()

        print(title,year)
        movie = movie_dict[title,year]
        if len(movie)==0:
            print(f'\tCould not prepare DataFrame for {title}. Movie info missing!')
            rejects.append((title,year,1))
            continue

        for col in all_columns:
            if col in movie:
                if type(movie[col])!=list:
                    df.loc[i,col] = movie[col]
                elif type(movie[col])==list:
                    df.loc[i,col] = len(movie[col])
            else:
                df.loc[i,col] = None

                
    return df, list(set(rejects))

In [107]:
movie_dict['titanic',1997]

{'title': 'titanic',
 'year': 1997,
 'screenplay': ['james cameron'],
 'director': ['james cameron'],
 'cast': ['kathy bates', 'billy zane', 'kate winslet', 'leonardo dicaprio'],
 'imdbID': 'tt0120338',
 'plot': 'a seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated r.m.s. titanic.',
 'language': 'English, Swedish, Italian',
 'running_time': '194 min',
 'imdb_rating': 7.8,
 'n_votes': 983627.0,
 'metscore': 7.5,
 'rotten_tomatoes': 8.9,
 'other_wins': 115.0,
 'other_noms': 80.0,
 'genre': ['drama', ' romance'],
 'budget': 200.0,
 'opening_weekend': 28.64,
 'gross_box_office': 2187.46}

In [19]:
%%time
movie_dict_all = pickle.load(open("my_data/movie_dict_FINAL","rb"))
columns = ['imdbID','title','year','n_votes','imdb_rating', 'metscore', 'rotten_tomatoes', 'budget', 
           'box_office','cast', 'genre','running_time', 'other_wins', 'other_noms', 'opening_weekend', 'gross_box_office']
df_movies, rejected = create_dataframe_from_dict(movie_dict_all, columns, range(1960,2020))
df_movies.rename(columns = {'cast':'cast_size', 'genre':'genre_span'},inplace=True)

a shaun the sheep movie: farmageddon 2019
parasite 2019
pain and glory 2019
CPU times: user 30min 4s, sys: 24 s, total: 30min 28s
Wall time: 32min 19s


In [193]:
df_movies.year = [y.year for y in pd.to_datetime(df_movies.year.astype('int'), format='%Y')]
df_movies.reset_index(drop=True)
df_movies.head()

Unnamed: 0,imdbID,title,year,n_votes,imdb_rating,metscore,rotten_tomatoes,budget,box_office,cast_size,genre_span,running_time,other_wins,other_noms,opening_weekend,gross_box_office
0,tt0054380,the 3rd voice,1960,98.0,7.1,7.1,7.1,,,4.0,3.0,79.0,0.0,0.0,,
1,tt0053561,5 branded women,1960,395.0,6.7,6.7,6.7,,,4.0,2.0,115.0,0.0,0.0,,
2,tt0054415,12 to the moon,1960,947.0,3.1,3.1,3.1,0.15,,4.0,1.0,74.0,0.0,0.0,,
3,tt0053558,13 fighting men,1960,54.0,4.7,4.7,4.7,,,4.0,3.0,69.0,0.0,0.0,,
4,tt0053559,13 ghosts,1960,4873.0,6.1,5.6,3.6,,,4.0,1.0,85.0,0.0,1.0,,


In [195]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13957 entries, 0 to 16260
Data columns (total 16 columns):
imdbID              13957 non-null object
title               13957 non-null object
year                13957 non-null int64
n_votes             13957 non-null float64
imdb_rating         13957 non-null float64
metscore            13957 non-null float64
rotten_tomatoes     13957 non-null float64
budget              7564 non-null float64
box_office          3480 non-null float64
cast_size           13957 non-null float64
genre_span          13957 non-null float64
running_time        13882 non-null float64
other_wins          13957 non-null float64
other_noms          13957 non-null float64
opening_weekend     7460 non-null float64
gross_box_office    8662 non-null float64
dtypes: float64(13), int64(1), object(2)
memory usage: 1.8+ MB


In [196]:
df_movies.to_csv('my_data/df_movies_FINAL.csv')

In [34]:
# Fill in metscore and rotten tomatoes info if missing
for row in df_movies.iterrows():
    if row[1].metscore is None: # | np.isnan(row[1].metscore):
        df_movies.loc[row[0],'metscore'] = row[1].imdb_rating

    if row[1].rotten_tomatoes is None: # | np.isnan(row[1].rotten_tomatoes):
        df_movies.loc[row[0],'rotten_tomatoes'] = row[1].imdb_rating

        
# Convert running_time for '132 min' to 132.0      
df_movies = pd.read_csv('my_data/df_movies_FINAL.csv')
for row in df_movies.iterrows():
    idx = row[0]
    x = row[1].running_time
    if type(x) != str:
        df_movies.loc[idx,'running_time'] = np.nan
        continue
        
    df_movies.loc[idx,'running_time'] = float(re.findall(r'(\d+)',x)[0])
df_movies.running_time = df_movies.running_time.astype(float)

In [109]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16261 entries, 0 to 16260
Data columns (total 16 columns):
imdbID              13957 non-null object
title               13957 non-null object
year                13957 non-null float64
n_votes             13957 non-null float64
imdb_rating         13957 non-null float64
metscore            13957 non-null float64
rotten_tomatoes     13957 non-null float64
budget              9868 non-null float64
box_office          3480 non-null float64
cast_size           13957 non-null float64
genre_span          13957 non-null float64
running_time        13882 non-null float64
other_wins          13957 non-null float64
other_noms          13957 non-null float64
opening_weekend     9764 non-null float64
gross_box_office    10966 non-null float64
dtypes: float64(14), object(2)
memory usage: 2.0+ MB


In [114]:
movie_dict['titanic', 1997]

{'title': 'titanic',
 'year': 1997,
 'screenplay': ['james cameron'],
 'director': ['james cameron'],
 'cast': ['kathy bates', 'billy zane', 'kate winslet', 'leonardo dicaprio'],
 'imdbID': 'tt0120338',
 'plot': 'a seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated r.m.s. titanic.',
 'language': 'English, Swedish, Italian',
 'running_time': '194 min',
 'imdb_rating': 7.8,
 'n_votes': 983627.0,
 'metscore': 7.5,
 'rotten_tomatoes': 8.9,
 'other_wins': 115.0,
 'other_noms': 80.0,
 'genre': ['drama', ' romance'],
 'budget': 200.0,
 'opening_weekend': 28.64,
 'gross_box_office': 2187.46}

# Oscar Information

Enter the code for getting oscars information and individual award and nomination information. 

# 2.0 Data Acquisition for Academy Awards with Webscraping and API
<a id="acquisition"></a>

We will use BeautifulSoup for web-scraping together with the wikipedia API to obtain information about the most popular films released between 1960 and 2019 and the various Academy Awards nominations and wins during that period. Out goal is to look at hollywood films under three categories of performance: awards, critical review and revenue to identify what predicts the success of a movie. 

In other words, we see each film in terms of its measurable and identifiable features and see hwo they contributed to the success of a movie. Are big budget movies more likely to win awards? Is it the size or profile of the cast size that are likely to draw a more critically acclaim? Are longer movies more popular among the award committe and the fans alike?

In order to obtain clean and reliable movie information and their awards and nomination in the different film categories, we use two APIs: the wikiedpa API and the omdb API. 

We also use the BeautifulSoup module aloing with regular expressions (re) to extract the various information from wikipedia and IMDB to obtain the most reliable information about films. 

## 2.2 Web scraping from Wikipedia

The following function obtained information about a film, namely its director, cast, running time, budget and box office information as shown below for Lawrence of Arabia (1962), which was used for predicting the Oscars for 2020. 
The information is returned as a Python dictionary.
<br>
<img src="files/wiki4.png" height="100" width="100" align="left" style="width:40%">
<img src="files/wiki2.png" height="100" width="100" align="center" style="width:40%">


## 2.1 The Wikipedia API
The following function accepts a title, year or category and obtains the HTML document for the relevant wikipedia page. 

The wikpedia pages for the Academy Awars expect get request not in the specification of year (eg. 1974 Academy Awards), but in the form of "N<sup>th</sup> Academy Awards, such as <a href="https://en.wikipedia.org/wiki/34th_Academy_Awards">"34th Academy Awards"</a>". The same wikipedia API can can individual film information if movie and year are specified, and The Academy Awards information for a given year as shwon below.
<br>

#### Getting Oscar Editions
The Wikipedia page for The Academy Awards is referenced **not** by the year (eg. 1962 Academy Awards) but by its edition (34th Academy Awards) as shown below. 
<br>
 

In [None]:
# The Wikipedia pages for Academy Awards are listed in terms of their edition: 1st, 2nd, 3rd, etc.


def wikiapi_nth(year, award = ' Academy Awards'):
    year =get_edition(year)
    year_award_wikiformat = year + award
    try:
        return wikipedia.page(year_award_wikiformat)
    except wikipedia.exceptions.PageError:
        pass
    
    try:
        return wikipedia.page(year_award_wikiformat.replace(' ','_'))
    except wikipedia.exceptions.PageError:
        print(f"The page for the year {year_award_wikiformat} could not be found ")
        return

########################################################
def get_edition(year):
    '''Takes in a year and returns the edition of the given year's oscars
    eg: Academy Awards, 1930 and returns 2nd Academy Awards'''
    
    # Define editiona
    editions = ['th','st', 'nd', 'rd','th', 'th', 'th', 'th', 'th', 'th', 'th']
    # editions_dict = {}
    nth = year - 1928
    if nth>10 and nth<20:
        year_th = str(nth)+'th'
    else:
        year_th = str(nth) + editions[nth%10]
    return year_th


#########################################################
def get_genre(intro):
    """Takes in the intro paragraph from Wikipedia and scrapes out information about genre
    """
    genre_dict = pickle.load(open("my_data_2/genre_dict","rb"))
    this_movie_genres = []
    for line in intro: # got through all the lines
        line = line.text.split('.')
        line = line[0]
        line = line.lower()
        for genre in genre_dict:
            if line.find(genre) > 0:
                this_movie_genres.append(genre_dict[genre])
        
        if len(this_movie_genres)>0:
            return list(set(this_movie_genres))
    return 


In [None]:
# Define Oscar Categories 
major_categories = ['picture','director','s_actor','s_actress','actor', 
                   'actress','screenplay']
minor_categories = ['music','cinematography','editing','effects','sound',
                    'costume','song', 'art_direction']
all_categories = major_categories + minor_categories

pickle.dump(major_categories,open( 'my_data_4/major_categories', "wb" ))
pickle.dump(minor_categories,open( 'my_data_4/minor_categories', "wb" ))
pickle.dump(all_categories,open( 'my_data_4/all_categories', "wb" ))

In [None]:
def get_main_category(category):
    """This function accepts variants of basic cateries such as 
    'Best Motion Picture' and 'Best Picture' and 'Outstanding Picture'
    and returns "picture"
    """
    category = category.lower()

    if category.find('story')>=0:
        return 'other'
    if category.find('best picture')>=0 or category.find('best motion picture')>=0:
        return 'picture'
    if category.find('outstanding production')>=0 or category.find('outstanding picture')>=0:
        return 'picture'
    if category.find('actor')>=0:
        if category.find('supporting')>=0:
            return 's_actor'
        else:
            return 'actor'   
    if category.find('actress')>=0:
        if category.find('supporting')>=0:
            return 's_actress'
        else:
            return 'actress'    
    if category.find('best director')>=0:
        return 'director'
    if category.find('screenplay')>=0:
        return 'screenplay'
    if category.find('music')>=0 or (category.find('scor')>=0):
        return 'music'
    if category.find('costume')>=0:
        return 'costume'
    if category.find('editing')>=0:
        return 'editing'
    if category.find('effects')>=0:
        return 'effects'
    if category.find('cinematography')>=0:
        return 'cinematography'
    if category.find('sound')>=0:
        return 'sound'
    if category.find('song')>=0:
        return 'song'
    if category.find('art')>=0 and category.find('direct')>=0:
        return 'art_direction'
    # if category.find('art direction')>=0:
        # return 'art direction'
    else:
        # print(f'Warning:{category} did not get matched!')
        return 'other'

## Film Awards and Nominations
The following function obtains movie information at the level of the mlovie and NOT the individual winners.

In [None]:
# Awards and Nominations
def awards_and_nominations(year, award = ' Academy Awards', all_categories = 'all'):
    '''This function accepts year, title or award and goes into the wikipedia page of the Award 
    or the Movie and extrac t all the necessary information.
    '''
    
    if all_categories=='all':
        all_categories = pickle.load(open("my_data_4/all_categories","rb")) 

    
    # initialize empty DataFrame and the corresponding Screening Number
    oscars_wn = pd.DataFrame()
    missing_categories = []
    winner_list = dict()
    nominee_list = dict()
    
    # Get the html file from the Wikipedia page using the wikipedia API.
    # Parse it with BeautifulSoup
    page = wikiapi_nth(year)
    soup = BeautifulSoup(page.html(),'lxml')
    print(get_edition(year))
        
    # Get the table-body (tbody) from the wikipedia page where Oscars information are stored
    tbody = soup.body.find('table', class_="wikitable").find('tbody')
    
    print("td:", len(tbody.find_all('td')))
    print("th:", len(tbody.find_all('th')))
    print("div:", len(tbody.find_all('div')))
       
    # Make sure number of cells and header match
    if len(tbody.find_all('td')) !=len(tbody.find_all(['div', 'th'])):
    
        # The wikipedia tables needs to be fixed! Some category header may be missing.
        print('Warning: Number of cell <td> element  and header <th> does not add up for year:', year)
        print('Returned Empty dataFrame')
        return oscars_wn, missing_categories
        
        
    # Get winners and nominees
    try:
        for td,th in zip(tbody.find_all('td'),tbody.find_all(['div', 'th'])):
            cat = th.text.strip()
            # Get standardized categories
            category = get_main_category(cat)
            if category not in all_categories:
                continue
            winner_list[category] = [] # inditialize an empty dictionary
            nominee_list[category] = []

            if category=='other':
                missing_categories.append(cat)
                # print(f'Warning in {category} in {year}')

            # Go into the list and look at every line
            for tli in td.find_all('li'):
                # go down each line (li) and check if it is bolded
                if tli.find('b')!= None: # winner
                    winner = tli.find('i').text.strip()       # get italicized movie
                    winner = re.sub(r'–',"", winner).strip()
                    winner = winner.lower()
                    winner_list[category].append(winner) # add movie to winner list
                    oscars_wn.loc[winner,'year']= int(year)-1
                    oscars_wn.loc[winner,category]='W'
                    
                elif tli.find('b')== None: #nominee
                    nominee = tli.find('i').text.strip()      # get italicized movie
                    nominee = re.sub(r'–',"", nominee).strip()
                    nominee = nominee.lower()
                    nominee_list[category].append(nominee)
                    oscars_wn.loc[nominee,'year']= int(year)-1
                    if nominee in winner_list[category]: # if same movie has already won, leave it unchanged as 'W'
                        oscars_wn.loc[nominee,category]='WN'
                    else:
                        oscars_wn.loc[nominee,category]='N'

        
    except AttributeError:
        print(f'Warning: in Category {category} for Year: {year}')
            

    # oscars_wn['film'] = oscars_wn.index
    return oscars_wn.fillna('O'), missing_categories

In [None]:
%%time
# GET MOVIE AWARDS FROM ALL CATEGORIES

df_oscars_wide = pd.DataFrame()
for year in range(2020,2021):
    if year%10==0:
        clear_output()
        print(f'In year {year}')
    df, missing = awards_and_nominations(year)
    df_oscars_wide = df_oscars_wide.append(df)

# Convert year to datwetime, removes Nans and SAVE
df_oscars_wide.year = df_oscars_wide.year.astype('int').astype('str')
x = pd.to_datetime(df_oscars_wide.year, format='%Y', exact=True)
df_oscars_wide.year = [x.year for x in pd.to_datetime(df_oscars_wide.year, format='%Y', exact=True)]
df_oscars_wide.reset_index(inplace=True)
df_oscars_wide.rename(columns = {'index':'film'},inplace=True)
print(df_oscars_wide.head())

#Perpare and save long version and SAVE
df_oscars_long = pd.melt(df_oscars_wide, id_vars = ['film', 'year'], var_name='category', value_name = 'result')



# Check 
df_oscars_wide.fillna('O',inplace=True)
print(df_oscars_wide.shape)
print(df_oscars_long.shape)
df_oscars_wide.rename(columns = {'film':'title'},inplace=True)
df_oscars_wide.info()

## Individual Awards and Nomination
The previous function obtained information about the varioiius winners and nominees at the movie level. The next function obtains information in the individual categories

In [None]:
# Awards and Nominations for individuals: actor, actress, director
def individual_awards_and_nominations(year, get_categories ='all', award =' Academy Awards'):
    '''This function accepts year, title or award and goes into the wikipedia page of the Award 
    or the Movie and extracts all the necessary information for INDIVIDUAL winners, example 
    Steven Spielberg in directing and Tom Hanks in Actor category etc. 
    '''
    
    if get_categories == 'all':
        get_categories = pickle.load(open("my_data_4/main_categories","rb"))

    
    # initialize empty DataFrame and the corresponding Screening Number
    individual_wn = pd.DataFrame()
    missing_categories = []
    idx = 0
    
    # Get the html file from the Wikipedia page using the wikipedia API.
    # Parse it with BeautifulSoup
    page = wikiapi_nth(year)
    soup = BeautifulSoup(page.html(),'lxml')
        
    # Get the table-body (tbody) from the wikipedia page where Oscars information are stored
    tbody = soup.body.find('table', class_="wikitable").find('tbody')
       
    # Make sure number of cells and header match
    if len(tbody.find_all('td')) !=len(tbody.find_all(['div', 'th'])):
    
        # The wikipedia tables needs to be fixed! Some category header may be missing.
        print('Warning: Number of cell <td> element  and header <th> does not add up for year:', year)
        print('Returned Empty dataFrame')
        return individual_wn, missing_categories
        
        
    # Get winners and nominees
    try:
        for td,th in zip(tbody.find_all('td'),tbody.find_all(['div', 'th'])):
            cat = th.text.strip()
            category = get_main_category(cat)
            # print('\nCategory:', category)
            if category not in get_categories:
                # print(year, category)
                continue

            # Go into the list and remove the film names in <i> italic tags
            for line in td.find_all('li'):
                if line.find('i') != None:
                    line.i.decompose() # remove film
                    
                if line.find('b')!= None: # if in bold then winner
                    for wins in line.find_all('b'):
                        if wins.find('i') != None: # remove film
                            wins.i.decompose()
                            
                        if wins.find('a')!=None: # winner will be the first hyperlink
                            winner = wins.find_all('a')[0].text.strip()
                        else:
                            winner = wins.text.strip() #if no hyperlink, winner is first text
                        winner = re.sub(r'–',"", winner).strip()
                        if winner != None:
                            individual_wn.loc[idx,'year']= int(year)
                            individual_wn.loc[idx, 'name']= winner
                            individual_wn.loc[idx, 'category']= category
                            individual_wn.loc[idx, 'result']= 'W'
                            idx = idx + 1
                
                elif line.find('b') == None: # if no bold then not a winner
                    if line.find('a') != None:
                        nominee = line.find_all('a')[0].text.strip()
                    else:
                        nominee = line.text.strip()
                    nominee = re.sub(r'–',"", nominee).strip()
                    if nominee != None:
                        individual_wn.loc[idx,'year']= int(year)
                        individual_wn.loc[idx, 'name']= nominee
                        individual_wn.loc[idx, 'category']= category
                        individual_wn.loc[idx, 'result']= 'N'
                        idx = idx + 1

        
    except AttributeError:
        print(f'Warning: in Category {category} for Year: {year}')

    return individual_wn, missing_categories

In [None]:
%%time
# GET INDIVIDUAL AWARDS FROM ALL CATEGORIES
major_categories = pickle.load(open("my_data_4/major_categories","rb"))
minor_categories = pickle.load(open("my_data_4/minor_categories","rb"))
get_categories = major_categories + minor_categories
print(get_categories)
df_individual_long = pd.DataFrame()
for year in range(2020,2021):
    if year%10==0:
        clear_output()
    print(f'In year {year}')
    df, missing = individual_awards_and_nominations(year, get_categories)
    df_individual_long = df_individual_long.append(df)


# Convert year to datetime, and SAVE
df_individual_long.year = df_individual_long.year.astype('int').astype('str')
x = pd.to_datetime(df_individual_long.year, format='%Y', exact=True)
df_individual_long.year = [x.year for x in pd.to_datetime(df_individual_long.year, format='%Y', exact=True)]
df_individual_long.reset_index(inplace=True)
# df_individual_long.to_csv('my_data_4/df_individual_long.csv')
df_individual_long.to_csv('my_data/df_individual_long_2020.csv')

# Genre DataFrame

In [200]:
def get_movie_genres(df_mov, mov_dict, all_genres):  
    """takes in the movie information and the genre dictionary and creates a 
    one hot encoded genre
    """
    
   
    columns = list(df_mov.columns) + all_genres
    df_genre = pd.DataFrame(columns = columns)
    
    counter = 0
    for row in df_mov.iterrows(): # go through each movie
        counter = counter + 1
        if counter%1000 == 0:
            print(counter)
        idx = row[0]
        title = row[1].title
        year = row[1].year
        imdbID = row[1].imdbID
        df_genre.loc[idx,'title'] = title
        df_genre.loc[idx,'year'] = year
        df_genre.loc[idx,'imdbID'] = imdbID
        
        if np.isnan(year):
            continue
        

        if 'genre' in mov_dict[title,year] and len(mov_dict[title,year]['genre']) > 0:
            
            for gen in mov_dict[title,year]['genre']:
                gen = gen.strip()
                df_genre.loc[idx,gen]=1


    return df_genre.fillna('0') 

In [201]:
%%time
all_genres = pickle.load(open('my_data/all_genres_omdb', "rb" ))
movie_dict = pickle.load(open('my_data/movie_dict_FINAL', "rb"))
df_genres = get_movie_genres(df_movies[['title','year', 'imdbID']], movie_dict, all_genres)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
CPU times: user 1min 50s, sys: 2.26 s, total: 1min 53s
Wall time: 1min 55s


In [204]:
df_movies.shape

(13957, 16)

In [203]:
# df_genres.year = [y.year for y in pd.to_datetime(df_genres.year.astype('int'), format='%Y')]
df_genres.tail()
df_genres.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13957 entries, 0 to 16260
Data columns (total 32 columns):
title          13957 non-null object
year           13957 non-null float64
imdbID         13957 non-null object
adult          13957 non-null object
action         13957 non-null object
animation      13957 non-null object
news           13957 non-null object
reality-tv     13957 non-null object
sci-fi         13957 non-null object
family         13957 non-null object
adventure      13957 non-null object
crime          13957 non-null object
music          13957 non-null object
film-noir      13957 non-null object
short          13957 non-null object
biography      13957 non-null object
history        13957 non-null object
western        13957 non-null object
fantasy        13957 non-null object
musical        13957 non-null object
romance        13957 non-null object
talk-show      13957 non-null object
sport          13957 non-null object
comedy         13957 non-null object
ho

In [205]:
df_genres.to_csv('my_data/df_genres_FINAL.csv')

In [207]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13957 entries, 0 to 16260
Data columns (total 16 columns):
imdbID              13957 non-null object
title               13957 non-null object
year                13957 non-null int64
n_votes             13957 non-null float64
imdb_rating         13957 non-null float64
metscore            13957 non-null float64
rotten_tomatoes     13957 non-null float64
budget              7564 non-null float64
box_office          3480 non-null float64
cast_size           13957 non-null float64
genre_span          13957 non-null float64
running_time        13882 non-null float64
other_wins          13957 non-null float64
other_noms          13957 non-null float64
opening_weekend     7460 non-null float64
gross_box_office    8662 non-null float64
dtypes: float64(13), int64(1), object(2)
memory usage: 1.8+ MB


# Score and Count

We will obtain and Oscar sciore and count for each movie. 

In [208]:
df_oscars_wide = pd.read_csv('my_data/df_oscars_wide.csv', index_col=[0])
df_oscars_long = pd.read_csv('my_data/df_oscars_long.csv', index_col=[0])
df_individual_long = pd.read_csv('my_data/df_individual_long.csv', index_col=[0])
# df_oscars_wide.rename(columns = {'film':'title'},inplace=True)
df_oscars_wide.title = df_oscars_wide.title.str.replace(':','')

df_individual_long.name = df_individual_long.name.str.lower()

In [211]:
df_oscars_long.head()

Unnamed: 0,title,year,category,result
0,gone with the wind,1939,actor,N
1,dark victory,1939,actor,O
2,"goodbye, mr. chips",1939,actor,W
3,love affair,1939,actor,O
4,mr. smith goes to washington,1939,actor,N


In [232]:
def pre_oscar_count(df_mov, categories, movie_dict, df_ind):
    """takes a year and title of a film and gives it a totla score based on Oscar records of actors, 
    directors and editorial crew
    """
    if categories == 'all':
                categories = pickle.load(open("my_data_3/main_categories","rb"))
    
    df_mov['precount_wins'] = 0 # initialize all socre to zero
    df_mov['precount_noms'] = 0
            
    for row in df_mov.iterrows(): # go through each movie
        idx = row[0]
        if idx%10==0:
            clear_output()
            print(f'{idx}of {len(df_mov)}')
        title = row[1].title
        year = row[1].year
        if (title,year) in movie_dict:
            movie = movie_dict[title,year]
        else: 
            print(f'{title} in {year} missing in movie_dict')
            continue

        # get all_members from the infobox information in movie_dict
        count_wins = 0
        count_noms = 0
        checked_members = [] # to prevent double counting if any personnel is repeated (example: director and writer)

        
        # aggregate accross categories and members
        for field in movie:
            if type(movie[field])==list:
                for member in movie[field]:
                    if member in checked_members:
                        continue
                    checked_members.append(member)
                    for cat in categories:

                        
                        df_this = df_ind[(df_ind.name == member)&(df_ind.category == cat)&(df_ind.year<=year)]
                        series = df_this.result.value_counts()
                        if 'W' in series:
                            count_wins = count_wins + series['W']
                            # print('W', cat, member, series['W'])
                        if 'N' in series:
                            count_noms = count_noms + series['N']
                            # print('N', cat, member, series['N'])
        
        df_mov.loc[idx,'precount_wins'] = count_wins
        df_mov.loc[idx,'precount_noms'] = count_noms
        
    
    return df_mov

In [233]:
%%time
categories = pickle.load(open("my_data/major_oscar_categories","rb"))
df_precount = pre_oscar_count(df_movies[['title','year', 'imdbID']], 
                                            categories, movie_dict, df_individual_long)



16260of 13957
CPU times: user 42min 20s, sys: 17.7 s, total: 42min 38s
Wall time: 42min 46s


In [235]:
# df_precount.precount_wins = df_precount_wins_total
# df_precount.precount_noms = df_precount_noms_total
df_precount.sort_values(by = "precount_wins", ascending=False)

Unnamed: 0,title,year,imdbID,precount_wins,precount_noms
12760,the post,2017,tt6294822,7,25
5248,love affair,1994,tt0110391,7,17
10024,how do you know,2010,tt1341188,6,10
602,lawrence of arabia,1962,tt0056172,6,9
13266,alita battle angel,2019,tt0437086,6,1
...,...,...,...,...,...
7544,joe dirt,2001,tt0245686,0,0
7545,joe somebody,2001,tt0279889,0,0
7546,josie and the pussycats,2001,tt0236348,0,0
7547,joy ride,2001,tt0206314,0,0


In [237]:
df_precount.to_csv('my_data/df_precount_FINAL.csv')

In [238]:
def post_oscar_count(df_mov,df_osc,categories):
    """
    This returns an oscar count for each year and film: 
    number of wins and nominations
    """
    
    # initialize the score column
    
    scores = []
    df = pd.DataFrame()
    df_missing = pd.DataFrame()
    df_mov['win'] = 0
    df_mov['nom'] = 0
    df_mov['none'] = 0
    
    df_osc.reset_index()
    
    for row in df_osc.iterrows():
        idx = row[0]
        film = row[1].title
        year= row[1].year
        title=film
        
        if idx%10:
            clear_output()

        wins = 0
        noms = 0
        nons = 0
        for cat in categories:
            # series = df_oscars[df_this.category==cat].result.value_counts()
            if row[1][cat] == 'W':
                wins = wins + 1
            if row[1][cat] == 'N':
                noms = noms + 1
            if row[1][cat] == 'O':
                nons = nons + 1
            if row[1][cat] == 'WN':
                wins = wins + 1
                noms = noms + 1
                
            this = df_mov[(df_mov.title==title) & (df_mov.year==year)]
            if len(this) == 0:
                print(f'WARNING: Oscar nominated film {title} in year {year} missing from df_movies ')
                df_missing = df_missing.append(row[1])
            if len(this) > 1:
                print(f'WARNING: Oscar nominated film {title} in year {year} multiple entries')
            if len(this) == 1:
                idx2 = this.index[0]
                df_mov.loc[idx2,'win'] = wins
                df_mov.loc[idx2,'nom'] = noms+wins
                df_mov.loc[idx2,'none'] = nons
            
#     df.year = df.year.astype('int').astype('str')
#     x = pd.to_datetime(df.year, format='%Y', exact=True)
#     df.year = [x.year for x in pd.to_datetime(df.year, format='%Y', exact=True)]
    return df_mov, df_missing

In [239]:
%%time
categories = pickle.load(open("my_data/major_oscar_categories","rb"))
df_osc = df_oscars_wide[df_oscars_wide.year >=1955]
df_postcount,df_missing = post_oscar_count(df_movies[['title','year', 'imdbID']],df_osc,categories)
df_postcount.drop('none', axis=1, inplace=True)
print(df_postcount.shape)

(13957, 6)
CPU times: user 1min 10s, sys: 2.35 s, total: 1min 13s
Wall time: 1min 12s


In [248]:
df_postcount.query("title == 'lincoln'")

Unnamed: 0,title,year,imdbID,win,nom
10820,lincoln,2012,tt0443272,1,6


In [250]:

df_postcount.to_csv('my_data/df_postcount_FINAL.csv')

# Combining Into 1 DataFrame

We will combine df_all, df_genres, df_count into a single dataframe

In [4]:
df_oscars_wide = pd.read_csv('my_data/df_oscars_wide_FINAL.csv', index_col=[0])
df_movies = pd.read_csv('my_data/df_movies_FINAL.csv', index_col=[0])
df_genres = pd.read_csv('my_data/df_genres_FINAL.csv', index_col=[0])
df_precount = pd.read_csv('my_data/df_precount_FINAL.csv', index_col=[0])
df_postcount = pd.read_csv('my_data/df_postcount_FINAL.csv', index_col=[0])

print(df_oscars_wide.shape)
print(df_movies.shape)
print(df_genres.shape)
print(df_precount.shape)
print(df_postcount.shape)

(13957, 17)
(13957, 16)
(13957, 32)
(13957, 5)
(13957, 5)


In [8]:
df_postcount.sort_values(by = "win", ascending=False).head(10)

Unnamed: 0,title,year,imdbID,win,nom
4240,kramer vs. kramer,1979,tt0079417,5,7
3548,one flew over the cuckoo's nest,1975,tt0073486,5,6
4536,the silence of the lambs,1991,tt0102926,5,5
14201,terms of endearment,1983,tt0086425,5,7
7452,a beautiful mind,2001,tt0268978,4,5
6763,shakespeare in love,1998,tt0138097,4,6
3319,the godfather part ii,1974,tt0071562,4,7
427,west side story,1961,tt0055614,4,5
8451,million dollar baby,2004,tt0405159,4,6
10203,the king's speech,2010,tt1504320,4,6


In [5]:
df_oscars_wide.query("title == 'the irishman'")

Unnamed: 0,title,year,imdbID,actor,actress,cinematography,costume,director,editing,effects,music,picture,s_actor,s_actress,screenplay,song,sound
11514,the irishman,2019,tt1302006,O,O,N,N,N,N,N,O,N,N,O,N,O,O
12958,the irishman,1978,tt0077749,O,O,O,O,O,O,O,O,O,O,O,O,O,O


In [266]:
# Merge Oscars Wide 
# df_osc = df_movies[['title', 'year', 'imdbID']].merge(df_oscars_wide, how = 'left').fillna('O')
# print(df_osc.shape)
# df_osc.to_csv('my_data/df_oscars_wide_FINAL.csv')

(13957, 17)

In [9]:
df_oscars_wide.columns

Index(['title', 'year', 'imdbID', 'actor', 'actress', 'cinematography',
       'costume', 'director', 'editing', 'effects', 'music', 'picture',
       's_actor', 's_actress', 'screenplay', 'song', 'sound'],
      dtype='object')

In [13]:
df_main = df_movies
df_main = pd.merge(df_main, df_precount)
df_main = pd.merge(df_main, df_postcount)
df_main = pd.merge(df_main, df_genres.drop(['n/a','adult'], axis=1))
df_main = pd.merge(df_main.drop('music', axis=1), df_oscars_wide)
df_main.title = df_main.title.str.replace(":","")
df_main.shape

(13957, 60)

In [14]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13957 entries, 0 to 13956
Data columns (total 60 columns):
imdbID              13957 non-null object
title               13957 non-null object
year                13957 non-null int64
n_votes             13957 non-null float64
imdb_rating         13957 non-null float64
metscore            13957 non-null float64
rotten_tomatoes     13957 non-null float64
budget              7564 non-null float64
box_office          3480 non-null float64
cast_size           13957 non-null float64
genre_span          13957 non-null float64
running_time        13882 non-null float64
other_wins          13957 non-null float64
other_noms          13957 non-null float64
opening_weekend     7460 non-null float64
gross_box_office    8662 non-null float64
precount_wins       13957 non-null int64
precount_noms       13957 non-null int64
win                 13957 non-null int64
nom                 13957 non-null int64
action              13957 non-null int64
animat

In [15]:
df_main.to_csv('my_data/df_main_FINAL.csv')