The first attempt at answering the question: "should Leonardo DiCaprio have won an oscar for an earlier performance?" resulted in models that grossly over predicted the number of nominations and wins for best actor. Additionally, both of the original models seemed to agree on which titles should have receieved nominations, but differed slightly as to whether the particular title should have won. One performance they both agreed upon was his portrayal of Arnie Grape in What's Eating Gilbert Grape. From this original data set it was apparent that additional information was necessary, for example: How important is the role to the title? How old was the actor when they appeared in the title? We attempt to improve the reliability of the DiCaprio predictions by adding more features to the data set.

The new datasets were retrieved from: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import re
import unidecode
import glob

In [2]:
movies_df = pd.read_csv('IMDb_movies.csv', usecols=['imdb_title_id','year','date_published','duration','budget','worlwide_gross_income','metascore','avg_vote','votes'],low_memory=False)
names_df = pd.read_csv('IMDb_names.csv', usecols=['imdb_name_id','name','height','date_of_birth'])
principals_df = pd.read_csv('IMDb_title_principals.csv', usecols=['imdb_title_id','imdb_name_id','ordering'], low_memory=False)
titles_df = pd.read_csv('title.basics.tsv',sep='\t', usecols=['tconst','primaryTitle'])
ratings_df = pd.read_csv('IMDb_ratings.csv')
ba_df = pd.read_csv('processed_data.csv')

While exploring the titles column in the movies_df, a large portion of the title information was stored in french and spanish. This neccessitated the extraction of title information from the larger comprehensive data set from IMDb.

In [3]:
titles_df.rename({'tconst':'imdb_title_id','primaryTitle':'title'}, axis ='columns', inplace=True)
names_df.rename({'name':'actor'}, axis ='columns',inplace=True)
ba_df.rename(str.lower, axis ='columns',inplace=True)
names_df.actor = names_df.actor.apply(lambda x: unidecode.unidecode(str(x)))

Here columns are given consistent names across data sets for easier joining. Additionally actor names are converted to unicode characters to match those in the original data set, ensuring that all actors that have received an academy award or nomination remain after filtering. 

In [4]:
actor_list = ba_df['actor'].unique().tolist()
names_df = names_df[names_df.actor.isin(actor_list)]

The data set retrieved from kaggle is filtered such that only the actors that have been nominated for or won the academy award for best actor remain.

In [5]:
names_df.at[38127,'height']=175.0
names_df.at[38127,'date_of_birth']='1916-12-09'
names_df.at[71464,'height']=183.0
names_df.at[81083,'height']=188.0
names_df.at[85305,'height']=191.0
names_df.at[85305,'date_of_birth']='1896-08-30'
names_df.at[119054,'height']=188.0
names_df.at[127665,'height']=179.0
names_df.at[161858,'height']=178.0
names_df.at[161858,'date_of_birth']='1907-05-22'
names_df.at[177644,'height']=182.0
names_df.at[183892,'height']=1.0
names_df.at[183892,'date_of_birth']='1900-01-01'
names_df.at[218016,'height']=1.0
names_df.at[218016,'date_of_birth']='1900-01-01'
names_df.at[282012,'height']=1.0

Due to discrepancies in the new dataset found on kaggle, there are missing values in the height and dob columns for several actors, and are manually adjusted when data could be found. In two cases no further record of actor height or dob could be found, so basic values and dates were assigned.

In [6]:
movies_named = pd.merge(titles_df,movies_df, how='inner',on='imdb_title_id')
principals_named = pd.merge(principals_df, names_df, how='inner',on=['imdb_name_id'])
movies_rated = pd.merge(movies_named, ratings_df, how='inner', on='imdb_title_id')
big_df = pd.merge(principals_named,movies_rated, how='inner', on='imdb_title_id')

The different data sets are joined on the imdb title and name id's associated with each actor and the titles they are associated with. These are in turn dropped as they are irrelevant to the analyses.

In [7]:
big_df = big_df.fillna('0')
big_df.drop(['imdb_title_id','imdb_name_id'], axis=1, inplace=True)

In [8]:
big_df.at[9181,'title'] = 'Star Wars the Rise of Skywalker'
big_df.at[3221,'date_published'] = '2019-09-08'
big_df.at[3221,'year']='2019'

IMDb id columns are dropped, and at this point it was noticed that an aberrant value was located in the year and date_published columns for row 3221 as well as a title discrepancy in row 9181. Fixing the date values solved the issue with converting to datetime objects. The discrepancy *noticed* with the title name suggests that there are other possible discrepancies.

In [9]:
big_df.year = pd.to_datetime(big_df.year, format='%Y').dt.year

for datecol in ['date_of_birth','date_published']:
    big_df['{}'.format(datecol)] = pd.to_datetime(big_df['{}'.format(datecol)], format='%Y-%m-%d').dt.date


In [10]:
big_df['age_at_performance'] = (big_df.year - pd.DatetimeIndex(big_df['date_of_birth']).year).astype('timedelta64[Y]')/ np.timedelta64(1, 'Y')

All date columns are converted from string to datetime objects, and since the formatting of both columns is consistent they can be looped over with the same conversion.

The actor age at performance is created as a time delta between the actor date of birth and the date of publishing for the title they appeared in.


In [11]:
big_df.title = big_df.title.apply(lambda x: re.sub('\.\.\.|\:|\;|\,|\.|\'', '', unidecode.unidecode(str(x).strip())).lower())
big_df.actor = big_df.actor.apply(lambda x: unidecode.unidecode(str(x)))
for moneycol in ['budget','worlwide_gross_income']:
    big_df['{}'.format(moneycol)] = big_df['{}'.format(moneycol)].str.extract('(\d+)', expand=False).astype(int)

The same subsitution and dropping of particular characters and conversion to unicode that was done to the first data set is done here as well to ensure continuity, as these will be the columns used to merge onto the original dataset.

In [12]:
big_df.drop(['date_of_birth','date_published'],axis=1, inplace=True)
big_df.head(5)

Unnamed: 0,ordering,actor,height,title,year,duration,avg_vote,votes,budget,worlwide_gross_income,...,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,age_at_performance
0,2,Emil Jannings,183.0,passion,1919,85,6.8,753,0,0,...,40,8.2,34,6.2,88,6.6,139.0,6.8,455.0,35.0
1,3,Emil Jannings,183.0,the eyes of the mummy,1918,63,5.5,554,0,0,...,12,6.1,19,5.6,81,5.3,178.0,5.6,219.0,34.0
2,2,Emil Jannings,183.0,the merry jail,1917,48,6.2,515,0,0,...,27,7.2,42,5.3,38,6.3,224.0,6.1,198.0,33.0
3,1,Emil Jannings,183.0,power,1920,99,6.3,138,0,0,...,7,6.5,2,6.1,26,5.5,21.0,6.4,80.0,36.0
4,2,Emil Jannings,183.0,deception,1920,100,6.6,576,0,0,...,19,7.1,28,6.3,88,6.5,144.0,6.6,318.0,36.0


In [13]:
big_df.describe()

Unnamed: 0,ordering,height,year,duration,avg_vote,votes,budget,worlwide_gross_income,weighted_average_vote,total_votes,...,males_30age_votes,males_45age_avg_vote,males_45age_votes,females_allages_avg_vote,females_allages_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,age_at_performance
count,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,...,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0,9404.0
mean,1.942897,180.293067,1977.481923,105.737346,6.478031,54566.46,16812090.0,40260850.0,6.478031,54566.46,...,19273.918545,6.423203,5799.456933,6.605838,7188.005849,6.527935,9252.549128,6.364419,24526.527648,46.66695
std,1.295216,7.624958,27.369836,22.138056,0.89562,157334.8,98022060.0,128390500.0,0.89562,157334.8,...,53238.889701,0.897744,12436.542185,0.897784,20601.778072,0.924482,23857.669423,0.918541,66670.230287,13.641218
min,1.0,1.0,1917.0,45.0,1.8,100.0,0.0,0.0,1.8,100.0,...,5.0,1.6,9.0,1.2,1.0,1.5,4.0,1.0,6.0,9.0
25%,1.0,175.0,1954.0,92.0,6.0,661.0,0.0,0.0,6.0,661.0,...,121.0,5.9,264.0,6.1,82.0,6.0,231.0,5.9,235.0,36.0
50%,1.0,180.0,1980.0,103.0,6.5,3443.5,400000.0,0.0,6.5,3443.5,...,889.0,6.5,1095.0,6.7,418.0,6.6,1004.0,6.4,1422.5,45.0
75%,3.0,185.0,2002.0,117.0,7.1,29520.5,15000000.0,15666420.0,7.1,29520.5,...,10409.0,7.0,5469.0,7.2,3856.0,7.1,6624.0,6.9,13752.0,56.0
max,10.0,196.0,2020.0,357.0,9.3,2278845.0,7000000000.0,2797801000.0,9.3,2278845.0,...,743676.0,9.2,165852.0,10.0,278964.0,9.3,348363.0,9.5,887226.0,121.0


At this point it is worth mentioning that big_df reflects only the information for the actors with names matching those from the initial data set that are of interest, and that there are considerably fewer entries.

In [14]:
big_df[(big_df.age_at_performance>90)]

Unnamed: 0,ordering,actor,height,title,year,duration,avg_vote,votes,budget,worlwide_gross_income,...,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,age_at_performance
2736,5,Orson Welles,183.0,the other side of the wind,2018,122,6.8,5887,0,0,...,117,5.5,62,5.9,190,6.9,1349.0,6.7,2504.0,103.0
2750,7,Orson Welles,183.0,the hitchhiker,2007,86,3.6,938,500000,0,...,62,5.9,38,4.0,29,4.3,345.0,3.1,445.0,92.0
2751,1,Orson Welles,183.0,jucy,2010,90,5.7,110,150000,0,...,26,5.4,12,3.0,4,5.3,32.0,5.8,58.0,95.0
4373,1,Ernest Borgnine,177.0,another harvest moon,2010,89,6.0,182,0,0,...,9,5.5,22,4.8,32,6.5,81.0,5.4,60.0,93.0
4374,1,Ernest Borgnine,177.0,the lion of judah,2011,87,3.4,527,15000000,0,...,37,7.0,18,2.9,22,3.6,130.0,3.3,243.0,94.0
4375,4,Ernest Borgnine,177.0,the genesis code,2010,138,5.3,633,5100000,0,...,57,6.4,68,4.7,49,5.4,275.0,4.0,109.0,93.0
4376,1,Ernest Borgnine,177.0,night club,2011,95,6.5,229,1000000,0,...,17,5.7,23,6.0,24,6.8,135.0,5.9,33.0,94.0
4377,1,Ernest Borgnine,177.0,the man who shook the hand of vicente fernandez,2012,99,6.1,284,0,10782,...,15,7.5,19,4.9,29,6.8,108.0,5.5,99.0,95.0
8953,8,Raymond Massey,191.0,whale music,1994,107,7.1,532,0,0,...,17,7.1,24,5.4,20,6.9,47.0,6.8,276.0,98.0
8954,7,Raymond Massey,191.0,black point,2002,100,5.2,552,6000000,0,...,26,4.8,27,4.8,43,5.2,132.0,5.3,277.0,106.0


From the descriptive analysis above, it was noticed that the maximum age_at_performance was 121 y/o. This helps to see that this is caused by missing information for dob in an earlier attempt to provide information for Raymond Massey and James Dean(not the first) which could not be found by any means

In [15]:
shortened_df = pd.merge(ba_df, big_df, how='inner',on=['actor','title'])

In [16]:
exclude_roles = ['director','producer','executive producer','narrator','host','writer','original music', 'music','screenwriter','himself']
shortened_df = shortened_df[~shortened_df.role.isin(exclude_roles)]

In [17]:
shortened_df[shortened_df.age_at_performance>90]

Unnamed: 0,actor,role,title,award,score_1,score_2,num_reviews,polarity_mean,objectivity_mean,numeric_key,...,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,age_at_performance
8003,Ernest Borgnine,rex,the man who shook the hand of vicente fernandez,0.0,68.0,40.0,3.0,-0.131944,0.415278,62,...,15,7.5,19,4.9,29,6.8,108.0,5.5,99.0,95.0
8004,Ernest Borgnine,slink,the lion of judah,0.0,62.0,0.0,2.0,0.05,0.825,62,...,37,7.0,18,2.9,22,3.6,130.0,3.3,243.0,94.0
8005,Ernest Borgnine,carl taylor,the genesis code,0.0,0.0,0.0,0.0,0.0,0.0,62,...,57,6.4,68,4.7,49,5.4,275.0,4.0,109.0,93.0
8006,Ernest Borgnine,frank,another harvest moon,0.0,0.0,0.0,1.0,0.25625,0.4625,62,...,9,5.5,22,4.8,32,6.5,81.0,5.4,60.0,93.0


Confirming that the erroneous information was not retained during merge. 

In [18]:
shortened_df

Unnamed: 0,actor,role,title,award,score_1,score_2,num_reviews,polarity_mean,objectivity_mean,numeric_key,...,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,age_at_performance
0,Adam Driver,charlie barber,marriage story,1.0,85.0,94.0,48.0,0.190969,0.479507,0,...,14992,7.5,3370,7.1,419,8.0,22814.0,7.9,88146.0,36.0
1,Adolphe Menjou,walter burns,the front page,1.0,60.0,93.0,1.0,0.500000,1.000000,1,...,81,6.8,134,6.5,184,6.8,1080.0,6.6,663.0,41.0
2,Adrien Brody,wladyslaw szpilman,the pianist,2.0,96.0,95.0,40.0,0.128576,0.603966,2,...,53249,8.4,10601,8.1,739,8.4,82064.0,8.5,323314.0,29.0
3,Al Pacino,frank serpico,serpico,1.0,88.0,90.0,6.0,0.238542,0.572917,3,...,3241,7.8,1742,7.6,624,7.7,19673.0,7.7,53318.0,33.0
4,Al Pacino,michael corleone,the godfather part ii,1.0,97.0,98.0,17.0,0.380286,0.646926,3,...,47656,8.7,16306,8.8,856,9.1,164758.0,9.0,448555.0,34.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8416,Terrence Howard,lt lincoln a scott,harts war,0.0,48.0,59.0,36.0,0.128374,0.495090,206,...,2185,6.6,995,6.2,428,6.3,10113.0,6.3,25963.0,33.0
8417,Terrence Howard,robby,angel eyes,0.0,46.0,33.0,38.0,0.102158,0.544197,206,...,3611,6.1,1076,5.3,338,5.5,5027.0,5.6,10778.0,32.0
8418,Terrence Howard,chris,love beat the hell outta me,0.0,0.0,0.0,0.0,0.000000,0.000000,206,...,10,2.8,17,3.5,11,3.9,61.0,1.0,11.0,31.0
8419,Terrence Howard,byron,spark,0.0,0.0,0.0,0.0,0.000000,0.000000,206,...,15,6.4,23,5.2,16,5.5,93.0,5.4,77.0,29.0


When joining the data sets on the actor and title columns, as well as dropping the non-performance based credits the resulting data set has 8,340 entries. The original data set contained 14,474 entries suggesting a difference of 43.4% of the data by adding more features to the data.

In [19]:
shortened_df.describe()

Unnamed: 0,award,score_1,score_2,num_reviews,polarity_mean,objectivity_mean,numeric_key,ordering,height,year,...,males_30age_votes,males_45age_avg_vote,males_45age_votes,females_allages_avg_vote,females_allages_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes,age_at_performance
count,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,...,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0,8340.0
mean,0.063909,40.59964,40.608873,9.725659,0.085345,0.333301,116.752878,1.81211,180.458513,1977.03729,...,20230.250839,6.455408,6095.756595,6.626918,7481.309832,6.553381,9724.732494,6.392758,25678.468225,46.415588
std,0.285757,34.446202,37.447243,14.160255,0.159247,0.287861,66.50214,1.037871,6.953434,27.368352,...,53995.163243,0.878404,12494.841276,0.872528,20909.731628,0.907233,24100.386069,0.895061,67603.998745,13.341051
min,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,1.0,157.0,1917.0,...,5.0,1.6,11.0,1.2,1.0,1.5,6.0,1.0,6.0,9.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,63.0,1.0,175.0,1953.0,...,136.0,6.0,301.75,6.2,94.0,6.1,273.0,5.9,267.75,36.0
50%,0.0,46.0,39.0,2.0,0.0,0.46658,115.0,1.0,180.0,1979.0,...,1065.0,6.5,1278.5,6.7,495.5,6.6,1187.5,6.4,1699.0,45.0
75%,0.0,73.0,78.0,14.0,0.171954,0.572105,173.0,2.0,186.0,2002.0,...,12050.0,7.0,6017.25,7.2,4308.25,7.1,7364.0,7.0,15904.0,56.0
max,2.0,99.0,100.0,66.0,1.0,1.0,232.0,10.0,196.0,2020.0,...,743676.0,9.2,165852.0,10.0,278964.0,9.3,348363.0,9.5,887226.0,95.0


In [20]:
shortened_df.award.value_counts()

0.0    7898
1.0     351
2.0      91
Name: award, dtype: int64

Here it can be noted that several nominations and wins are not present in this dataset.

In [21]:
shortened_df.to_csv('best_actor_hd.csv', index=False)