# Every Academy Award for Best Picture Winner
## (1927-2021)

## Intro
What makes an Academy Award Best Picture? This process involves web scraping every movie that was nominated for Best Picture from Wikipedia. Then, the data will be prepared for analysis to find any common threads between these Oscar worthy movies. Are there any quantitative or qualitative measurements that helps a movie get nominated? Let's find out!

## Data
https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture

## Data Preparation & Cleaning

### Import Data

In [1]:
import json

def load_data(title):
    with open(title, encoding='utf-8') as f:
        return json.load(f)

In [2]:
data = load_data('Best_Picture_data.json')

In [3]:
import pandas as pd

df = pd.DataFrame(data)

### Data Info

In [4]:
df.notna().sum()

title                     580
Directed by               580
Written by                206
Screenplay by             384
Based on                  380
Produced by               574
Starring                  580
Cinematography            578
Edited by                 576
Distributed by            579
Release dates             287
Running time              579
Country                   481
Language                  513
Budget                    519
Box office                551
Production company        241
Release date              293
Story by                   57
Music by                  540
Color process              35
Languages                  59
Production companies      235
Countries                  96
Narrated by                18
Additional dialogue by      1
Suggested by                2
Narration by                1
Traditional                 1
Simplified                  1
Mandarin                    1
Hangul                      1
Revised Romanization        1
McCune–Rei

In [5]:
df.drop(columns=['Color process', 'Narrated by', 'Additional dialogue by', 'Suggested by', 'Narration by', 'Traditional', 'Simplified', 'Mandarin', 'Hangul', 'Revised Romanization',
        'McCune–Reischauer', 'Japanese', 'Hepburn'],inplace=True)
df.columns

Index(['title', 'Directed by', 'Written by', 'Screenplay by', 'Based on',
       'Produced by', 'Starring', 'Cinematography', 'Edited by',
       'Distributed by', 'Release dates', 'Running time', 'Country',
       'Language', 'Budget', 'Box office', 'Production company',
       'Release date', 'Story by', 'Music by', 'Languages',
       'Production companies', 'Countries'],
      dtype='object')

In [6]:
df.shape

(580, 23)

In [7]:
pd.set_option('display.max_columns', 26)
df.head()

Unnamed: 0,title,Directed by,Written by,Screenplay by,Based on,Produced by,Starring,Cinematography,Edited by,Distributed by,Release dates,Running time,Country,Language,Budget,Box office,Production company,Release date,Story by,Music by,Languages,Production companies,Countries
0,7th Heaven,Frank Borzage,"[Harry H. Caldwell (titles), Katharine Hillike...",Benjamin Glazer,"[Seventh Heaven, by Austin Strong]",William Fox,"[Janet Gaynor, Charles Farrell, Ben Bard]","[Ernest Palmer, Joseph A. Valentine]",Barney Wolf,Fox Film Corporation,"[May 6, 1927 (1927-05-06) (Los Angeles), May 2...",110 min,United States,Silent (English intertitles),$1.3 million,$2.5 million,,,,,,,
1,The Racket,Lewis Milestone,"[Bartlett Cormack, Tom Miranda, Uncredited:, H...",,,Howard Hughes,"[Thomas Meighan, Marie Prevost, Louis Wolheim]",Tony Gaudio,Eddie Adams,Paramount Pictures,,84 minutes,United States,Silent (English intertitles),,,The Caddo Company,"[November 1, 1928 (1928-11-01)]",,,,,
2,The Broadway Melody,Harry Beaumont,"[Sarah Y. Mason, (continuity), Norman Houston,...",,,"[Irving Thalberg, Lawrence Weingarten]","[Charles King, Anita Page, Bessie Love]",John Arnold,"[Sam S. Zimbalist, Uncredited:, William LeVanw...",Metro-Goldwyn-Mayer,"[February 1, 1929 (1929-02-01) (Grauman's Chin...",100 minutes,United States,English,"$379,000",$4.4 million,,,Edmund Goulding,(see article),,,
3,Alibi,Roland West,Elaine Sterne Carrington,,"[Nightstick, by, Elaine Sterne Carrington, ,, ...",Roland West,"[Chester Morris, Mae Busch]",Ray June,,United Artists,,90 minutes,United States,English,,,,"[April 20, 1929 (1929-04-20)]",,,,,
4,The Hollywood Revue of 1929,Charles Reisner,"[Al Boasberg, Robert E. Hopkins, Joseph W. Far...",,,"[Irving Thalberg, Harry Rapf]","[Conrad Nagel, Jack Benny]","[John Arnold, Max Fabian, Irving G. Ries, John...","[William S. Gray, Cameron K. Wood]",Metro-Goldwyn-Mayer,"[June 20, 1929, (Los Angeles)]","[130 minutes (roadshow), 118 min (Turner libra...",United States,English,"$426,000","$2,421,000 (worldwide rental)",,,,"[Gus Edwards, Arthur Freed, ("", Singin' in the...",,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580 entries, 0 to 579
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   title                 580 non-null    object
 1   Directed by           580 non-null    object
 2   Written by            206 non-null    object
 3   Screenplay by         384 non-null    object
 4   Based on              380 non-null    object
 5   Produced by           574 non-null    object
 6   Starring              580 non-null    object
 7   Cinematography        578 non-null    object
 8   Edited by             576 non-null    object
 9   Distributed by        579 non-null    object
 10  Release dates         287 non-null    object
 11  Running time          579 non-null    object
 12  Country               481 non-null    object
 13  Language              513 non-null    object
 14  Budget                519 non-null    object
 15  Box office            551 non-null    ob

### Convert 'Running Time' to Float

In [9]:
df.iloc[-20]

title                                                                1917
Directed by                                                    Sam Mendes
Written by                             [Sam Mendes, Krysty Wilson-Cairns]
Screenplay by                                                         NaN
Based on                                                              NaN
Produced by             [Sam Mendes, Pippa Harris, Jayne-Ann Tenggren,...
Starring                [George MacKay, Dean-Charles Chapman, Mark Str...
Cinematography                                              Roger Deakins
Edited by                                                       Lee Smith
Distributed by          [Universal Pictures (Worldwide), Entertainment...
Release dates           [4 December 2019 (2019-12-04) (London), 25 Dec...
Running time                                                  119 minutes
Country                                                               NaN
Language                              

In [10]:
def convert_running_time(run):
    if isinstance(run,list):
        return run[0].split(" ")[0].replace(",","")
    elif isinstance(run,float):
        return str(run).split(" ")[0].replace(",","")
    elif "–" in run:
        run.split("–")[0].replace(",","")
    else:
        return run.split(" ")[0].replace(",","")

In [11]:
df.iloc[196]

title                                                      Mister Roberts
Directed by             [John Ford, Mervyn LeRoy, Joshua Logan, (uncre...
Written by                                                            NaN
Screenplay by                             [Frank S. Nugent, Joshua Logan]
Based on                [Mister Roberts, (1946 novel), by, Thomas Hegg...
Produced by                                                Leland Hayward
Starring                [Henry Fonda, James Cagney, William Powell, Ja...
Cinematography                                             Winton C. Hoch
Edited by                                                     Jack Murray
Distributed by                                               Warner Bros.
Release dates                                                         NaN
Running time                [120,, 123,, or, 120-121, 123 or 126 minutes]
Country                                                     United States
Language                              

In [12]:
df['Running time(minutes)'] = [convert_running_time(run) for run in df['Running time']]

In [13]:
df['Running time(minutes)'] = df['Running time(minutes)'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580 entries, 0 to 579
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   title                  580 non-null    object 
 1   Directed by            580 non-null    object 
 2   Written by             206 non-null    object 
 3   Screenplay by          384 non-null    object 
 4   Based on               380 non-null    object 
 5   Produced by            574 non-null    object 
 6   Starring               580 non-null    object 
 7   Cinematography         578 non-null    object 
 8   Edited by              576 non-null    object 
 9   Distributed by         579 non-null    object 
 10  Release dates          287 non-null    object 
 11  Running time           579 non-null    object 
 12  Country                481 non-null    object 
 13  Language               513 non-null    object 
 14  Budget                 519 non-null    object 
 15  Box of

In [14]:
df['Running time(minutes)'].sort_values()

223      2.0
32      66.0
136     75.0
23      80.0
46      80.0
       ...  
89     221.0
229    227.0
236    251.0
93       NaN
161      NaN
Name: Running time(minutes), Length: 580, dtype: float64

In [15]:
df.iloc[-50]

title                                                        Hacksaw Ridge
Directed by                                                     Mel Gibson
Written by                                                             NaN
Screenplay by                            [Robert Schenkkan, Andrew Knight]
Based on                  [The Conscientious Objector, by, Terry Benedict]
Produced by              [Bill Mechanic, David Permut, Terry Benedict, ...
Starring                 [Andrew Garfield, Sam Worthington, Luke Bracey...
Cinematography                                                Simon Duggan
Edited by                                                     John Gilbert
Distributed by           [Lionsgate (United States and United Kingdom),...
Release dates            [September 4, 2016 (2016-09-04) (Venice), Nove...
Running time                                                   139 minutes
Country                                                                NaN
Language                 

### Convert Money to Float

In [84]:
import re

number = r"\d+(,\d{3})*\.*\d*"
amount = "thousand|million|billion|trillion"

word_usd = rf"\${number}(–|-|\sto\s)?({number})?\s{amount}"
value_usd = rf"\${number}"
word_bpound = rf"\£{number}(–|-|\sto\s)?({number})?\s{amount}"
value_bpound = rf"\£{number}"

def word_to_number(word):
    word_dict = {"thousand":1000,"million":1000000,"billion":1000000000,"trillion":1000000000000}
    return word_dict[word]
                 
def parse_word_syntax(string):
    if string.count("$")>1:
        if "–" in string:
            string=string.split("–")[1]
            value_string = re.search(number,string).group()        
            value = float(value_string.replace(",","").replace("$",""))
            word_string = re.search(amount,string).group()
            word_value = word_to_number(word_string)
        elif "or" in string:
            string=string.split("or")[0].strip()
            value_string = re.search(number,string).group()        
            value = float(value_string.replace(",","").replace("$",""))
            try:
                word_string = re.search(amount,string).group()
                word_value = word_to_number(word_string)
            except Exception as e:
                word_value = 1
        else:
            string=string.split("(")[0].strip()
            value_string = re.search(number,string).group()        
            value = float(value_string.replace(",","").replace("$",""))
            try:
                word_string = re.search(amount,string).group()
                word_value = word_to_number(word_string)
            except Exception as e:
                word_value = 1
        return round(value*word_value,2)
    elif "£" in string:
        string=string.split("$")[1]
        value_string = re.search(number,string).group()        
        value = float(value_string.replace(",","").replace("$",""))
        word_string = re.search(amount,string).group()
        word_value = word_to_number(word_string)   
        return round(value*word_value,2)
    elif string.count("$")==1:
        value_string = re.search(number,string).group()        
        value = float(value_string.replace(",","").replace("$",""))
        word_string = re.search(amount,string).group()
        word_value = word_to_number(word_string)
        return round(value*word_value,2)
    else:
        None
                 
def parse_value_syntax(string):
    value_string = re.search(number,string).group()        
    value = float(value_string.replace(",","").replace("$",""))
    return round(value,2)

def parse_bpound_word_syntax(string):
    value_string = re.search(number,string).group()
    value = float(value_string.replace(",","").replace("£",""))
    word_string = re.search(amount,string).group()
    word_value = word_to_number(word_string)
    value_bpound = value*word_value
    return round(value_bpound*1.16,2) #as of 9/1/22     
                 
def parse_bpound_value_syntax(string):
    value_string = re.search(number,string).group()
    value_bpound = float(value_string.replace(",","").replace("£",""))
    return round(value_bpound*1.16,2) #as of 9/1/22    

def convert_money(money):
    if isinstance(money,list):
        money=money[0]
    
    money = str(money)
    
    if "N/A" in money:
        return None
    
    word_syntax = re.search(word_usd,money,flags=re.I)                 
    value_syntax = re.search(value_usd,money)
    bpound_word_syntax = re.search(word_bpound,money,flags=re.I)
    bpound_value_syntax = re.search(value_bpound,money)
    
    # if ("$" in money) & ("£" in money):
    #     if bpound_word_syntax:
    #         return parse_bpound_word_syntax(money)
    #     elif bpound_value_syntax:
    #         return parse_bpound_value_syntax(money)
    #     else:
    #         return None
    if "$" in money:
        if word_syntax:
            return parse_word_syntax(money)
        elif value_syntax:
            return parse_value_syntax(money)
        else:
            return None
    elif "£" in money:
        if bpound_word_syntax:
            return parse_bpound_word_syntax(money)
        elif bpound_value_syntax:
            return parse_bpound_value_syntax(money)
        else:
            return None
    else:
        return None

In [57]:
box = "$11,000,000 or $4.3 million (US rentals)"
convert_money(box)

$11,000,000 or $4.3 million (US rentals)
$11,000,000 or $4.3 million (US rentals)!!!!!!!!!!!!!!


11000000.0

In [58]:
convert_money("$1,200")

$1,200


1200.0

In [59]:
convert_money("£1 million")

£1 million


1160000.0

In [18]:
import numpy as np
df['Budget'].replace(np.nan,"N/A")

0        $1.3 million
1                 N/A
2            $379,000
3                 N/A
4            $426,000
            ...      
575       $50 million
576       $40 million
577       $60 million
578    $35–39 million
579      $100 million
Name: Budget, Length: 580, dtype: object

In [85]:
df['Budget(USD)'] = df['Budget'].apply(lambda x: convert_money(x))

In [86]:
df['Box office(USD)'] = df['Box office'].apply(lambda x: convert_money(x))

In [63]:
df['Budget(USD)'].sort_values(ascending=False)

465    237000000.0
404    200000000.0
481    200000000.0
546    200000000.0
472    175000000.0
          ...     
308            NaN
360            NaN
553            NaN
571            NaN
573            NaN
Name: Budget(USD), Length: 580, dtype: float64

In [64]:
df['Box office(USD)'].sort_values(ascending=False)

465    2.847000e+09
404    2.202000e+09
546    1.348000e+09
434    1.146000e+09
557    1.074000e+09
           ...     
116             NaN
235             NaN
282             NaN
286             NaN
301             NaN
Name: Box office(USD), Length: 580, dtype: float64

In [89]:
compare_money_convert = df[['Budget','Budget(USD)','Box office','Box office(USD)']]
compare_money_convert.to_csv('compare_money_conversion.csv',index=False,encoding='utf-8')

### Convert 'Release dates' to Datetime

In [65]:
df['Release dates'].head(20)

0     [May 6, 1927 (1927-05-06) (Los Angeles), May 2...
1                                                   NaN
2     [February 1, 1929 (1929-02-01) (Grauman's Chin...
3                                                   NaN
4                        [June 20, 1929, (Los Angeles)]
5     [Premiere:, December 25, 1928, (, 1928-12-25, ...
6                                                   NaN
7                                                   NaN
8                                                   NaN
9                                                   NaN
10                                                  NaN
11    [November 19, 1929, (New York City), January 1...
12    [January 26, 1931 (1931-01-26) (Premiere-New Y...
13    [February 20, 1931 (1931-02-20) (New York City...
14                                                  NaN
15                                                  NaN
16                                                  NaN
17    [April 12, 1932 (1932-04-12) (New York Cit

In [66]:
df.loc[577,['Release dates']]

Release dates    [December 1, 2021 (2021-12-01) (Alice Tully Ha...
Name: 577, dtype: object

In [67]:
from datetime import datetime

dates = df['Release dates']

def clean_date(date):
    if isinstance(date,list):
        if date[0] != 'Premiere:':
            date=date[0]
        else:
            date=date[1]
        
    if isinstance(date,float): 
        return "N/A"
    elif date == 'NaN':
        return "N/A"
    else:
        date = str(date)
        return date.split("(")[0].strip()

def convert_date(date):
    if date != "N/A":
        for fmt in ('%B %d, %Y', '%B %Y', '%d %B %Y'):
            try:
                return datetime.strptime(date, fmt)
            except ValueError:
                pass
        raise ValueError('no valid date format found')
            

In [68]:
clean_date(['December 1, 2021 (2021-12-01) (Alice Tully Hall)',
  'December 17, 2021 (2021-12-17) (United States)'])

'December 1, 2021'

In [69]:
df['Release dates(dt)'] = [clean_date(date) for date in df['Release dates']]
df['Release dates(dt)']

0            May 6, 1927
1                    N/A
2       February 1, 1929
3                    N/A
4          June 20, 1929
             ...        
575    September 2, 2021
576                  N/A
577     December 1, 2021
578    September 2, 2021
579    November 29, 2021
Name: Release dates(dt), Length: 580, dtype: object

In [70]:
df['Release dates(dt)'].value_counts().head(60)

N/A                   293
November 11, 2014       2
September 2, 2021       2
30 August 2018          2
September 2, 2005       2
May 6, 1927             1
September 24, 2010      1
January 25, 2010        1
July 8, 2010            1
December 6, 2010        1
September 1, 2010       1
6 September 2010        1
4 September 2010        1
September 5, 2009       1
May 13, 2009            1
January 15, 2009        1
May 20, 2009            1
18 January 2009         1
13 August 2009          1
June 12, 2010           1
15 May 2011             1
January 21, 2010        1
August 31, 2012         1
5 December 2012         1
October 8, 2012         1
September 28, 2012      1
December 11, 2012       1
January 20, 2012        1
20 May 2012             1
4 December 2011         1
September 4, 2008       1
May 16, 2011            1
September 9, 2011       1
May 11, 2011            1
October 10, 2011        1
August 9, 2011          1
September 10, 2011      1
December 10, 2009       1
October 28, 

In [71]:
convert_date('9 May 2001')
convert_date('December 10, 2012')

datetime.datetime(2012, 12, 10, 0, 0)

In [72]:
df['Release dates(dt)'] = [convert_date(date) for date in df['Release dates(dt)']]
df['Release dates(dt)']

0     1927-05-06
1            NaT
2     1929-02-01
3            NaT
4     1929-06-20
         ...    
575   2021-09-02
576          NaT
577   2021-12-01
578   2021-09-02
579   2021-11-29
Name: Release dates(dt), Length: 580, dtype: datetime64[ns]

In [73]:
df.head()

Unnamed: 0,title,Directed by,Written by,Screenplay by,Based on,Produced by,Starring,Cinematography,Edited by,Distributed by,Release dates,Running time,Country,...,Budget,Box office,Production company,Release date,Story by,Music by,Languages,Production companies,Countries,Running time(minutes),Budget(USD),Box office(USD),Release dates(dt)
0,7th Heaven,Frank Borzage,"[Harry H. Caldwell (titles), Katharine Hillike...",Benjamin Glazer,"[Seventh Heaven, by Austin Strong]",William Fox,"[Janet Gaynor, Charles Farrell, Ben Bard]","[Ernest Palmer, Joseph A. Valentine]",Barney Wolf,Fox Film Corporation,"[May 6, 1927 (1927-05-06) (Los Angeles), May 2...",110 min,United States,...,$1.3 million,$2.5 million,,,,,,,,110.0,1300000.0,2500000.0,1927-05-06
1,The Racket,Lewis Milestone,"[Bartlett Cormack, Tom Miranda, Uncredited:, H...",,,Howard Hughes,"[Thomas Meighan, Marie Prevost, Louis Wolheim]",Tony Gaudio,Eddie Adams,Paramount Pictures,,84 minutes,United States,...,,,The Caddo Company,"[November 1, 1928 (1928-11-01)]",,,,,,84.0,,,NaT
2,The Broadway Melody,Harry Beaumont,"[Sarah Y. Mason, (continuity), Norman Houston,...",,,"[Irving Thalberg, Lawrence Weingarten]","[Charles King, Anita Page, Bessie Love]",John Arnold,"[Sam S. Zimbalist, Uncredited:, William LeVanw...",Metro-Goldwyn-Mayer,"[February 1, 1929 (1929-02-01) (Grauman's Chin...",100 minutes,United States,...,"$379,000",$4.4 million,,,Edmund Goulding,(see article),,,,100.0,379000.0,4400000.0,1929-02-01
3,Alibi,Roland West,Elaine Sterne Carrington,,"[Nightstick, by, Elaine Sterne Carrington, ,, ...",Roland West,"[Chester Morris, Mae Busch]",Ray June,,United Artists,,90 minutes,United States,...,,,,"[April 20, 1929 (1929-04-20)]",,,,,,90.0,,,NaT
4,The Hollywood Revue of 1929,Charles Reisner,"[Al Boasberg, Robert E. Hopkins, Joseph W. Far...",,,"[Irving Thalberg, Harry Rapf]","[Conrad Nagel, Jack Benny]","[John Arnold, Max Fabian, Irving G. Ries, John...","[William S. Gray, Cameron K. Wood]",Metro-Goldwyn-Mayer,"[June 20, 1929, (Los Angeles)]","[130 minutes (roadshow), 118 min (Turner libra...",United States,...,"$426,000","$2,421,000 (worldwide rental)",,,,"[Gus Edwards, Arthur Freed, ("", Singin' in the...",,,,130.0,426000.0,2421000.0,1929-06-20


In [74]:
df['Countries'].explode().unique()

array([nan, 'United Kingdom', 'United States', 'Australia', 'Greece',
       'Italy', 'Algeria', 'France', 'Turkey', 'Canada', 'India',
       'Brazil', 'Ireland', 'Japan', 'New Zealand', 'Belgium', 'Taiwan',
       'China', 'Hong Kong', 'Germany', 'Poland', 'Mexico', 'Morocco',
       'South Africa', 'United Arab Emirates', 'Spain', 'Austria',
       'Netherlands', 'Czech Republic'], dtype=object)

In [75]:
df['Languages'].explode().unique()

array([nan, 'Silent film', 'English sequences', 'English', 'Italian',
       'English (primarily), German, Italian', 'French', 'German',
       'English, Cantonese', 'Welsh', 'English, French', 'Irish',
       'Japanese', 'Greek', 'Spanish', 'Russian', 'Mandarin', 'Sicilian',
       'Vietnamese', 'Turkish', 'Maltese', 'Khmer', 'Swahili',
       'Portuguese', 'Pennsylvania Dutch', 'American Sign Language',
       'Latin', 'Guaraní', 'Lakota', 'Pawnee', 'Māori',
       'British Sign Language', 'Arabic', 'Japanese Sign Language',
       'Berber languages', 'Hindi', 'Tamil', 'Somali', 'Bengali',
       'Mixtec', 'Korean'], dtype=object)

In [76]:
df.to_csv("Best_Picture_clean.csv",index=False,encoding='utf-8')