## Import the necessary basic libraries

In [1]:
import numpy as np
import pandas as pd
import csv

## Obtaining the datasets for:
> 1. **Top 50 Directors** 
> 2. **Top 1000 Actors and Actresses**
> 3. **The Oscar Award**
------------------------

###### Top 50 Directors dataset:

In [2]:
# The dataset is in CSV format, hence we use the read_csv function from Pandas.
# Immediately after importing, we will take a quick look at the data using the head function.
directorData = pd.read_csv('Top 50 directors.csv')
directorData.head()

Unnamed: 0,Rank,Name of Director
0,1,Steven Spielberg
1,2,Martin Scorsese
2,3,Francis Ford Coppola
3,4,Stanley Kubrick
4,5,Alfred Hitchcock


Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [3]:
print("Data type : ", type(directorData))
print("Data dims : ", directorData.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (50, 2)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [4]:
print(directorData.dtypes)

Rank                 int64
Name of Director    object
dtype: object


###### Top 1000 Actors and Actresses dataset:

In [5]:
# The dataset is in CSV format, hence we use the read_csv function from Pandas.
# Immediately after importing, we will take a quick look at the data using the head function.
actorData = pd.read_csv('Top 1000 Actors and Actresses.csv')
actorData.head()

Unnamed: 0,Position,Const,Created,Modified,Description,Name,Known For,Birth Date
0,1,nm0000134,2014-03-09,2014-03-09,,Robert De Niro,Raging Bull,1943-08-17
1,2,nm0000197,2014-03-09,2015-10-25,,Jack Nicholson,Chinatown,1937-04-22
2,3,nm0000008,2014-03-09,2014-03-09,,Marlon Brando,Apocalypse Now,1924-04-03
3,4,nm0000243,2014-03-09,2014-03-09,,Denzel Washington,Fences,1954-12-28
4,5,nm0000031,2014-03-09,2014-03-09,,Katharine Hepburn,The Lion in Winter,1907-05-12


Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [6]:
print("Data type : ", type(actorData))
print("Data dims : ", actorData.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (1000, 8)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [7]:
print(actorData.dtypes)

Position         int64
Const           object
Created         object
Modified        object
Description    float64
Name            object
Known For       object
Birth Date      object
dtype: object


###### Oscar Award dataset:

In [8]:
# The dataset is in CSV format, hence we use the read_csv function from Pandas.
# Immediately after importing, we will take a quick look at the data using the head function.
oscarData = pd.read_csv('oscarData.csv')
oscarData.head()

Unnamed: 0,Title,Year,Genre,Rating,Runtime,Sales,Language,Age Rating,Director,Cast1,Cast2,Cast3,Country of Origin,Number of Wins
0,Sunrise,1927,"Drama, Romance",8.1,1 hour 34 minutes,121107,"None, English",Passed,F.W. Murnau,George O'Brien,Janet Gaynor,Margaret Livingston,United States,3
1,Wings,1927,"Drama, Romance",7.6,2 hours 24 minutes,746,English,PG-13,William A. Wellman,Clara Bow,Charles 'Buddy' Rogers,Richard Arlen,United States,1
2,Hollywood Revue,1928,"Comedy, Music",5.8,2 hours 10 minutes,5277780,English,Passed,Charles Reisner,Conrad Nagel,Jack Benny,John Gilbert,United States,0
3,Morocco,1930,"Drama, Romance",7.0,1 hour 32 minutes,191,"English, French, Spanish, Arabic, Italian",Passed,Josef von Sternberg,Gary Cooper,Marlene Dietrich,Adolphe Menjou,United States,0
4,The Public Enemy,1930,"Crime, Drama",7.6,1 hour 23 minutes,1214260,English,Not Rated,William A. Wellman,James Cagney,Jean Harlow,Edward Woods,United States,0


Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [9]:
print("Data type : ", type(oscarData))
print("Data dims : ", oscarData.shape)

Data type :  <class 'pandas.core.frame.DataFrame'>
Data dims :  (1669, 14)


Check the variables (and their types) in the dataset using the `dtypes` attribute.

In [10]:
print(oscarData.dtypes)

Title                 object
Year                   int64
Genre                 object
Rating               float64
Runtime               object
Sales                  int64
Language              object
Age Rating            object
Director              object
Cast1                 object
Cast2                 object
Cast3                 object
Country of Origin     object
Number of Wins         int64
dtype: object


## Adding our variables to the dataset

As we are using the actors' and directors' popularity to determine the number of times a film wins at the Oscar Awards, we need to add these variables to the dataset using the existing actor and directors names.

We begin by adding columns into the oscarData dataframe.

This includes:
> 1. Cast1 Ranking
> 2. Cast2 Ranking
> 3. Cast3 Ranking
> 4. Director Ranking
> 5. Sum of Rankings


The default value of director ranks is set as 100000. This allows the rankings of the top 50 Directors to weigh more.

In [11]:
oscarData.insert(9, "Director Rank", 100000)

The default value of cast ranks is also set as 100000. This allows the rankings of the top 1000 Actors and Actresses to weigh more.
Henceforth, the default value of sum of cast rankings is set at 400000, which is the sum of default value of all the cast ranks.

In [12]:
oscarData.insert(11, "Cast1 Rank", 100000)
oscarData.insert(13, "Cast2 Rank", 100000)
oscarData.insert(15, "Cast3 Rank", 100000)
oscarData.insert(16, "Sum of Cast Rankings", 400000)

The new dataframe:

In [13]:
oscarData.head()

Unnamed: 0,Title,Year,Genre,Rating,Runtime,Sales,Language,Age Rating,Director,Director Rank,Cast1,Cast1 Rank,Cast2,Cast2 Rank,Cast3,Cast3 Rank,Sum of Cast Rankings,Country of Origin,Number of Wins
0,Sunrise,1927,"Drama, Romance",8.1,1 hour 34 minutes,121107,"None, English",Passed,F.W. Murnau,100000,George O'Brien,100000,Janet Gaynor,100000,Margaret Livingston,100000,400000,United States,3
1,Wings,1927,"Drama, Romance",7.6,2 hours 24 minutes,746,English,PG-13,William A. Wellman,100000,Clara Bow,100000,Charles 'Buddy' Rogers,100000,Richard Arlen,100000,400000,United States,1
2,Hollywood Revue,1928,"Comedy, Music",5.8,2 hours 10 minutes,5277780,English,Passed,Charles Reisner,100000,Conrad Nagel,100000,Jack Benny,100000,John Gilbert,100000,400000,United States,0
3,Morocco,1930,"Drama, Romance",7.0,1 hour 32 minutes,191,"English, French, Spanish, Arabic, Italian",Passed,Josef von Sternberg,100000,Gary Cooper,100000,Marlene Dietrich,100000,Adolphe Menjou,100000,400000,United States,0
4,The Public Enemy,1930,"Crime, Drama",7.6,1 hour 23 minutes,1214260,English,Not Rated,William A. Wellman,100000,James Cagney,100000,Jean Harlow,100000,Edward Woods,100000,400000,United States,0


Now, we will add the corresponding director and cast ranks.

In [14]:
#Adding the director ranks by comparing the name of the director in the oscarData dataframe and that in our directorData dataframe.
#Those that do not have a rank in the directorData dataframe will be assigned a rank of 100000.
#This places more weight on those directors who have a rank.
yes = 0
for i in range(0,len(oscarData)):
    for j in range(0,len(directorData)):
        if(oscarData['Director'][i] == directorData['Name of Director'][j]):
            oscarData.at[i,'Director Rank'] = directorData['Rank'][j]
            yes = 1
    if yes != 1:
        oscarData.at[i,'Director Rank'] = '100000'
    yes = 0

In [15]:
#Adding the cast ranks by comparing the name of the cast in the oscarData dataframe and that in our directorData dataframe.
#Those that do not have a rank in the directorData dataframe will be assigned a rank of 100000.
#This places more weight on those casts who have a rank.
yes1 = 0
yes2 = 0
yes3 = 0
for i in range(0,len(oscarData)):
    for j in range(0,len(actorData)):
        if(oscarData['Cast1'][i].strip() == actorData['Name'][j]):
            oscarData.at[i,'Cast1 Rank'] = actorData['Position'][j]
            yes1 = 1
            
        if(oscarData['Cast2'][i].strip() == actorData['Name'][j]):
            oscarData.at[i,'Cast2 Rank'] = actorData['Position'][j]
            yes2 = 1
            
        if(oscarData['Cast3'][i].strip() == actorData['Name'][j]):
            oscarData.at[i,'Cast3 Rank'] = actorData['Position'][j]
            yes3 = 1
    #if there was no corresponding rank data, the rank remains at 100000        
    if yes1 != 1:
        oscarData.at[i,'Cast1 Rank'] = '100000'
        
    if yes2 != 1:
        oscarData.at[i,'Cast2 Rank'] = '100000'
        
    if yes3 != 1:
        oscarData.at[i,'Cast3 Rank'] = '100000'
    #initialize all yes variables back to 0
    yes1 = 0
    yes2 = 0
    yes3 = 0

In [16]:
#Adding the Sum of Cast Rankings
sum = 0

for i in range(0,len(oscarData)):
    sum += oscarData['Cast1 Rank'][i] + oscarData['Cast2 Rank'][i] + oscarData['Cast3 Rank'][i]
    oscarData.at[i,'Sum of Cast Rankings'] = sum
    sum = 0

###### The new oscarData dataframe:

In [17]:
oscarData.head()

Unnamed: 0,Title,Year,Genre,Rating,Runtime,Sales,Language,Age Rating,Director,Director Rank,Cast1,Cast1 Rank,Cast2,Cast2 Rank,Cast3,Cast3 Rank,Sum of Cast Rankings,Country of Origin,Number of Wins
0,Sunrise,1927,"Drama, Romance",8.1,1 hour 34 minutes,121107,"None, English",Passed,F.W. Murnau,100000,George O'Brien,100000,Janet Gaynor,387,Margaret Livingston,100000,200387,United States,3
1,Wings,1927,"Drama, Romance",7.6,2 hours 24 minutes,746,English,PG-13,William A. Wellman,100000,Clara Bow,100000,Charles 'Buddy' Rogers,100000,Richard Arlen,100000,300000,United States,1
2,Hollywood Revue,1928,"Comedy, Music",5.8,2 hours 10 minutes,5277780,English,Passed,Charles Reisner,100000,Conrad Nagel,100000,Jack Benny,100000,John Gilbert,100000,300000,United States,0
3,Morocco,1930,"Drama, Romance",7.0,1 hour 32 minutes,191,"English, French, Spanish, Arabic, Italian",Passed,Josef von Sternberg,100000,Gary Cooper,71,Marlene Dietrich,241,Adolphe Menjou,100000,100312,United States,0
4,The Public Enemy,1930,"Crime, Drama",7.6,1 hour 23 minutes,1214260,English,Not Rated,William A. Wellman,100000,James Cagney,40,Jean Harlow,244,Edward Woods,100000,100284,United States,0


In [18]:
oscarData.to_csv("temp.csv")

In [38]:
oscarData = pd.read_csv("temp.csv")
oscarData

Unnamed: 0.1,Unnamed: 0,Title,Year,Genre,Rating,Runtime,Sales,Language,Age Rating,Director,Director Rank,Cast1,Cast1 Rank,Cast2,Cast2 Rank,Cast3,Cast3 Rank,Sum of Cast Rankings,Country of Origin,Number of Wins
0,0,Sunrise,1927,"Drama, Romance",8.1,1 hour 34 minutes,121107,"None, English",Passed,F.W. Murnau,100000,George O'Brien,100000,Janet Gaynor,387,Margaret Livingston,100000,200387,United States,3
1,1,Wings,1927,"Drama, Romance",7.6,2 hours 24 minutes,746,English,PG-13,William A. Wellman,100000,Clara Bow,100000,Charles 'Buddy' Rogers,100000,Richard Arlen,100000,300000,United States,1
2,2,Hollywood Revue,1928,"Comedy, Music",5.8,2 hours 10 minutes,5277780,English,Passed,Charles Reisner,100000,Conrad Nagel,100000,Jack Benny,100000,John Gilbert,100000,300000,United States,0
3,3,Morocco,1930,"Drama, Romance",7.0,1 hour 32 minutes,191,"English, French, Spanish, Arabic, Italian",Passed,Josef von Sternberg,100000,Gary Cooper,71,Marlene Dietrich,241,Adolphe Menjou,100000,100312,United States,0
4,4,The Public Enemy,1930,"Crime, Drama",7.6,1 hour 23 minutes,1214260,English,Not Rated,William A. Wellman,100000,James Cagney,40,Jean Harlow,244,Edward Woods,100000,100284,United States,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1664,1664,Saria,2019,"Short, Drama",7.2,22 minutes,330661,Spanish,Parents guide,Bryan Buckley,100000,Estefanía Tellez,100000,Gabriela Ramírez,100000,Verónica Zúñiga,100000,300000,United States,0
1665,1665,Ad Astra,2019,"Adventure, Drama",6.5,2 hours 3 minutes,127461872,"English, Norwegian",PG13,James Gray,100000,Brad Pitt,120,Tommy Lee Jones,103,Ruth Negga,100000,100223,"China, United States, Brazil",0
1666,1666,Avengers: Endgame,2019,"Action, Adventure",8.4,3 hours 1 minute,2797501328,"English, Japanese, Xhosa, German",PG13,Anthony Russo,100000,Robert Downey Jr.,125,Chris Evans,457,Mark Ruffalo,516,1098,United States,0
1667,1667,The Lion King,2019,"Animation, Adventure",6.8,1 hour 58 minutes,1662899439,"English, Xhosa, Zulu, French, Spanish, Hindi",PG,Jon Favreau,100000,Donald Glover,100000,Beyoncé,831,Seth Rogen,462,101293,"United States, United Kingdom",0


In [39]:
#clearing unneccessary columns
del oscarData["Unnamed: 0"]
oscarData.head()

Unnamed: 0,Title,Year,Genre,Rating,Runtime,Sales,Language,Age Rating,Director,Director Rank,Cast1,Cast1 Rank,Cast2,Cast2 Rank,Cast3,Cast3 Rank,Sum of Cast Rankings,Country of Origin,Number of Wins
0,Sunrise,1927,"Drama, Romance",8.1,1 hour 34 minutes,121107,"None, English",Passed,F.W. Murnau,100000,George O'Brien,100000,Janet Gaynor,387,Margaret Livingston,100000,200387,United States,3
1,Wings,1927,"Drama, Romance",7.6,2 hours 24 minutes,746,English,PG-13,William A. Wellman,100000,Clara Bow,100000,Charles 'Buddy' Rogers,100000,Richard Arlen,100000,300000,United States,1
2,Hollywood Revue,1928,"Comedy, Music",5.8,2 hours 10 minutes,5277780,English,Passed,Charles Reisner,100000,Conrad Nagel,100000,Jack Benny,100000,John Gilbert,100000,300000,United States,0
3,Morocco,1930,"Drama, Romance",7.0,1 hour 32 minutes,191,"English, French, Spanish, Arabic, Italian",Passed,Josef von Sternberg,100000,Gary Cooper,71,Marlene Dietrich,241,Adolphe Menjou,100000,100312,United States,0
4,The Public Enemy,1930,"Crime, Drama",7.6,1 hour 23 minutes,1214260,English,Not Rated,William A. Wellman,100000,James Cagney,40,Jean Harlow,244,Edward Woods,100000,100284,United States,0


As we will be comparing runtime as a numerical variable, we need to convert the time into minutes only.

In [40]:
# Storing runtime as a new dataframe
runtime = pd.DataFrame(oscarData['Runtime'])
runtime.head()

Unnamed: 0,Runtime
0,1 hour 34 minutes
1,2 hours 24 minutes
2,2 hours 10 minutes
3,1 hour 32 minutes
4,1 hour 23 minutes


In [41]:
#inserting the necessary columns
runtime.insert(1, "Hours", 0)
runtime.insert(2, "Minutes", 0)

In [54]:
for i in range(0, len(runtime)):
    if "hours" in runtime["Runtime"][i]:
        temp1 = runtime["Runtime"][i].split("hours")
        hour = temp1[0].strip()
        if "minutes" in runtime["Runtime"][i]:
            temp2 = temp1[1].split("minutes")
            minute = temp2[0].strip()
        elif "minute" in runtime["Runtime"][i]:
            temp2 = temp1[1].split("minute")
            minute = temp2[0].strip()
        
    elif "hour" in runtime["Runtime"][i]:
        temp1 = runtime["Runtime"][i].split("hour")
        hour = temp1[0].strip()
        if "minutes" in runtime["Runtime"][i]:
            temp2 = temp1[1].split("minutes")
            minute = temp2[0].strip()
        elif "minute" in runtime["Runtime"][i]:
            temp2 = temp1[1].split("minute")
            minute = temp2[0].strip()
    else:
        if "minutes" in runtime["Runtime"][i]:
            temp1 = runtime["Runtime"][i].split("minutes")
            hour = 0
            minute = temp1[0].strip()
        
        elif "minute" in runtime["Runtime"][i]:
            temp1 = runtime["Runtime"][i].split("minute")
            hour = 0
            minute = temp1[0].strip()
        
    runtime.at[i,'Hours'] = hour
    runtime.at[i,'Minutes'] = minute

In [58]:
#new runtime dataframe
runtime.head()

Unnamed: 0,Runtime,Hours,Minutes
0,1 hour 34 minutes,1,34
1,2 hours 24 minutes,2,24
2,2 hours 10 minutes,2,10
3,1 hour 32 minutes,1,32
4,1 hour 23 minutes,1,23


In [59]:
#converting the hours into minutes and putting back the data into the oscarData dataframe
time = 0
for i in range(0, len(runtime)):
    time = int(runtime["Hours"][i])*60 + int(runtime["Minutes"][i])
    oscarData.at[i,'Runtime'] = time

In [None]:
# Categorising the films by their primary genre
gens = []
for i in range(0, len(oscarData)):
    gen = oscarData['Genre'][i].split(',')
    gens.append(gen[0])
    
gdf = pd.Series(gens)
oscarData.insert(3, 'Genre 1', gdf)
gdf.value_counts()

In [None]:
# Categorising the films by whether their primary language of release is English
pri = []
for i in range(0, len(oscarData)):
    langs = []
    lang = oscarData['Language'][i].split(',')
    if lang[0] == 'English':
        pri.append('Yes')
    else:
        pri.append('No')
        
pdf = pd.Series(pri)
oscarData.insert(8, 'English Film', pdf)
pdf.value_counts()

In [None]:
# Categorising the films by their primary country of origin
cofs = []
for i in range(0, len(oscarData)):
    cof = oscarData['Country of Origin'][i].split(',')
    cofs.append(cof[0])
    
cpd = pd.Series(cofs)
oscarData.insert(21, 'Country of Origin 1', cpd)
cpd.value_counts()

In [60]:
oscarData

Unnamed: 0,Title,Year,Genre,Rating,Runtime,Sales,Language,Age Rating,Director,Director Rank,Cast1,Cast1 Rank,Cast2,Cast2 Rank,Cast3,Cast3 Rank,Sum of Cast Rankings,Country of Origin,Number of Wins
0,Sunrise,1927,"Drama, Romance",8.1,94,121107,"None, English",Passed,F.W. Murnau,100000,George O'Brien,100000,Janet Gaynor,387,Margaret Livingston,100000,200387,United States,3
1,Wings,1927,"Drama, Romance",7.6,144,746,English,PG-13,William A. Wellman,100000,Clara Bow,100000,Charles 'Buddy' Rogers,100000,Richard Arlen,100000,300000,United States,1
2,Hollywood Revue,1928,"Comedy, Music",5.8,130,5277780,English,Passed,Charles Reisner,100000,Conrad Nagel,100000,Jack Benny,100000,John Gilbert,100000,300000,United States,0
3,Morocco,1930,"Drama, Romance",7.0,92,191,"English, French, Spanish, Arabic, Italian",Passed,Josef von Sternberg,100000,Gary Cooper,71,Marlene Dietrich,241,Adolphe Menjou,100000,100312,United States,0
4,The Public Enemy,1930,"Crime, Drama",7.6,83,1214260,English,Not Rated,William A. Wellman,100000,James Cagney,40,Jean Harlow,244,Edward Woods,100000,100284,United States,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1664,Saria,2019,"Short, Drama",7.2,22,330661,Spanish,Parents guide,Bryan Buckley,100000,Estefanía Tellez,100000,Gabriela Ramírez,100000,Verónica Zúñiga,100000,300000,United States,0
1665,Ad Astra,2019,"Adventure, Drama",6.5,123,127461872,"English, Norwegian",PG13,James Gray,100000,Brad Pitt,120,Tommy Lee Jones,103,Ruth Negga,100000,100223,"China, United States, Brazil",0
1666,Avengers: Endgame,2019,"Action, Adventure",8.4,181,2797501328,"English, Japanese, Xhosa, German",PG13,Anthony Russo,100000,Robert Downey Jr.,125,Chris Evans,457,Mark Ruffalo,516,1098,United States,0
1667,The Lion King,2019,"Animation, Adventure",6.8,118,1662899439,"English, Xhosa, Zulu, French, Spanish, Hindi",PG,Jon Favreau,100000,Donald Glover,100000,Beyoncé,831,Seth Rogen,462,101293,"United States, United Kingdom",0


In [61]:
oscarData.to_csv("fullOscarData.csv")