In [1]:
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv("oscars.csv")
dataset.head() 

Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,...,New_York_Film_Critics_Circle_nominated,New_York_Film_Critics_Circle_nominated_categories,Los_Angeles_Film_Critics_Association_won,Los_Angeles_Film_Critics_Association_won_categories,Los_Angeles_Film_Critics_Association_nominated,Los_Angeles_Film_Critics_Association_nominated_categories,release_date.year,release_date.month,release_date.day-of-month,release_date.day-of-week
0,2001,Kate & Leopold,tt0035423,PG-13,118,Comedy|Fantasy|Romance,6.4,44.0,An English Duke from 1876 is inadvertedly drag...,66660,...,0,,0,,0,,2001.0,12.0,25.0,2.0
1,2000,Chicken Run,tt0120630,G,84,Animation|Adventure|Comedy,7.0,88.0,When a cockerel apparently flies into a chicke...,144475,...,1,Best Animated Film,1,Best Animation,1,Best Animation,2000.0,6.0,23.0,5.0
2,2005,Fantastic Four,tt0120667,PG-13,106,Action|Adventure|Family,5.7,40.0,A group of astronauts gain superpowers after a...,273203,...,0,,0,,0,,2005.0,7.0,8.0,5.0
3,2002,Frida,tt0120679,R,123,Biography|Drama|Romance,7.4,61.0,"A biography of artist Frida Kahlo, who channel...",63852,...,0,,0,,0,,2002.0,11.0,22.0,5.0
4,2001,The Lord of the Rings: The Fellowship of the Ring,tt0120737,PG-13,178,Adventure|Drama|Fantasy,8.8,92.0,A meek Hobbit from the Shire and eight compani...,1286275,...,0,,1,Best Music,2,Best Music|Best Production Design,2001.0,12.0,19.0,3.0


In [3]:
dataset.shape

(1183, 119)

Let's break up the data in order to make it easier to explore it.

In [4]:
dataset[dataset.columns[:20]].head()

Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,gross,release_date,user_reviews,critic_reviews,popularity,awards_wins,awards_nominations,Oscar_Best_Picture_won,Oscar_Best_Picture_nominated,Oscar_Best_Director_won
0,2001,Kate & Leopold,tt0035423,PG-13,118,Comedy|Fantasy|Romance,6.4,44.0,An English Duke from 1876 is inadvertedly drag...,66660,47100000.0,2001-12-25,318.0,125.0,2363.0,1,4,No,No,No
1,2000,Chicken Run,tt0120630,G,84,Animation|Adventure|Comedy,7.0,88.0,When a cockerel apparently flies into a chicke...,144475,106790000.0,2000-06-23,361.0,186.0,2859.0,5,11,No,No,No
2,2005,Fantastic Four,tt0120667,PG-13,106,Action|Adventure|Family,5.7,40.0,A group of astronauts gain superpowers after a...,273203,154700000.0,2005-07-08,1008.0,278.0,1876.0,0,0,No,No,No
3,2002,Frida,tt0120679,R,123,Biography|Drama|Romance,7.4,61.0,"A biography of artist Frida Kahlo, who channel...",63852,25780000.0,2002-11-22,272.0,126.0,2508.0,2,12,No,No,No
4,2001,The Lord of the Rings: The Fellowship of the Ring,tt0120737,PG-13,178,Adventure|Drama|Fantasy,8.8,92.0,A meek Hobbit from the Shire and eight compani...,1286275,313840000.0,2001-12-19,5078.0,296.0,204.0,26,67,No,Yes,No


From the first 20, let's drop `movie` (or maybe the title has a little influence xD) and `movie_id` (use movie_id as index instead), and maybe `synopsis` (no NLP). `awards_wins` is, I guess, the number of awards won before the Oscars ([awards timeline](https://www.indiewire.com/2017/07/awards-calendar-2017-2018-oscars-golden-globes-indie-spirits-1201792252/)). Let's check this using the second row.

In [59]:
sum = 0
for i in list(dataset.columns):
    if "won" in i: # This film didn't win any Oscars
        if dataset[i].iloc[1] == 1:
            sum += 1
sum

5

Our guess was correct. The same applies to `award_nominations`. The Oscars (Academy Awards) are in the end of February/beginning of March since 2004, before they were in the end of March (google search "oscars date year" and see for yourself lol). Notice that some movies won't have had the opportunity to have won some of the awards before the Oscars ceremony. The next columns are a bunch of Oscars nominations and awards. I guess these awards are our targets. Is this a multiple class labels problem? Or should we use various models, each one trying to predict whether a film will win each Oscars category or not? In the second case we would have

In [70]:
sum_ = 0
for i in list(dataset.columns):
    if "Oscar" in i and "won" in i:
        sum_ += 1
print(sum_, "targets")

8 targets


while on the first, we'd have one target with 8 unique class labels, and the target could potentially have more than one class label on a single row. I don't know how to do this, so I think we'd better try the second approach first.

In [71]:
dataset[dataset.columns[:20]].dtypes

year                              int64
movie                            object
movie_id                         object
certificate                      object
duration                          int64
genre                            object
rate                            float64
metascore                       float64
synopsis                         object
votes                             int64
gross                           float64
release_date                     object
user_reviews                    float64
critic_reviews                  float64
popularity                      float64
awards_wins                       int64
awards_nominations                int64
Oscar_Best_Picture_won           object
Oscar_Best_Picture_nominated     object
Oscar_Best_Director_won          object
dtype: object

The types seem fine. We need to encode `certificate` (assign numbers with order), one-hot-encode `genre` (this one will need a little playing around) and `Oscars_..._nominated`. `release_date` I'm not sure.

In [72]:
dataset[dataset.columns[20:40]].head()

Unnamed: 0,Oscar_Best_Director_nominated,Oscar_Best_Actor_won,Oscar_Best_Actor_nominated,Oscar_Best_Actress_won,Oscar_Best_Actress_nominated,Oscar_Best_Supporting_Actor_won,Oscar_Best_Supporting_Actor_nominated,Oscar_Best_Supporting_Actress_won,Oscar_Best_Supporting_Actress_nominated,Oscar_Best_AdaScreen_won,Oscar_Best_AdaScreen_nominated,Oscar_Best_OriScreen_won,Oscar_Best_OriScreen_nominated,Oscar_nominated,Oscar_nominated_categories,Golden_Globes_won,Golden_Globes_won_categories,Golden_Globes_nominated,Golden_Globes_nominated_categories,BAFTA_won
0,No,No,No,No,No,No,No,No,No,No,No,No,No,1,"Best Music, Original Song",1,Best Original Song - Motion Picture,2,Best Original Song - Motion Picture|Best Perfo...,0
1,No,No,No,No,No,No,No,No,No,No,No,No,No,0,,0,,1,Best Motion Picture - Comedy or Musical,0
2,No,No,No,No,No,No,No,No,No,No,No,No,No,0,,0,,0,,0
3,No,No,No,No,Yes,No,No,No,No,No,No,No,No,6,"Best Music, Original Score|Best Makeup|Best Pe...",1,Best Original Score - Motion Picture,2,Best Original Score - Motion Picture|Best Perf...,1
4,Yes,No,No,No,No,No,Yes,No,No,No,Yes,No,No,13,"Best Cinematography|Best Makeup|Best Music, Or...",0,,4,Best Motion Picture - Drama|Best Director - Mo...,5


In [74]:
dataset[dataset.columns[20:40]].dtypes

Oscar_Best_Director_nominated              object
Oscar_Best_Actor_won                       object
Oscar_Best_Actor_nominated                 object
Oscar_Best_Actress_won                     object
Oscar_Best_Actress_nominated               object
Oscar_Best_Supporting_Actor_won            object
Oscar_Best_Supporting_Actor_nominated      object
Oscar_Best_Supporting_Actress_won          object
Oscar_Best_Supporting_Actress_nominated    object
Oscar_Best_AdaScreen_won                   object
Oscar_Best_AdaScreen_nominated             object
Oscar_Best_OriScreen_won                   object
Oscar_Best_OriScreen_nominated             object
Oscar_nominated                             int64
Oscar_nominated_categories                 object
Golden_Globes_won                           int64
Golden_Globes_won_categories               object
Golden_Globes_nominated                     int64
Golden_Globes_nominated_categories         object
BAFTA_won                                   int64


In [73]:
dataset[dataset.columns[40:60]].head()

Unnamed: 0,BAFTA_won_categories,BAFTA_nominated,BAFTA_nominated_categories,Screen_Actors_Guild_won,Screen_Actors_Guild_won_categories,Screen_Actors_Guild_nominated,Screen_Actors_Guild_nominated_categories,Critics_Choice_won,Critics_Choice_won_categories,Critics_Choice_nominated,Critics_Choice_nominated_categories,Directors_Guild_won,Directors_Guild_won_categories,Directors_Guild_nominated,Directors_Guild_nominated_categories,Producers_Guild_won,Producers_Guild_won_categories,Producers_Guild_nominated,Producers_Guild_nominated_categories,Art_Directors_Guild_won
0,,0,,0,,0,,0,,1,Best Song,0,,0,,0,,0,,0
1,,2,|Best Achievement in Special Visual Effects,0,,0,,1,Best Animated Film,1,Best Animated Film,0,,0,,0,,0,,0
2,,0,,0,,0,,0,,0,,0,,0,,0,,0,,0
3,Best Make Up/Hair,4,Best Make Up/Hair|Best Performance by an Actre...,0,,2,Outstanding Performance by a Female Actor in a...,0,,2,Best Supporting Actor|Best Actress,0,,0,,0,,0,,0
4,|Best Film|Best Achievement in Special Visual ...,14,|Best Film|Best Achievement in Special Visual ...,1,Outstanding Performance by a Male Actor in a S...,2,Outstanding Performance by a Male Actor in a S...,3,Favorite Film Franchise|Best Song|Best Composer,5,Favorite Film Franchise|Best Song|Best Compose...,0,,1,Outstanding Directorial Achievement in Motion ...,0,,1,Outstanding Producer of Theatrical Motion Pict...,0


In [75]:
dataset[dataset.columns[40:60]].dtypes

BAFTA_won_categories                        object
BAFTA_nominated                              int64
BAFTA_nominated_categories                  object
Screen_Actors_Guild_won                      int64
Screen_Actors_Guild_won_categories          object
Screen_Actors_Guild_nominated                int64
Screen_Actors_Guild_nominated_categories    object
Critics_Choice_won                           int64
Critics_Choice_won_categories               object
Critics_Choice_nominated                     int64
Critics_Choice_nominated_categories         object
Directors_Guild_won                          int64
Directors_Guild_won_categories              object
Directors_Guild_nominated                    int64
Directors_Guild_nominated_categories        object
Producers_Guild_won                          int64
Producers_Guild_won_categories              object
Producers_Guild_nominated                    int64
Producers_Guild_nominated_categories        object
Art_Directors_Guild_won        

One-hot-encode `..._categories`, which will also need a bit of playing around. Maybe drop the ones without `categories`, or try it both ways.

In [76]:
dataset[dataset.columns[60:80]].head()

Unnamed: 0,Art_Directors_Guild_won_categories,Art_Directors_Guild_nominated,Art_Directors_Guild_nominated_categories,Writers_Guild_won,Writers_Guild_won_categories,Writers_Guild_nominated,Writers_Guild_nominated_categories,Costume_Designers_Guild_won,Costume_Designers_Guild_won_categories,Costume_Designers_Guild_nominated,Costume_Designers_Guild_nominated_categories,Online_Film_Television_Association_won,Online_Film_Television_Association_won_categories,Online_Film_Television_Association_nominated,Online_Film_Television_Association_nominated_categories,Online_Film_Critics_Society_won,Online_Film_Critics_Society_won_categories,Online_Film_Critics_Society_nominated,Online_Film_Critics_Society_nominated_categories,People_Choice_won
0,,0,,0,,0,,0,,0,,0,,1,"Best Music, Original Song",0,,0,,0
1,,0,,0,,0,,0,,0,,1,Best Animated Picture,2,Best Animated Picture|Best Voice-Over Performance,1,Top Ten Films of the Year,1,Top Ten Films of the Year,0
2,,0,,0,,0,,0,,0,,0,,0,,0,,0,,0
3,,0,,0,,0,,0,,1,Excellence in Period/Fantasy Film,0,,1,Best Costume Design,0,,0,,0
4,,1,Period or Fantasy Film,0,,1,Best Screenplay Based on Material Previously P...,0,,0,,13,Motion Picture|Best Picture|Best Ensemble|Best...,22,Motion Picture|Best Picture|Best Ensemble|Best...,1,Top Ten Films of the Year,8,Top Ten Films of the Year|Best Picture|Best Di...,2


In [77]:
dataset[dataset.columns[80:100]].head()

Unnamed: 0,People_Choice_won_categories,People_Choice_nominated,People_Choice_nominated_categories,London_Critics_Circle_Film_won,London_Critics_Circle_Film_won_categories,London_Critics_Circle_Film_nominated,London_Critics_Circle_Film_nominated_categories,American_Cinema_Editors_won,American_Cinema_Editors_won_categories,American_Cinema_Editors_nominated,American_Cinema_Editors_nominated_categories,Hollywood_Film_won,Hollywood_Film_won_categories,Hollywood_Film_nominated,Hollywood_Film_nominated_categories,Austin_Film_Critics_Association_won,Austin_Film_Critics_Association_won_categories,Austin_Film_Critics_Association_nominated,Austin_Film_Critics_Association_nominated_categories,Denver_Film_Critics_Society_won
0,,0,,0,,0,,0,,0,,0,,0,,0,,0,,0
1,,0,,0,,2,British Film of the Year|British Producer of t...,0,,0,,0,,0,,0,,0,,0
2,,0,,0,,0,,0,,0,,0,,0,,0,,0,,0
3,,0,,0,,0,,0,,0,,0,,0,,0,,0,,0
4,Favorite Motion Picture|Favorite Dramatic Moti...,3,Favorite Move Fan Following|Favorite Motion Pi...,0,,0,,0,,1,Best Edited Feature Film - Dramatic,0,,0,,0,,1,Best Movie of the Decade,0


In [81]:
dataset[dataset.columns[100:120]].head()

Unnamed: 0,Denver_Film_Critics_Society_won_categories,Denver_Film_Critics_Society_nominated,Denver_Film_Critics_Society_nominated_categories,Boston_Society_of_Film_Critics_won,Boston_Society_of_Film_Critics_won_categories,Boston_Society_of_Film_Critics_nominated,Boston_Society_of_Film_Critics_nominated_categories,New_York_Film_Critics_Circle_won,New_York_Film_Critics_Circle_won_categories,New_York_Film_Critics_Circle_nominated,New_York_Film_Critics_Circle_nominated_categories,Los_Angeles_Film_Critics_Association_won,Los_Angeles_Film_Critics_Association_won_categories,Los_Angeles_Film_Critics_Association_nominated,Los_Angeles_Film_Critics_Association_nominated_categories,release_date.year,release_date.month,release_date.day-of-month,release_date.day-of-week
0,,0,,0,,0,,0,,0,,0,,0,,2001.0,12.0,25.0,2.0
1,,0,,0,,0,,1,Best Animated Film,1,Best Animated Film,1,Best Animation,1,Best Animation,2000.0,6.0,23.0,5.0
2,,0,,0,,0,,0,,0,,0,,0,,2005.0,7.0,8.0,5.0
3,,0,,0,,0,,0,,0,,0,,0,,2002.0,11.0,22.0,5.0
4,,0,,0,,1,Best Director,0,,0,,1,Best Music,2,Best Music|Best Production Design,2001.0,12.0,19.0,3.0


From column 35 to column 117 is all about various awards. Then we have `release_date.year`, `release_date.month`, `release_date.day-of-month` and `release_date.day-of-week`. 

In [82]:
dataset[dataset.columns[100:120]].dtypes

Denver_Film_Critics_Society_won_categories                    object
Denver_Film_Critics_Society_nominated                          int64
Denver_Film_Critics_Society_nominated_categories              object
Boston_Society_of_Film_Critics_won                             int64
Boston_Society_of_Film_Critics_won_categories                 object
Boston_Society_of_Film_Critics_nominated                       int64
Boston_Society_of_Film_Critics_nominated_categories           object
New_York_Film_Critics_Circle_won                               int64
New_York_Film_Critics_Circle_won_categories                   object
New_York_Film_Critics_Circle_nominated                         int64
New_York_Film_Critics_Circle_nominated_categories             object
Los_Angeles_Film_Critics_Association_won                       int64
Los_Angeles_Film_Critics_Association_won_categories           object
Los_Angeles_Film_Critics_Association_nominated                 int64
Los_Angeles_Film_Critics_Associati

Convert these last ones to integers?

In [36]:
dataset.isin(["?"]).sum().unique() # Checking if missing values are encoded ad "?"

array([0])

In [13]:
dataset.isna().sum()[dataset.isna().sum() != 0]

certificate                                                    10
metascore                                                      14
gross                                                          24
release_date                                                    9
user_reviews                                                    7
critic_reviews                                                  7
popularity                                                    119
Oscar_nominated_categories                                    551
Golden_Globes_won_categories                                 1024
Golden_Globes_nominated_categories                            732
BAFTA_won_categories                                         1004
BAFTA_nominated_categories                                    783
Screen_Actors_Guild_won_categories                           1103
Screen_Actors_Guild_nominated_categories                      910
Critics_Choice_won_categories                                 966
Critics_Ch

Holy crap. For the ones with little missing values, we could impute them by looking them up (maybe some won't be available though). In this case just drop them if they didn't win any Oscar, or think of something else if they won.

In [87]:
dataset.isna().sum()[dataset.isna().sum() < 30][dataset.isna().sum() > 0]

certificate                  10
metascore                    14
gross                        24
release_date                  9
user_reviews                  7
critic_reviews                7
release_date.year             9
release_date.month            9
release_date.day-of-month     9
release_date.day-of-week      9
dtype: int64

And for the rest... ok so we have 

In [90]:
dataset.isna().sum()[dataset.isna().sum() > 30][dataset.isna().sum() > 0]

popularity                                                    119
Oscar_nominated_categories                                    551
Golden_Globes_won_categories                                 1024
Golden_Globes_nominated_categories                            732
BAFTA_won_categories                                         1004
BAFTA_nominated_categories                                    783
Screen_Actors_Guild_won_categories                           1103
Screen_Actors_Guild_nominated_categories                      910
Critics_Choice_won_categories                                 966
Critics_Choice_nominated_categories                           614
Directors_Guild_won_categories                               1163
Directors_Guild_nominated_categories                         1088
Producers_Guild_won_categories                               1149
Producers_Guild_nominated_categories                         1006
Art_Directors_Guild_won_categories                           1137
Art_Direct

and

In [88]:
print(dataset.shape[0], "rows")

1183 rows


In [89]:
dataset.isna().sum()[dataset.isna().sum() > 1000][dataset.isna().sum() > 0] 

Golden_Globes_won_categories                            1024
BAFTA_won_categories                                    1004
Screen_Actors_Guild_won_categories                      1103
Directors_Guild_won_categories                          1163
Directors_Guild_nominated_categories                    1088
Producers_Guild_won_categories                          1149
Producers_Guild_nominated_categories                    1006
Art_Directors_Guild_won_categories                      1137
Writers_Guild_won_categories                            1150
Writers_Guild_nominated_categories                      1016
Costume_Designers_Guild_won_categories                  1142
Online_Film_Critics_Society_won_categories              1036
People_Choice_won_categories                            1127
London_Critics_Circle_Film_won_categories               1065
American_Cinema_Editors_won_categories                  1140
Hollywood_Film_won_categories                           1017
Hollywood_Film_nominated

I would say we drop the columns where we have more than ?% missing data? They are all `...categories`... Also it seems that when we have `np.NaN` on these columns, on the corresponding aggregation column, we have a zero. These missing values may indicate that the movie came out later than the awards, so it could not participate?????? 

We may be better of just dropping them, because we are already going to have too many features after one-hot-encoding... We may want to keep `Golden_Globes_won_categories` and `BAFTA_won_categories`, because they are one of the most important awards. 

Making sure our data makes sense...

In [6]:
dataset["year"].value_counts()

2013    73
2000    71
2011    71
2006    71
2004    71
2010    69
2005    69
2014    68
2007    68
2002    67
2012    67
2009    67
2001    67
2015    66
2008    64
2003    63
2016    61
2017    30
Name: year, dtype: int64

So we have data since the beginning of this century. For 2017 we don't really have much data. Let's see if there is only one winner per year (i.e, 17 winners). 

In [13]:
num_winners = {}
for i in list(dataset.columns):
    if "Oscar" in i and "won" in i:
        num_winners[i] = dataset[dataset[i] == "Yes"].shape[0]
num_winners

{'Oscar_Best_Picture_won': 17,
 'Oscar_Best_Director_won': 17,
 'Oscar_Best_Actor_won': 17,
 'Oscar_Best_Actress_won': 17,
 'Oscar_Best_Supporting_Actor_won': 17,
 'Oscar_Best_Supporting_Actress_won': 17,
 'Oscar_Best_AdaScreen_won': 17,
 'Oscar_Best_OriScreen_won': 17}

Oh btw, we only have the Oscars for best picture, best director, actor, actress, suporting actor/actress, Adapted Screenplay and Original Screenplay ([see](http://oscar.go.com/nominees/writing-adapted-screenplay)). 

Also worth taking into account that our target/s are heavely imbalanced. Of course, out of all the movies comming out every year, only one can win each Oscar category!

I think this is all for the EDA for now. Next is Data Preprocessing.