# Exploratory Data Analysis

### Setup
Load the relevant packages.

In [153]:
import pandas as pd
from cleaning_tools import flatten, one_hot

# print floats more nicely
pd.set_option('display.float_format', '{:.2f}'.format)
# pd.set_option('display.float_format', lambda x: f'%.{len(str(x%1))-2}f' % x)

Read in data.

In [154]:
%store -r raw_data

### Data Description
We look at the number of movies in the data, number of columns, and data types.

In [155]:
# number of observations and columns
raw_data.shape

(5000, 23)

In [156]:
# information on columns
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   adult                 5000 non-null   bool          
 1   budget                5000 non-null   int64         
 2   genres                5000 non-null   object        
 3   id                    5000 non-null   int64         
 4   imdb_id               4938 non-null   object        
 5   original_language     5000 non-null   object        
 6   original_title        5000 non-null   object        
 7   overview              4963 non-null   object        
 8   popularity            5000 non-null   float64       
 9   production_companies  5000 non-null   object        
 10  production_countries  5000 non-null   object        
 11  release_date          4980 non-null   datetime64[ns]
 12  revenue               5000 non-null   int64         
 13  runtime           

### Summary Statistics
It would be interesting to see the summary statistics for the numerical columns `budget`, `revenue`, `runtime`, `vote_average`, and `vote_count`. 

Note we do not include `popularity` as it is a popularity metric on the TMDB website only, which has a far smaller userbase than than e.g. IMDB or Letterbox. 

In [157]:
financial_cols = ["budget", "revenue"]
vote_cols = ["vote_average", "vote_count"] 

raw_data[financial_cols].describe()

Unnamed: 0,budget,revenue
count,5000.0,5000.0
mean,28183016.52,95756509.6
std,47092793.24,198856725.48
min,0.0,0.0
25%,0.0,0.0
50%,5500000.0,11251443.5
75%,35000000.0,104419265.0
max,380000000.0,2847246203.0


In [158]:
raw_data[vote_cols].describe()

Unnamed: 0,vote_average,vote_count
count,5000.0,5000.0
mean,6.5,2195.85
std,1.24,3397.82
min,0.0,0.0
25%,6.0,186.0
50%,6.6,867.5
75%,7.2,2707.25
max,10.0,31119.0


In [159]:
raw_data["runtime"].describe()

count   4999.00
mean     100.66
std       28.45
min        0.00
25%       90.00
50%      100.00
75%      115.00
max      248.00
Name: runtime, dtype: float64

### Distribution of Genres

In [165]:
genres_flat = flatten(raw_data, "genres")

In [168]:
genres_flat["genres_name"].value_counts()

Action             1637
Drama              1549
Comedy             1462
Thriller           1272
Adventure          1150
Animation           982
Family              855
Fantasy             829
Science Fiction     792
Horror              783
Romance             629
Crime               570
Mystery             396
History             170
War                 138
Music               113
TV Movie             92
Documentary          87
Western              51
Name: genres_name, dtype: int64

### Award for Most Hardworking
For some models, it may be preferable to recategorise cast and crew data into top 10 vs else. For example, it might make more sense to separate casts into top 10 most commonly occurring actors and an additional category for all others. 

Here we see who the most 'hardworking' cast and crew members are on a 

### Unoriginality of Movies
People often complain that movies are becoming more and more similar and that studios are pushing out more and more remakes. Might movies be getting more similar? 

Let's look at the distribution and similarity of keywords and taglines in our dataset. 