# Finding a Model

We start with imports and reading data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MultiLabelBinarizer, FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
from CustomExtractors import extract_numbers, extract_rating

m_cast = pd.read_csv('./data/Movie Cast.csv')
m_data = pd.read_csv('./data/Movie Data.csv')

print(m_cast.shape)
m_cast.head()

(203333, 4)


Unnamed: 0,Movie,Type,Name,Role
0,"10,000 B.C.",Actors,Steven Strait,D'Leh
1,"10,000 B.C.",Actors,Camilla Belle,Evolet
2,"10,000 B.C.",Actors,Cliff Curtis,Tic-Tic
3,"10,000 B.C.",Actors,Reece Ritchie,Moha
4,"10,000 B.C.",Actors,Marco Khan,One Eye


In [2]:
print(m_data.shape)
m_data.head()

(5691, 16)


Unnamed: 0,Movie,Budget (thousands of $),Domestic Box Office Revenue (thousands of $),International Box Office Revenue (thousands of $),MPAA Rating,Running time,Franchise,Original source,Genre,Production method,Type,Production companies,Production country,Languages,Distributor,Release year
0,10 Questions for the Dalai Lama,,224.5,260.0,Not Rated,,,Real Life Events,Documentary,Live Action,Factual,,United States,English,Monterey Media,2007
1,10th & Wolf,8000.0,54.7,89.1,Not Rated,,,Real Life Events,Drama,Live Action,Dramatization,,United States,English,ThinkFilm,2006
2,2006 Academy Award Nominated Short Films,,335.1,,Not Rated,,Academy Award Short Film Nominations,Compilation,Thriller,Multiple Production Methods,Multiple Creative Types,,United States,English,Magnolia Pictures,2007
3,24 Hour Party People,,1169.0,2435.9,"R for strong language, drug use and sexuality",,,Real Life Events,Drama,Live Action,Dramatization,,United States,English,MGM,2002
4,39 Pounds of Love,,28.1,2.1,Not Rated,,,Real Life Events,Documentary,Live Action,Factual,,United States,English,Balcony Releasing,2005


So I have a list of Staff with over 200k rows and also a list of movies with almost 6k rows. I need to combine this data somehow, including the staff into the data of each movie.

The Target values would be **Total Box Office Revenue**, a sum of **Domestic** and **International Box Office Revenue**. This makes it a **Regression** problem.

The models that we can try to approximate to data are:
* Gradient descent (SGDRegressor & Lasso)
* Random Forest
* Boosted Trees

In any case, before implementing the model we need to do some feature engineering, as there's a lot of missing data and columns which will clearly not have any effect (like **Role**).

Finally, it would be smart to create some Dummy predictions so we can compare our models to it.

But before anything, we need to split our data intro train/test, to avoid any data leakage when doing Feat. engineering.

## Train/Test split

We are not gonna split the `m_cast` data as it is just an extension for `m_data`. We already will select some movies to train the model on, and then we will take the data from the Cast & Crew of only the movies we've selected for training.

In [3]:
target_cols = [
    'Domestic Box Office Revenue (thousands of $)', 
    'International Box Office Revenue (thousands of $)'
]
clean_target_cols = m_data[target_cols].fillna(0)
y = clean_target_cols.iloc[:,0] + clean_target_cols.iloc[:,1]
y

0        484.500
1        143.800
2        335.100
3       3604.900
4         30.200
          ...   
5686    1259.500
5687      40.300
5688      77.131
5689      64.600
5690    1944.300
Length: 5691, dtype: float64

In [4]:
X = m_data.drop(target_cols, axis=1)

# Divide data into training and test subsets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size=0.8,
    test_size=0.2,
    random_state=0
)

## Feature engineering

Let's check the data from each column to see how to approach the missing data.

In [5]:
nans = X.shape[0] - X.dropna().shape[0]
print ("%d rows have missing values in the train data" %nans)
X.isnull().sum()

4993 rows have missing values in the train data


Movie                         0
Budget (thousands of $)    2166
MPAA Rating                   6
Running time               1035
Franchise                  4751
Original source             170
Genre                        11
Production method           105
Type                        226
Production companies       1812
Production country            6
Languages                    30
Distributor                  58
Release year                  0
dtype: int64

We see we can drop "Franchise" as it has too many rows without data.

Then there is some data that will not be known previous to release, like "Release year", "Distributor" or "MPAA Rating" but they can be estimated (in case of Release and MPAA Rating) or, in the case of the Distributor, it can be inputed to check which one will give us more profits.

### Excluded cols

In [6]:
dropped_cols = ['Franchise']

### Budget & Running time

In [7]:
X.dtypes

Movie                       object
Budget (thousands of $)    float64
MPAA Rating                 object
Running time                object
Franchise                   object
Original source             object
Genre                       object
Production method           object
Type                        object
Production companies        object
Production country          object
Languages                   object
Distributor                 object
Release year                 int64
dtype: object

To work with NaN, in the case of Running time and Budget we should use the mean or median, but Running time is saved as a string, so we should change it to float (in minutes).

In [8]:
X['Running time'].unique()

array([nan, '101 minutes', '100 minutes', '80 minutes', '90 minutes',
       '107 minutes', '102 minutes', '140 minutes', '89 minutes',
       '127 minutes', '95 minutes', '92 minutes', '119 minutes',
       '110 minutes', '96 minutes', '109 minutes', '137 minutes',
       '132 minutes', '105 minutes', '93 minutes', '124 minutes',
       '70 minutes', '87 minutes', '99 minutes', '117 minutes',
       '86 minutes', '82 minutes', '103 minutes', '79 minutes',
       '88 minutes', '150 minutes', '125 minutes', '135 minutes',
       '115 minutes', '133 minutes', '71 minutes', '114 minutes',
       '97 minutes', '94 minutes', '116 minutes', '120 minutes',
       '129 minutes', '81 minutes', '112 minutes', '108 minutes',
       '118 minutes', '128 minutes', '84 minutes', '104 minutes',
       '40 minutes', '98 minutes', '158 minutes', '146 minutes',
       '91 minutes', '136 minutes', '190 minutes', '83 minutes',
       '111 minutes', '138 minutes', '121 minutes', '164 minutes',
       '123 m

In [9]:
num_extract_cols = ['Running time']

# Preprocessing for numerical data
number_extractor_transformer = Pipeline(steps=[
    ('extractor', FunctionTransformer(extract_numbers)),
    ('imputer', SimpleImputer(strategy='median'))
])

num_cols = ['Budget (thousands of $)', 'Release year']

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

## Rating

Now let's check string columns, starting with Rating. This is a perfect candidate for **ordinal encoding**. Let's check which values does it have.

In [10]:
rating_col = ['MPAA Rating']

X[rating_col[0]].unique()

array(['Not Rated', 'R for strong language, drug use and sexuality',
       'PG for thematic material, sensuality and language. ', ...,
       'R for sequences of strong violence, language, some drug use and sexuality',
       'R for pervasive strong violence and some sexual content',
       'PG-13 for mature thematic material, sexual content and a rude gesture'],
      dtype=object)

We just need the first string in the column data, the rest is unnecessary.

In [11]:
X[rating_col[0]].str.extract('(Not Rated|[^\s]+)')[0].value_counts()

0
R            2120
PG-13        1765
Not Rated     937
PG            718
G              99
G(Rating       32
NC-17           8
Open            2
GP              2
PGPG            1
M/PG            1
Name: count, dtype: int64

We find that we have cases besides the 5 current ratings (G, PG, PG-13, R, NC-17). With a bit of research we find that **GP** became PG. M was the previous iteration of PG, that's why *Z (1969)* (a great movie) has the **M/PG** rating, which should be transformed to PG. **PGPG** is a typo, as the movie is *E.T.*. The **G(Rating** ones just have the following format: *G(Rating bulletin XXXX, MM/DD/YYYY)*, so they should be G.

Now, the two movies with **Open** have a bit more interesting stories. They were both released without rating as the directors disagreed with the rating given by the MPAA, which in both cases were NC-17, as it would've commercially killed the movies. For our model, we will be fair with the directors and give them both R ratings.

For the NaN cases, we'll make them Not Rated.

In [12]:
ratings_categories = [
    'Not Rated',
    'G',
    'PG',
    'PG-13',
    'R',
    'NC-17'
];

ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Not Rated')),
    ('extractor', FunctionTransformer(extract_rating)),
    ('encoder', OrdinalEncoder(categories=ratings_categories))
])

### Other categorical data

In [13]:
for col in X.columns[5:-1]:
    print(X[col].value_counts())

Original source
Original Screenplay           2902
Real Life Events               864
Fiction Book/Short Story       857
Remake                         211
Comic/Graphic Novel            155
Factual Book/Article           136
TV                             120
Play                            69
Folk Tale/Legend/Fairytale      40
Game                            35
Spin-Off                        26
Movie                           19
Short Film                      18
Musical or Opera                15
Toy                             15
Religious Text                  14
Compilation                     11
Theme Park Ride                  7
Musical Group                    3
Web Series                       2
Ballet                           1
Song                             1
Name: count, dtype: int64
Genre
Drama              1743
Comedy              931
Thriller            616
Adventure           572
Action              545
Documentary         475
Horror              323
Romantic Comed

Looks like **Languages**, **Production country** and **Production companies** need some O-H Encoding with the list of data they have. Distributor is complicated, because there are 431 different data! How to categorize it?

#### Simple categorical columns

In [14]:
categorical_cols_simple = ['Original source', 'Genre', 'Production method', 'Type']

simple_categorical_cols_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

#### MultiLabel columns

In [15]:
categorical_cols_multilabel = ['Production companies', 'Production country', 'Languages']

multilabel_categorical_cols_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', MultiLabelBinarizer())
])

## Model

In [16]:
# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num_extract', number_extractor_transformer, num_extract_cols),
        ('num', numerical_transformer, num_cols),
        ('rating', ordinal_transformer, rating_col),
        ('cat_simp', simple_categorical_cols_transformer, categorical_cols_simple),
        ('cat_multi', multilabel_categorical_cols_transformer, categorical_cols_multilabel)
    ])

# Model
model = RandomForestRegressor(n_estimators=100, random_state=0)

## First try

In [17]:
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

my_pipeline.fit(X_train, y_train)

[['R for language, some strong violence, and drug content']
 ['R for strong sexual content, nudity, language and brief drug use']
 ['PG for language, sexual situations, and some thematic material including partying (after editing, originally rated PG-13)']
 ...
 ['PG-13 for some sexuality and violence']
 ['PG for some rude humor, mild language and brief smoking(Rating bulletin 2103, 1/13/2010)']
 ['R for strong fantasy horror violence and gore, brief sexuality/nudity and language']]


ValueError: 2