# Table of Contents

* [Introduction](#Introduction)
* [1. EDA & Data Transformation](#1.-EDA-&-Data-Transformation)
  * [1.1. Checking missing values](#1.1.-Checking-missing-values)
  * [1.2. Measurement Scales](#1.2.-Measurement-Scales)
  * [1.3. Nominal Data](#1.3.-Nominal-Data)
    * [1.3.1. Dummifying](#1.3.1.-Dummifying)
    * [1.3.2. Binarizaiton](#1.3.2.-Binarizaiton)
    * [1.3.3. Counts](#1.3.3.-Counts)
    * [1.3.4. Date Extraction](#1.3.4.-Date-Extraction)
  * [1.4. Nominal Data Transformation](#1.4.-Nominal-Data-Transformation)
    * [1.4.1. Feature Selector](#1.4.1.-Feature-Selector)
    * [1.4.2. Dictionary Vectorizer](#1.4.2.-Dictionary-Vectorizer)
    * [1.4.3. Top Features](#1.4.3.-Top-Features)
    * [1.4.4. Sum Transformer](#1.4.4.-Sum-Transformer)
    * [1.4.5. Binarizer](#1.4.5.-Binarizer)
    * [1.4.6. Date Transformer](#1.4.6.-Date-Transformer)
    * [1.4.7. Item Counter](#1.4.7.-Item-Counter)
  * [1.5. Numerical data](#1.5.-Numerical-data)
* [2. Building a Pipeline](#2.-Building-a-Pipeline)
  * [2.1. Feature Union](#2.1.-Feature-Union)
  * [2.2. Model Selection](#2.2.-Model-Selection)
* [Conclusion](#Conclusion)
* [References](#References)

# Introduction

In this kernel, I'll focus on feature engineering using `sklearn pipelines`. 

What are `pipelines`?. In short `pipelines` are ways to organize your transformers in a manageable, linear way. I'd like to think about each `pipeline` as a list of step-by-step instructions to transform your data. More information, you can find in [sklearn.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) documentation.

I noticed once you will get used to them, you can quickly and easily deal with data imputation and transformation. Moreover, it prevents data leakage and you only write transformer once - you can easily fit (and if needed transform as well) it on the training dataset and use it on the test set.

Here are some resources I find useful to explain what `pipelines` are and how to use them:
* [Kevin Goetsch - Deploying Machine Learning using sklearn pipelines](https://www.youtube.com/watch?v=URdnFlZnlaE)
* [Julie Michelman - Pandas, Pipelines, and Custom Transformers](https://www.youtube.com/watch?v=BFaadIqWlAg)

As this is my first data science project I'm aiming to establish a clear and consistent workflow for future projects, so I'd say this notebook is rather directed towards beginners, looking for an inspiration/reference.

<br/><br/>

**The goal of the project is to predict revenue of a movie using TMDB 5000 Movie Dataset.**

<br/><br/>

Let's start with imports and train/test split!

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
import seaborn as sns
import matplotlib.pyplot as plt

import random
random.seed(42)

In [None]:
credits = pd.read_csv('../input/tmdb_5000_credits.csv', index_col='movie_id')
movies = pd.read_csv('../input/tmdb_5000_movies.csv', index_col='id')

data = pd.merge(movies, credits)
print(data.shape)

data = data.loc[data['revenue'] != 0]
data['revenue'].dropna(inplace=True)
print(data.shape)


X_train, X_test, y_train, y_test = train_test_split(data.drop(['revenue'], axis=1), data['revenue']) 

Exploratory = X_train.copy() # I'm using the copy of the data (not the view!) just in case, not to mess with the original dataset.

I'm dropping observations where target data (*revenue*) is missing, so I'm only left with data I can use for predictions.

# 1. EDA & Data Transformation

Exploratory Data Analysis (EDA) and Data Transformation are often thought as a cycle (EDA &rarr; Transform &rarr; EDA etc.), I'll start with **checking missing values**, then I'll get to transforming nominal data, and later on I'll do EDA on numeric data. As the nominal data is quite entangled (efter extraction reasembles text corpus) in this dataset, I'll omit the EDA of this part. 

## 1.1. Checking missing values

In [None]:
nan_percent = Exploratory.isna().mean()*100
nan_count = Exploratory.isna().sum()
pd.concat([nan_count.rename('missing_count'), nan_percent.round().rename('missing_percent')], axis=1)

Luckily only two features have any missing values.

Let's think if we really need all the features we have so far. I believe the *title* won't tell us much about future revenue and we already have a similar feature to *overview* and *tagline*, which is *keywords*, so let's get rid of those.

In [None]:
columns_to_drop = ['original_title', 'overview', 'tagline', 'title']
# original_title/title - not informative
# overview/tagline - similar features may be found in 'keywords'

Exploratory = Exploratory.drop(columns_to_drop, axis=1)

As I find *tagline* uninformative, I got rid of it completely, therefore we're left with only *homepage* column as the only feature having missing values. As the *homepage* feature has about 60% of missing data, we can, later on, binarize this column on the criteria whether or not a movie has a homepage (**True** if a movie has a homepage **False** if a homepage is missing).

In [None]:
dtypes_description = pd.Series(['ratio', 'nominal', 'nominal', 'nominal', 'nominal', 'ratio', 'nominal', 'nominal', \
                     'interval', 'ratio', 'nominal', 'nominal', 'ratio', 'ratio', 'nominal', 'nominal'], \
                     index=Exploratory.dtypes.index)

pd.concat([Exploratory.dtypes.rename('dtype'), Exploratory.iloc[420].rename('example'), dtypes_description.rename('description')], axis=1)

## 1.2. Measurement Scales

This overview of a random sample shows us what kind of data we are dealing with. Six columns are a list of dictionaries (genres, keywords, production_companies, production_countries, spoken_languages, cast, and crew). We also have five numerical columns (budget, popularity, runtime, vote_average, and vote_count) and other string columns, which are labeled as an object - original_language, release_date, and status.

You probably noticed I labeled each column with its type. Each type refers to the Measurement Scale. In short, these scales refer to the quality of the data, where:
* **ratio** - it's a numerical scale with absolute zero, for example, age;
* **interval** - it's also a numerical scale, but without absolute zero, as it is the case for Fahrenheit scale. For temperature measurement Kelvin would be a ratio scale;
* **ordinal** - which is not present in our data set, refers to measurements you can put in order, but you cannot tell the quantitive difference between adjacent measurements;
* **nominal** - in this scale each item is treated as having the same quality, for example, city names;

More detailed overview of measurement scales you can find in [Multivariate Data Analysis](https://www.pearson.com/us/higher-education/program/Hair-Multivariate-Data-Analysis-7th-Edition/PGM263675.html) book.

## 1.3. Nominal Data

Let's now try to come up with a plan to deal with our nominal data. As the data comes with multiple different forms, we have a wide field of options on how to deal with it.

In [None]:
Exploratory[['genres', 'spoken_languages', 'crew']].head()

### 1.3.1. Dummifying

For most of the data coming in the form of a list of dictionaries, we'll simply extract fields that interest us, and dummify them. In some cases, to avoid sparsification, we'll choose some fraction of the most occurring values.

**Columns to dummify**:

*genres, keywords production_companies, production_countries, crew*

In [None]:
Exploratory[['homepage', 'original_language', 'status']].head()

### 1.3.2. Binarizaiton

Here we'll simple binarize the data - the column will get label `True` or `False` (or `1` or `0`) on certain, established condition.

**Columns to binarize**:

*homepage, original_language, status, spoken_languages*

In [None]:
Exploratory['cast'].head().to_frame()

### 1.3.3. Counts

As an example, we'll count how many popular actors (having most appearances) are cast in a movie. Perhaps the more of them playing in one movie, the higher is the revenue...

In [None]:
Exploratory['release_date'].head().to_frame()

### 1.3.4. Date Extraction

In this data set, we have *release_date* in a string form. Probably it will be better if we extract from it: year, month and day, and dummify the latter two.

# 1.4. Nominal Data Transformation

Let's now get to writing actual transformers to, well, transform the data. First, we need to import certain classes our custom transformers need to inherit from.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

### 1.4.1. Feature Selector

This transformer is really straightforward - it simply takes the name of the column we want to extract and if we use it, it will 'spit out' the data column of our Data Frame.

In [None]:
class FeatureSelector(BaseEstimator, TransformerMixin):

    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return X[self.feature_names]

In [None]:
prod_companies = FeatureSelector('production_companies').fit_transform(Exploratory)
prod_companies.to_frame().head()

### 1.4.2. Dictionary Vectorizer

This one is a bit more complex. It's role is to:
* 1<sup>st</sup> - extract values from dictionaries,
* 2<sup>nd</sup> - join them in one string,
* 3<sup>rd</sup> - dummify it using `sklearn` Count Vectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import re

def extract_items(list_, key, all_=True):
    sub = lambda x: re.sub(r'[^A-Za-z0-9]', '_', x)
    if all_:
        target = []
        for dict_ in eval(list_):
            target.append(sub(dict_[key].strip()))
        return ' '.join(target)
    elif not eval(list_):
        return 'no_data'
    else:
        return sub(eval(list_)[0][key].strip())

class DictionaryVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, key, all_=True):
        self.key = key
        self.all = all_
    
    def fit(self, X, y=None):
        genres = X.apply(lambda x: extract_items(x, self.key, self.all))
        self.vectorizer = CountVectorizer().fit(genres)        
        self.columns = self.vectorizer.get_feature_names()
        return self
        
    def transform(self, X):
        genres = X.apply(lambda x: extract_items(x, self.key))
        data = self.vectorizer.transform(genres)
        return pd.DataFrame(data.toarray(), columns=self.vectorizer.get_feature_names(), index=X.index)

In [None]:
prod_companies_vectorized = DictionaryVectorizer('name').fit_transform(prod_companies)
prod_companies_vectorized.head()

### 1.4.3. Top Features

This transformer expects dummified data set and extract most popular features.

In [None]:
class TopFeatures(BaseEstimator, TransformerMixin):
    
    def __init__(self, percent):
        if percent > 100:
            self.percent = 100
        else:
            self.percent = percent
    
    def fit(self, X, y=None):
        counts = X.sum().sort_values(ascending=False)
        index_ = int(counts.shape[0]*self.percent/100)
        self.columns = counts[:index_].index
        return self
    
    def transform(self, X):
        return X[self.columns]

In [None]:
top_companies = TopFeatures(1).fit_transform(prod_companies_vectorized)
top_companies.head()

### 1.4.4. Sum Transformer

Sum Transformer simply computes a sum across given features. We'll use it on our sparse data (after dummification).

In [None]:
class SumTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, series_name):
        self.series_name = series_name
    
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return X.sum(axis=1).to_frame(self.series_name)

In [None]:
companies_count = SumTransformer('companies_count').fit_transform(prod_companies_vectorized)
companies_count.head()

### 1.4.5. Binarizer

Biniarizer takes as an input function that decides whether or not label value as `True` or `False`.

In [None]:
class Binarizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, condition, name):
        self.condition = condition
        self.name = name
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.apply(lambda x : int(self.condition(x))).to_frame(self.name)

In [None]:
missing_homepage = Binarizer(lambda x: isinstance(x, float), 'missing_homepage').fit_transform(Exploratory['homepage'])
missing_homepage.head(15)

### 1.4.6. Date Transformer

As mentioned earlier, this transformer takes a date in string format and extract values of interest.

In [None]:
from datetime import datetime

def get_year(date):
    return datetime.strptime(date, '%Y-%m-%d').year

def get_month(date):
    return datetime.strptime(date, '%Y-%m-%d').strftime('%b')

def get_weekday(date):
    return datetime.strptime(date, '%Y-%m-%d').strftime('%a')

class DateTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        year = X.apply(get_year).rename('year')
        month = pd.get_dummies(X.apply(get_month))
        day = pd.get_dummies(X.apply(get_weekday))
        return pd.concat([year, month, day], axis=1)        

In [None]:
date = DateTransformer().fit_transform(Exploratory['release_date'])
date.head()

### 1.4.7. Item Counter

Item Counter counts how many items are in a list.

In [None]:
def get_list_len(list_):
    return len(eval(list_))

class ItemCounter(BaseEstimator, TransformerMixin):
        
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return X.apply(lambda x: int(get_list_len(x)))

In [None]:
language_count = ItemCounter().fit_transform(Exploratory['spoken_languages'])
language_count.head().to_frame('language_count')

## 1.5. Numerical data

Now that we've dealt with nominal data it is time to take care of numerical data. Due to use of our transformers, we have new numerical columns: *year* and *top_cast_count*.

In [None]:
year = DateTransformer().fit_transform(Exploratory['release_date'])['year']
top_cast_count = make_pipeline(FeatureSelector('cast'), DictionaryVectorizer('name'), 
                               TopFeatures(0.25), SumTransformer('top_cast_count')).fit_transform(Exploratory)

In [None]:
notional_to_numeric = pd.concat([year, top_cast_count], axis=1)
notional_to_numeric.head(15)

Let's take a look if we have any abnormal values in our numerical columns.

In [None]:
numeric = pd.concat([Exploratory.select_dtypes(['int64', 'float64']), notional_to_numeric], axis=1)

numeric.hist(figsize=(15,15), bins=25)

Seems like everything looks fine, although the data is skewed.

In [None]:
numeric.corr().style.background_gradient(cmap='coolwarm')

We see we have two features (*popularity* and *vote_count*) that are strongly correlated. Let's take a closer look.

In [None]:
numeric.plot(kind='scatter', x='popularity', y='vote_count')
possible_outliers = Exploratory[Exploratory['popularity'] > 400]

numeric[['popularity', 'vote_count']] = np.log(Exploratory[['popularity', 'vote_count']] + 1)
numeric.plot(kind='scatter', x='popularity', y='vote_count')

We had to take care of heteroscedasticity. Luckily log transformation took care of it. Now we have more or less the same variance of residuals across all values.

We also could notice some outliers. Let's take a look at observations with popularity higher than 400.

In [None]:
possible_outliers

It seems that we're dealing with huge blockbusters here. I guess there's nothing to worry about in this case.

In [None]:
numeric.corr().style.background_gradient(cmap='coolwarm')

After our transformation *vote_count* and *popularity* are even more correlated. It's time to combine them into one feature. I believe taking their average is good enough.

In [None]:
class MeanTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, name):
        self.name = name
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.mean(axis=1).to_frame(self.name)

In [None]:
feature_mean = make_pipeline(FeatureSelector(['vote_count', 'popularity']), MeanTransformer('popularity_vote')).fit_transform(Exploratory)
feature_mean.head()

In [None]:
numeric['vote_popularity'] = feature_mean
numeric.drop(columns=['popularity', 'vote_count'], inplace=True)

In [None]:
sns.pairplot(numeric)

Now we're left with uncorrelated columns. That's good news as we don't actually count any feature 'twice'.

Maybe we also should transform our target value, so our data fits it better?

In [None]:
from scipy.stats import pearsonr

transformations = [lambda x: x, np.sqrt, lambda x: np.log(x+1)]
tran_description = [' no transformation', ' sqrt', ' log']
numeric_columns = numeric.columns

fig, axes = plt.subplots(len(numeric_columns), len(transformations), figsize=(20,15))
fig.tight_layout()

for col_idx, col in enumerate(numeric_columns):
    for tran_idx, tran in enumerate(transformations):
        axes[col_idx, tran_idx].scatter(x=numeric[col], y=tran(y_train))
        axes[col_idx, tran_idx].set_xticklabels([])
        axes[col_idx, tran_idx].set_xticks([]) 
        R2 = pearsonr(numeric[col], tran(y_train))[0]**2     
        axes[col_idx, tran_idx].title.set_text(f'{col}, {tran_description[tran_idx]} \n R2 coefficient: {R2:.2f}')
               
plt.show()

Looks like it's not worth the hassle, as we get similar, or lower R2 scores for transformed target data.

# 2. Building a Pipeline

Now that we've dealt with transformers, it's time to combine them into a `pipeline`. What will we do now is we apply our transformers to certain columns and we will combine those transformed data into one `Data Frame`.

## 2.1. Feature Union

To combine the data, we need a class to do this for us. Unfortunately `sklearn` doesn't provide a class that works out of the box with `Pandas`, as we would expect. Instead `sklearn` Feature Union takes `Pandas` Data Frame as input and gave `numpy` array as output and we would like to have `Pandas` Data Frame as output as well. In order to do this, we need to modify `sklearn` source code, so it works as intended. Luckily someone has already done that for us. 

To learn more you can read this [blog post](https://zablo.net/blog/post/pandas-dataframe-in-scikit-learn-feature-union/) by Marcin Zabłocki, along with the [source code](https://github.com/marrrcin/pandas-feature-union/blob/master/pandas_feature_union.py).

In [None]:
from sklearn.externals.joblib import Parallel, delayed
from sklearn.pipeline import FeatureUnion, _fit_transform_one, _transform_one, _name_estimators
from scipy import sparse

import warnings
warnings.filterwarnings('ignore')

class PandasFeatureUnion(FeatureUnion):
    def fit_transform(self, X, y=None, **fit_params):
        self._validate_transformers()
        result = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_transform_one)(
                transformer=trans,
                X=X,
                y=y,
                weight=weight,
                **fit_params)
            for name, trans, weight in self._iter())

        if not result:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        Xs, transformers = zip(*result)
        self._update_transformer_list(transformers)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

    def merge_dataframes_by_column(self, Xs):
        return pd.concat(Xs, axis="columns", copy=False)

    def transform(self, X):
        Xs = Parallel(n_jobs=self.n_jobs)(
            delayed(_transform_one)(
                transformer=trans,
                X=X,
                y=None,
                weight=weight)
            for name, trans, weight in self._iter())
        if not Xs:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs
    
def make_union(*transformers, **kwargs):
    n_jobs = kwargs.pop('n_jobs', None)
    verbose = kwargs.pop('verbose', False)
    if kwargs:
        # We do not currently support `transformer_weights` as we may want to
        # change its type spec in make_union
        raise TypeError('Unknown keyword arguments: "{}"'
                        .format(list(kwargs.keys())[0]))
    return PandasFeatureUnion(
        _name_estimators(transformers), n_jobs=n_jobs, verbose=verbose)

Now we just need to apply transformations on each column we intend to work on.

In [None]:
union = make_union(
    make_pipeline(
        FeatureSelector('genres'),
        DictionaryVectorizer('name')
    ),
    make_pipeline(
        FeatureSelector('homepage'),
        Binarizer(lambda x: isinstance(x, float), 'missing_homepage')
    ),
    make_pipeline(
        FeatureSelector('keywords'),
        DictionaryVectorizer('name'),
        TopFeatures(0.5)
    ),
    make_pipeline(
        FeatureSelector('original_language'),
        Binarizer(lambda x: x == 'en', 'en')
    ),
    make_pipeline(
        FeatureSelector('production_companies'),
        DictionaryVectorizer('name'),
        TopFeatures(1)
    ),
    make_pipeline(
        FeatureSelector('production_countries'),
        DictionaryVectorizer('name'),
        TopFeatures(25)
    ),
    make_pipeline(
        FeatureSelector('release_date'),
        DateTransformer()
    ),
    make_pipeline(
        FeatureSelector('spoken_languages'),
        ItemCounter(),
        Binarizer(lambda x: x > 1, 'multilingual')
    ),
    make_pipeline(
        FeatureSelector('original_language'),
        Binarizer(lambda x: x == 'Released', 'Released')
    ),    
    make_pipeline(
        FeatureSelector('cast'),
        DictionaryVectorizer('name'),
        TopFeatures(0.25),
        SumTransformer('top_cast_count')
    ),
    make_pipeline(
        FeatureSelector('crew'),
        DictionaryVectorizer('name', False),
        TopFeatures(1)
    ),
    make_pipeline(
        FeatureSelector(['budget', 'runtime', 'vote_average'])
    ),
    make_pipeline(
        FeatureSelector(['popularity', 'vote_count']),
        MeanTransformer('popularity_vote')
    )
)

What's left now is to fit our pipeline on our training data and transform both train and test data.

In [None]:
union.fit(X_train)

X_train_T = union.transform(X_train)
X_test_T = union.transform(X_test)

print(X_train_T.shape)
print(X_test_T.shape)

And here is our result:

In [None]:
X_train_T.head()

## 2.2. Model Selection

Now that we have the data set in the form we wanted, let's fit some models and see wich performs best. `Sklearn` provides very useful utilities for this purpose, namely Grid Search CV. It performs a search of best parameters provided by us, using cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

First, we import models, and then we set parameters for each model in the form of a dictionary.

**Attention!** You can also put models in `pipelines`, but I find this approach a bit messy, so I decided to implement them separately.

In [None]:
lin_params = dict(alpha=np.logspace(1,7,7), normalize=(False, True))
for_params = dict(n_estimators=np.linspace(10,40,4).astype(int), min_samples_split=(2,3), min_samples_leaf=(1,2,3))
gbr_params = dict(n_estimators=np.linspace(100,300,3).astype(int), min_samples_split=(2,3))

In [None]:
ridge_grid = GridSearchCV(Ridge(random_state=42), lin_params, cv=10)
forest_grid = GridSearchCV(RandomForestRegressor(random_state=42), for_params, cv=10)
gbr_grid = GridSearchCV(GradientBoostingRegressor(random_state=42), gbr_params, cv=10)

In [None]:
ridge_grid.fit(X_train_T, y_train)

In [None]:
forest_grid.fit(X_train_T, y_train)

In [None]:
gbr_grid.fit(X_train_T, y_train)

Now that our models are fit on test data, let's take a look which one performs the best, along with its best parameters chosen by Grid Search.

In [None]:
print(f'Ridge:\n\t *best params: {ridge_grid.best_params_}\n\t *best score: {ridge_grid.best_score_}')
print(f'Forest:\n\t *best params: {forest_grid.best_params_}\n\t *best score: {forest_grid.best_score_}')
print(f'Gradient Boost:\n\t *best params: {gbr_grid.best_params_}\n\t *best score: {gbr_grid.best_score_}')

Now that we have the best models with the best parameters, let's find out how they perform on test data.

In [None]:
best_ridge = Ridge(alpha=100, normalize=False)
best_forest = RandomForestRegressor(min_samples_leaf=3, min_samples_split=2, n_estimators=40)
best_gbr = GradientBoostingRegressor(min_samples_split=2, n_estimators=300)

In [None]:
from sklearn.metrics import r2_score

In [None]:
best_ridge.fit(X_train_T, y_train)
predicted = best_ridge.predict(X_test_T)

print(f'Ridge test score: {r2_score(y_test, predicted)}')

best_forest.fit(X_train_T, y_train)
predicted = best_forest.predict(X_test_T)

print(f'Random Forest test score: {r2_score(y_test, predicted)}')

best_gbr.fit(X_train_T, y_train)
predicted = best_gbr.predict(X_test_T)

print(f'Gradient Boosted Regressor test score: {r2_score(y_test, predicted)}')

We can also interpret each model by looking at its coefficients.

In [None]:
ridge_coefs_df = pd.DataFrame(dict(score=best_ridge.coef_, column=X_test_T.columns))
ridge_coefs_df.sort_values(['score'], ascending=False).head(10)

In [None]:
print(f'Train target variable mean: ${round(y_train.mean()):,}.')

Top coefficients refer to our dummy variables. How can we interpret this? Basically, it shows how a variable differs from a global mean. As the mean of the target variable is counted in hundreds of millions of dollars, no wonder that those values are so high! In addition to those coefficients, Ridge Regression has also regularization terms, that weaken those coefficients.

What about the numerical values?

In [None]:
ridge_coefs_df.loc[136:]

As we can see, these are much lower. Especially *budget* variable, which we can interpret, that for every dollar invested in a move, we expect about $1.6 revenue.

Both Random Forest and Gradient Boost are subsets of ensemble regressors. In this case, all scores (feature importances) should add up to 1. We can interpret those as the influence of the feature in predicting target value.

In [None]:
pd.DataFrame(dict(score=best_forest.feature_importances_, column=X_test_T.columns)).sort_values(['score'], ascending=False).head(10)

In [None]:
pd.DataFrame(dict(score=best_gbr.feature_importances_, column=X_test_T.columns)).sort_values(['score'], ascending=False).head(10)

What we can conclude is that *popularity_vote* and *budget* are the strongest predictors, where the importance of other features is almost insignificant.

# Conclusion

`Pipelines` can be thought of as a useful way to transform and model your data. If used correctly, can save a lot of unnecessary lines of code and unexpected issues, as data leakage. I believe that proficiency in those can make workflow more smooth and the code readable and easy to maintain.

# References
* [PyData Youtube Channel](https://www.youtube.com/user/PyDataTV)
* [Marcin Zabłocki blog](https://zablo.net/)
* [Multivariate Data Analysis - Joseph F. Hair Jr. William C. Black Barry J. Babin Rolph E. Anderson](https://www.pearson.com/us/higher-education/program/Hair-Multivariate-Data-Analysis-7th-Edition/PGM263675.html)
* [Feature Engineering for Machine Learning - Alice Zheng, Amanda Casari](http://shop.oreilly.com/product/0636920049081.do)