# The Comprehensive Machine Learning Pipeline (For Business)

*"Do machine learning like the great engineer you are, not like the great machine learning
expert you aren't"* - [Martin Zinkevich.](http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf)

That is because the vast majority of problems you'll face are not related to ML engineering per se, but to common **infrastructure roadblocks** and **business constraints**.

# Before ML

## Get Your Business Intelligence Hours In

The **best way** to present an ml product to a business owner / manager is to show how you can achieve better results for a current metric without disrupting their core business and infrastructure, *or at least trying to*. But how?

0. Understand the problem from the business perspective.
 - What's the objective?
 - What are the metrics used for decision making?
 - What is the heuristic or decision making process already in place for solving this problem?
 - What's the lag between procedure and results?
 - What could be the positive and negative impacts of implementing a model to solve it?
1. Understand business constraints. Problems are rarely univariate.
 - What are the freshness requirements for a model-based approach? For example, a credit scoring model can have an impact on revenue in under a month.
 - What are the situations that can influence the result you're tracking? Maybe economics or some business procedure that has indirect relation to it.
 - What ethical, legal and financial implications must be taken in consideration on a model-based strategy?
2. Discuss how a ML approach can solve the problem.
 - What are the stakeholders' expectations?
 - What's the MVP?
 - Define deliverables.

## Define Judging Metrics Early On

In this stage you should already devise a strategy to track all points of interest defined in the mvp. Ideally, you'll end up with a very automatic way to judge wether a model is good or bad.

# ML Step -1: Data Pipelines

*"Good models require good data. Good production ml systems require good data pipelines."*

What does it mean to have "good" data? Here's a list of quality characteristics to begin with:

- Normalized (like in database normalization)
- Complete
- Consistent
- Accurate
- Time stamped
- [Tidy](http://vita.had.co.nz/papers/tidy-data.pdf)

What about good data pipelines? Brace yourselves.
<img src="http://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/09/pentaho_1.png" />

# ML Step 0: Getting Started

## Basic Pipeline (MVP)

You can think of a **basic** ml product pipeline as a moving assembly line, where data passes from one module to another (data acquisition, pre-processing...) in a sequential manner, and the prediction is the final product.

**Every module should be independent and changeable.** Suppose you're working with a tree-based model and decided it is time to update it to a neural network; this should be easy to do without destroying the underlying pipeline. **This type of quality can only be achieved through *skillful discipline and commitment to fundamentals*.**


<div style="width:100; font-size:80%; text-align:center;"><img src="https://d1ldz4te4covpm.cloudfront.net/graphics/9781788838290/graphics/171dfaaa-00a1-4a62-9dd4-786ed31b412f.png" alt="ML MVP Pipeline" style="padding-bottom:0.5em;" />ML MVP Pipeline</div>

## Production Pipeline

Chances are the basic pipeline structure won't be good enough for production. You'll need a more robust system and it's overall functionalily is defined in the following picture.

<img src="https://pbs.twimg.com/media/DxnbU_0X0AIeEo_.jpg" />

## Pipeline Dry Run

*"All systems nominal."*

A dry run is a testing process where the effects of a possible failure are intentionally mitigated. In our case, a dry run is an end-to-end test to assess the quality of the pipeline and that the results are in acceptable range. You should perform this step rigorously and repeatedly before deploying.

**A good pipeline consists of independent modules that smoothly transform instances into predictions.**

# Imports

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
# Notebook configuration above.

# Data collection
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None) 

# Data preparation.
from sklearn.preprocessing import LabelEncoder

# Model development.
from sklearn.ensemble import GradientBoostingClassifier

# Model training.
from sklearn.utils import compute_class_weight

# Model evaluation.
from sklearn.metrics import classification_report, confusion_matrix, recall_score

# Data visualization.
import matplotlib.pyplot as plt
import seaborn as sns

# Objective


## Metric definition



# Data collection


# Data preparation


# Data split

# Model configuration


# Model training


# Model evaluation


# Model deployment

