In [1]:
# even if you don't use an import until way later on in the file, put your imports here!
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.impute import SimpleImputer

# SLU19 - Workflow

In this notebook we will be covering the following:

- Workflow
- Workflow Tips
- Pipelines and custome estimators

The goal for this LU is to establish the common steps and tools that you'll use to keep your data science workflow tight and efficient. If you get a new dataset or are starting out with a hackathon and you find yourself asking yourself "where and how do I begin?" this LU is your best friend!

## Why learn workflow?

Because you want to spend your time doing science! Not debugging stupid little things constantly and wasting your time!

Also, Data Science is largely an engineering discipline - you might as well just accept that right now. Writing code is an engineering practice and most data science is done with code these days. The most PURE industry research scientist I've spoken to works at Google's Deep Mind lab and he said that he is 50% software engineer! Nailing down a workflow and how to express it in code will make your life an order of magnitude easier. Furthermore, and more importantly, it makes your data science more responsible.

You don't want the following:

<img src="media/xkcd-machine-learning.jpg" width="300" />

Before we get started, I need to take a dig at Jupyter:

## Jupyter is a terrible development environment

Because it isn't one! A development environment is centered around being able to organize your code in an effective way. Jupter is made primarily for rapid prototyping and communication, not software engineering so there are going to be significant drawbacks when it comes to organizing your code and you will need to be extremely anal about following best practices because Jupyter won't do any of it for you the way that a real IDE would.

## So why are we using Jupyter?

Because our primary task in this academy is not to teach you how to be software engineers. It's to help you learn how to prototype and communicate as data scientists.

## No Jupyter in production

For the love of god, don't use Juypter notebooks in production. Write code in real .py files that can be tested, properly tracked, diffed, imported into other code, linted in CI/CD, viewed in any editor, and a million other advantages.

# Workflow

## Step 1: Get the data

In a real live environment, this step could literally take months. It depends on the organization, who guards the data, how well the data itself is known, what format it is in, as well as a million other factors. Throughout the academy we will be largely skipping this step. With the exception of the Data Wrangling Specialization, we will be handing you nice and tidy CSVs that you will be able to bring into your experiment with a single simple function call.

I would love to expand a bit more upon the substeps involved in this step but they vary so much in practice that the only thing I can say for certain is that it will involve a lot of meetings and will likely result in reading from a system that behaves a bit like the following:

<img src="media/xkcd-data-pipeline.png" width="600" />

## Step 2: Data analysis and preparation

This step has some more definitive substeps than the previous. In general you'll hit the following steps:

1. Data analysis
1. Dealing with data problems
1. Feature engineering
1. Feature selection

### 2.1 Data analysis

You've already learned quite a bit about how to do Data Analysis. In SLU01, SLU02, SLU03, SLU04, and SLU05 you have a nice pile of tools that you can use to get a feel for the type of data that you are dealing with. Use them until you feel comfortable enough that you could confidently describe the most important characteristics of the data set you are working with.

<img src="media/xkcd-quality-data-analysis.png" width="600"/>

### 2.2 Dealing with data problems

Your data analysis will certainly uncover data problems. Some of these data problems you may be able to deal with once and others you may need to make part of a pipeline. An example of a data problem that you would deal with once at the beginning of your workflow is changing numbers that are stored as strings in a csv into actual numbers. An example of something that you might want to put off until later is filling in nans so that you could experiment with imputation strategies.

In any case, the first time someone delivers you a dataset, the experience is likely to be very much like the following:

<img src="media/xkcd-dirty-data.png" width="200"/>

### 2.3 Feature engineering

Once you've got some clean data and have a benchmark model as a reference you may want to create some new features out of the existing features. A classic example of this would be to create a debt to income ratio feature for credit risk by simply dividing the debt of a person by some measure of their income.

You will likely iterate on this step several times.

### 2.4 Feature selection

You can do feature selection in a few different stages. One stage is right at the beginning when you can remove features that you KNOW for sure should not be in there. Examples of these are features that are all unique, all one value, are leakage, or are disallowed by law. Examples of features that you may want to remove at a later stage are because you found out that they are redundant or don't have any predictive power.

You will likely iterate on this step several times.

## Step 3: Train model

You know the drill here. Based upon the attributes of the problem at hand (binary classification, multi-class classification, supervised, unsupervised, regression, etc.), choose a few different types of models to experiment with. Note that you should start as simple as possible in order to keep your complexity under control.

I'll also take the opportunity to, one more time, stress the importance of creating a training and test set. Never mix the two. Ever.

<img src="media/not-xkcd-model-training.png" width="300"/>

## Step 4: Evaluate results

You've properly separated training and test data, fitted your model, made some prediction on your test sets. Now, depending on the type of problem once again, you need to select a metric or set of metrics to understand how your model is performing. This is also a great time to use learning curves.

Try not to suffer from too much tunnel vision here when trying to optimize a single test set on a single metric. That will be tough, especially since the nature of the hackathons in the course are actually all about doing just this... However, when you put a model into production, you won't have the luxury of knowing what your test set will look like so be properly skeptical and be aware of your model's characteristics.

Remember, just because something has never happened doesn't mean it may never happen. A model that is overfitted on your training set is blissfully unaware of this. Keep assumptions to a minimum and you'll fail more gracefully when previously unseen things happen.

<img src="media/xkcd-unseen-data.png" width="600"/>

# More tips and tricks

## Establish a simple baseline FAST

We've already mentioned this a few times but it deserves it's own section. Run as quickly as you can toward a simple baseline, no matter how simple it may be. For the specific problem you're working on, it's the data that is important and even the simplest model will give you an idea as to whether or not it has signal.

## Incrementally increase complexity

Take your super simple baseline model and increase complexity a little bit at a time. Like any responsible scientist, you don't want to be changing more than 1 variable at a time when running experiments.

## Use Scikit pipelines

Sooner than later, you will run into the problem of having to do duplicate pre-processing for a training and a test set or for different folds in cross validation. This can be a huge pain the the butt and can result in duplicated code or functions that have a crazy amount of arguments. 

Let's start of with a bit of motivation by looking at the titanic dataset where we will drop all categorical features and fill the nulls on the rest with the median.

In [2]:
train_df = pd.read_csv('data/titanic.csv')
X_train, y_train = train_df.drop('Survived', axis=1), train_df.Survived.copy()
X_test = pd.read_csv('data/titanic-test.csv')

# now let's preprocess and train

X_train_clean = X_train.select_dtypes(exclude='object').copy()
# note that you will want to impute with the median age from the training set
# and NOT the test set. This creates a few difficultites when trying to design
# around it
X_train_clean['Age'] = X_train_clean.Age.fillna(X_train_clean.Age.median())

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train_clean, y_train)

# then to test, we will need to do the same set of preprocessing

X_test_clean = X_test.select_dtypes(exclude='object').copy()
X_test_clean['Age'] = X_test_clean.Age.fillna(X_train_clean.Age.median())
# now it turns out that X_test_clean has a column with nulls that X_test
# didn't have so the preprocessing would have to be a bit different

# Now there are some nulls in Fare for the test set that were not 
# in the training set.
X_test_clean['Fare'] = X_test_clean.Fare.fillna(X_train.Fare.median())

preds = clf.predict_proba(X_test_clean)[:, 1]

It's totally true that we could write a few functions to take care of this, but scikit already provides some tooling that end up being cleaner using [pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

What is a pipeline? Pretty simple: it's a set of steps that has a model at the end of it. It implements the same API as the models (has `predict` and/or `predict_proba`) but it applies each of the steps before calling the model with the input!

Let's see how to use one to simplify the code we just looked at

In [5]:
train_df = pd.read_csv('data/titanic.csv')
X_train, y_train = train_df.drop('Survived', axis=1), train_df.Survived.copy()
X_test = pd.read_csv('data/titanic-test.csv')

# for some cases, we will want to create our own pipeline step
# this provides a lot of flexibility!
class RemoveObjectColumns(TransformerMixin):
    
    def transform(self, X, *_):
        return X.select_dtypes(exclude='object').copy()
    
    def fit(self, *_):
        return self

# now let's make the pipeline
pipeline = make_pipeline(
    RemoveObjectColumns(),
    # it's cool how scikit already has a mean imputer ready to go!
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=10)
)

# No need for us to manually preprocess the X_train at all
pipeline.fit(X_train, y_train)

# Same thing for X_test
probas = pipeline.predict_proba(X_test)

These can take a bit of time to learn how to use and they do have some strange behavior in some cases so be a bit patient with them - the work pays off!

### Custom Estimator

You may have noticed that the APIs that we use in our models are very simple. There is a `fit()`, a `predict`, a `predict_proba`, and sometimes a `transform` that we use and really not much else. If you're thinking that you can create your own estimator without much trouble, you'd be 100% correct!

#### There are good docs for this

Check out the section of the scikit user guide called [rolling your own estimator](https://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator) for the official explanations
of exactly how to do this. One of the more useful things that these docs have are a pointer to a
repo that has [project templates](https://github.com/scikit-learn-contrib/project-template/) which
includes several examples of some [custom estimators](https://github.com/scikit-learn-contrib/project-template/blob/master/skltemplate/_template.py)

#### Let's roll our own model

Alright, let's consider a binary classifier that does something really stupid: it flips a coin for
each observation and assigns a class based upon the outcome.

For custom estimators, there are normally the following steps:

1. Create a class from the `BaseEstimator` and optionally a `ClassifierMixin`.
1. Implement a constructor with your hyperparams.
1. Implement the `fit()` method.
1. Implement a `predict()` method.

Consider the following (taken directly from the scikit documentation) that implements a classifier based upon
a 1-NN scheme. In this classifier, we are given a set of obsevations with labels and when we get a new one,
we just find the sample from the training data that is closest to it and mark it as the same label. Simple :-) 

In [4]:
class TemplateClassifier(BaseEstimator, ClassifierMixin):
    """ An example classifier which implements a 1-NN algorithm.
    For more information regarding how to build your own classifier, read more
    in the :ref:`User Guide <user_guide>`.
    Parameters
    ----------
    demo_param : str, default='demo'
        A parameter used for demonstation of how to pass and store paramters.
    Attributes
    ----------
    X_ : ndarray, shape (n_samples, n_features)
        The input passed during :meth:`fit`.
    y_ : ndarray, shape (n_samples,)
        The labels passed during :meth:`fit`.
    classes_ : ndarray, shape (n_classes,)
        The classes seen at :meth:`fit`.
    """
    def __init__(self, demo_param='demo'):
        self.demo_param = demo_param

    def fit(self, X, y):
        """A reference implementation of a fitting function for a classifier.
        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            The training input samples.
        y : array-like, shape (n_samples,)
            The target values. An array of int.
        Returns
        -------
        self : object
            Returns self.
        """
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)
        # Store the classes seen during fit
        self.classes_ = unique_labels(y)

        self.X_ = X
        self.y_ = y
        # Return the classifier
        return self

    def predict(self, X):
        """ A reference implementation of a prediction for a classifier.
        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            The input samples.
        Returns
        -------
        y : ndarray, shape (n_samples,)
            The label for each sample is the label of the closest sample
            seen during fit.
        """
        # Check is fit had been called
        check_is_fitted(self, ['X_', 'y_'])

        # Input validation
        X = check_array(X)

        closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
        return self.y_[closest]

## Some advice for working in hackathon teams

### General advice

- Aim to make a submission as early as possible (baseline model)
- During the EDA, make sure to output some plots and save them - they will be helpful to build your presentation
- Try to keep a "pipeline" for your code, from the beginning to the end. Do not rely on successively edit the same DataFrame object, or you might end up unable to re-try to run your code. (I remember we suffered a lot on our first hackaton because of this) (edited) 


### Advice for working in teams

How to split work: should everyone work on their own notebooks? should you keep a single notebook? what is the best strategy?

Our advice is to keep a "main" notebook that everyone has access to. Nominate a "guardian" of such notebook. Work locally on small problems, starting from the "main" notebook  - make sure that everyone on the team knows which problem you are attacking. Once you are happy with the solution, add it to the main notebook and make sure everyone knows it has been updated.

Also, set time deadlines for tasks. For instance ("now, everyone has 40 minutes to explore these variables, and we talk again afterwards to share our findings"). Time goes by fast!

## Wrapping up

Keep this notebook open and reference it regularly, especially when you are doing your first few hackathons. The first few times you get a new dataset and it's 100% up to you to make all decisions about the steps to take
it will be VERY easy to skip important steps which will lead you to have much less fun than you deserve!

For lots of other additional advice on how to organize your code in your notebooks, check out the Examples notebook
that has lots of tips mostly focused around writing well-organized code.