# Classification Workflow: Pipelines!

## Objectives

- Formulate and implement an iterative modeling workflow
- Recognize how pipelines streamline the preprocessing and modeling process

## Why Pipeline?

Pipelines can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

**Advantages**: 
- Reduces complexity
- Convenient 
- Flexible 
- Can help prevent mistakes (like data leakage between train and test set) 

## Today's Agenda: 

We'll introduce pipelines in the lens of simplifying the whole classification workflow, top to bottom!

Our data: https://www.kaggle.com/c/cat-in-the-dat-ii

The goal is to classify the `target` column. 

The competition's main metric is ROC-AUC score! We can explore other metrics, but we should be sure to use that to evaluate our models.

### Steps:

1. Data Exploration
2. Define and structure data preprocessing steps
3. Run a `DummyClassifier` to create a model-less baseline, using a pipeline to combine the classifier with preprocessing steps
4. Run a `LogisticRegression` and compare results to the model-less baseline

If we have time, we'll keep iterating to improve!

In [None]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, plot_roc_curve, plot_precision_recall_curve
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [None]:
# Grab, then explore data
df = pd.read_csv("data/cat_in_the_dat2_train.csv", index_col='id')

In [None]:
# Define our X and y

X = df.drop(columns=['target'])
y = df['target']

# and train test split - to create our val holdout set!
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1,
                                                  random_state=0)

## Data Exploration

Explore the **training** data, checking out both numeric and categorical features.

We should run _at least_ one visualization to explore relationships among features!

In [None]:
# Your code here

## Data Preprocessing

Let's outline our data processing strategy!

#### Discuss:

Some questions we can ask ourselves:

> How will we handle any null values? How will we handle any categorical features? What if our categorical features have 20+ unique values in each column? How will we scale our features?

- 


Let's build a column transformer to define our data processing steps. Note that it's typically easiest to create list-like arrays of column names to match up with each processing step. Also - don't repeat columns! SKLearn's ColumnTransformer needs the lists of features it's processing to be mutually exclusive.

Reference: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

In [None]:
# Your code here

## Baseline Model

Let's find out how hard our problem is, by creating a model-less baseline!

If we use SKLearn's `DummyClassifier`, we can create our first full Pipeline!

Reference: https://scikit-learn.org/stable/modules/compose.html#pipeline

In [None]:
# Your code here

#### Evaluate:

- 


## Logistic Regression

Let's build an initial logistic regression model, using the same preprocessing steps:

In [None]:
# Your code here

#### Evaluate:

- 


## Iterate

Let's either change something in our preprocessor, change something about our logistic regression model set up, or change the features we're using, then try again.

In [None]:
# Your code here

#### Evaluate:

- 


## Validate

How does our best model (so far) perform on our holdout val set?

In [None]:
# Your code here

#### Discuss:

- 


## Resources

Check out Aurélien Geron's notebook of an [end-to-end ml project](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) on his GitHub repo based around his book [_Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed)_](https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)

## Level Up - What to do with too many options in categorical columns?


New library you can install with MORE encoding techniques, beyond one-hot encoding! https://contrib.scikit-learn.org/category_encoders/

- (these work within SKLearn pipelines, since they're written in the SKLearn style!)
