In [1]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Binarizer, PolynomialFeatures, \
StandardScaler, FunctionTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline, \
FeatureUnion, make_union
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif

# Intro to Object Oriented Programming

## Object Oriented Programming

In Python, everything is an "object" of some type. This is the basis of what is known as Object Oriented Programming (OOP). Broadly, the focus within an OOP language is on objects -- what they are and what we can do with them. Some examples include:

- A `list` -- we can add, sum, or remove elements in a list, or do something to each element there
- A Panda's `DataFrame` -- we can slice and manipulate each column, move them to new DataFrames, or reduce them to single numbers
- A sklearn `DecisionTreeClassifier` -- we can create splits for each of the nodes, access and view them, refit it to new data, or use its internal values to predict or score new data

Some things that are common across all of these objects (and broadly, across this programming paradigm):

1. Everything we've defined in our language has certain things they can and can't do -- in other words, there's inherent behavior to one type of object versus another.
2. Almost every object is mutable in some fashion -- we can add or modify values in all three of the examples above. 
3. Our focus as developers in an OOP language like Python is what we want to do to each of those objects and how to connect or modify them.

Contrast this to the other broad paradigm, Functional Programming. In Functional Programming, our data is (typically) immutable -- we are not changing the underlying values or doing anything to them. Instead, through the use of functions and control flow, we represent that data differently. 

In other words (very broadly), OOP programs are interested in changing an object where as FP are interested in the evaluation of functions. Most languages are not purely one or the other -- in fact, many offer elements of both paradigms. 

> Popular Data Science languages `R` and `Julia` typically have stronger functional programming elements than Python!

An example of Pythonic OOP programming to add 2 to every number in a list:

In [2]:
my_list = [1, 2, 3, 4, 5]
my_list = [x + 2 for x in my_list]
print(my_list)

[3, 4, 5, 6, 7]


An alternative to that would be a functional programming approach:

In [3]:
def add2(value):
    return value + 2

my_list = [1, 2, 3, 4, 5]
print(list(map(add2, my_list)), my_list)

[3, 4, 5, 6, 7] [1, 2, 3, 4, 5]


Notice how in the OOP approach to the problem, the result is a **new** or **modified** object. In the FP approach to the problem, the result is the output of `my_list` (which is not modified) through `map` applying `add2` to each value, then `list` to convert that mapping into a list representation.

While there are many other differences between functional and object oriented programming paradigms, here's what to take away from this:

1. Everything in Python is an object.
2. Much of our work as Python developers will involve creating, using, or modifying the contents of objects.
3. Functional programming involves passing (immutable) data through multiple functions to represent it in our desired way. While this is achievable in Python (and most programming languages), we are _typically_ solving programs using an object-oriented paradigm.

## Classes in Python 

A class is a type of object. You can think of a class definition as a sort of "blueprint" that specifies the construction of a new object when instantiated.

One quick example is to imagine that I have two sets of data and I fit a `DecisionTreeClassifier` to each of them:

```python
df1 = pd.DataFrame(...)
df2 = pd.DataFrame(...)

dt1 = DecisionTreeClassifier()
dt1.fit(df1, target)

dt2 = DecisionTreeClassifier()
dt2.fit(df2, target)
```

Both `dt1` and `dt2` have the same underlying blueprint. Calling `.fit()` on both will start the process for fitting a decision tree. Calling `.score()` will call the same scoring method for both, etc. They are both the _same_ kind of object and can _do_ the same kind of things.

**However**, both have different _attributes_. If we called `.tree_` on both, we would see a different underlying decision tree, with different splits, different nodes, and different data for each. In other words, while they both can _do_ the same things, they have different internal values. 

Instantiating a class lets us create a unique copy of that class -- it can do all the same things that every other member of its class can do (such as `.fit()`, `.score()`, etc.) but lets us keep unique internal values (or attributes) to that specific instance of the class.

## Check For Understanding (10-15 Minutes)

In a small group (2-4 people), try to work through the following example:
1. Pick a pet (like a cat, or a dog, or a lizard)
2. What are things that that every member of that pet can do (such as eat)? 
3. What are attributes of a specific pet that might differ across each of the class (such as their name)?
4. Are there things that those pets can do that _other_ pets can also do? _Other_ animals? 
5. Write down a set of the following:
    - Things that all animals can do
    - Things that members of your pet class can do
    - Things that are unique to a specific pet 

This thought experiment might seem silly, but breaking items up into these questions of shared behavior and distinct attributes underlie how classes work in object oriented programs overall. 

If your group finishes early, replace _pet_ with the Python `list()` object. What are things that every list can do? If we create a specific list (`[1, 2, 3, 4, 5]`), what attributes of that list are unique to it?

## Code Example: Creating Classes and Inheritance  

Let's pretend that we were creating vehicles. We might begin by creating a generic Vehicle class.

> Note: we typically use proper case in Python to denote classes

In [None]:
class Vehicle(object):
    pass

vehicle = Vehicle()
vehicle

We start off with `class` to denote that what follows is a class, followed by the name of the class. `(object)` denotes that we are inheriting things from the `object` class:

In [None]:
object

What is `object`? `object` is basically the grandfather object of all other objects in Python. Here we're just letting Python know that `Vehicle` should be able to do everything that `object` can (which is not much!) We'll dive a little more into what that means in a second.

Next, let's give `Vehicle` some attributes: a `number_of_wheels`, a `name`, a `current_speed`, and a `max_speed`.

In [None]:
class Vehicle(object):
    
    def __init__(self):
        self.number_of_wheels = 4
        self.name = 'KITT'
        self.current_speed = 0
        self.max_speed = 60
        
vehicle = Vehicle()
print(vehicle,
     vehicle.number_of_wheels,
     vehicle.name,
     vehicle.current_speed,
     vehicle.max_speed)

Let's break down what we did here:

1. Every class can have internal functions (which we call **methods** that do things in that class. This is how we define what an object can do!)
2. We define a special function known as `__init__(self)` (the `__` is pronounced "dunder" for double underscore). This method runs as soon as a specific instance of the object is created (we call that **instantiating**). 
> For all of the methods we assign to an object, their first argument will be `self` to tell the class to look internally for attributes or variables.
3. Inside that function, we detail things that we might want to have populated as soon as the object exists, like its name and max speed. 
4. We denote the **attributes** or things specific to **one instance** of the object with `self.` -- this points back to whatever current instance of the object exists.
5. Once we instantiate the object (`vehicle = Vehicle()`), Python calls the internal `__init__(self)` and sets up anything that is inside of there.

We can also have certain attributes be defined by the user when they instantiate the class:

In [None]:
class Vehicle(object):
    
    def __init__(self, current_speed, name='KITT'):
        self.number_of_wheels = 4
        self.name = name
        self.current_speed = current_speed
        self.max_speed = 60
        
vehicle = Vehicle(25, name='Big Blue Van')

print(vehicle,
     vehicle.number_of_wheels,
     vehicle.name,
     vehicle.current_speed,
     vehicle.max_speed)

Note that we have changed the arguments that go into `__init__()` to include an argument for `current_speed` and a keyword argument for `name`. We then assign whatever values are passed into `Vehicle` _when it is instantiated_ to those attributes:

```python
def __init__(self, name, current_speed)
    self.name = name # name comes through the class instantiation
    self.current_speed = current_speed # same here
```

Using keyword arguments means that the user doesn't have to define them at the start: 

In [None]:
my_car = Vehicle(10)

print(vehicle.name, my_car.name)

But, using regular arguments will throw up an error if they are not supplied:

In [None]:
error_car = Vehicle()

Let's add in a couple of extra methods to do things with any vehicle we want:

1. A method to change the current speed of the vehicle
2. A method to check if our speed is too fast

In [None]:
class Vehicle(object):
    
    def __init__(self, current_speed, name='KITT'):
        self.number_of_wheels = 4
        self.name = name
        self.current_speed = current_speed
        self.max_speed = 60
        
    # New methods here 
    
    def set_current_speed(self, speed):
        self.current_speed = speed
        
    def check_speed(self):
        if self.current_speed >= self.max_speed:
            print('Woah! You are driving at {}'.format(self.current_speed))
            print('Warning! Too fast! Slowing you down!')
            self.current_speed = self.max_speed
        else:
            print('You are driving at {}!'.format(self.current_speed))
            print('Thank you for driving safely!')

Just like `__init__()`, we add in two more functions, using `def` to define them and passing in `self` as the first argument. If we want extra information from the end user for that call, we can also set that up as additional arguments. 

In [None]:
death_cab = Vehicle(95, name='Death Cab')

death_cab.check_speed()

We can use `.set_current_speed()` to go slower

In [None]:
death_cab.set_current_speed(20)

And then try `.check_speed()` again:

In [None]:
death_cab.check_speed()

However, what if `death_cab` only had 3 wheels? We could set this manually:

In [None]:
death_cab.number_of_wheels = 3
print(death_cab.number_of_wheels)

But different vehicles inherently have different numbers of wheels. While we could create a new class for, say, `Bicycles` and another for `18WheelerTrucks`, that seems like a lot of extra coding. 

Thankfully, classes allow for inheritance. This means that we can use an existing class as the start of a blueprint and modify as we need to. Let's make a `Bicycle` class that inherits from `Vehicle`:

In [None]:
class Bicycle(Vehicle):
    
    def __init__(self, current_speed, name='Bikey'):
        super().__init__(current_speed, name)
        self.number_of_wheels = 2
        self.max_speed = 15
        
    def ring_bell(self):
        print('BRRRRRING')

When we inherit from a class, we can overwrite a method that it already has (like we do with `__init__()` above) or we can add a new method (like `ring_bell()`).

`super()` is a special function within a class. It lets us go to the class that we are inheriting from and call a method that we may have overwritten. 

In this example, we initalize all four attributes in `Vehicle` by using `super()` and passing in `current_speed` and `name`, **THEN** we reset `number_of_wheels` and `max_speed` as is appropriate for Bicycle.

> Note, in Python 3, `super()` already implicitly passes in `self`, so we don't need to worry about adding it.

`Bicycle` has access to anything defined in `Vehicle`, but has a specific default for `max_speed` and `wheels`, plus a method `ring_bell()` that `Vehicle` does not. 

In [None]:
car = Vehicle(0)
bicycle = Bicycle(10)

In [None]:
car.check_speed()

In [None]:
bicycle.check_speed()

In [None]:
bicycle.number_of_wheels

In [None]:
bicycle.ring_bell()

In [None]:
car.ring_bell()

## Check for Understanding (20 Minutes)

Individually, please attempt the following:

1. Modify the `Vehicle` class to have:
    1. A `seats` attribute that lists the number of seats. This value should be set to 4.
    2. A `currently_riding` attribute that will hold the names of whoever is currently in the vehicle. This value should be set to an empty list (`[]`).
    3. Three methods: 
        - `add_passenger()`: this should take the name of the passenger in as an argument. Before it adds the passenger to the `currently_riding`, it should check how many people are currently riding and, if it is the same value as the number of seats that the vehicle can support, it should tell you that the vehicle is full.
        - `let_passenger_off()`: this should take the name of a passenger in as an argument. It should check to see if that name exists in the list of `currently_riding`. If the name exists there, it should remove it from that list. If the name does not exist there, it should warn the user.
        - `current_passengers()`: this should print out each element in the `currently_riding` attribute.
2. Write some code that tests whether the changes you have made to `Vehicle` are working successfully.
    > This can be very informal, but the key ideas are to know what it should look like if each of your additions worked successfully, then try and give it input to make sure that it works correctly. 
    > For example, if I wanted to test that `number_of_wheels` was added successfully I might write something like:
```python
# I think if I call number_of_wheels it should return back the number of wheels, which in this case is 4
vehicle.number_of_wheels
>>> 4
# Success!
```
3. Modify the `Bicycle` class to:
    1. Have a default `seats` attribute of 1.
4. Create a new `Train` class that:
    1. Inherits from the `Vehicle` class.
    2. Sets the `number_of_wheels` attribute to 12.
    3. Sets the `number_of_seats` attribute to 24. 
    
This is a pretty intimidating check for understanding, so feel free to reach out to your classmates, the Slack, and your local instructors as well, if you feel lost. There is solution code in this repository as well, but I highly encourage you to spend 20 minutes pushing yourself to try and develop these skills.

Question 2 can be coded up as informally or formally as you wish, but learning how to test your code to ensure that its behavior is as you expect can be a really key skill (and is one I do frequently when I code) -- just because it looks like it works doesn't mean that it **actually** works.

As developers, we will typically build out classes to handle the organization and manipulation of data -- such as parameters, distances, etc. We've already seen cases like this with both `DataFrame` and modeling libraries like `DecisionTreeClassifier`.

What we will usually do with classes is to help extend or build out modules that already exist. Our next step will be to build out a base library within the `preprocessing` module in sklearn. 

## OOP / Classes Applied: Creating a `FeatureExtractor` class

In this applied exercise, we're going to create a `FeatureExtractor` class using sklearn objects. Our goal is to create a sklearn object with a `.fit()` and `.transform()` method that, when given a Pandas DataFrame, will extract a specific column by column name. 

Why would we want this class to exist? We will use it in our next section on data pipelines and the sklearn `Pipeline` object!

### sklearn `mixin` object

**Note**: This portion is taking advantage of some years of experience working with scikit-learn and is intended to illustrate how a developer might use their insight into classes to help extend a library. We are _not_ expecting you, at this point, to have the knowledge of sklearn or the experience with Python to be able to apply this to a different case down the line. Use this as an example of workflow, not an expectation for skills development.

We're going to make use of a [`mixin`](https://en.wikipedia.org/wiki/Mixin) class. These classes usually exist in larger libraries and provide a framework for a set of expected behaviors. 

Within sklearn, there are a set of mixin classes kept in the [`base`](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.base) library. These are the backbones of all of the modules we've used to date. For example:

- [`LogisticRegression`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L962) inherits three mixin classes (`BaseEstimator`, `LinearClassifierMixin`, and `SparseCoefMixin`) -- each of these classes provide standardized methods and structure for the `LogisticRegression` class to work.
- [`StandardScaler`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/data.py#L461) inherits from `BaseEstimator` and `TransformerMixin`, each of which give it standardized methods as well.

#### What this means

Sklearn's preprocessing libraries all inherit from the `BaseEstimator` and `TransformerMixin` libraries. This will give us access to the following methods:
- `get_params()`
- `set_params()`
- `fit_transform()`

in exactly the way that sklearn expects to see. It will be up to us to  create the object, and then define our own `.fit()` and `.transform()` methods. 

#### Why both `.fit()` and `.transform()`?

Our class to extract a specific column only needs to find a column by name and extract it -- it never needs to "remember" anything. Our `.fit()` method won't do very much at all. However, other things in sklearn are going to expect to see a `.fit()` method, so we will include one. 

#### Let's create some fake data for testing

In [None]:
df = pd.DataFrame().from_dict({'a': [1, 2, 3, 4, 5], 
                               'b': [6, 7, 8, 9, 10]})

df

#### Now, let's set up the basic framework for our class -- we'll create a `.fit()` and a `.transform()` method that do nothing, then we will fill them in later.

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self):
        print('Called the fit method')
        pass
    
    def transform(self):
        print('Called the transform method')
        pass

In [None]:
fe = FeatureExtractor()
fe.fit()

In [None]:
fe.transform()

Because of the inheritence, we also have access to `fit_transform()` which will do a `fit()` then immediately `transform()` the same data, all in one step. However...

In [None]:
fe.fit_transform()

We will need to have a `X` variable passed in when we call `.fit_transform()` -- this is standard templating across sklearn. The classes are looking for the features which, internally, they will call `X` (there are also a smaller number of classes where they are looking for the predictors, which they'll internally call `y`).

In [None]:
fe.fit_transform(df)

In addition, because `.fit_transform()` is calling for an `X`, once we gave it one, it tries to immediately pass it to *our* `.fit()`. *Our* `.fit()` method right now does not take an `X` and so throws an exception. We'll fix that shortly.

Do the other methods inherited from `BaseEstimator` work?

In [None]:
fe.get_params()

In [None]:
fe.set_params()

Looks like they do!

Let's refactor our `FeatureExtractor` class to take in an `X` in our `.fit()` and `.transform()` methods.

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X): # new line
        print('Called the fit method')
        pass
    
    def transform(self, X): # new line
        print('Called the transform method')
        pass

And let's test it:

In [None]:
fe = FeatureExtractor()
fe.fit_transform(df)

What could be happening here? If we look at the `TransformerMixin` code for what happens when we call `.fit_transform()`, we see the following lines for `.fit_transform()`:

```python
def fit_transform(self, X, y=None, **fit_params):
        ...
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
            return self.fit(X, **fit_params).transform(X)
        else:
            # fit method of arity 2 (supervised transformation)
            return self.fit(X, y, **fit_params).transform(X)
```

So, it's calling `.transform(X)` on whatever comes out of `self.fit(X)` -- we're not passing anything out of `.fit()` and so `.transform()` fails. How do we fix this?

What we can do is instead of `pass`, we can use a construction called `return self` -- this will mean that if we call `.fit()`, we'll just return our own class back. This will let us **chain commands** together in the way that sklearn expects.

**Note**: This is behavior that does not exist in every library (because each library is different, coded by different developers, etc.) -- it's good to know but will not 100% be a pattern in every library you see. 

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X): 
        print('Called the fit method')
        return self # new line
    
    def transform(self, X): 
        print('Called the transform method')
        pass
    
fe = FeatureExtractor() # need to reinstantiate with new code
fe.fit_transform(df)

Success! Now all we need to do is modify fit and transform to do what we want. 

So, let's take a step back and think about what we want to do:

1. when we instantiate `FeatureExtractor`, we want to be able to give it a column name to look for. 
2. when we call `.fit()`, we don't need to do anything.
3. when we call `.transform()`, we want `FeatureExtractor` to look at what it's been given and return back a numpy array with just those values
    > Why a numpy array and not a `DataFrame`? Numpy arrays tend to work more easily with other sklearn components. 
    
Since we don't need to do anything for step 2, let's get step 1 targeted first. We know that when we instantiate a class, it will run whatever is in the `__init__()` method, and this is where we can establish start-of-life attributes. Let's add in a `__init__()` method!

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        print('Initialized Class') # new line
        self.column = column # new line
        
    def fit(self, X): 
        print('Called the fit method')
        return self 
    
    def transform(self, X): 
        print('Called the transform method')
        pass
    
fe = FeatureExtractor('b') 
fe.fit_transform(df)
print(fe.column) # new line
print(fe.get_params()) # new line
print(fe.set_params()) # new line

Notice that a couple of new things have happened -- we can now call the `.column` attribute of the class itself, and the `.get_params()` and `.set_params()` methods (which know what to look for) are returning information about our new attribute as well. 

Let's change the `.transform()` method -- we'll also add in a little boilerplate code so that if we do pass in an `X` and a `y`, it can accept the `y` and then safely ignore it:

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        print('Initialized Class') 
        self.column = column 
        
    def fit(self, X, y=None): # new line
        print('Called the fit method')
        return self 
    
    def transform(self, X, y=None): # new line
        print('Called the transform method')
        return X[[self.column]].values # new line
    
fe = FeatureExtractor('b') 
fe.fit_transform(df)
print(fe.column) 
print(fe.get_params())
print(fe.set_params()) 

Everything ran successfully, but that _should_ have transformed our data. Let's make one tweak (because we are now returning something in `.transform()`, we want to print out what comes out of the method as opposed to sending it off into the ether).

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        print('Initialized Class') 
        self.column = column 
        
    def fit(self, X, y=None):
        print('Called the fit method')
        return self 
    
    def transform(self, X, y=None):
        print('Called the transform method')
        return X[[self.column]].values 
    
fe = FeatureExtractor('b') 
print(fe.fit_transform(df)) # new line
print(fe.column) 
print(fe.get_params())
print(fe.set_params()) 

It's our `b` column from our fake `DataFrame`. We've got this class successfully made!

Our final edit will be to remove the print statements inside of the class, now that we know that it is working:

In [None]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column 
        
    def fit(self, X, y=None):
        return self 
    
    def transform(self, X, y=None):
        return X[[self.column]].values 
    
fe = FeatureExtractor('b') 
print(fe.fit_transform(df))
print(fe.column) 
print(fe.get_params())
print(fe.set_params()) 

# we can now even do this
fe.fit(df)
column = fe.transform(df)
print(df, '\n', column)

We'll use this custom class moving forward in our `Pipeline` discussion.

## Data Pipelines and sklearn `Pipeline` library

If you remember from our discussion a couple of weeks ago, we talked about a typical Data Science workflow:

![](./images/model_development.png)

In this sort of scheme, we can imagine a couple of steps happening across the model fitting and model deployment processes:

#### Model Generation

1. Fit transformations to training data
2. Transform training data
3. Fit model to transformed training data
4. Predict / Predict Probabilities / Score Model 

#### Model Deployment

1. Transform incoming data based on previous fit
2. Predict / Predict Probabilities / Score Previously fit Model

In other words, **when fitting**, we're going to have a set of steps to do to fit data transformations. Then, we'd like to transform our data, doing all of those in the correct order. Then we'd like to fit our model on that transformed data. Then we'd like to do stuff with that model. **When deployed**, we want to reproduce just the transformation and predictions steps.

So, really, we could break this down into something simpler:

#### Model Generation
1. Data_Transform -- all **fit** steps
2. Data_Transform -- all **transform** steps
3. Model -- all **fit** steps
4. Model -- all **predict** steps

#### Model Deployment
1. Data_Transform -- all **transform** steps
2. Model -- all **predict** steps

Sklearn has a library (`Pipeline`) that, once you know how it works, is much easier than trying to handle all of these steps by hand.

## Serial Steps -- the `Pipeline` Class

`Pipeline` allows you to set up a list of steps. Once set up, the `Pipeline` will take a set of data and sequentially feed it through the steps, calling fit and transform at each step and passing that result into the next step. At the last step, the `Pipeline` object will do whatever your last command for it is (i.e., if you call .predict() it will return the predictions of whatever you wanted).

A `Pipeline` requires a list of steps. These steps are kept in a tuple with a name for the step followed by the step to take. These steps can either be instantiated into their own object or passed in directly to the list.

Let's use the Iris data and the following libraries:

- `FeatureExtractor` -- the custom sklearn object we created above
- `Binarizer` -- to create a cutoff
- `KNeighborsClassifier` -- to classify the Iris flowers.

We'll begin by using `train_test_split()` to split the data into a training set and a test set. This will be to show off that the `Pipeline` object can be reproducibly used!

In [None]:
iris = pd.read_csv('datasets/iris.csv')

def split_species(val):
    if val == 'setosa':
        return 0
    elif val == 'versicolor':
        return 1
    else:
        return 2
    
iris['species'] = iris['species'].apply(split_species)

iris.head()

We'll split `species` into its own `y` variable and assign the remaining features to `X`, then train-test split (using 2017 as the `random_state`):

In [None]:
y = iris['species'].copy()
X = iris[[col for col in iris.columns if col !='species']].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=2017)

Next, we'll set up the steps. The model we want to make is one where we take the `petal_length` feature, bin it into a dummy variable where the cutoff is the median petal length, and then predict the flower species using that one feature. 

First, we will list out the steps in sequential order. Each step will in the list will look like this:

```python
('NAMEOFSTEP', callable())
```

where the `'NAMEOFSTEP'` is the name of that step (as a string), and the `callable()` is any sklearn object that has a `.fit()` and a `.transform()` method. The **last** item in the list of steps can be a model object with `.score()`, `.predict()`, and `.predict_proba()` as well. 

In [None]:
modeling_steps = [
    ('extract_petal_length', FeatureExtractor('petal_length')),
    ('cut_off_at_median', Binarizer(X_train['petal_length'].median())),
    ('predict_using_knn', KNeighborsClassifier())
]

print(modeling_steps)

Next, we'll instantiate a Pipeline object, passing in the steps:

In [None]:
model1 = Pipeline(modeling_steps)
model1

Finally, we'll fit it to the data we have!

In [None]:
model1.fit(X_train, y_train)

`model1` now acts like a `KNeighborsClassifier()` **plus** it does the transformation required to predict and score the results. For example, if we wanted to see the score on the training data:

In [None]:
print(model1.score(X_train, y_train))

But if we pass in **new** data, it will still do the requisite transformations, provided that it sees a column named `'petal_length'` in the incoming data:

In [None]:
print(model1.score(X_test, y_test))

Or we can get predictions as well:

In [None]:
print(model1.predict(X_test)[0:5])

## Checks for Understanding

For the `Pipeline` Checks for Understanding, we will be using a dataset from the University of California at Irvine's Machine Learning Repository on predicting the age of [Abalones](https://en.wikipedia.org/wiki/Abalone), which are a type of marine snail. Manually confirming the age of the Abalone is a very manual and difficult process involving cutting the shell open and using a microscope to count rings. Automating this process saved researchers quite a bit of time!

![](./images/abalone.jpg)

Our dataset can be accessed directly via the [Abalone Dataset](http://archive.ics.uci.edu/ml/datasets/Abalone) page and contains the following features:

|Name|Data Type|Measurement|Description|
|:---|:---|:---|:---|
|Sex|nominal||M, F, and I (infant) |
|Length	|continuous|mm|Longest shell measurement |
|Diameter|continuous|mm|perpendicular to length |
|Height|continuous|mm|with meat in shell |
|Whole weight|continuous|grams	|whole abalone |
|Shucked weight	|continuous|grams	|weight of meat |
|Viscera weight	|continuous|grams	|gut weight (after bleeding) |
|Shell weight|continuous|grams|after being dried |
|Rings		|integer||+1.5 gives the age in years |

In [None]:
abalone_columns = ['sex', 'length', 'diameter', 'height',
                  'whole_weight', 'shucked_weight', 'viscera_weight',
                  'shell_weight', 'rings']

abalone = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',
                     names=abalone_columns)

abalone.head()

In [None]:
abalone.info()

We could approach predicting the number of rings in a couple of ways:

1. We could treat it like a continuous variable and use a regression technique
2. We could attempt to predict each number of rings as if it were its own class
3. We could engineer a new feature to predict using `rings` as the starting point. 

For the ease of learning, we're going to pick the third option and predict whether or not the abalone in question is above the average age in our sample:

In [None]:
X = abalone[[col for col in abalone.columns if col != 'rings']].copy()
rings = abalone['rings'].copy()
y = rings.apply(lambda x: 1 if x > rings.mean() else 0).copy()

In [None]:
print(y.describe(), '\n\n', rings.describe())

## Check for Understanding 1 

Our target column is stored in `y` and our features are now stored in `X`. Using the `abalone` dataset, please do the following individually:

1. Use `train_test_split()` with the following parameters:
    - `test_size`: `0.25`
    - `random_state`: `20171107`
    - Name your dataframes `X_train`, `X_test`, `y_train`, `y_test`
2. Using `FeatureExtractor`, `PolynomialFeatures`, `StandardScaler`, and `LogisticRegression`, create a list of steps that:
    1. Extracts the `diameter` column using `FeatureExtractor`
    2. Creates a set of $\text{diameter}$, $\text{diameter}^2$, and $\text{diameter}^3$ using `PolynomialFeatures` (don't forget to ignore the bias term!)
    3. Standardizes those new features using `StandardScaler`
    4. Feeds the resulting 3 columns into `LogisticRegression`
> Remember, each step needs to be formatted: `('name of step', callable())`
3. Create a `Pipeline` object with your list of steps.
4. Fit your `Pipeline` object to your `X_train` and `y_train` dataframes.
5. **Without refitting your `Pipeline` object**, use it to score `X_test` and `y_test`
6. Create predictions using your `Pipeline` object on your test set. Use `confusion_matrix()` and `classification_report()` investigate the goodness of fit of your model. How well is it fit?

## Serial Steps -- `make_pipeline()`

If we don't care about the names of each individual step (and we rarely will), we can use the `make_pipeline()` helper function to move steps forward:

In [None]:
pipe = make_pipeline(
    FeatureExtractor('Diameter'),
    PolynomialFeatures(3, include_bias=False),
    StandardScaler(),
    LogisticRegression()
)

print(pipe)

Instead of having to create a `list` of steps separately, `make_pipeline()` takes in a series of arguments and creates the pipeline for you, automatically naming the steps as needed. It can be a huge timesaver!

## Quick Diversion: Categorical Variables Encoded as Strings

So, there are some steps here to modeling the `abalone` dataset that sklearn does not handle particularly elegantly:

1. Extracting features from a DataFrame by column name
2. Creating dummy features from categorical variables that are encoded as strings

Step 1 we've already taken care of with our custom `FeatureExtractor` class. Step 2, however, is a little inelegant. We'll be creating a second custom class, `CategoryConverter()` and using a preprocessing module known as `OneHotEncoder()`

### `OneHotEncoder`

OneHotEncoder will take a column of data and convert it into dummy variables (one per unique instance in the feature). 

Two keyword arguments we will want to set are:
- `sparse=False`: `OneHotEncoder` defaults to returning a `sparse` matrix after transformation. We want to see dense numpy arrays instead
- `handle_unknown='ignore'`: `OneHotEncoder` defaults to raising an error if it sees a value it does not know when transforming. We'll set this to `ignore` that. If `OneHotEncoder` transforms a column with values it has not seen before, it will set each column to `0`, which is what we want. 

**However**, `OneHotEncoder` will _only_ transform columns that use integers to represent categories (a more user-friendly version of `OneHotEncoder` named `CategoricalEncoder` that will, among other things, gracefully handle strings, is currently under active development (see PR [here](https://github.com/scikit-learn/scikit-learn/pull/9151)) but until then, we need a class that will do the following:

1. Take in a Pandas dataframe
2. Extract a single column
3. When we fit the column: it should assign a unique integer to each string category.
4. When we transform the column:
    - It should convert each of the string categories to that integer
    - It should return that column as a 2D numpy array.

In the interests of time, I'm going to define that class here and move forward, but this is a **great opportunity** to, on your own time, investigate what makes this class work.

(Also note: this is not a rigorously tested piece of code (certainly not to the standards that sklearn's code is!) -- you should not expect 100% performance from it!)

In [None]:
class CategoricalExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
        self.values = None
        
    def _create_values(self, indices):
        return {ind: i+1 for i, ind in enumerate(indices)}
    
    def _apply_values(self, row_val):
        return self.values.get(row_val, 0)
        
    def fit(self, X, y=None):
        self.values = self._create_values(X[self.column].value_counts().index)
        return self 
    
    def transform(self, X, y=None):
        col = X[self.column].apply(self._apply_values)
        return col.values.reshape(-1, 1)

We can use `CategoricalExtractor` (for cases where we have incoming categories) instead of `FeatureExtractor`. Once we do that, we'll pass it directly into `OneHotEncoder` with the two options we identified above.

In [None]:
pipe = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore'))

pipe.fit(X_train)
print(pipe.transform(X_train)[0:5, :])
print(X_train[['sex']].head())

## Parallel Steps -- `FeatureUnion()`

`Pipeline` works great if we have a set of steps that we want to do in a direct series. A great application of this is to create a series of steps to apply to one specific column.

**However**, we frequently want to do different things to different columns. `FeatureUnion()` lets us take multiple transformers and join the results back into one array.

Let's imagine we have two features we want to predict with in the Iris dataset:

- `petal_length` as a dummy feature coded 1 if it is above the mean for that column
- `petal_width` as both itself and a $\text{petal_width}^2$ feature.

We can create each pipeline seperately, first for `petal_length`:

In [None]:
petal_length_pipe = make_pipeline(
    FeatureExtractor('petal_length'),
    Binarizer(iris['petal_length'].mean())
)

And a second for the `petal_width` feature:

In [None]:
petal_width_pipe = make_pipeline(
    FeatureExtractor('petal_length'),
    PolynomialFeatures(2, include_bias=False)
)

`FeatureUnion()` lets me run these in parallel and join the results. `.fit()` will call the `.fit()` method of each of the constituent transformers, and `.transform()` will transform each and then join the results. 

Just like `Pipeline()`, we need to create a list of these pipelines as a tuple with a name and the step. Each of the pipelines we've created above (`petal_length_pipe` and `petal_width_pipe`) are considered steps.

In [None]:
fu = FeatureUnion([
    ('petal_length_transformer', petal_length_pipe),
    ('petal_width_transformer', petal_width_pipe)
])

fu.fit(iris)

Then calling `.transform()` will run each pipeline in parallel on the *same* set of data.

In [None]:
fu.transform(iris)[0:5, :]

Features are joined in the order of the steps provided.

Just like `Pipeline`, `FeatureUnion` also has a function that removes some of the boilerplate code (`make_union()`):

In [None]:
fu = make_union(
    petal_length_pipe,
    petal_width_pipe
)

fu.fit(iris)
fu.transform(iris)[0:5, :]

In [None]:
abalone.head()

## Check For Understanding 2 (10 minutes)

Using the `abalone` dataset:

1. Create three pipeline objects (using `make_pipeline()`, doing the following steps:
   1. Extract `length`, use `Binarizer` to cut at the average value for length
   2. Extract `diameter`, use `PolynomialFeatures` (`include_bias=False`) to create a $\text{diameter}$ and $\text{diameter}^2$ feature
   3. Extract `height`, use `Binarizer` to cut it at a value of 0.10
2. Before checking your work with Python, how many features would these three pipelines make? 
3. Feed each of these steps into a feature union, using `make_union()`
4. Fit and transform `X_train`. What is the shape of this numpy array? Does it match your expectations from question 2?

## The power of `Pipeline` and `FeatureUnion`

The true power of `Pipeline` and `FeatureUnion` is that we can **chain them together**.

As long as we can break down the feature engineering and modeling into sets of steps that need to happen in parallel and steps that need to happen in series (and provided that every step is an sklearn transformer that provides a `.fit()` and a `.transform()` method), we can map out the entire modeling process **and fit / run it** with one call. 

Let's use this to create a quick model using the `abalone` data. My model will consist of the following steps:

1. Data Transformation (in parallel):
    - Extract `sex` and create dummy variables (one pipeline)
    - Extract `length` and do nothing else (one transformer)
    - Extract `width` and do nothing else (one transformer)
    - Extract `diameter` and create polynomial features (one pipeline)
2. Scale and fit a `RandomForestClassifier` (one pipeline)

First, I'll create the two multistep extractors (for `sex` and `diameter`):

In [None]:
extract_sex_pipeline = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore')
)

extract_diameter_pipeline = make_pipeline(
    FeatureExtractor('diameter'),
    PolynomialFeatures(2, include_bias=False)
)

Next, I'll create the feature union that extracts everything and makes one large array with all of my transformed columns:

In [None]:
feature_transformers = make_union(
    extract_sex_pipeline,
    FeatureExtractor('length'),
    FeatureExtractor('height'),
    extract_diameter_pipeline
)

Finally, I'll set up my main modeling pipeline. This has three steps:

1. Do all of the feature transformations
2. Standard Scale the entire set of data
3. Pass into `RandomForestClassifier`

In [None]:
modeling_pipe = make_pipeline(
    feature_transformers,
    StandardScaler(),
    RandomForestClassifier()
)

And finally, we'll fit and score our `RandomForestClassifier` on the training data, then the test data, then use predictions to see how well we did via a confusion matrix and classification report:

In [None]:
modeling_pipe.fit(X_train, y_train)

In [None]:
modeling_pipe.score(X_train, y_train)

Not bad!

In [None]:
modeling_pipe.score(X_test, y_test)

Worse!

In [None]:
predictions = modeling_pipe.predict(X_test)

print(pd.DataFrame(confusion_matrix(y_test, predictions),
                  columns=['Predicted 0', 'Predicted 1'],
                  index=['Actual 0', 'Actual 1']))
print('\n')
print(classification_report(y_test, predictions))

All that work for nothing!

## Check for Understanding 3 (15 minutes)

Your goal for the next 15 minutes is to try and hack our model to make our predictions on the test set more accurate. How you choose to do so is up to you (provided that your changes are added to the pipeline), but you can consider doing any of the following:
- Modify the `feature_transformers` portion of our pipeline to add in more features or change how they are transformed.
- Change the type of model used in the pipeline.
- Change the hyperparameters that are used in the model

All of the code from above has been copied below to make editing easier for you:

In [None]:
extract_sex_pipeline = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore')
)

extract_diameter_pipeline = make_pipeline(
    FeatureExtractor('diameter'),
    PolynomialFeatures(2, include_bias=False)
)

feature_transformers = make_union(
    extract_sex_pipeline,
    FeatureExtractor('length'),
    FeatureExtractor('height'),
    extract_diameter_pipeline
)

modeling_pipe = make_pipeline(
    feature_transformers,
    StandardScaler(),
    RandomForestClassifier()
)

modeling_pipe.fit(X_train, y_train)
print('Training Set Score:', modeling_pipe.score(X_train, y_train))
print('Test Set Score:', modeling_pipe.score(X_test, y_test))
predictions = modeling_pipe.predict(X_test)

print('Confusion Matrix')
print(pd.DataFrame(confusion_matrix(y_test, predictions),
                  columns=['Predicted 0', 'Predicted 1'],
                  index=['Actual 0', 'Actual 1']))
print('\nClassification Report')
print(classification_report(y_test, predictions))

# Conclusion

`Pipeline` and `FeatureUnion` can create a really robust and powerful way to systematically and reproducibly transform and predict models. In other words, these systems allow us to do the following:

1. Be explicit about the transformations that our data undergoes as well as the final model type
2. Packages it up into one command, dramatically reducing the chance (compared to doing things manually) that user error affects our results.

This sort of structure means that Pipelines are typically best suited to a refactoring step -- once we have a model and associated data transformations that we are happy with, we can refactor our code to apply those transformations reproducibly. 

A bonus section follows this that briefly introduces the other case where `Pipeline` can be very useful -- in letting us use `GridSearchCV` to iterate over many more parameters than just the hyperparameters of one model. This is fairly advanced material (and dives deep into the guts of sklearn) so we will cover it only briefly, if at all, in class.

## **Bonus**: Applying `GridSearchCV` to Pipelines

You do not need to fully master the following content or be able to apply it on your own (and we may not even get to it in class, depending on time). This is optional material to look at and work through if you're interested in some of the guts of sklearn.

One of the more fascinating parts about `Pipelines` is that they can be recognized as estimators just like any of your model objects. One wrinkle is that we will need to change how we write out our dictionary of parameters to grid search over. Let's take the pipeline we were working with the `abalone` dataset:

In [None]:
extract_sex_pipeline = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore')
)

extract_diameter_pipeline = make_pipeline(
    FeatureExtractor('diameter'),
    PolynomialFeatures(2, include_bias=False)
)

feature_transformers = make_union(
    extract_sex_pipeline,
    FeatureExtractor('length'),
    FeatureExtractor('height'),
    extract_diameter_pipeline
)

modeling_pipe = make_pipeline(
    feature_transformers,
    StandardScaler(),
    RandomForestClassifier()
)

If we look at `modeling_pipe` we can see the names of each of the steps:

In [None]:
modeling_pipe.named_steps

What we're most interested in are the keys:

In [None]:
modeling_pipe.named_steps.keys()

We're going to use these keys in the grid of parameters we pass to `GridSearchCV`. Originally, when we were only fitting an estimator, we would pass them in like this:

```python
params_grid = {'NAMEOFHYPERPARAMETER': [value of hyper parameters]
```

Now that we have multiple steps, we need to preface each of the entries in that dictionary with which step has that hyperparameter first:

```python
params_grid = {'STEPNAME__HYPERPARAMETER': [values]
```

We'll separate each of the steps with `__` (double underscore).

A good way to think about how to craft this is like follows:

In [None]:
modeling_pipe.named_steps['randomforestclassifier']

To access any of the hyperparameters here, we would want to set up our params grid like this:

```python
params_grid = {'randomforestclassifier__n_estimators': [10, 100, 1000]}
```

If we want to change the items in the `featureunion` step, it's a little trickier. We can see all of the steps through the following view:

In [None]:
modeling_pipe.named_steps['featureunion'].transformer_list

We would want to continue diving down named steps until we found the actual step we wanted:

For example, if we wanted to grid search and see if it would be better to have 1, 2, or 3 polynomial features for `diameter`, we would need to traverse down:

(notice that we're looking at the **first** entry in each of these tuples!)

1. The named step in the final pipeline: `featureunion`
2. The next step that contains the transformer: `pipeline-2`
3. Inside of `pipeline-2` is the transformer that contains the thing we want to tweak (`PolynomialFeatures`): `polynomialfeatures`

That means our entry in the paramter_grid would be:

```python
params_grid = {'featureunion__pipeline-2__polynomialfeatures__degree': [1, 2, 3]}
```

where each named step is printed, in order, and then separated by double underscores `__` until the actual hyperparameter.

It's a little messy, but this lets `GridSearchCV` optimize a number of things at once!

Let's do one final example. Again, this is _fairly_ advanced and included more as a sneak peek for what sklearn is capable of (not for you to be able to replicate on your own immediately). 

Here, we're going to let `GridSearchCV` optimize the following choices for us:

1. How many numeric features to keep, using `SelectKBest`
2. Whether or not to use StandardScaler
3. Whether to use one of the three following models:
    - `RandomForestClassifier` (at `n_estimators: 1000`)
    - `KNeighborsClassifier` (at `n_neighbors: 5`)
    - `LogisticRegression` (at default)

In [None]:
extract_sex_pipeline = make_pipeline(
    CategoricalExtractor('sex'),
    OneHotEncoder(sparse=False, handle_unknown='ignore')
)

feature_transformer = make_union(
    extract_sex_pipeline,
    FeatureExtractor('length'),
    FeatureExtractor('diameter'),
    FeatureExtractor('height'),
    FeatureExtractor('whole_weight'),
    FeatureExtractor('shucked_weight'),
    FeatureExtractor('viscera_weight'),
    FeatureExtractor('shell_weight')
)

model = Pipeline([
    ('feature', feature_transformer),
    ('selectkbest', SelectKBest(score_func=f_classif)),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

model.named_steps.keys()

In [None]:
grid = {
    'selectkbest__k': range(2, 10, 2), # tries 2-10 selected columns 
    'scaler': [None, StandardScaler()], # tries StandardScaler or skips step
    'clf': [RandomForestClassifier(n_estimators=1000),
           KNeighborsClassifier(n_neighbors=5),
           LogisticRegression()],
}

gs = GridSearchCV(model, grid, verbose=1, n_jobs=-1)
gs.fit(X_train, y_train)

In [None]:
gs.best_score_

In [None]:
gs.best_params_

Apparently a `RandomForestClassifier` with no scaling and 8 features selected by `SelectKBest` scored the best on our training set. How does this look against our test set?

In [None]:
gs.best_estimator_.score(X_test, y_test)

(Because we passed in a `Pipeline` object into `GridSearchCV`, the `.best_estimator_` in the grid search is the entire pipeline -- making it really easy to check our results against a holdout set!)

In [None]:
predictions = gs.best_estimator_.predict(X_test)

print('Confusion Matrix')
print(pd.DataFrame(confusion_matrix(y_test, predictions),
                  columns=['Predicted 0', 'Predicted 1'],
                  index=['Actual 0', 'Actual 1']))
print('\nClassification Report')
print(classification_report(y_test, predictions))

We might want to swap between different modeling techniques once we have a good idea what a few of them work best as and need to make a final determination across them!