In [78]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from stringstamper import StringStamper

from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline, FeatureUnion

# Object Oriented Programming and Sklearn

In this lesson we will be using the `sklearn` library to give examples of how object oriented programming is used in practice.  Our objective is not so much to learn about `sklearn`, but to explore how a professionally developed, widely used library uses the organizational principles of object oriented programming to provide a good user experience.

The main theme of our explorations will be the power of **Providing a Consistent Interface**.

One of the fundamental commandments of OOP is:

> Program to an interface instead of an implementation.

Sklearn is a stellar example of this important programming philosophy.

## Review Example: The String Stamper

To begin our exploration, let's review the basic concepts of object oriented programming using a simple `class` I wrote called `StringStamper`.

Here's how string stamper works:

In [16]:
# Create a new object:
ss_matt = StringStamper(message="Property of Matt.")

# Use the object to stamp something:
print(ss_matt.stamp("The Hobbit."))

# Use the SAME object to stamp a different thing:
print(ss_matt.stamp("The Lord of the Rings."))

# Create a DIFFERENT string stamper object:
ss_jack = StringStamper(message="Property of Jack.")

# Jack has a different copy of the hobbit.
print(ss_jack.stamp("The Hobbit."))
print(ss_jack.stamp("The Lord of the Rings."))

The Hobbit. Property of Matt.
The Lord of the Rings. Property of Matt.
The Hobbit. Property of Jack.
The Lord of the Rings. Property of Jack.


#### Questions:

- What things above can be considered *objects* in python?
- What things above can be considered *classes*?
- Where did we use a *constructor*, i.e. where did we create new objects?
- What things above can be considered *methods*?
- What is the type of `ss_matt` and `ss_jack`?

Here's the code for the `StringStamper`:

In [15]:
class StringStamper:
    """Stamps a message on the end of a string.

    Attributes
    ----------
    message: str
      A message to stamp on the end of a string.

    Usage
    -----
    $ stamper = StringStamper("Property of Matt.")
    $ stamper.stamp("Elements of Statistical Learning.")
    "Elements of Statistical Learning. Property of Matt."
    """
    def __init__(self, message):
        self.message = message

    def stamp(self, string):
        return string + " " + self.message

#### Questions:

- What does the `class` keyword do here?
- What does the `__init__` method do?
- What does the notation `self.message = message` do?
- What does `self` refer to?

Generally, as data scientists, we will be **consumers** of code written in an object oriented style.  So let's spend the rest of the lesson studying an important example of this style from teh `sklearn` library.

## Sklearn and the Transformer Interface

[Sklearn](https://scikit-learn.org/stable/) or or scikit-learn is the standard library for machine learning in python.  It contains many, many tools that will take a very long time for us to explore.  Our goal for today will be to explore only a small corner of it, the [transformers](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing).

**Transformers** are tools that transform data sets.  This is a very common operation, we have some set of data, and we want to modify it in some consistent way for use in some task.  Sklearn has a very consistent way that it expresses these types of operations, and it leverages the concepts of OOP very heavily in creating a consistent user experience.

Let's start with a couple of examples.

### Standardizing Data

One of the most ubiquitous transformers is the `StandardScaler`, which is used to **standardize** a data set.

A vector $x$ is said to be **standardized** if it has mean zero and standard deviation one.  If we take *any* vector, then we can transform it into a standardized one by subtracting its mean and dividing by its standard deviation.  This process is called **standardization**.

$$ X_{\text{standardized}} = \frac{x - \bar x}{\text{sd}(x)} $$

If you have some experience with `numpy`, then you can probably see that standardizing a numpy array is quite simple:

In [19]:
x = np.array([1, 0, 2, 2, 0, 1, 0, 2])
x_standardized = (x - np.mean(x)) / np.std(x)

The mean and standard deviation of the standardized vector are zero and one respectively:

In [20]:
print("Mean of standardized vector: ", np.mean(x_standardized))
print("Stanard deviation of standardized vector: ", np.std(x_standardized))

Mean of standardized vector:  0.0
Stanard deviation of standardized vector:  1.0


This approach to standardization is simple and understandable, which are certainly virtues!  It does start to run into some issues when used in real production machine learning code though:

  - We often want to standardize many vectors together in a bundle, but use different means and standard deviations for the different vectors.  This becomes awkward with the straight numpy approach.
  - Because of a concept called *data leakage* which we will discuss later, it is often neccessary to standardize one vector, and then use the **same** mean and standard deviation to transform other different vectors.  This requires us to memorize a bunch of means and standard deviations, and it's good to have an organizaing principle for this type of work.
  
Whenever we are in a situation where data transformations of some kind depend on some **parameters** (like the mean and standard deviation of the vector), the concept of object oriented programming starts to shine.

### The Standard Scaler

Sklearn includes a class used for standardizing all of the columns in a data set.

The `StandardScalar` class implements the **transformation interface**.   Here's how you use it.

#### 1. Create a `StandardScalar` object.

In [22]:
standardizer = StandardScaler()

#### 2.  Fit the `StandardScaler` object to a data set.

Use the `fit` method, and pass in the data set you would like to standardize.  Behind the scenes this computes and memorizes the mean and standard deviation of all the columns in the dataset.

We'll use a data set about distanse to wells in bangladesh as our working example:

In [26]:
wells = pd.read_csv('./wells.dat', sep=' ', index_col='id')
wells.head()

Unnamed: 0_level_0,switch,arsenic,dist,assoc,educ
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2.36,16.826,0,0
2,1,0.71,47.321999,0,0
3,0,2.07,20.966999,0,10
4,1,1.15,21.486,0,12
5,1,1.1,40.874001,1,14


In [28]:
standardizer.fit(wells)

  return self.partial_fit(X, y)


StandardScaler(copy=True, with_mean=True, with_std=True)

#### Question:

Notice that calling the fit method doesn't seem to really **do** much of anything.  So, what happens behind the scenes when we call the `fit` method?

#### 3. Transform a (possibly) different data set with the `StandardScaler` object.

Use the `transform` method on *any* dataset to perform the standardization (i.e., subtract the memorized mean from each column and divide by its memorized standard deviation. 

In [30]:
wells_standardized = standardizer.transform(wells)

  """Entry point for launching an IPython kernel.


We now have a **new** dataset that is a transformed version of our wells dataset.

In [36]:
wells_standardized.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'

Huh?

This is a common issue when working with sklearn: it is designed to work with numpy arrays, but **not** data frames.

In [38]:
type(wells_standardized)

numpy.ndarray

We can make the transformed data back into a dataframe using standard pandas techniques we've already learned:

In [39]:
wells_standardized = pd.DataFrame(wells_standardized, columns=wells.columns)

In [41]:
wells_standardized.head()

Unnamed: 0,switch,arsenic,dist,assoc,educ
0,0.859436,0.634996,-0.818923,-0.855947,-1.202115
1,0.859436,-0.855245,-0.026249,-0.855947,-1.202115
2,-1.163554,0.373075,-0.711287,-0.855947,1.287521
3,0.859436,-0.457848,-0.697797,-0.855947,1.785448
4,0.859436,-0.503006,-0.19385,1.168297,2.283375


**Question:**

What's going on here, why is there only two unique values of `switch` in the standardized data frame?

In [58]:
wells_standardized.loc[:, "switch"].unique()

array([ 0.85943576, -1.1635541 ])

**Question:**

What is the mean and standard deviation of the columns in `wells_standardized`?

In [55]:
for name, (_, col) in zip(wells.columns, wells_standardized.T.iterrows()):
    print("Mean of column {}: {:2.2f}".format(name, col.mean()))
    print("Standard Deviation of column {}: {:2.2f}".format(name, col.std()))

Mean of column switch: -0.00
Standard Deviation of column switch: 1.00
Mean of column arsenic: -0.00
Standard Deviation of column arsenic: 1.00
Mean of column dist: 0.00
Standard Deviation of column dist: 1.00
Mean of column assoc: -0.00
Standard Deviation of column assoc: 1.00
Mean of column educ: 0.00
Standard Deviation of column educ: 1.00


## Discretization (Binning)

It is occasionally useful to take a continuous feature, and convert it into a discrete feature by binning together the value of the original feature that fall withing certain ranges.

For example, if we start with the feature vector:

$$ x = \left( \begin{array}{cccccccccc} 0.00 & 0.15 & 0.71 & 0.79 & 0.37 & 1.00 & 0.36 & 0.06 & 0.04 & 0.15 \end{array} \right) $$

We may have occassion to bin this into three buckets, say:

$$ B_1 = \left( -\infty, \frac{1}{3} \right\rbrack, \ B_2 = \left( \frac{1}{3}, \frac{2}{3} \right\rbrack, \ B_3 = \left( \frac{2}{3}, 1 \right\rbrack $$

If we label any data point in the first bucket as a `0`, and in the second bucket as a `1`, and any in the final bucket as a `2`, then this would transform our vector into:

$$ x_{\text{bucketed}} = \left( \begin{array}{cccccccccc} 0 & 0 & 2 & 2 & 2 & 2 & 1 & 0 & 0 & 0 \end{array} \right) $$

Sklearn also contains a tool for this type of operation.

In [148]:
# Note: Sklearn tools are designed to work with *column* vectors!
# This means you will often have to reshape(-1, 1) your row vectors into column vectors.
x = np.array([0.00, 0.15, 0.71, 0.79, 0.37, 1.00, 0.36, 0.06, 0.04, 0.15]).reshape(-1, 1)

#### 1. Create a `KBinsDiscretizer` object.

This time we need to supply a few parameters: the number of bins we want, and a strategy for computing the endpoints of the bins.  For information on what strategies are implemented, and how these strategies work, see [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer).

**Note**: You'll find in the documentation that there is a simple strategy that sklearn has **not** implemented: the user supplying the endpoints manually!  This happens often, where a tool provides some useful things, but not the thing that you really need.  This is why it is important to get practice programming, and develop the courage to build and test your own tools.

In [85]:
binner = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')

#### 2. Fit the KBinsDiscretizer object to a data set.

This works the same way as fitting the `StandardScaler` object.

In [86]:
binner.fit(x)

KBinsDiscretizer(encode='ordinal', n_bins=3, strategy='quantile')

#### Questions:

Again, what's going on behind the scenes when we call the `fit` method?

#### 3. Transform a (possibly) different data set with the `KBinsDiscretizer` object.

Again, this works the same way as within `StandardScaler`.

In [87]:
binner.transform(x)

array([[0.],
       [1.],
       [2.],
       [2.],
       [2.],
       [2.],
       [1.],
       [0.],
       [0.],
       [1.]])

And, of course, we can transform a different array:

In [92]:
binner.transform(np.random.uniform(size=(10, 1)))

array([[2.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.],
       [0.],
       [0.],
       [1.],
       [2.]])

## Creating our Own Transformer: Grabbing Specific Columns

We saw in our last example that we may find need to create our own transformers.  Let's try our hand by creating a simple one that simply picks out specific columns in a data set. So, for example, we should be able to use our transformers to:

  - Select the first column only.
  - Select all but the last column.
  - Select the columns at even indexes.
  
  
#### The Transformer interface.  

To define a transformer, we need to define a class that implements both `fit` and `transform` methods.  This framework of "write a class that implements certain methods" is very, very important in object oriented programming.

In [97]:
class ColumnSelector:
    """Select columns out of an array or DataFrame.
    
    Parameters
    ----------
    idxs: np.array of int
      The column indexes to select.
    """
    def __init__(self, idxs):
        self.idxs = np.asarray(idxs)
        
    # Fit here doesn't need to do anything.  We already know the indices of the columns
    # we want to keep.
    def fit(self, *args, **kwargs):
        return self
    
    def transform(self, X, **transform_params):
        # Need to teat pandas data frames and numpy arrays slightly differently since the [...] 
        # indexing behaves differently for arrays and data frames.
        if isinstance(X, pd.DataFrame):
            return X.iloc[:, self.idxs]
        return X[:, self.idxs]

There are a few rules we need to follow:

  - `fit` needs to be defined as either `fit(self, *args, **kwargs)` (if we do not need to look at the data to fit the transformer), or `fit(self, X, *args, **kwargs)` (if we *do* need to look at the data).
  - `fit` needs to return `self`.  This is a common oversight, and will case problems when using `Pipeline` below if forgotten.
  - `transform` needs to be defined as `transform(self, X, **transform_params)`, and returns the transformed data set.
  
This process, of implementing certain methods under some constraints, is called **coding to an interface**.  As long as it is done properly, it allows us to seamlessly use our objects inside of code that was designed to work with built in transformer objects.

Let's try it out.

In [100]:
# Selecting the first column
column_selector = ColumnSelector([0])
column_selector.fit()
wells_column_selected = column_selector.transform(wells)

Let's make sure our new DataFrame has exactly one column:

In [101]:
wells_column_selected.iloc[0:10, :]

Unnamed: 0_level_0,switch
id,Unnamed: 1_level_1
1,1
2,1
3,0
4,1
5,1
6,1
7,1
8,1
9,1
10,1


## Combining Objects: Pipelines

In [103]:
wells.head()

Unnamed: 0_level_0,switch,arsenic,dist,assoc,educ
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2.36,16.826,0,0
2,1,0.71,47.321999,0,0
3,0,2.07,20.966999,0,10
4,1,1.15,21.486,0,12
5,1,1.1,40.874001,1,14


One of the neat ideas that transformers allow is **chaining**.  We can take a single input dataset and apply multiple transformations in sequence.  This process is often called **pipelining** because we have metaphorically plumbed together a sequence of transformations.

The `Pipeline` class in sklearn allows us to chain together transformers, and optionally end the chain with a single regression or classification model.

In [106]:
wells_pipeline = Pipeline([
    # Select the columns that contain continuous data: arsenic and dist.
    ('continuous_column_selector', ColumnSelector([1, 2])),
    ('standardize', StandardScaler())
])

Above, we have chained together some of the transformers we discussed earlier:

```
X --ColumnSelector--> X_continuous_columns 
  --StandardScalar--> X_continuous_columns_standardized 
```

Once we have a `Pipeline` we only have to `fit` it **one time**.

In [107]:
wells_pipeline.fit(wells)

Pipeline(memory=None,
     steps=[('continuous_column_selector', <__main__.ColumnSelector object at 0x12e4bf748>), ('standardize', StandardScaler(copy=True, with_mean=True, with_std=True))])

This has **many, many advantages**.

  - It allows us to write **less code**.  Every line of code we write contains potential bugs, bugs are bad.
  - It creates a **conceptual unit**.  The `Pipeline` makes clear that each of these transformations is intended to be used **together**.  This makes the code easier to understand.
  - It makes our code **harder to abuse**.  Since we intend the transformers to be used together, we would like other developers or our future selves to think hard about whether they want to separate them.
  - It make the code more re-usable.  Future developers only have to do **one thing** to reuse all our transformers, instead of needing to `fit` all of them separately, this reduces the chance of mistakes.

Once our pipeline is fit, we can use it to transform our data set.  This applies **both** the transformers in the `Pipeline` **in sequence**.

In [109]:
wells_pipeline.transform(wells)

array([[ 0.63499567, -0.81892321],
       [-0.85524506, -0.02624912],
       [ 0.37307458, -0.71128737],
       ...,
       [-1.0358803 , -1.05592488],
       [-0.9184674 , -0.66255101],
       [-0.90040387, -0.71448445]])

We can also access the various transformers and regressors/classifiers by using the `named_steps` method.

In [110]:
# The column means memorized by the pipeling.
print(wells_pipeline.named_steps['standardize'].mean_)
# The column standard deviations memorized by the pipeline.
print(wells_pipeline.named_steps['standardize'].scale_)

[ 1.65693046 48.33186257]
[ 1.10720366 38.47230347]


These contain the means and standard deviations of the two columns in the data set that we selected.

In [115]:
print("Mean of arsenic: ", np.mean(wells['arsenic']))
print("Standard Deviation of arsenic: ", np.std(wells['arsenic']))

Mean of arsenic:  1.656930463576163
Standard Deviation of arsenic:  1.1072036618468495


#### Question:

When were these means and standard deviations computed?

### Another Custom Transformer in a Pipeline: Polynomial Expansion

Often it is useful to fit a **polynomial term** in a regression model (stay tuned!).  So, instead of creating a regression like

$$ \text{WingSize} \approx a + b \times \text{Latitude} $$

We would fit a polynomial curve, for example, a quadratic like

$$ \text{WingSize} \approx a + b \times \text{Latitude} + b \times \text{Latitude}^2 $$

Our first task is to write a transformer class that consumes a **single** column, and creates a matrix with the square, cube, etc of the column.

In [127]:
class PolynomialExpansion:
    """Transform a single column array or data frame using a polynomial.
    
    Parameters
    ----------
    idxs: np.array of int
      The column indexes to select.
    """
    def __init__(self, degree):
        self.degree = degree
        
    def fit(self, *args, **kwargs):
        # We still don't need to do anything when we fit this transformer, 
        # it doesn't need to learn anything from the data!
        return self
    
    def transform(self, X, **transform_params):
        # Initialize our return value as a matrix of all zeros.
        # We are going to overwrite all of these zeros in the code below.
        X_poly = np.zeros((X.shape[0], self.degree))
        # The first column in our transformed matrix is just the vector we started with.
        X_poly[:, 0] = X.squeeze()
        # Cleverness Alert:
        # We create the subsequent columns by multiplying the most recently created column
        # by X.  This creates the sequence X -> X^2 -> X^3 -> etc...
        for i in range(2, self.degree + 1):
            X_poly[:, i-1] = X_poly[:, i-2] * X.squeeze()
        return X_poly

Let's test this out on a simple example.

In [128]:
X = np.array([[1], [2], [3], [4]])
poly = PolynomialExpansion(3)
poly.fit(X)
poly.transform(X)

array([[ 1.,  1.,  1.],
       [ 2.,  4.,  8.],
       [ 3.,  9., 27.],
       [ 4., 16., 64.]])

We can use this in a pipeline if our initial matrix has more than one column:  

  - We use the `ColumnSelector` to grab a single column.
  - Then a `PolynomialExpansion` to make ten polynomial columns.

In [129]:
dist_poly = Pipeline([
    ('distance_selector', ColumnSelector([2])),
    ('quadratic_expansion', PolynomialExpansion(2))
])
dist_poly.fit(wells)

Pipeline(memory=None,
     steps=[('distance_selector', <__main__.ColumnSelector object at 0x12dc742b0>), ('quadratic_expansion', <__main__.PolynomialExpansion object at 0x12dc74390>)])

Now we can transform the data frame to get our polynomial terms.

In [130]:
quadratic_distance = dist_poly.transform(wells)
quadratic_distance

array([[  16.82600021,  283.11428319],
       [  47.3219986 , 2239.37155114],
       [  20.96699905,  439.61504933],
       ...,
       [   7.70800018,   59.41326682],
       [  22.84199905,  521.75692078],
       [  20.84399986,  434.47233028]])

The second column is indeed the square of the first:

In [131]:
quadratic_distance[:, 1] == quadratic_distance[:, 0]**2

array([ True,  True,  True, ...,  True,  True,  True])

## Combining Objects: FeatureUnion

What if we want to create a polynomial expansion using **two** features in our model?

To accomplish this, we would need to grab two different columns, take a polynomial transformation of them individually, and then re-join the results into a single matrix:

```  
    +--- Select Column 1 --- Polynomial Expansion ---+
X --+                                                +--- Rejoin --> X transfomed
    +--- Select Column 2 --- Polynomial Expansion ---+
```

The splitting and rejoining operation can be accomplished with another sklean feature, the `FeatureUnion`.

### A silly example of `FeatureUnion`.

Here's a simple example:

```  
    +--- Select Column 1 ---+
X --+                       +--- Rejoin --> X transfomed
    +--- Select Column 2 ---+
```

In [133]:
two_columns = FeatureUnion([
    ('arsenic_selector', ColumnSelector([0])),
    ('distance_selector', ColumnSelector([1]))
])
two_columns.fit(wells)
print(two_columns.transform(wells))

[[1.   2.36]
 [1.   0.71]
 [0.   2.07]
 ...
 [0.   0.51]
 [0.   0.64]
 [1.   0.66]]


You can see that all we've done is selected the first two columns, which is admittedly, kind of silly.  In this case we could have just used `ColumnSelector([0, 1])`.

In [135]:
wells.head()

Unnamed: 0_level_0,switch,arsenic,dist,assoc,educ
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2.36,16.826,0,0
2,1,0.71,47.321999,0,0
3,0,2.07,20.966999,0,10
4,1,1.15,21.486,0,12
5,1,1.1,40.874001,1,14


### A more useful example of `FeatureUnion`.

Let's end by putting together the example I outlined above:

```  
    +--> Select Column 1 --> Standardize --> Polynomial ---+
X --+                                                      +--- Rejoin --> X transfomed
    +--> Select Column 2 --> Standardize --> Polynomial ---+
```

We will use polynomials of degree 2.

In [144]:
wells_pipeline = FeatureUnion([
    ('arsenic_quadratic', Pipeline([
        ('arsenic_selector', ColumnSelector([1])),
        ('arsenic_standardizer', StandardScaler()),
        ('quadratic_expansion', PolynomialExpansion(2))
    ])),
    ('distance_quadratic', Pipeline([
        ('distance_selector', ColumnSelector([2])),
        ('distance_standardizer', StandardScaler()),
        ('quadratic_expansion', PolynomialExpansion(2))         
    ]))
])

This is now a pipline of considerable complexity.  Even so, using it is exactly the same as any of the simpler pipelines that we constructed earlier.

In [145]:
wells_pipeline.fit(wells)

FeatureUnion(n_jobs=None,
       transformer_list=[('arsenic_quadratic', Pipeline(memory=None,
     steps=[('arsenic_selector', <__main__.ColumnSelector object at 0x12c3d25f8>), ('arsenic_standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('quadratic_expansion', <__main__.PolynomialExpansion object at 0x12c3d..., with_std=True)), ('quadratic_expansion', <__main__.PolynomialExpansion object at 0x12c3d20f0>)]))],
       transformer_weights=None)

In [146]:
wells_quadratic = wells_pipeline.transform(wells)

In [147]:
wells_quadratic

array([[ 6.34995675e-01,  4.03219507e-01, -8.18923213e-01,
         6.70635228e-01],
       [-8.55245061e-01,  7.31444115e-01, -2.62491165e-02,
         6.89016115e-04],
       [ 3.73074576e-01,  1.39184639e-01, -7.11287369e-01,
         5.05929721e-01],
       ...,
       [-1.03588030e+00,  1.07304800e+00, -1.05592488e+00,
         1.11497735e+00],
       [-9.18467395e-01,  8.43582357e-01, -6.62551010e-01,
         4.38973841e-01],
       [-9.00403871e-01,  8.10727132e-01, -7.14484453e-01,
         5.10488034e-01]])

#### Final Exercise:

Find another combination of transformers, `Pipeline`s, and `FeatureUnion`s that accomplishes the same task, then code it up and try it out!