# Scikit-Learn's new integration with Pandas
Scikit-Learn will make one of its biggest upgrades in recent years with its mammoth version 0.20 release. For many data scientists, a typical workflow consists of using Pandas to do exploratory data analysis before moving to scikit-learn for machine learning. This new release will make the process simpler, more feature-rich, robust, and standardized.

## Summary and goals of this article
This article is aimed at those that use Scikit-Learn as their machine learning library but depend on Pandas as their data exploratory and preparation tool. 

* It assumes you have some familiarity with both Scikit-Learn and Pandas
* We explore the new ColumnTransformer estimator, which allows us to apply separate transformations to different subsets of your data in parallel before concatenating the results together.
* A major pain point for users (and in my opinion the worst part of Scikit-Learn) was preparing a pandas DataFrame with string values in its columns. This process should become much more standardized.
* The OneHotEncoder estimator was given a nice upgrade to encode columns with string values. 
* To help with one hot encoding, we use the new SimpleImputer estimator to fill in missing values with constants
* We will build a custom estimator that does all the "basic" transformations on a DataFrame instead of relying on the built-in Scikit-Learn tools. This will also transform the data with a couple different features not present within Scikit-Learn.
* Finally, we explore binning numeric columns with the new KBinsDiscretizer estimator.

### A note before we get started
This tutorial is provided as a preview of things to come. The final version 0.20 has not been released. It is very likely that this tutorial will be updated at a future date to reflect any changes.

### Continuing…
For those that use Pandas as as their exploratory and preparation tool before moving to Scikit-Learn for machine learning, you are likely familiar with the non-standard process of handling columns containing string columns. Scikit-Learn's machine learning models require the input to be a two dimensional data structure of numeric values. No string values are allowed. Scikit-Learn never provided a canonical way to handle columns of strings, a very common occurrence in data science.

This lead to numerous tutorials all handling string columns in their own way. Some solutions included turning to Pandas get_dummies function. Some used Scikit-Learn'sLabelBinarizer which does one-hot encoding, but was designed for labels (the target variable) and not for the input. Others created their own custom estimators. Even entire packages such as [sklearn-pandas][1] were built to support this trouble spot. This lack of standardization made for a painful experience for those wanting to build machine learning models with string columns.

Furthermore, there was poor support for making transformations to specific columns and not to the entire dataset. For instance, it's very common to standardize continuous features but not categorical features.  This will now become much easier.

### Upgrading to version 0.20
The first release candidate for version 0.20 was put out just a couple days ago. You can install it with either conda:

> `conda install scikit-learn=0.20rc1 -c conda-forge/label/rc -c conda-forge`

or pip:

> `pip install - pre scikit-learn`

# Introducing `ColumnTransformer` and the upgraded `OneHotEncoder`
With the upgrade to version 0.20, many workflows from Pandas to Scikit-Learn should start looking similar. The `ColumnTransformer` estimator applies a transformation to a specific subset of columns of your Pandas DataFrame (or array).

The `OneHotEncoder` estimator is not new but has been upgraded to encode string columns. Before, it only encoded columns containing numeric categorical data.

Let's see how these new additions work to handle string columns in a Pandas DataFrame.

## Kaggle Housing Dataset
One of Kaggle's beginning machine learning competitions is the [Housing Prices: Advanced Regression Techniques][2]. The goal is to predict housing prices given about 80 features. There are a mix of continuous and categorical columns. You can download the data from the website or use their [command line tool][3] (which is very nice).

## Inspect the data
Let's read in our DataFrame and output the first few rows.

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('data/housing/train.csv')
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [2]:
train.shape

(1460, 81)

### Remove the target variable from the training set
The target variable is `SalePrice` which we remove and assign as an array to its own variable.

In [3]:
y = train.pop('SalePrice').values

# Encoding a single string column
To start off, let's encode a single string column, `HouseStyle`, which has values for the exterior of the house. Let's output the unique counts of each string value.

In [4]:
vc = train['HouseStyle'].value_counts()
vc

1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64

We have 8 unique values in this column.

In [5]:
len(vc)

8

## Scikit-Learn Gotcha - Must have 2D data
Most Scikit-Learn estimator require that data be strictly 2-dimensional. If we select the column above as `train['HouseStyle']`, this technically creates a Pandas Series which is a single dimension of data. We can force Pandas to create a one-column DataFrame, by passing a single-item list to the brackets.

In [6]:
hs_train = train[['HouseStyle']].copy()
hs_train.ndim

2

# Import, Instantiate, Fit - The three-step process for each estimator
The scikit-learn API is consistent for all estimators and uses a three-step process to train or fit the data. 

1. Import the estimator we want from the module its located in
1. Instantiate the estimator possibly changing its defaults
1. Fit the estimator to the data. Possibly transform the data to its new space if need be.

Below, we import `OneHotEncoder`, instantiate it and ensure that we get a dense (and not sparse) array returned and then encode our single column with the `fit_transform` method.

In [7]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
hs_train_transformed = ohe.fit_transform(hs_train)
hs_train_transformed

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

As expected, it has encoded each unique value as its own binary column.

In [8]:
hs_train_transformed.shape

(1460, 8)

# We have a NumPy array. Where are the column names?
Notice that our output is a NumPy array and not a Pandas DataFrame. Scikit-Learn was not originally built to be directly integrated with Pandas. All Pandas objects are converted to NumPy arrays internally and NumPy arrays are always returned after a transformation.

We can still get our column name from the `OneHotEncoder` object through its `get_feature_names` method.

In [9]:
feature_names = ohe.get_feature_names()
feature_names

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

## Verifying our first row of data is correct
It's good to verify that our estimator is properly working. Let's look at the first row of encoded data.

In [10]:
row0 = hs_train_transformed[0]
row0

array([0., 0., 0., 0., 0., 1., 0., 0.])

This encodes the 6th value in the array as 1. Let's use boolean indexing to reveal the feature name.

In [11]:
feature_names[row0 == 1]

array(['x0_2Story'], dtype=object)

Now, let's verify that the first value in our original DataFrame column is the same.

In [12]:
hs_train.values[0]

array(['2Story'], dtype=object)

### Use `inverse_transform` to automate this
Just like most transformer objects, there is an `inverse_transform` method that will get you back your original data. Here we must wrap `row0` in a list to make it a 2D array.

In [13]:
ohe.inverse_transform([row0])

array([['2Story']], dtype=object)

We can verify all values by inverting the entire transformed array.

In [14]:
hs_inv = ohe.inverse_transform(hs_train_transformed)
hs_inv

array([['2Story'],
       ['1Story'],
       ['2Story'],
       ...,
       ['2Story'],
       ['1Story'],
       ['1Story']], dtype=object)

In [15]:
np.array_equal(hs_inv, hs_train.values)

True

## Applying transformation to the test set
Whatever transformation we do to our training set, we must apply to our test set. Let's read in the test set and get the same column and apply our transformation.

In [16]:
test = pd.read_csv('data/housing/test.csv')
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [17]:
hs_test = test[['HouseStyle']].copy()
hs_test_transformed = ohe.transform(hs_test)
hs_test_transformed

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

We should again get 8 columns and we do.

In [18]:
hs_test_transformed.shape

(1459, 8)

This example works nicely, but there are multiple cases where we will run into problems. Let's examine them now.

# Trouble area #1 - Categories unique to the test set
What happens if we have a home with a house style that is unique to just the test set? Say something like `3Story`. Let's change the first value of the house styles and see what the default is from Scikit-Learn.

In [19]:
hs_test = test[['HouseStyle']].copy()
hs_test.iloc[0, 0] = '3Story'
print(hs_test.head(3))

  HouseStyle
0     3Story
1     1Story
2     2Story


In [20]:
ohe.transform(hs_test)

ValueError: Found unknown categories ['3Story'] in column 0 during transform

## Error: Unknown Category
By default, our encoder will produce an error. This is likely what we want as we need to know if there are unique strings in the test set. If you do have this problem then there could be something much deeper that needs investigating. For now, we will ignore this problem and encode this row as all 0's by setting the `handle_unknown` parameter to 'ignore' upon instantiation.

In [21]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit(hs_train)

hs_test_transformed = ohe.transform(hs_test)
hs_test_transformed

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

Let's verify that the first row is all 0's.

In [22]:
hs_test_transformed[0]

array([0., 0., 0., 0., 0., 0., 0., 0.])

# Trouble area #2 - Missing Values in test set
If you have missing values in your test set (NaN or None), then these will be ignored as long as `handle_unknown` is set to 'ignore'.

In [23]:
hs_test = test[['HouseStyle']].copy()
hs_test.iloc[0, 0] = np.nan
hs_test.iloc[1, 0] = None
print(hs_test.head(4))

  HouseStyle
0        NaN
1       None
2     2Story
3     2Story


In [24]:
hs_test_transformed = ohe.transform(hs_test)
hs_test_transformed[:4]

array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.]])

# Trouble area #3 - Missing Values in training set
Missing values in the training set is more of an issue. As of now, the `OneHotEncoder` estimator cannot fit with missing values.

In [25]:
hs_train = hs_train.copy()
hs_train.iloc[0, 0] = np.nan
hs_train.head(3)

Unnamed: 0,HouseStyle
0,
1,1Story
2,2Story


In [26]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit_transform(hs_train)

TypeError: '<' not supported between instances of 'str' and 'float'

It would be nice if there was an option to ignore them like what happens when transforming the test set above. But this doesn't exist.

# Must impute missing values
For now, we must impute the missing values. The old `Imputer` from the preprocessing module got deprecated. A new module, `impute`, was formed in its place, with a new estimator `SimpleImputer` and a new strategy, 'constant'. By default, using this strategy will fill missing values with the string 'missing_value'. We can choose what to set it with the fill_value parameter.

In [27]:
hs_train = train[['HouseStyle']].copy()
hs_train.iloc[0, 0] = np.nan

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', fill_value='MISSING')
hs_train_imputed = si.fit_transform(hs_train)
hs_train_imputed

array([['MISSING'],
       ['1Story'],
       ['2Story'],
       ...,
       ['2Story'],
       ['1Story'],
       ['1Story']], dtype=object)

From here we can then encode as we did previously.

In [28]:
hs_train_transformed = ohe.fit_transform(hs_train_imputed)
hs_train_transformed

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

Notice, that we now have an extra column and and an extra feature name.

In [29]:
hs_train_transformed.shape

(1460, 9)

In [30]:
ohe.get_feature_names()

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_MISSING', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

### Apply both transformations to test set
We can manually apply each of the two steps above in order like this:

In [31]:
hs_test = test[['HouseStyle']].copy()
hs_test.iloc[0, 0] = 'unique value to test set'
hs_test.iloc[1, 0] = np.nan

hs_test_imputed = si.transform(hs_test)
hs_test_transformed = ohe.transform(hs_test_imputed)
hs_test_transformed.shape

(1459, 9)

## Why just the `transform` method for the test set?
When transforming the test set, its important to just call the `transform` method and not `fit_transform`. When we ran `fit_transform` on the training set, Scikit-Learn found all the necessary information it needed in order to transform any other dataset containing the same column names.

In [32]:
ohe.get_feature_names()

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_MISSING', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

## Use a `Pipeline` instead
Scikit-Learn provides a Pipeline transformer and estimator that takes a list of transformations and applies them in succession. You can also run a machine learning model as the final estimator. Here we simply impute and encode.

In [33]:
from sklearn.pipeline import Pipeline

Each step is a two-item tuple consisting of a string that labels the step and the instantiated estimator.

In [34]:
si_step = ('si', SimpleImputer(strategy='constant', fill_value='MISSING'))
ohe_step = ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
steps = [si_step, ohe_step]

pipe = Pipeline(steps)

hs_train = train[['HouseStyle']].copy()
hs_train.iloc[0, 0] = np.nan

hs_transformed = pipe.fit_transform(hs_train)
hs_transformed.shape

(1460, 9)

The test set is easily transformed by passing through each step of the pipeline.

In [35]:
hs_test = test[['HouseStyle']].copy()
hs_test_transformed = pipe.transform(hs_test)
hs_test_transformed.shape

(1459, 9)

# Multiple String Columns
Encoding multiple string columns is not a problem. Select the columns you want and then pass the new DataFrame through the pipeline again.

In [36]:
string_cols = ['RoofMatl', 'HouseStyle']
string_train = train[string_cols]
print(string_train.head(3))

  RoofMatl HouseStyle
0  CompShg     2Story
1  CompShg     1Story
2  CompShg     2Story


In [37]:
string_train_transformed = pipe.fit_transform(string_train)
string_train_transformed.shape

(1460, 16)

### Get individual pieces of the pipeline
It is possible to get each individual transformer through its name. In this instance, we get the one-hot encoder so that we can output the feature names.

In [38]:
ohe = pipe.named_steps['ohe']
ohe.get_feature_names()

array(['x0_ClyTile', 'x0_CompShg', 'x0_Membran', 'x0_Metal', 'x0_Roll',
       'x0_Tar&Grv', 'x0_WdShake', 'x0_WdShngl', 'x1_1.5Fin', 'x1_1.5Unf',
       'x1_1Story', 'x1_2.5Fin', 'x1_2.5Unf', 'x1_2Story', 'x1_SFoyer',
       'x1_SLvl'], dtype=object)

# Use the new ColumnTransformer to choose columns
The brand new `ColumnTransformer` (part of the new compose module) allows you to choose which columns get which transformations. Categorical columns will almost always need separate transformations than continuous columns.

The `ColumnTransformer` is currently experimental, meaning that its functionality can change in the future. The `ColumnTransformer` takes a list of three-item tuples. The first value in the tuple is a name that labels it, the second is an instantiated estimator, and the third is a list of columns you want to apply the transformation to. The tuple will look like this:

```('name', SomeTransformer(parameters), columns)```

The columns actually don't have to be column names. Instead, you can use the integer indexes of the columns, a boolean array, or even a function (which accepts the entire DataFrame as the argument and must return a selection of columns).

You can also use NumPy arrays with the ColumnTransformer, but this tutorial is focused on the integration of Pandas so we will stick with just using DataFrames.

# Pass a Pipeline to the ColumnTransformer
We can even pass a pipeline of many transformations to the column transformer, which is what we do here because we have multiple transformations on our string columns.

Below, we reproduce the above imputation and encoding using the `ColumnTransformer`. Notice that the pipeline is the exact same as above, just with `cat` appended to each variable name. We will add a different pipeline for the numeric columns in an upcoming section.

In [39]:
from sklearn.compose import ColumnTransformer

cat_si_step = ('si', SimpleImputer(strategy='constant', fill_value='MISSING'))
cat_ohe_step = ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
cat_steps = [cat_si_step, cat_ohe_step]

cat_pipe = Pipeline(cat_steps)
cat_cols = ['RoofMatl', 'HouseStyle']
cat_transformers = [('cat', cat_pipe, cat_cols)]

ct = ColumnTransformer(transformers=cat_transformers)

## Pass the entire DataFrame to the ColumnTransformer
The ColumnTransformer instance selects the columns we want to use, so we simply pass the entire DataFrame to the fit_transform method. The desired columns will be selected for us.

In [40]:
X_cat_transformed = ct.fit_transform(train)
X_cat_transformed

array([[0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [41]:
X_cat_transformed.shape

(1460, 16)

We can now transform our test set in the same manner.

In [42]:
X_cat_transformed_test = ct.transform(test)
X_cat_transformed_test.shape

(1459, 16)

### Retrieving the feature names
We have to do a little digging to get the feature names. All the transformers are stored in the `named_transformers_` dictionary attribute. We then use the names, the first item from the three-item tuple to select the specific transformer. Below, we select our transformer (there is only one here - a pipeline named 'cat').

In [43]:
pl = ct.named_transformers_['cat']

Then from this pipeline we select the one-hot encoder object and finally get the feature names.

In [44]:
ohe = pl.named_steps['ohe']
ohe.get_feature_names()

array(['x0_ClyTile', 'x0_CompShg', 'x0_Membran', 'x0_Metal', 'x0_Roll',
       'x0_Tar&Grv', 'x0_WdShake', 'x0_WdShngl', 'x1_1.5Fin', 'x1_1.5Unf',
       'x1_1Story', 'x1_2.5Fin', 'x1_2.5Unf', 'x1_2Story', 'x1_SFoyer',
       'x1_SLvl'], dtype=object)

# Transforming the numeric columns
The numeric columns will need a different set of transformations. Instead of imputing missing values with a constant, the median or mean is often chosen. And instead of encoding the values, we usually standardize them by subtracting the mean of each column and dividing by the standard deviation. This helps many models like ridge regression produce a better fit.

## Usually all the numeric columns

Instead of selecting just one or two columns by hand like we did above with the string columns, we can select all of the numeric columns. We do this by first finding the data type of each column with the dtypes attribute and then testing whether the kind of each dtype is 'O'. The dtypes attribute returns a Series of NumPy dtype objects. Each of these has a kind attribute that is a single character. We can use this to find the numeric or string columns. Pandas stores all of its string columns as object which have a kind equal to 'O'. See the [NumPy docs][1] for more on the `kind` attribute.

[1]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.kind.html

In [45]:
train.dtypes.head()

Id               int64
MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
dtype: object

Get the kinds, a one character string representing the dtype.

In [46]:
kinds = np.array([dt.kind for dt in train.dtypes])
kinds[:5]

array(['i', 'i', 'O', 'f', 'i'], dtype='<U1')

Assume all numeric columns are non-object.

In [47]:
all_columns = train.columns.values
is_num = kinds != 'O'
num_cols = all_columns[is_num]
num_cols[:5]

array(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual'],
      dtype=object)

In [48]:
cat_cols = all_columns[~is_num]
cat_cols[:5]

array(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour'],
      dtype=object)

Once we have our numeric column names, we can use the `ColumnTransformer` again.

In [49]:
from sklearn.preprocessing import StandardScaler

num_si_step = ('si', SimpleImputer(strategy='median'))
num_ss_step = ('ss', StandardScaler())
num_steps = [num_si_step, num_ss_step]

num_pipe = Pipeline(num_steps)
num_transformers = [('num', num_pipe, num_cols)]

ct = ColumnTransformer(transformers=num_transformers)

X_num_transformed = ct.fit_transform(train)
X_num_transformed.shape

(1460, 37)

# Combining both categorical and numerical column transformations
We can apply separate transformations to each section of our DataFrame with `ColumnTransformer`. We will use every single column in this example. 

We then create a separate pipeline for both categorical and numerical columns and then use the `ColumnTransformer` to independently transform them. These two transformations happen **in parallel**. The results of each are then concatenated together.

In [50]:
transformers = [('cat', cat_pipe, cat_cols),
                ('num', num_pipe, num_cols)]

ct = ColumnTransformer(transformers=transformers)

X = ct.fit_transform(train)
X.shape

(1460, 305)

# Machine Learning
The whole point of this exercise is to set up our data so that we can do machine learning. We can create one final pipeline and add a machine learning model as the final estimator. The first step in the pipeline will be the entire transformation we just did above. We assigned `y` way back at the top of the tutorial as the `SalePrice`. Here, we will just use the `fit` method instead of `fit_transform` since our final step is a machine learning model and does no transformations.

In [51]:
from sklearn.linear_model import Ridge

ml_pipe = Pipeline([('transform', ct), ('ridge', Ridge())])
ml_pipe.fit(train, y)

Pipeline(memory=None,
     steps=[('transform', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('cat', Pipeline(memory=None,
     steps=[('si', SimpleImputer(copy=True, fill_value='MISSING', missing_values=nan,
       strategy='constant', verbos...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [52]:
ml_pipe.score(train, y)

0.9220545988101002

# Cross Validation
Of course, scoring ourselves on the training set is not useful. Let's do some K-fold cross validation to get an idea of how well we would do with unseen data. We set a random state so that the splits will be the same throughout the rest of the tutorial.

In [53]:
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5, shuffle=True, random_state=123)

cross_val_score(ml_pipe, train, y, cv=kf).mean()

0.8133920019569663

# Selecting parameters when Grid Searching
Grid searching in Scikit-Learn requires us to pass a dictionary of parameter names mapped to possible values. When using a pipeline, we must use the name of the step followed by a double-underscore and then the parameter name. If there are multiple layers to your pipeline, as we have here, we must continue using double-underscores to move up a level until we reach the estimator whose parameters we would like to optimize.

In [54]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'transform__num__si__strategy': ['mean', 'median'],
    'ridge__alpha': [.001, 0.1, 1.0, 5, 10, 50, 100, 1000],
}
gs = GridSearchCV(ml_pipe, param_grid, cv=kf)

In [55]:
gs.fit(train, y)
gs.best_params_

{'ridge__alpha': 10, 'transform__num__si__strategy': 'median'}

In [56]:
gs.best_score_

0.8190367464419676

## Getting all the grid search results in a Pandas DataFrame
All the results of the grid search are stored in the `cv_results_` attribute. This is a dictionary that can get converted to a Pandas DataFrame for a nice display and is provides a structure that is much easier to manually scan.

In [57]:
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

In [58]:
pd.DataFrame(gs.cv_results_)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ridge__alpha,param_transform__num__si__strategy,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.035772,0.000894,0.008333,0.001063,0.001,mean,"{'ridge__alpha': 0.001, 'transform__num__si__strategy': 'mean'}",0.898019,0.024032,0.77595,0.883812,0.662549,0.648872,0.323738,16,0.935225,0.945623,0.945871,0.938929,0.933276,0.939785,0.005197
1,0.03359,0.000494,0.007693,0.000153,0.001,median,"{'ridge__alpha': 0.001, 'transform__num__si__strategy': 'median'}",0.898015,0.024086,0.775946,0.883812,0.662624,0.648897,0.323717,15,0.935224,0.945617,0.945872,0.938929,0.933274,0.939783,0.005196
2,0.035185,0.001886,0.009171,0.000782,0.1,mean,"{'ridge__alpha': 0.1, 'transform__num__si__strategy': 'mean'}",0.897111,0.779393,0.800366,0.884558,0.653398,0.802965,0.087697,12,0.934181,0.942736,0.944092,0.93766,0.933105,0.938355,0.004418
3,0.038704,0.003397,0.00852,0.000977,0.1,median,"{'ridge__alpha': 0.1, 'transform__num__si__strategy': 'median'}",0.897114,0.779408,0.800366,0.884561,0.653472,0.802984,0.087672,11,0.934181,0.942731,0.944092,0.93766,0.933104,0.938354,0.004417
4,0.034963,0.000945,0.008691,0.000903,1.0,mean,"{'ridge__alpha': 1.0, 'transform__num__si__strategy': 'mean'}",0.891715,0.809381,0.822107,0.876908,0.66671,0.813364,0.07972,8,0.922745,0.931678,0.928385,0.925259,0.929545,0.927523,0.003164
5,0.037801,0.001084,0.008998,0.001275,1.0,median,"{'ridge__alpha': 1.0, 'transform__num__si__strategy': 'median'}",0.891728,0.809413,0.822128,0.876924,0.666768,0.813392,0.079704,7,0.922745,0.93168,0.928385,0.925261,0.929545,0.927523,0.003164
6,0.034113,0.001811,0.008526,0.000628,5.0,mean,"{'ridge__alpha': 5, 'transform__num__si__strategy': 'mean'}",0.889795,0.820456,0.820334,0.871716,0.684609,0.817382,0.071892,4,0.906954,0.916934,0.912647,0.909262,0.920456,0.913251,0.004928
7,0.037207,0.002088,0.008591,0.000501,5.0,median,"{'ridge__alpha': 5, 'transform__num__si__strategy': 'median'}",0.889804,0.820471,0.820383,0.871755,0.684598,0.817402,0.071905,3,0.906956,0.916942,0.912647,0.909265,0.92046,0.913254,0.004929
8,0.031866,0.002102,0.007863,0.000569,10.0,mean,"{'ridge__alpha': 10, 'transform__num__si__strategy': 'mean'}",0.89047,0.824359,0.819779,0.871443,0.689068,0.819024,0.070385,2,0.899419,0.909901,0.905999,0.901841,0.915168,0.906466,0.005635
9,0.03432,0.001349,0.008168,0.000819,10.0,median,"{'ridge__alpha': 10, 'transform__num__si__strategy': 'median'}",0.890488,0.824358,0.819844,0.871495,0.688998,0.819037,0.070422,1,0.899421,0.909908,0.905997,0.901841,0.915177,0.906469,0.005639


# Building a custom transformer that does it all
There are a few limitations to the above workflow. For instance, it would be nice if the `OneHotEncoder` gave you the option of ignoring missing values during the fit method. It could simply encode missing values as a row of all zeros. Currently, it forces us to fill the missing values with some string and then encodes this string as a separate column.

### Low frequency strings
Also, string columns that appear only a few times during the training set may not be reliable predictors in the test set. We may want to encode those as if they were missing as well.

### Writing your own estimator class
Scikit-Learn provides some help within its documentation on writing your own estimator class. The `BaseEstimator` class found within the base module provides the `get_params` and `set_params` methods for you. The `set_params` method is necessary when doing a grid search. You can write your own or inherit from the `BaseEstimator`. There is also a `TransformerMixin` but it just writes the `fit_transform` method for you. We do this in one line of code below, so we don't inherit from it.

The following class BasicTransformer does the following:

* Fills in missing values with either the mean or median for numeric columns
* Standardizes all numeric columns
* Uses one hot encoding for string columns
* Does not fill in missing values for categorical columns. Instead it encodes them as a 0's
* Ignores unique values in string columns in the test set
* Allows you to choose a threshold for the number of occurrences a value must have in a string column. Strings below this threshold will be encoded as all 0's
* It only works with DataFrames and is just experimental and not tested so it will break for some datasets
* It is called 'basic' because, these are probably the most basic transformations that typically get done to many datasets.

In [59]:
from sklearn.base import BaseEstimator

class BasicTransformer(BaseEstimator):
    
    def __init__(self, cat_threshold=None, num_strategy='median', return_df=False):
        # store parameters as public attributes
        self.cat_threshold = cat_threshold
        
        if num_strategy not in ['mean', 'median']:
            raise ValueError('num_strategy must be either "mean" or "median"')
        self.num_strategy = num_strategy
        self.return_df = return_df
        
    def fit(self, X, y=None):
        # Assumes X is a DataFrame
        self._columns = X.columns.values
        
        # Split data into categorical and numeric
        self._dtypes = X.dtypes.values
        self._kinds = np.array([dt.kind for dt in X.dtypes])
        self._column_dtypes = {}
        is_cat = self._kinds == 'O'
        self._column_dtypes['cat'] = self._columns[is_cat]
        self._column_dtypes['num'] = self._columns[~is_cat]
        self._feature_names = self._column_dtypes['num']
        
        # Create a dictionary mapping categorical column to unique values above threshold
        self._cat_cols = {}
        for col in self._column_dtypes['cat']:
            vc = X[col].value_counts()
            if self.cat_threshold is not None:
                vc = vc[vc > self.cat_threshold]
            vals = vc.index.values
            self._cat_cols[col] = vals
            self._feature_names = np.append(self._feature_names, col + '_' + vals)
            
        # get total number of new categorical columns    
        self._total_cat_cols = sum([len(v) for col, v in self._cat_cols.items()])
        
        # get mean or median
        self._num_fill = X[self._column_dtypes['num']].agg(self.num_strategy)
        return self
        
    def transform(self, X):
        # check that we have a DataFrame with same column names as the one we fit
        if set(self._columns) != set(X.columns):
            raise ValueError('Passed DataFrame has different columns than fit DataFrame')
        elif len(self._columns) != len(X.columns):
            raise ValueError('Passed DataFrame has different number of columns than fit DataFrame')
            
        # fill missing values    
        X_num = X[self._column_dtypes['num']].fillna(self._num_fill)
        
        # Standardize numerics
        std = X_num.std()
        X_num = (X_num - X_num.mean()) / std
        zero_std = np.where(std == 0)[0]
        
        # If there is 0 standard deviation, then all values are the same. Set them to 0.
        if len(zero_std) > 0:
            X_num.iloc[:, zero_std] = 0
        X_num = X_num.values
        
        # create separate array for new encoded categoricals
        X_cat = np.empty((len(X), self._total_cat_cols), dtype='int')
        i = 0
        for col in self._column_dtypes['cat']:
            vals = self._cat_cols[col]
            for val in vals:
                X_cat[:, i] = X[col] == val
                i += 1
                
        # concatenate transformed numeric and categorical arrays
        data = np.column_stack((X_num, X_cat))
        
        # return either a DataFrame or an array
        if self.return_df:
            return pd.DataFrame(data=data, columns=self._feature_names)
        else:
            return data
    
    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)
    
    def get_feature_names(self):
        return self._feature_names

# Using our `BasicTransformer`
Our `BasicTransformer` estimator should be able to be used just like any other scikit-learn estimator. We can instantiate it and then transform our data.

In [60]:
bt = BasicTransformer(cat_threshold=3, return_df=True)
train_transformed = bt.fit_transform(train)
train_transformed.head(3)

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,MSZoning_RL,MSZoning_RM,MSZoning_FV,MSZoning_RH,MSZoning_C (all),Street_Pave,Street_Grvl,Alley_Grvl,Alley_Pave,LotShape_Reg,LotShape_IR1,LotShape_IR2,LotShape_IR3,...,KitchenQual_Ex,KitchenQual_Fa,Functional_Typ,Functional_Min2,Functional_Min1,Functional_Mod,Functional_Maj1,Functional_Maj2,FireplaceQu_Gd,FireplaceQu_TA,FireplaceQu_Fa,FireplaceQu_Ex,FireplaceQu_Po,GarageType_Attchd,GarageType_Detchd,GarageType_BuiltIn,GarageType_Basment,GarageType_CarPort,GarageType_2Types,GarageFinish_Unf,GarageFinish_RFn,GarageFinish_Fin,GarageQual_TA,GarageQual_Fa,GarageQual_Gd,GarageCond_TA,GarageCond_Fa,GarageCond_Gd,GarageCond_Po,PavedDrive_Y,PavedDrive_N,PavedDrive_P,Fence_MnPrv,Fence_GdPrv,Fence_GdWo,Fence_MnWw,MiscFeature_Shed,SaleType_WD,SaleType_New,SaleType_COD,SaleType_ConLD,SaleType_ConLw,SaleType_ConLI,SaleType_CWD,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_Abnorml,SaleCondition_Family,SaleCondition_Alloca,SaleCondition_AdjLand
0,-1.730272,0.07335,-0.220799,-0.207071,0.651256,-0.517023,1.050634,0.878367,0.513928,0.575228,-0.288554,-0.944267,-0.459145,-0.793162,1.161454,-0.120201,0.370207,1.107431,-0.240978,0.78947,1.227165,0.163723,-0.211381,0.911897,-0.950901,1.01725,0.311618,0.35088,-0.751918,0.216429,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-1.598563,0.13873,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,-1.7279,-0.872264,0.460162,-0.091855,-0.071812,2.178881,0.15668,-0.42943,-0.570555,1.171591,-0.288554,-0.641008,0.466305,0.257052,-0.794891,-0.120201,-0.482347,-0.819684,3.947457,0.78947,-0.76136,0.163723,-0.211381,-0.318574,0.600289,-0.10789,0.311618,-0.06071,1.625638,-0.704242,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.488943,-0.614228,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,-1.725528,0.07335,-0.084607,0.073455,0.651256,-0.517023,0.984415,0.82993,0.325803,0.092875,-0.288554,-0.30154,-0.313261,-0.627611,1.188943,-0.120201,0.514836,1.107431,-0.240978,0.78947,1.227165,0.163723,-0.211381,-0.318574,0.600289,0.933906,0.311618,0.63151,-0.751918,-0.070337,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,0.990552,0.13873,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## Using our transformer in a pipeline
Our transformer can be part of pipeline.

In [61]:
basic_pipe = Pipeline([('bt', bt), ('ridge', Ridge())])
basic_pipe.fit(train, y)
basic_pipe.score(train, y)

0.9035633239411638

We can also cross-validate with it as well and get a similar score as we did with our scikit-learn column transformer pipeline from above.

In [62]:
cross_val_score(basic_pipe, train, y, cv=kf).mean()

0.8157670129480069

We can use it as part of a grid search as well. It turns out that not including low-count strings did not help this particular model, though it stands to reason it could in other models. The best score did improve a bit, perhaps due to using a slightly different encoding scheme.

In [63]:
param_grid = {
    'bt__cat_threshold': [0, 1, 2, 3, 4, 5],
    'ridge__alpha': [.1, 1, 10, 30, 100]
}

gs = GridSearchCV(basic_pipe, param_grid, cv=kf)
gs.fit(train, y)
gs.best_params_

{'bt__cat_threshold': 0, 'ridge__alpha': 10}

In [64]:
gs.best_score_

0.8297473585998105

In [65]:
pd.DataFrame(gs.cv_results_)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bt__cat_threshold,param_ridge__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.151099,0.008482,0.058924,0.006349,0,0.1,"{'bt__cat_threshold': 0, 'ridge__alpha': 0.1}",0.896662,0.781349,0.803144,0.878337,0.684654,0.808829,0.075833,26,0.933923,0.942257,0.94403,0.937377,0.93309,0.938136,0.00437
1,0.13427,0.006803,0.051739,0.001623,0,1.0,"{'bt__cat_threshold': 0, 'ridge__alpha': 1}",0.893991,0.80601,0.832287,0.874838,0.70053,0.821531,0.067956,19,0.922321,0.931128,0.928199,0.924859,0.929494,0.9272,0.003193
2,0.125818,0.003608,0.052028,0.001495,0,10.0,"{'bt__cat_threshold': 0, 'ridge__alpha': 10}",0.894996,0.822835,0.835066,0.871813,0.724028,0.829747,0.058787,1,0.899157,0.909725,0.9057,0.901508,0.915036,0.906225,0.005699
3,0.137552,0.009644,0.056596,0.007646,0,30.0,"{'bt__cat_threshold': 0, 'ridge__alpha': 30}",0.895582,0.82546,0.831901,0.871506,0.723175,0.829525,0.059091,2,0.884856,0.896451,0.89307,0.887565,0.904876,0.893364,0.007045
4,0.133064,0.005175,0.059503,0.010216,0,100.0,"{'bt__cat_threshold': 0, 'ridge__alpha': 100}",0.889236,0.823003,0.822421,0.866925,0.715614,0.823439,0.059746,13,0.864086,0.876754,0.874573,0.867295,0.890486,0.874639,0.009175
5,0.126686,0.005695,0.049627,0.001913,1,0.1,"{'bt__cat_threshold': 1, 'ridge__alpha': 0.1}",0.884871,0.712477,0.821721,0.863899,0.717456,0.800085,0.072432,30,0.920131,0.931422,0.918306,0.92341,0.932306,0.925115,0.005755
6,0.130103,0.004108,0.053175,0.001933,1,1.0,"{'bt__cat_threshold': 1, 'ridge__alpha': 1}",0.888417,0.797304,0.821625,0.866313,0.703418,0.815415,0.064544,22,0.913693,0.922866,0.914073,0.916202,0.92745,0.918857,0.005413
7,0.131027,0.003182,0.051814,0.002601,1,10.0,"{'bt__cat_threshold': 1, 'ridge__alpha': 10}",0.894205,0.82206,0.831289,0.870218,0.723553,0.828265,0.058516,10,0.896995,0.907343,0.90254,0.899486,0.914338,0.90414,0.006157
8,0.124641,0.004546,0.050944,0.003075,1,30.0,"{'bt__cat_threshold': 1, 'ridge__alpha': 30}",0.895353,0.825277,0.830322,0.870972,0.723009,0.828987,0.059015,5,0.883881,0.895349,0.891696,0.88666,0.904584,0.892434,0.007255
9,0.127741,0.004085,0.051622,0.003958,1,100.0,"{'bt__cat_threshold': 1, 'ridge__alpha': 100}",0.88918,0.822979,0.821884,0.866778,0.715578,0.82328,0.059727,15,0.863687,0.876302,0.874037,0.866918,0.890375,0.874264,0.009269


# Binning and encoding numeric columns with the new KBinsDiscretizer
There are a few columns that contain years. It makes more sense to bin the values in these columns and treat them as categories. Scikit-Learn introduced the new estimator `KBinsDiscretizer` to do just this. It not only bins the values, but it encodes them as well. Before you could of done this manually with Pandas `cut` or `qcut` functions.
Let's see how it works with just the `YearBuilt` column.

In [66]:
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer(encode='onehot-dense')
year_built_transformed = kbd.fit_transform(train[['YearBuilt']])
year_built_transformed

array([[0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])

By default, each bin contains (approximately) an equal number of observations. Let's sum up each column to verify this.

In [67]:
year_built_transformed.sum(axis=0)

array([292., 274., 307., 266., 321.])

This is the 'quantile' strategy. You can choose 'uniform' to make the bin edges equally spaced or 'kmeans' which uses kmeans clustering to find the bin edges.

In [68]:
kbd.bin_edges_

array([array([1872. , 1947.8, 1965. , 1984. , 2003. , 2010. ])],
      dtype=object)

# Processing all the year columns separately with ColumnTransformer
We now have another subset of columns that need separate processing and we can do this with the `ColumnTransformer`. The following code adds one more step to our previous transformation. We also drop the Id column which was just identifying each row.

In [70]:
year_cols = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
not_year = ~np.isin(num_cols, year_cols + ['Id'])
num_cols2 = num_cols[not_year]

year_si_step = ('si', SimpleImputer(strategy='median'))
year_kbd_step = ('kbd', KBinsDiscretizer(n_bins=5, encode='onehot-dense'))
year_steps = [year_si_step, year_kbd_step]
year_pipe = Pipeline(year_steps)

transformers = [('cat', cat_pipe, cat_cols),
                ('num', num_pipe, num_cols2),
                ('year', year_pipe, year_cols)]

ct = ColumnTransformer(transformers=transformers)
X = ct.fit_transform(train)
X.shape

(1460, 320)

We cross validate and score and see that all this work yielded us no improvements.

In [71]:
ml_pipe = Pipeline([('transform', ct), ('ridge', Ridge())])
cross_val_score(ml_pipe, train, y, cv=kf).mean()

0.812785401015533

Using a different number of bins for each column might improve our results. Still, the KBinsDiscretizer makes it easy to bin numeric variables.

# More goodies in Scikit-Learn 0.20
There are more new features that come with the upcoming release. Check the What's New section of the docs for more. There are a ton of changes.

# Conclusion
This article introduced a new workflow that will be available to Scikit-Learn users who rely on Pandas for the initial data exploration and preparation. A much smoother and feature-rich process for taking a Pandas DataFrame and transforming it so that it is ready for machine learning is now done through the new and improved estimators `ColumnTransformer`, `SimpleImputer`, `OneHotEncoder`, and `KBinsDiscretizer`.

I am very excited to see this new upgrade and am going to be integrating these new workflows immediately into my projects and teaching materials.

If you are interested in taking a more personalized class from me, see my [upcoming public courses][1].

[1]: https://www.eventbrite.com/o/dunder-data-9780280058