# A Real-Life Massive Machine Learning Pipeline with scikit-learn

The objective is to build a scikit-learn pipeline that takes in raw data, transforms many different subsets of data, learns from it, tunes hyperparameters and is then able to make predictions. The final outcome will be a single sckit-learn estimator that does it all. This estimator will then be able to be used in the future to make predictions in a single step.

This tutorial includes information on how to:
* Categorize columns into distinct groups
* Apply independent transformations to each column grouping
* Create a single pipline that handles all the steps of transforming, modeling, validating, and predicting
* Build custom transformers to create new features
* Save our final model to disk to be used to make future predictions

## Focus on transformations
Although this builds a complete machine learning pipeline, most of it will focus on how to transform and prepare the data for the machine learning models.

## Assume fundamentals of scikit-learn
This tutorial assumes you are familiar with doing machine learning with scikit-learn. At a minimum, you need to know what a scikit-learn estimator is and how it behaves

## Prior Issues with transformations in scikit-learn
Up until the release of scikit-learn version 0.20 in September, 2018, there was no easy way to apply separate transformations to different subsets of the data. Additionally burdensome, was encoding of categorical features. For instance, building a single pipeline to handle input data that contained a mix of continuous and categorical variables was not trivial and a huge issue. Many divergent workflows were built to accommodate such transformations.

The addition of the `ColumnTransformer` and upgrade to the `OneHotEncoder` in version 0.20 alleviated these painful issues. Building an entire pipeline in scikit-learn was not pleasant before these additions. A previous post of mine details the exciting new workflow that became possible.

## Real-World Data
In this post, we will work with the [Ames, Iowa housing dataset][0] from a popular Kaggle competition. You are given a dataset with 79 features with the objective of learning a model to predict the sale price.

[0]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

## Begin by reading in data
Let's begin by reading our training data into a pandas DataFrame.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)
housing = pd.read_csv('data/train.csv')
housing.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500


Output the number of rows and columns.

In [2]:
housing.shape

(1460, 81)

### Remove the sale price
The target variable to predict is the sale price. We will remove it and assign to the variable `y` with the `pop` method, which modifies the DataFrame in place.

In [3]:
# Running this cell twice will cause an error since the columns are no longer in the DataFrame
y = housing.pop('SalePrice').values
y[:5]

array([208500, 181500, 223500, 140000, 250000])

## Use the data dictionary to gain a deeper understanding of the problem

A very useful data dictionary is provided that gives descriptions on each column in the dataset. We will rely on it to understand the meaning of the columns and help with making more logical transformations.

Its contents may be printed out directly into the notebook for easy access. For now, look at the first column. It's composed of numeric values, but they are just codes for a type of house.

In [4]:
print(open('data/data_description.txt').read())

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

## Classifying columns into groups
One of the most important tasks during this tutorial is going to be the classification of each column into a particular group. Each group will undergo its own set of transformations before being used as input in the machine learning model. A commonly taught approach is to classify each column as either **categorical** or **continuous**. Values in categorical columns are discrete and typically the total number of categories is known. Continuous columns are always numeric and not limited to a known set of values.

### Different transformations for each group
Categorical columns are processed differently than continous ones with each group needing its own set of transformations. For instance, we may want to fill in missing values with the most frequent for categorical, but the mean for continuous. Categorical columns are often one-hot encoded, while continuous columns are often standardized. As we will see, there are different groups of columns than just categorical or continuous and each column group will have its own set of transformations.

## A Mini Machine Learning Pipeline
Before we embark on our massive machine learning pipeline, we'll create a much simpler one that consists of a few columns that are either categorical or continuous. Creating this small pipeline will help understand the larger one.

### Select columns for each group
We'll begin by assigning a few categorical and continuous columns as lists. The columns here are chosen arbirtrarily as the focus is going to be on applying the transformations and building the pipeline.

In [5]:
cat_cols = ['Neighborhood', 'LotShape', 'OverallQual', 'MasVnrType']
cont_cols = ['GrLivArea', 'GarageArea', 'LotFrontage']

### Build a pipeline for each column group
Each group will go through two transformations. For the categorical columns, the missing values will be filled with the most frequent and then one-hot encoded. For the continuous columns, the missing values will be filled with the mean and then standardized. A scikit-learn `Pipeline` can be used whenever there are two or more transformations that are needed to be applied in succession.

Before we biuld the pipeline, we will import each of the transformers and instantiate them. Notice that the `OneHotEncoder` is constructed with the `handle_unknown` parameter set to 'ignore'. This will help us when predicting new data that have categories not seen in the training set.

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

si_cat = SimpleImputer(strategy='most_frequent')
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

si_cont = SimpleImputer(strategy='mean')
ss = StandardScaler()

We can now import the scikit-learn `Pipeline` to create the actual pipelines for each column group. To instantiate a `Pipeline`, create a list of two-item tuples where each tuple consists of a **name** and the **transformer**. The name is an arbitrary string that may be used to reference the transformer at a later time. Below, we create two pipelines, `cat_pipe` and `cont_pipe`.

In [7]:
from sklearn.pipeline import Pipeline

cat_steps = [
    ('si', si_cat),
    ('ohe', ohe)
]
cat_pipe = Pipeline(cat_steps)

cont_steps = [
    ('si', si_cont),
    ('ss', ss)
]
cont_pipe = Pipeline(cont_steps)

By default, scikit-learn will apply the transformations in a pipeline to all of the columns. But, we are only interested in passing the categorical columns through the categorical pipeline and the continuous columns through the continuous pipeline. The excellent `ColumnTransformer` allows us to do this. We instantiate it with a list of three-item tuples where each tuple consists of a **name** of the transformation, the **transformer**, and a list of **columns** to apply the transformation to. In our case, the transformer is a pipeline of individual transformers.

In [8]:
from sklearn.compose import ColumnTransformer

transformers = [
    ('cat_cols', cat_pipe, cat_cols),
    ('cont_cols', cont_pipe, cont_cols)
]
ct = ColumnTransformer(transformers)

### Visualizing the ColumnTransformer
At this point, we have constructed the machinery to do the transformations. We haven't actually done any transformations, but we are ready to do so. The following shows how our data would flow at this point. Our raw data would be passed to the `ColumnTransformer` which will send the categorical column to the categorical pipeline and the continuous columns to the continuous pipelines. Each pipeline will apply two successive transformations to the data. After each pipeline has completed, the `ColumnTransformer` concatenates the data back together to form a singe transformed dataset.

![](images/simple_columntransformer.png)

### Passing the data through the `ColumnTransformer`

Let's pass our data through the `ColumnTransformer` to obtain our final transformed dataset. Only the columns that appear in either the `cat_cols` or `cont_cols` lists will be transformed. Any other columns will be dropped. Pass in the pandas DataFrame to the `fit_transform` method and you will be returned a numpy array of the transformed data.

In [9]:
X_t = ct.fit_transform(housing)
type(X_t)

numpy.ndarray

Let's see the shape of the new dataset.

In [10]:
X_t.shape

(1460, 46)

We transformed 7 columns and were returned 46. This is entirely due to one-hot encoding. All the other transformers mapped each input column to exactly one output column.

### Create one more pipeline to do machine learning
Our data is now ready to be fed into a machine learning model. We could use the variable `X_t` that was created above, but instead, we will build another pipeline where the first step passes the data through the `ColumnTransformer` and the second passes the transformed data to the machine learning model. We'll use Ridge Regression for learning. Let's create this final two-step pipeline.

In [16]:
from sklearn.linear_model import Ridge
ridge = Ridge()

steps = [
    ('ct', ct),
    ('ridge', ridge)
]

final_pipe = Pipeline(steps)

### Visualizing the final pipeline

Our final pipeline is a bit complex. It consists of only two steps, but the ColumnTransformer in the first step contains two separate pipelines and each of those pipelines contains two transformations. The image below shows how the data is processed.

![](images/final_pipeline.png)

### Train the model
Let's use this pipeline to pass in train a Ridge regression model. We simply pass the original DataFrame to the `fit` method which will run all the transformations and train the model.

In [17]:
final_pipe.fit(housing, y);

### Making predictions
After training, we can now make predictions on new dataets that have the same column names as the original. Let's read in the test dataset and assign the Id column to its own variable.

In [18]:
housing_test = pd.read_csv('data/test.csv')
housing_test.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal


We now pass in the test set to the pipeline's `predict` method to get the predictions.

In [19]:
y_pred = final_pipe.predict(housing_test)
y_pred[:5]

array([132549.05709854, 151506.41270501, 172053.69388497, 182112.62922525,
       255906.38643771])

## Saving the model to disk for future use
We can preserve our trained pipeline exactly as it is by saving it to disk with help from the `joblib` library `dump` function. See [joblib documentation][0] for more information. Pass it the pipeline and a name for the new file.

[0]: https://joblib.readthedocs.io/en/latest/persistence.html#persistence

In [20]:
import joblib
joblib.dump(final_pipe, 'models/minipipeline_ridge.joblib')

['models/minipipeline_ridge.joblib']

## Retrieve the saved model
We can retrieve the model from disk with the `load` function. It has preserved every step of the pipeline. We test that the results are the same by testing that the predictions are the same as the original.

In [21]:
final_pipe_new = joblib.load('models/minipipeline_ridge.joblib')
y_pred_new = final_pipe_new.predict(housing_test)
(y_pred == y_pred_new).all()

True

## Submit to Kaggle

We now have a prediction for each test observation and can submit them to Kaggle to get scored and ranked against the other competitors.

### Create a csv of Id and SalePrice
We need to submit a csv file of the row Id and our predicted sale price. We do so with the DataFrame constructor to create a two-column DataFrame.

In [22]:
sub01 = pd.DataFrame({'Id': housing_test['Id'], 'SalePrice': y_pred})
sub01.head(3)

Unnamed: 0,Id,SalePrice
0,1461,132549.057099
1,1462,151506.412705
2,1463,172053.693885


### File naming and directory structure

From here we can export our DataFrame as a csv. I strongly recommend creating a submissions folder within the data folder and within that folder a new folder for each date that you track submissions. Within that folder is where the submission files will be saved.

In [23]:
sub01.to_csv('data/submissions/20190710/sub01.csv', index=False)

Our directory structure for the data takes the following shape:

![](images/dir.png)

### Make a submission to Kaggle from python
Kaggle has kindly provided a [python library][0] (`pip install kaggle`) to make submissions programmatically. You'll need to read the documentation to learn how to authenticate your account.

[0]: https://github.com/Kaggle/kaggle-api

In [24]:
import kaggle

We submit our csv by passing the submission function our file location, a message, and the competition name. It's important to give a good descriptive message so that you can remember how that particular submission was created.

In [25]:
file = 'data/submissions/20190710/sub01.csv'
message = '''
One hot encoded four categorical columns and standardized three continuous
columns. Modeled with ridge regression with alpha=1
'''
competition = 'house-prices-advanced-regression-techniques'
kaggle.api.competition_submit(file, message, competition)

100%|██████████| 33.6k/33.6k [00:01<00:00, 28.0kB/s]


Successfully submitted to House Prices: Advanced Regression Techniques

### Retrieve score
We can retrieve a list of all of our prior scores with the following function call. The most recent submission will be the first item in the list which we print to the screen.

In [26]:
all_submissions = kaggle.api.competitions_submissions_list(competition)
all_submissions[0]

{'ref': 11780576,
 'totalBytes': 34422,
 'date': '2019-07-10T23:26:52.247Z',
 'description': '\nOne hot encoded four categorical columns and standardized three continuous\ncolumns. Modeled with ridge regression with alpha=1\n',
 'errorDescription': None,
 'fileName': 'sub01.csv',
 'publicScore': '0.18034',
 'privateScore': None,
 'status': 'complete',
 'submittedBy': 'Ted Petrou',
 'submittedByRef': 'tedpetrou',
 'teamName': 'Ted Petrou',
 'type': 'standard',
 'url': 'https://www.kaggle.com/submissions/11780576/11780576.raw'}

The competition uses root mean squared error of the logged housing prices. Our score is .18. Unfortunately there is no easy way to get our place on the leaderboard, but you can just navigate [directly to the Kaggle leaderboard][0] instead.

[0]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard

## Pipeline vs ColumnTransformer
It may not be crystal clear what the difference is between the `Pipeline` and `ColumnTransformer`. The simplest distinction is to think of the `Pipeline` as moving all of the data **vertically** in succession from one transformer to the next. The `ColumnTransformer` splits data **horizontally** into multiple subsets. Each of these subsets gets applied a transformation that is independent of all the other subsets of data. All the transformed subsets are then concatenated together to form a single dataset.

In a `Pipeline`, all columns will be passed into each estimator with the result being passed to the next estimator. A `Pipeline` allows for the last step to be a machine learning model whereas the `ColumnTransformer` only allows transformers.

### All steps in one cell
All the above steps for the our mini machine learning pipeline were written in different cells, which probably makes it harder to see the full structure of the program as it would outside of a Jupyter Notebook/Tutorial

In [28]:
si_cat = SimpleImputer(strategy='most_frequent')
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

si_cont = SimpleImputer(strategy='mean')
ss = StandardScaler()

cat_cols = ['Neighborhood', 'LotShape', 'OverallQual', 'MasVnrType']
cont_cols = ['GrLivArea', 'GarageArea', 'LotFrontage']

cat_steps = [('si', si_cat), ('ohe', ohe)]
cat_pipe = Pipeline(cat_steps)

cont_steps = [('si', si_cont), ('ss', ss)]
cont_pipe = Pipeline(cont_steps)

transformers = [('cat_cols', cat_pipe, cat_cols), ('cont_cols', cont_pipe, cont_cols)]
ct = ColumnTransformer(transformers)

steps = [('ct', ct), ('ridge', ridge)]
final_pipe = Pipeline(steps)
final_pipe.fit(housing, y)
y_pred = final_pipe.predict(housing_test)
sub01 = pd.DataFrame({'Id': housing_test['Id'], 'SalePrice': y_pred})
sub01.to_csv('data/submissions/20190710/sub01.csv', index=False)

## Massive Machine Learning Pipeline
We will move on to building the massive machine learning pipeline. The overall architecture will look similar to the mini-pipeline from above with the major difference being the number of distinct column groups. Each of the column groupings we create will have its own pipeline.

## Create column groupings
For this massive pipeline, we will use all of the columns. Typically, it's a good idea to do exploratory data analysis first to manually inspect the columns and possibly select a subset to model on. We will forgo this step and go straight into pipeline construction. Further along in the tutorial we will do some data inspection and feature engineering.

Below, we separate the data into five separate column groupings. The data type of each column is either numeric or string. Both numeric and string columns can be categorical, but only numeric data can be continuous. Categorical data is further subdivided into nominal (no natural ordering) or ordinal (has a natural ordering - basement quality for example). The data dictionary was used to help classify each column correctly.

In [29]:
str_nomial = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 
              'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
              'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation',
              'Heating', 'CentralAir', 'Electrical', 'GarageType', 'GarageFinish', 'PavedDrive',
              'MiscFeature', 'SaleType', 'SaleCondition']
str_ordinal = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
               'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'Functional', 'GarageQual', 'GarageCond',
               'PoolQC', 'Fence', 'FireplaceQu']

numeric_nominal = ['MSSubClass', 'MoSold', 'YrSold', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
numeric_ordinal = ['OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
                   'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', ]
numeric_cont = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 
                'TotalBsmtSF','1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 
                'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
                'MiscVal']

### Nominal vs Ordinal
Most of the time it is easy to determine whether a categorical column is nominal or ordinal. The neighborhood column, for instance has no natural ordering and would therefore be classified as nominal. 

But with a column like LotShape, it isn't so clear. The values are Reg (Regular), IR1 (Slightly Irregular), IR2 (Moderately Irregular), and IR3 (Irregular). The values appear to have an order with Reg being the 'best' and IR3 the 'worst'. But, without being an expert in this field, it's probably safer to assume less and treat it as a nominal.

### One-Hot encoding all categorical columns
While scikit-learn does provide an `OrdinalEncoder` to encode ordinal features, we will choose opt to one-hot encode them instead. This is done for a couple reasons. First, ordinal encoding places a stricter assumption on the data

### Create pipeline steps for each column group

In [30]:
str_nominal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
str_orinal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
numeric_nominal_steps = [
    ('si', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
numeric_ordinal_steps = [
    ('si', SimpleImputer(strategy='median')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
]
numeric_cont_steps = [
    ('si', SimpleImputer(strategy='mean')),
    ('ss', StandardScaler())
]

In [31]:
str_nominal_pipe = Pipeline(str_nominal_steps)
str_ordinal_pipe = Pipeline(str_orinal_steps)
numeric_nominal_pipe = Pipeline(numeric_nominal_steps)
numeric_ordinal_pipe = Pipeline(numeric_ordinal_steps)
numeric_cont_pipe = Pipeline(numeric_cont_steps)

In [32]:
transformers = [
    ('str_nominal_pipe', str_nominal_pipe, str_nomial),
    ('str_ordinal_pipe', str_ordinal_pipe, str_ordinal),
    ('numeric_nominal_pipe', numeric_nominal_pipe, numeric_nominal),
    ('numeric_ordinal_pipe', numeric_ordinal_pipe, numeric_ordinal),
    ('numeric_cont_pipe', numeric_cont_pipe, numeric_cont)
]
ct = ColumnTransformer(transformers)

In [33]:
final_steps = [
    ('ct', ct),
    ('ridge', Ridge())
]
final_pipe = Pipeline(final_steps)

In [34]:
final_pipe.fit(housing, y);

In [36]:
y_pred = final_pipe.predict(housing_test)

In [37]:
sub02 = pd.DataFrame({'Id': housing_test['Id'], 'SalePrice': y_pred})
sub02.to_csv('data/submissions/20190710/sub02.csv', index=False)

In [39]:
file = 'data/submissions/20190710/sub02.csv'
message = '''
One hot encoded all categorical columns and standardized all continuous
columns. Modeled with ridge regression with alpha=1
'''
competition = 'house-prices-advanced-regression-techniques'
kaggle.api.competition_submit(file, message, competition)

100%|██████████| 33.6k/33.6k [00:02<00:00, 16.2kB/s]


Successfully submitted to House Prices: Advanced Regression Techniques

In [40]:
all_submissions = kaggle.api.competitions_submissions_list(competition)

In [43]:
all_submissions[0]

{'ref': 11780599,
 'totalBytes': 34451,
 'date': '2019-07-10T23:30:10.557Z',
 'description': '\nOne hot encoded all categorical columns and standardized alla continuous\ncolumns. Modeled with ridge regression with alpha=1\n',
 'errorDescription': None,
 'fileName': 'sub02.csv',
 'publicScore': '0.18201',
 'privateScore': None,
 'status': 'complete',
 'submittedBy': 'Ted Petrou',
 'submittedByRef': 'tedpetrou',
 'teamName': 'Ted Petrou',
 'type': 'standard',
 'url': 'https://www.kaggle.com/submissions/11780599/11780599.raw'}

In [None]:
df = housing[numeric_bin]
kbd = KBinsDiscretizer([2, 3, 4], encode='onehot-dense')
kbd.fit(df.fillna(2000))

In [None]:
kbd.bin_edges_

### Replacing low-frequency categorical values
Before we get started transforming groups of columns, we can look at individual columns for particular values that appear very few times (sometimes referred to as outliers).  Categorical values that appear infrequently are candidates to be reclassified as a another similar category or to be grouped together with other infrequent categories into an 'other' category.

### Why recclassify low-frequency categoricals?
A primary goal of machine learning is to build a model that generalizes well to future, unseen data. If our model is built with too many low-frequency categorical values, it may overfit to those particular categories. As a concrete example, imagine that there are just 2 houses in our training data from a particular neighborhood and both of these houses, by chance, just happen to be very poor quality houses and are not representative of the entire neighborhood. Our model might unfairly give too much negative weight to that neighborhood and then make poor predictions in the future.

Of course, this isn't always the case and a single unique category can actually give useful information. Perhaps there is a single house that has a solid-gold toilet that massively increases the value of the house. 

But, in general, I like to experiment with consolidating low-frequency categories so that the model can generalize better.

### Finding low-frequency categoricals
The `value_counts` Series method find the number of times each category appears. Let's see an example

In [None]:
housing['LotConfig'].value_counts()

In this example, the `LotConfig` feature has 5 unique values but 'FR3' only appears 4 times. By looking at the data dictionary, we can see its description is similar to that of FR2, so we can consider replacing it.

### An automated way to find low-frequency categoricals
We can loop through each column and run the `value_counts` method on it if it is a string column ('object' in pandas). There are several columns in this dataset that are numeric, but represent discrete categories such as we saw with the first column 'MSSubClass'. We can also add a condition to run `value_counts` if there number of unique values is below a certain threshold. Below, only the columns that have a one category that appears 5 or fewer times is printed to the screen.

In [None]:
for col in housing.columns:
    if housing[col].dtype == 'object' or housing[col].nunique() < 30:
        vc = housing[col].value_counts(dropna=False)
        if vc.min() <= 5:
            print(f'\nColumn {col}')
            print(vc)

### Replacing with `replace`

Let's pick a few columns to demonstrate how to replace values. In pandas, the `replace` method helps us do this. We need to create a dictionary mapping the column name to another dictionary that maps the value to be replaced with its replacement value.

In [None]:
replace_dict = \
{
    'LotConfig': {'FR3': 'FR2'},
    # replace all railroads with RR
    'Condition1': {'PosA': 'PosN', 
                   'RRAe': 'RR', 
                   'RRNe': 'RR', 
                   'RRAn': 'RR', 
                   'RRNn': 'RR'},
    'OverallQual': {1: 2},
    'OverallCond': {1: 2},
    'Exterior1st': {'BrkComm': 'OTHER', 
                    'Stone': 'OTHER', 
                    'AsphShn': 'OTHER',
                    'CBlock': 'OTHER',
                    'ImStucc': 'OTHER'},
    'ExterCond':{'Po': 'Fa', 
                 'Ex': 'Gd'},
    'Foundation': {'Stone': 'OTHER',
                   'Wood': 'OTHER'},
    'Functional': {'Sev': 'Maj',
                   'Maj1': 'Maj',
                   'Maj2': 'Maj'},
    'HeatingQC': {'Po': 'Fa'}
}

#### Make the replacement

In [None]:
h2 = housing.replace(replace)

Let's check some of the replaced columns

In [None]:
h2['LotConfig'].value_counts()

In [None]:
h2['Condition1'].value_counts()

### Replace all but a few categories
If there are a lot of low-frequency categories and only a few categories you want to keep, it may be easier to just provide the categories you want to keep. This isn't easily possible to replace, but is so with the `where` method. Let's see this with the `Heating` column.

In [None]:
# before
housing['Heating'].value_counts()

In [None]:
# after
keep = housing['Heating'].isin(['GasA', 'GasW'])
new_col = housing['Heating'].where(keep, 'OTHER')
new_col.value_counts()

### Create a dictionary of the values you want to keep
Map the column name to the list of value you want to keep. All other values will be transformed to 'OTHER'.

In [None]:
keep_dict = \
{
    'Condition2': ['Norm'],
    'RoofStyle': ['Gable', 'Hip'],
    'RoofMatl': ['CompShg'],
    'Heating': ['GasA', 'GasW'],
    'Electrical': ['SBrkr', 'FuseA','FuseF'],
    'SaleType': ['WD', 'New', 'COD']
    
}

In [None]:
h3 = h2.copy()
for col, keep_vals in keep_dict.items():
    keep = h3[col].isin(keep_vals)
    h3[col] = h3[col].where(keep, 'OTHER')

In [None]:
# before
h2['Condition2'].value_counts()

In [None]:
# after
h3['Condition2'].value_counts()

## Create custom transformer in scikit-learn

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
class Replacer(BaseEstimator, TransformerMixin):
    
    def __init__(self, replace_dict=None, keep_dict=None):
        self.replace_dict = replace_dict
        self.keep_dict = keep_dict
        
    def fit(self, X):
        return self
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            raise TypeError('`X` must be a DataFrame')
        if self.replace_dict:
            df = X.replace(self.replace_dict)
        if self.keep_dict:
            for col, keep_vals in self.keep_dict.items():
                keep = df[col].isin(keep_vals)
                df[col] = df[col].where(keep, 'OTHER')
        return df

In [None]:
repl = Replacer(replace_dict, keep_dict)

In [None]:
housing_replaced = repl.fit_transform(housing)
housing_replaced.head()

In [None]:
housing_replaced['LotConfig'].value_counts()

In [None]:
housing_replaced['Condition2'].value_counts()

## Replacing low-frequency categorical values
* Why recclassify low-frequency categoricals?
* Finding low-frequency categoricals
* Do this as the first step
* See LotConfig


### An automated way to find low-frequency categoricals

## Months and years

In [None]:
numeric_months = ['MoSold']
numeric_years = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

### Replacing with `replace`


In [None]:
replace_dict = \
{
    'LotConfig': {'FR3': 'FR2'},
    # replace all railroads with RR
    'Condition1': {'PosA': 'PosN', 
                   'RRAe': 'RR', 
                   'RRNe': 'RR', 
                   'RRAn': 'RR', 
                   'RRNn': 'RR'},
    'OverallQual': {1: 2},
    'OverallCond': {1: 2},
    'Exterior1st': {'BrkComm': 'OTHER', 
                    'Stone': 'OTHER', 
                    'AsphShn': 'OTHER',
                    'CBlock': 'OTHER',
                    'ImStucc': 'OTHER'},
    'ExterCond':{'Po': 'Fa', 
                 'Ex': 'Gd'},
    'Foundation': {'Stone': 'OTHER',
                   'Wood': 'OTHER'},
    'Functional': {'Sev': 'Maj',
                   'Maj1': 'Maj',
                   'Maj2': 'Maj'},
    'HeatingQC': {'Po': 'Fa'}
}

### Replace all but a few categories
Heating column with `where`

### Create a dictionary of the values you want to keep

In [None]:
keep_dict = \
{
    'Condition2': ['Norm'],
    'RoofStyle': ['Gable', 'Hip'],
    'RoofMatl': ['CompShg'],
    'Heating': ['GasA', 'GasW'],
    'Electrical': ['SBrkr', 'FuseA','FuseF'],
    'SaleType': ['WD', 'New', 'COD']
    
}

Must loop through columns to replace

## Percentage values
numeric_percent = []

### Clip values
Keep values within a range with `clip`

### Binarize 
Pool/No Pool

## Create custom transformer in scikit-learn

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin