### Using Pipelines in SciKit Learn

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor

In [2]:
train = pd.read_csv('data/train.csv')
test  = pd.read_csv('data/test.csv')
y     = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)

First, we'll go ahead and define numeric and categorical columns.

In [3]:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns     = train.select_dtypes(include=np.object).columns.tolist()

Then, we'll create a pipe to fill in missing values for each class of columns.

In [16]:
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())

Now, we'll combine them into a master pipe, that fills in and transforms each type of category together.

In [17]:
transformers = [
    ('num', numeric_pipeline, numeric_columns),
    ('cat', cat_pipeline, cat_columns)
]

combined_pipe = ColumnTransformer(transformers)

We'll now go ahead and transform our test test:

In [18]:
train_clean = combined_pipe.fit_transform(train)

And, importantly, transform our test set *according to the shape of our training set*.

In [7]:
test_clean  = combined_pipe.transform(test)