# Creating an end-to-end Pipeline
The code below is mainly adapted from this [blog post about creating Pipelines](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf)

**Question 4:** Try creating a single pipeline that does the full data preparation plus the final transformation.

What data preparation steps do we have to include? <br>

*Exclude data observation/wrangling:* We aren't concerned with checking the integrity/usefulness of our features or making any choices about composite features (e.g. feature engineering). We are only concerned with the *preparation* of the data for use in an ML algorithm. <br>

However, in the chapter, we decided to include certain composite features in our model. Since we also need to have these features for any test or real-life data that we would want to use the Pipeline on, we need to include a transformer to add these features for new instances. Luckily, he already provides the code for this custom transformer, which I'll use in my answer.

In this case, we *are* concerned about the following steps:
1. Impute or drop missing values 
2. Encode categorical features (One-hot)
3. Add combined features 
4. Scaling/standardization of continuous features 

## 1. Import modules, get data and split

In [57]:
import sklearn

import numpy as np
import os
import pandas as pd

In [58]:
# get data: copied from Ch2 Notebook
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()

In [59]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
housing = load_housing_data()

In [60]:
# using random test/train split
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [61]:
train_set.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
14196,-117.03,32.71,33.0,3126.0,627.0,2300.0,623.0,3.2596,103000.0,NEAR OCEAN
8267,-118.16,33.77,49.0,3382.0,787.0,1314.0,756.0,3.8125,382100.0,NEAR OCEAN
17445,-120.48,34.66,4.0,1897.0,331.0,915.0,336.0,4.1563,172600.0,NEAR OCEAN
14265,-117.11,32.69,36.0,1421.0,367.0,1418.0,355.0,1.9425,93400.0,NEAR OCEAN
2271,-119.8,36.78,43.0,2382.0,431.0,874.0,380.0,3.5542,96500.0,INLAND


In [62]:
y_train = train_set['median_house_value']
X_train = train_set.drop(['median_house_value'], axis=1)
y_test = test_set['median_house_value']
X_train = test_set.drop(['median_house_value'], axis=1)

## 2. Import and define transformers

We want the following transformers:
1. `SimpleImputer`
2. `OneHotEncoder`
3. Our custom transformer `CombinedAttributesAdder`
4. `StandardScaler`
5. `ColumnTransformer`

In [63]:
# import Pipeline 
from sklearn.pipeline import Pipeline

In [64]:
# import the sklearn built-in transformers
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [65]:
# define custom transformer
from sklearn.base import BaseEstimator, TransformerMixin

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, rooms_ix, bedrooms_ix, population_ix, households_ix, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        self.rooms_ix = rooms_ix
        self.bedrooms_ix = bedrooms_ix
        self.population_ix = population_ix
        self.households_ix = households_ix
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]


## 3. Make Pipeline object 

We can nest Pipelines: a Pipeline object can be called within another Pipeline object. The blog post I'm using as a reference suggests having tertiary nested Pipeline objects. The outermost Pipeline is used to preprocess and fit the data in one step. Within that Pipeline, are two steps: a preproccessing transformer and an estimator. The preprocessing transformer is an instance of ColumnTramsformer, and contains two Pipelines: one for preprocessing numerical features, and one for categorical features. Each of those is made up of a Pipeline of the steps necessary for each, an imputer followed by onehot encoding for categorical features, and imputer followed by a scaler for numerical features. In my answer, there is a third Pipeline, for adding combined features.

Steps:
1. Define numerical preprocessor (Pipeline object)
2. Define categorical preprocessor (Pipeline object)
3. Define combined features preprocessor (Pipeline object)
4. Combine all three into the preprocessing object with ColumnTransformer
5. Make overall Pipeline of preprocessor + estimator 

### Numerical preprocessor

In [66]:
# adapted from the blog post 
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

### Categorical preprocessor

In [67]:
# adapted from the blog post 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

### Combined feature preprocessor

In [68]:
#combined_features = Pipeline(steps=[
#    ('combine', CombinedAttributesAdder(rooms_ix, bedrooms_ix, population_ix, households_ix))])

### Use ColumnTransformer

In [69]:
train_set.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

In [70]:
# adapted from blog post
numeric_features = train_set.select_dtypes(include=['int64', 'float64']).drop(['median_house_value'], axis=1).columns
categorical_features = train_set.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [72]:
type(categorical_features)

pandas.core.indexes.base.Index

### Overall Pipeline

In [73]:
from sklearn.ensemble import RandomForestRegressor

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', RandomForestRegressor())])

#### What is missing here?
1. Feature selection (needs to come before Pipeline)
2. Hyperparameter turning
3. Cross-validation

Need to figure out how to incorporate 2 and 3 into the final Pipeline object

## 4. Use Pipeline to fit and apply Pipeline

In [74]:
rf.fit(X_train, y_train) # error with custom transformer

ValueError: Number of labels=16512 does not match number of samples=4128

In [None]:
y_pred = rf.predict(X_test)