In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


%matplotlib inline

---

# Part 2. Using sklearn pipeline

**Goals:**

* Implement ML pipeline in sklearn
    * Check that there no NAs
    * Train/test split
    * Preprocess columns
    * Train model
    * Evaluate on test
* Practice using pipeline

In [2]:
data = pd.read_csv('house_prices_small.csv')
data.head()

Unnamed: 0,SalePrice,LotArea,OverallQual,SaleCondition,YearBuilt
0,208500,8450,7,Normal,2003
1,181500,9600,6,Normal,1976
2,223500,11250,7,Normal,2001
3,140000,9550,7,Abnorml,1915
4,250000,14260,8,Normal,2000


## 1. Prepare the data

### 1.1 Explore the dataset


In [3]:
data.head()

Unnamed: 0,SalePrice,LotArea,OverallQual,SaleCondition,YearBuilt
0,208500,8450,7,Normal,2003
1,181500,9600,6,Normal,1976
2,223500,11250,7,Normal,2001
3,140000,9550,7,Abnorml,1915
4,250000,14260,8,Normal,2000


In [4]:
# missing values?
data.isna().sum()

SalePrice        0
LotArea          0
OverallQual      0
SaleCondition    0
YearBuilt        0
dtype: int64

### 1.2  Separate features form the target and perform train-test split

Function `train_test_split` randomly split dataset into two parts: 
- training data, that we will use to find optimal parameters of the model 
- test data, which will be used to report the final performance of the model

E.g. if your dataset initially had 1000 observations and you set argument `test_size=0.2`, it will select return 800 random observations as **train dataset** and the rest (200 observations) as **test dataset**. 

In [8]:
from sklearn.model_selection import train_test_split

tr, te = train_test_split(data)

y_train = tr.SalePrice
X_train = tr.drop(['SalePrice'], axis=1)

y_test = te.SalePrice
X_test = te.drop(['SalePrice'], axis=1)


### 1.3 Encode categorical and ordinal features, scale numerical ones

In [9]:
X_train.head()

Unnamed: 0,LotArea,OverallQual,SaleCondition,YearBuilt
902,7875,7,Normal,2003
771,8877,4,Normal,1951
1154,13700,7,Normal,1965
925,15611,5,Abnorml,1977
289,8730,6,Normal,1915


How to preprocess the features:
* `LotArea`,  `YearBuilt`  - numerical features, scale
* `SaleCondition` - categorical feature, one-hot encoding
* `OverallQual` - ordinal feature, no need to encode

That being said, we need to apply different transformations to different columns. It can be done with `ColumnTransformer`:

```
ColumnTransformer([
    ('name1', transorm1, column_names1),
    ('name2', transorm2, column_names2)
])
```

In [18]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

transforms = ColumnTransformer([
    ('ohe', OneHotEncoder(), ['SaleCondition']),
    ('scaling', StandardScaler(), ['LotArea', 'YearBuilt'])
], remainder='passthrough')

## 2. Train the model

Now, we are ready to train the model. We will use `LinearRegression` model.

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ('col_transforms', transforms),
    ('regression', LinearRegression())
])

pipe.fit(X_train, y_train)


Pipeline(memory=None,
         steps=[('col_transforms',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('ohe',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['SaleCondition']),
                                                 ('scaling',
                                                  StandardScaler(copy=True,
                                                                 with

## 3. Evaluate on the test set

$$
\text{Root Mean Squared Error} = \sqrt{\frac{1}{N}\sum_i \left(y_i - \hat{y}_i   \right)^2 }
$$

In [22]:
# evaluate on test
y_pred = pipe.predict(X_test)
np.mean((y_pred - y_test) ** 2) ** 0.5

45763.64446499611