# Linear Regression Modelling with Elastic Net
Build a pipeline to model an optimized Elastic Net solution.
Evaluate Feature Importances.

**Data Sources**

- `data/raw/train.csv`: Training set from [kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

**Changes**

- 2019-03-22: Start notebook



<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries,-load-data" data-toc-modified-id="Import-libraries,-load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries, load data</a></span></li><li><span><a href="#Go-quick-&amp;-dirty" data-toc-modified-id="Go-quick-&amp;-dirty-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Go quick &amp; dirty</a></span></li><li><span><a href="#Pre-process-data-(outside-of-sklearn-pipeline)" data-toc-modified-id="Pre-process-data-(outside-of-sklearn-pipeline)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pre-process data (outside of sklearn pipeline)</a></span><ul class="toc-item"><li><span><a href="#General-pre-processing" data-toc-modified-id="General-pre-processing-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>General pre-processing</a></span></li><li><span><a href="#Create-two-versions-of-training-data-for-different-outlier-treatment-(experiment)" data-toc-modified-id="Create-two-versions-of-training-data-for-different-outlier-treatment-(experiment)-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Create two versions of training data for different outlier treatment (experiment)</a></span></li><li><span><a href="#Split-train-&amp;-test-sets" data-toc-modified-id="Split-train-&amp;-test-sets-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Split train &amp; test sets</a></span></li></ul></li><li><span><a href="#Fit,-tune,-predict-(with-Pipelines)" data-toc-modified-id="Fit,-tune,-predict-(with-Pipelines)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Fit, tune, predict (with Pipelines)</a></span><ul class="toc-item"><li><span><a href="#Build-Pipe" data-toc-modified-id="Build-Pipe-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Build Pipe</a></span></li><li><span><a href="#Fit-&amp;-Tune" data-toc-modified-id="Fit-&amp;-Tune-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Fit &amp; Tune</a></span></li></ul></li></ul></div>

---

## Import libraries, load data

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from tqdm import tqdm

from scipy import stats
from scipy.stats import norm, skew

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.model_selection import StratifiedKFold, GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# My functions
import EDA_functions as EDA
import cleaning_functions as cleaning
from linRegModel_class import LinRegModel

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns #, sns.set_style('whitegrid')
color = 'rebeccapurple'
%matplotlib inline

# Display settings
from IPython.display import display
pd.options.display.max_columns = 100

In [2]:
# Load data
raw_data = pd.read_csv('data/raw/train.csv')

In [3]:
# Load variables from notebook 1
%store -r cols_to_del
%store -r cols_to_log
%store -r outliers_to_del
%store -r top_corr_columns

## Go quick & dirty
Use my 'quick & dirty' function for a baseline model on unprocessed data.

In [4]:
# Initialize a scikit-learn model object of choice
model_simple = ElasticNetCV(alphas=[0.5, 0.1, 1.5], copy_X=True, cv=5, eps=0.001, 
                            fit_intercept=True, l1_ratio=0.5, max_iter=2000, 
                            n_alphas=None, n_jobs=-1)

# Create an instance of the LinRegModel class by passing df, target variable and model object
elastic_net_simple = LinRegModel(raw_data, 'SalePrice', model_simple)

# Output instance
display(elastic_net_simple)

ElasticNetCV(alphas=[0.5, 0.1, 1.5], copy_X=True, cv=5, eps=0.001,
       fit_intercept=True, l1_ratio=0.5, max_iter=2000, n_alphas=None,
       n_jobs=-1, normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0)

In [5]:
# Perform the modelling
elastic_net_simple.go_quickDirty()



In [6]:
# Output result
elastic_net_simple

ElasticNetCV(alphas=[0.5, 0.1, 1.5], copy_X=True, cv=5, eps=0.001,
       fit_intercept=True, l1_ratio=0.5, max_iter=2000, n_alphas=None,
       n_jobs=-1, normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0)

RMSE on test data 34631.39, r2-score 0.79.

In [7]:
# Check best alpha value
model_simple.alpha_

0.1

## Pre-process data (outside of sklearn pipeline)
Pre-processing steps that take place before data is pipelined

### General pre-processing

In [8]:
# Disable warning
pd.set_option('mode.chained_assignment', None)

# Create and clean training set with variables from the EDA notebook
train_data = (raw_data
              .pipe(cleaning.change_dtypes, cols_to_category=raw_data.select_dtypes(object))
              .pipe(cleaning.delete_columns, cols_to_delete=cols_to_del)
              .pipe(cleaning.apply_log, cols_to_transform=cols_to_log)
             )

train_data.drop(outliers_to_del, inplace=True)
train_data.dropna(subset=['MasVnrArea', 'MasVnrType', 'Electrical'], inplace=True);

'Alley successfully deleted'

'Id successfully deleted'

'Fence successfully deleted'

'PoolQC successfully deleted'

'FireplaceQu successfully deleted'

'MiscFeature successfully deleted'

### Create two versions of training data for different outlier treatment (experiment)

In [9]:
# Create List of Columns containing NaN
nan_cols = []
for col in train_data.columns:
    if train_data[col].isnull().sum() > 0:
        nan_cols.append(col)

In [10]:
# Check results
nan_cols

['LotFrontage',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond']

**Note:** All remaining cols with missing values but Lot Frontage are categorical.

In [11]:
# Create train set without missing values
train_data_reduced = train_data.drop(nan_cols, axis=1)

In [12]:
# Check results
print("train set with NaN: ", train_data.shape[1])
print("train set without NaN: ", train_data_reduced.shape[1])

train set with NaN:  75
train set without NaN:  64


### Split train & test sets

In [13]:
# Set with NaN
X_train = train_data.drop('SalePrice', axis=1)
y_train = train_data['SalePrice'].copy()

In [14]:
categorical_features = X_train.select_dtypes(include=['category']).columns
numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns

In [15]:
len(categorical_features) + len(numeric_features)

74

In [16]:
# Set without NaN
X_train_reduced = train_data_reduced.drop('SalePrice', axis=1)
y_train_reduced = train_data_reduced['SalePrice'].copy()

In [17]:
categorical_features_reduced = X_train_reduced.select_dtypes(include=['category']).columns
numeric_features_reduced = X_train_reduced.select_dtypes(include=['float64', 'int64']).columns

In [18]:
len(categorical_features_reduced) + len(numeric_features_reduced)

63

## Fit, tune, predict (with Pipelines)

### Build Pipe

In [19]:
# Assemble pipeline (define function)
def build_pipe(X_train, y_train, numeric_features, categorical_features, clf):
    """Build a pipeline for preprocessing and modelling.
    
    ARGUMENTS:
        X_train: training features (df or array)
        y_train: training labels (df or array)
        numeric_features: list of strings, numeric columns
        categorical_features: list of strings, categorical columns
        clf: classifier (sk-learn model object)
        
    RETURNS:
        full_pipe: pipeline object
    """
    # level 1 - two separate pipes for cat and num features
    numeric_transformer = Pipeline(steps=[
        ('imputer_n', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
            ])

    categorical_transformer = Pipeline(steps=[
        ('imputer_c', SimpleImputer(strategy='constant', fill_value='missing')),
        ('ohe', OneHotEncoder(handle_unknown='ignore')),
            ])

    # level 2 - wrap the two level 1 pipes into a ColumnTransformer
    preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_features),
                ('cat', categorical_transformer, categorical_features),
                         ])

    # level 3 - pipe it with a classifier
    full_pipe = Pipeline(steps=[
                       ('preprocessor', preprocessor),
                       ('clf', model_simple),
                               ]) 
    
    return full_pipe

In [20]:
# Build pipeline
full_pipe = build_pipe(X_train, y_train, numeric_features, 
                       categorical_features, model_simple)

In [21]:
# Build pipeline for train set without NaN
full_pipe_reduced = build_pipe(X_train_reduced, y_train_reduced, 
                               numeric_features_reduced, 
                               categorical_features_reduced, model_simple)

### Fit & Tune

In [22]:
def fit_pipe(X_train, y_train, pipe, scorer, cv=StratifiedKFold(3)):
    """Fit training data to a pipeline with GridSearchCV
    for best parameter tuning.
    
    ARGUMENTS:
        X_train: training features (df or array)
        y_train: training labels (df or array)
        pipe: pipeline (sk-learn pipeline object)
        scorer: evaluation metric for validation
        cv: type of CV, default is StratifiedKFold(3)
        
    RETURNS:
        grid: grid search object
        grid_results: dict with grid search results
    """
    parameters = {
            'preprocessor__num__imputer_n__strategy': ['mean', 'median'],
#             'classifier__C': [0.1, 1.0, 10, 100],

                 }

    cv = GridSearchCV(pipe, param_grid=parameters, scoring=scorer, n_jobs=-1, iid=False,
                      cv=cv, error_score='raise', return_train_score=False, verbose=1)

    grid = cv.fit(X_train, y_train) 
    grid_results = grid.cv_results_

    return grid, grid_results

In [23]:
scorer = scorer = make_scorer(mean_squared_error)
cv = 3

# Pipe with NaN
grid, grid_results = fit_pipe(X_train, y_train, full_pipe, scorer, cv=cv)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    7.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    7.3s finished


In [24]:
grid.best_estimator_

Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer_n', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbo...ive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0))])

In [25]:
grid.best_score_

822119404.750297

In [26]:
# Pipe without NaN
grid_reduced, grid_results_reduced = fit_pipe(X_train_reduced, y_train_reduced, 
                                              full_pipe_reduced, scorer, cv=cv)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    2.8s finished


In [28]:
grid_reduced.best_score_

905441377.1598457

Includes following of the pre-processing steps identified in notebook 1 because I want them to be evaluated:

- watch multicollinearity (evtl. remove cols: '1stFloor', 'GarageArea', 'FirstFlSF')
- test IQR-method on 'top_corr_columns' as alternative
- one-hot-encode categorical features

In [27]:
# # OLD PIPE

# cols_to_crop = top_corr_columns[1:]  # 'SalePrice' has to be dropped
# cols_to_del_multicol = ['1stFlrSF', 'GarageArea', 'TotRmsAbvGrd', 'GarageYrBlt']


# first_transformer = Pipeline(steps=[
#     ('crop', OutlierDropperIQR(columns=cols_to_crop)),
# #     ('drop', ColumnDropper(columns=cols_to_del_multicol)),
#     ])

# # level 1 - two separate pipes for cat and num features

# numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns
# numeric_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='median')),
#     ('scaler', StandardScaler())])

# categorical_features = X_train.select_dtypes(include=['category']).columns
# categorical_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
#     ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# # level 2 - wrap the two level 1 pipes into a ColumnTransformer
# preprocessor = ColumnTransformer(
#         transformers=[
#             ('num', numeric_transformer, numeric_features),
#             ('cat', categorical_transformer, categorical_features),
#                      ])

# # level 3 - pipe it with a classifier
# clf = Pipeline(steps=[
#                    ('first', first_transformer),
#                    ('preprocessor', preprocessor),
#                    ('regressor', model_simple),
#                      ]) 

# # apply the preprocessor and then pass transformed data to the predictor 
# clf.fit(X_train, y_train)

NameError: name 'OutlierDropperIQR' is not defined