# Linear Regression Modelling with Elastic Net
Build a pipeline to model an optimized Elastic Net solution.
Evaluate Feature Importances.

**Data Sources**

- `data/raw/train.csv`: Training set from [kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

**Changes**

- 2019-03-22: Start notebook



<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries,-load-data" data-toc-modified-id="Import-libraries,-load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries, load data</a></span></li><li><span><a href="#Go-quick-&amp;-dirty" data-toc-modified-id="Go-quick-&amp;-dirty-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Go quick &amp; dirty</a></span></li><li><span><a href="#Pre-process-outside-of-sklearn-pipeline" data-toc-modified-id="Pre-process-outside-of-sklearn-pipeline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pre-process outside of sklearn pipeline</a></span><ul class="toc-item"><li><span><a href="#Split-train-&amp;-test-set" data-toc-modified-id="Split-train-&amp;-test-set-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Split train &amp; test set</a></span></li></ul></li><li><span><a href="#Build-sklearn-Pipeline" data-toc-modified-id="Build-sklearn-Pipeline-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Build sklearn Pipeline</a></span></li></ul></div>

---

## Import libraries, load data

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from tqdm import tqdm

from scipy import stats
from scipy.stats import norm, skew

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# My functions
import EDA_functions as EDA
import cleaning_functions as cleaning
from linRegModel_class import LinRegModel
from custom_transformers import ColumnDropper, OutlierDropperIQR

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns #, sns.set_style('whitegrid')
color = 'rebeccapurple'
%matplotlib inline

# Display settings
from IPython.display import display
pd.options.display.max_columns = 100

In [2]:
# Load data
raw_data = pd.read_csv('data/raw/train.csv')

In [3]:
# Load variables from notebook 1
%store -r cols_to_del
%store -r cols_to_log
%store -r outliers_to_del
%store -r top_corr_columns

## Go quick & dirty
Use my 'quick & dirty' function for a baseline model on unprocessed data.

In [4]:
# Initialize a scikit-learn model object of choice
model_simple = ElasticNetCV(alphas=[0.5, 0.1, 1.5], copy_X=True, cv=5, eps=0.001, 
                            fit_intercept=True, l1_ratio=0.5, max_iter=2000, 
                            n_alphas=None, n_jobs=-1)

# Create an instance of the LinRegModel class by passing the dataframe, the target variable and the model object
elastic_net_simple = LinRegModel(raw_data, 'SalePrice', model_simple)

# Output instance
display(elastic_net_simple)

ElasticNetCV(alphas=[0.5, 0.1, 1.5], copy_X=True, cv=5, eps=0.001,
       fit_intercept=True, l1_ratio=0.5, max_iter=2000, n_alphas=None,
       n_jobs=-1, normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0)

In [5]:
# Perform the modelling
elastic_net_simple.go_quickDirty()



In [6]:
# Output result
elastic_net_simple

ElasticNetCV(alphas=[0.5, 0.1, 1.5], copy_X=True, cv=5, eps=0.001,
       fit_intercept=True, l1_ratio=0.5, max_iter=2000, n_alphas=None,
       n_jobs=-1, normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0)

RMSE on test data 34631.39, r2-score 0.79.

In [7]:
# Check best alpha value
model_simple.alpha_

0.1

## Pre-process outside of sklearn pipeline
Pre-process steps that I think are not to be further evaluated in cv-pipeline.

In [8]:
# Disable warning
pd.set_option('mode.chained_assignment', None)

# Create and clean training set with variables from the EDA notebook
train_data = (raw_data
              .pipe(cleaning.change_dtypes, cols_to_category=raw_data.select_dtypes(object))
              .pipe(cleaning.delete_columns, cols_to_delete=cols_to_del)
              .pipe(cleaning.apply_log, cols_to_transform=cols_to_log)
             )

train_data.drop(outliers_to_del, inplace=True)
train_data.dropna(subset=['MasVnrArea', 'MasVnrType', 'Electrical']);

'Alley successfully deleted'

'Id successfully deleted'

'Fence successfully deleted'

'PoolQC successfully deleted'

'FireplaceQu successfully deleted'

'MiscFeature successfully deleted'

### Split train & test set

In [9]:
X_train = train_data.drop('SalePrice', axis=1)
y_train = train_data['SalePrice'].copy()

In [10]:
categorical_features = X_train.select_dtypes(include=['category'])
numeric_features = X_train.select_dtypes(include=['float64', 'int64'])

In [11]:
len(categorical_features.columns) + len(numeric_features.columns)

74

## Build sklearn Pipeline

Includes following of the pre-processing steps identified in notebook 1 because I want them to be evaluated:

- watch multicollinearity (evtl. remove cols: '1stFloor', 'GarageArea', 'FirstFlSF')
- try different strategies for cols with many missing values
- test IQR-method on 'top_corr_columns' as alternative
- one-hot-encode categorical features

In [12]:
cols_to_crop = top_corr_columns[1:]  # 'SalePrice' has to be dropped

cols_to_del_multicol = ['1stFlrSF', 'GarageArea', 'TotRmsAbvGrd', 'GarageYrBlt']
cols_to_del_multicol_NaN = ['1stFlrSF', 
                            'GarageArea', 
                            'TotRmsAbvGrd', 
                            'GarageYrBlt' 
                            'LotFrontage', 
                            'GarageType', 
                            'GarageYrBlt',
                            'GarageFinish', 
                            'GarageQual', 
                            'GarageCond',
                            'BsmtExposure', 
                            'BsmtFinType2', 
                            'BsmtFinType1',
                            'BsmtCond',
                            'BsmtQual',
                            'MasVnrArea',
                           ]

first_transformer = Pipeline(steps=[
    ('crop', OutlierDropperIQR(columns=cols_to_crop)),
#     ('drop', ColumnDropper(columns=cols_to_del_multicol)),
    ])

# level 1 - two separate pipes for cat and num features

numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = X_train.select_dtypes(include=['category']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# level 2 - wrap the two level 1 pipes into a ColumnTransformer
preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features),
                     ])

# level 3 - pipe it with a classifier
clf = Pipeline(steps=[
                   ('first', first_transformer),
                   ('preprocessor', preprocessor),
                   ('regressor', model_simple),
                     ]) 

# apply the preprocessor and then pass transformed data to the predictor 
clf.fit(X_train, y_train)

ValueError: X and y have inconsistent dimensions (1340 != 1456)

In [None]:
X_train.shape

In [None]:
raw_data.shape

---

In [None]:
X_train['OverallQual'].max()

In [None]:
raw_data['OverallQual'].max()