__Kaggle competition - house prices__

1. [Kaggle competition - house prices](#Kaggle-competition-house-prices)
1. [Import](#Import)
    1. [Tools](#Tools)
    1. [Data](#Data)    
1. [Initial EDA](#Initial-EDA)
    1. [Categorical feature EDA](#Categorical-feature-EDA)
        1. [Univariate & feature vs. target](#Univariate-&-feature-vs.-target)
    1. [Continuous feature EDA](#Continuous-feature-EDA)
        1. [Univariate & feature vs. target](#Univariate-&-feature-vs.-target2)
        1. [Correlation](#Correlation)
            1. [Correlation (all samples)](#Correlation-all-samples)
            1. [Correlation (top vs. target)](#Correlation-top-vs-target)
        1. [Pair plot](#Pair-plot)
    1. [Target variable evaluation](#Target-variable-evaluation)    
1. [Data cleaning](#Data-cleaning)
    1. [Outliers (preliminary)](#Outliers-preliminary)
        1. [Training](#Training5)
        1. [Validation](#Validation5)
    1. [Missing data](#Missing-data)
        1. [Evaluate](#Evaluate1)
        1. [Training](#Training1)
        1. [Validation](#Validation1)
    1. [Engineering](#Engineering)
        1. [Evaluate](#Evaluate3)
        1. [Training](#Training3)
        1. [Validation](#Validation3)
    1. [Encoding](#Encoding)
        1. [Evaluate](#Evaluate2)
        1. [Training](#Training2)
        1. [Validation](#Validation2)
    1. [Transformation](#Transformation)
        1. [Evaluate](#Evaluate4)
        1. [Training](#Training4)
        1. [Validation](#Validation4)
    1. [Outliers (final)](#Outliers-final)
        1. [Training](#Training6)
1. [Data evaluation](#Data-evaluation)
    1. [Feature importance](#Feature-importance)
    1. [Rationality](#Rationality)
    1. [Value override](#Value-override)
    1. [Continuous feature EDA](#Continuous-feature-EDA3)
        1. [Univariate & feature vs. target](#Univariate-&-feature-vs.-target3)
        1. [Correlation](#Correlation3)
            1. [Correlation (top vs. target)](#Correlation-top-vs-target3)
1. [Modeling](#Modeling)
    1. [Prepare training data](#Prepare-training-data)
    1. [Prepare validation data](#Prepare-validation-data)
    1. [GridSearch](#GridSearch)
        1. [Evaluation](#Evaluation)
        1. [Model explanability](#Model-explanability)
            1. [Permutation importance](#Permutation-importance)
            1. [Partial plots](#Partial-plots)
            1. [SHAP values](#SHAP-values)
    1. [Stacking](#Stacking)
        1. [Primary models](#Primary-models)
        1. [Meta model](#Meta-model)        
1. [Submission](#Submission)
    1. [Stack](#Stack)
    1. [Standard](#Standard)

# Kaggle competition - house prices

<a id = 'Kaggle-competition-house-prices'></a>

# Import

<a id = 'Import'></a>

## Tools

<a id = 'Tools'></a>

In [None]:
# standard libary and settings
import os
import sys
import importlib
import itertools
import csv
import ast
from timeit import default_timer as timer

global ITERATION
import time
from functools import reduce

rundate = time.strftime("%Y%m%d")

import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.options.display.float_format = "{:,.6f}".format

# modeling extensions
import sklearn.base as base
import sklearn.cluster as cluster
import sklearn.datasets as datasets
import sklearn.decomposition as decomposition
import sklearn.discriminant_analysis as discriminant_analysis
import sklearn.ensemble as ensemble
import sklearn.feature_extraction as feature_extraction
import sklearn.feature_selection as feature_selection
import sklearn.gaussian_process as gaussian_process
import sklearn.linear_model as linear_model
import sklearn.kernel_ridge as kernel_ridge
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.naive_bayes as naive_bayes
import sklearn.neighbors as neighbors
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm
import sklearn.tree as tree
import sklearn.utils as utils

import eif as iso

from scipy import stats, special
import xgboost
import lightgbm
import catboost

from hyperopt import hp, tpe, Trials, fmin, STATUS_OK
from hyperopt.pyll.stochastic import sample

# visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# custom extensions and settings
sys.path.append("/home/mlmachine") if "/home/mlmachine" not in sys.path else None
sys.path.append("/home/prettierplot") if "/home/prettierplot" not in sys.path else None

import mlmachine as mlm
from prettierplot.plotter import PrettierPlot
import prettierplot.style as style

## Data

<a id = 'Data'></a>

In [None]:
# load data and print dimensions
dfTrain = pd.read_csv("/home/data-science-portfolio/data/kaggleHousingPrices/train.csv")
dfValid = pd.read_csv("/home/data-science-portfolio/data/kaggleHousingPrices/test.csv")

print("Training data dimensions: {}".format(dfTrain.shape))
print("Validation data dimensions: {}".format(dfValid.shape))

In [None]:
# display info and first 5 rows
dfTrain.info()
display(dfTrain[:5])

In [None]:
# review counts of different column types
dfTrain.dtypes.value_counts()

In [None]:
# load training data into mlmachine
train = mlm.Machine(
    data=dfTrain,
    target=["SalePrice"],
    removeFeatures=["Id"],
    overrideCat=[
        "MSSubClass",
        "OverallQual",
        "OverallCond",
        "YearBuilt",
        "YearRemodAdd",
        "MoSold",
        "YrSold",
    ],
    targetType="continuous",
)
print(train.data.shape)

In [None]:
# load training data into mlmachine
valid = mlm.Machine(
    data=dfValid,
    removeFeatures=["Id"],
    overrideCat=[
        "MSSubClass",
        "OverallQual",
        "OverallCond",
        "YearBuilt",
        "YearRemodAdd",
        "MoSold",
        "YrSold",
    ],
)
print(valid.data.shape)

# Initial EDA

<a id = 'Initial-EDA'></a>

## Categorical feature EDA

<a id = 'Categorical-feature-EDA'></a>

### Univariate & feature vs. target

<a id = 'Univariate-&-feature-vs.-target'></a>

In [None]:
# categorical features
train.edaNumTargetCatFeat()

## Continuous feature EDA

<a id = 'Continuous-feature-EDA'></a>

### Univariate & feature vs. target

<a id = 'Univariate-&-feature-vs.-target2'></a>

In [None]:
# continuous features
train.edaNumTargetNumFeat()

### Correlation

<a id = 'Correlation'></a>

#### Correlation (all samples)

<a id = 'Correlation-all-samples'></a>

In [None]:
# correlation heat map
p = PrettierPlot(chartProp=25)
ax = p.makeCanvas()
p.prettyCorrHeatmap(df=train.data, ax=ax)

#### Correlation (top vs. target)

<a id = 'Correlation-top-vs-target'></a>

In [None]:
# correlation heat map with most highly correlated features relative to the target
p = PrettierPlot()
ax = p.makeCanvas()
p.prettyCorrHeatmapTarget(df=train.data, target=train.target, thresh=0.6, ax=ax)

> Remarks - There are three pairs of highly correlated features:
    - 'GarageArea' and 'GarageCars'
    - 'TotRmsAbvGrd' and 'GrLivArea'
    - '1stFlrSF' and 'TotalBsmtSF
This makes sense, given what each feature represents and how each pair items relate to each other. We likely only need one feature from each pair.

### Pair plot

<a id = 'Pair-plot'></a>

In [None]:
# pair plot
p = PrettierPlot(chartProp=10)
p.prettyPairPlot(
    df=train.data,
    cols=[
        "LotFrontage",
        "LotArea",
        "MasVnrArea",
        "BsmtFinSF1",
        "BsmtFinSF2",
        "BsmtUnfSF",
        "TotalBsmtSF",
        "1stFlrSF",
        "2ndFlrSF",
        "GrLivArea",
        "TotRmsAbvGrd",
        "GarageYrBlt",
        "GarageArea",
        "WoodDeckSF",
        "OpenPorchSF",
    ],
    diag_kind="auto",
)

## Target variable evaluation

<a id = 'Target-variable-evaluation'></a>

In [None]:
# evaluate distribution of target variable
train.edaTransformInitial(data=train.target, name=train.target.name)
train.edaTransformLog1(data=train.target, name=train.target.name)

In [None]:
# log + 1 transform target
train.target = np.log1p(train.target)

# Data cleaning

<a id = 'Data-cleaning'></a>

## Outliers (preliminary)

<a id = 'Outliers-preliminary'></a>

### Training

<a id = 'Training5'></a>

In [None]:
# identify columns that have zero missing values
nonNull = train.data.columns[train.data.isnull().sum() == 0].values.tolist()

# identify intersection between non-null columns and continuous columns
nonNullNumCol = list(set(nonNull).intersection(train.featureByDtype_["continuous"]))
print(nonNull)
print(nonNullNumCol)

In [None]:
# identify outliers using IQR
trainPipe = pipeline.Pipeline(
    [
        (
            "outlier",
            train.OutlierIQR(
                outlierCount=5, iqrStep=1.5, features=nonNullNumCol, dropOutliers=False
            ),
        )
    ]
)
train.data = trainPipe.transform(train.data)

# capture outliers
iqrOutliers = np.array(sorted(trainPipe.named_steps["outlier"].outliers_))
print(iqrOutliers)

# remove outliers
# train.target = np.delete(train.target, trainPipe.named_steps['outlier'].outliers_)

In [None]:
# identify outliers using Isolation Forest
clf = ensemble.IsolationForest(
    behaviour="new", max_samples=train.data.shape[0], random_state=0, contamination=0.02
)
clf.fit(train.data[nonNullNumCol])
preds = clf.predict(train.data[nonNullNumCol])
# np.unique(preds, return_counts = True)

# evaluate index values
mask = np.isin(preds, -1)
ifOutliers = np.where(mask)
print(ifOutliers)

In [None]:
# identify outliers using Extended Isolation Forest
if_eif = iso.iForest(
    train.data[nonNullNumCol].values, ntrees=100, sample_size=256, ExtensionLevel=1
)

# calculate anomaly scores
anomalies_ratio = 0.009
anomaly_scores = if_eif.compute_paths(X_in=train.data[nonNullNumCol].values)
anomaly_scores_sorted = np.argsort(anomaly_scores)
eifOutliers = anomaly_scores_sorted[
    -int(np.ceil(anomalies_ratio * train.data.shape[0])) :
]
print(sorted(eifOutliers))

In [None]:
# identify outliers that are identified in multiple algorithms
# reduce(np.intersect1d, (iqrOutliers, ifOutliers, eifOutliers))
outliers = reduce(np.intersect1d, (ifOutliers, eifOutliers))
print(outliers)

In [None]:
# capture index values of known outliers
knownOutliers = (
    train.data[train.data["LotArea"] > 60000].index.values.tolist()
    + train.data[train.data["LotFrontage"] > 300].index.values.tolist()
    + train.data[train.data["GrLivArea"] > 4000].index.values.tolist()
)
knownOutliers = sorted(set(knownOutliers))
print(knownOutliers)

# train.data = train.data.drop(train.data.index[outliers])
# train.target = np.delete(train.target, outliers)

In [None]:
# index of known outliers and outliers identified with the known outliers removed
outliers = [
    53,
    185,
    197,
    437,
    492,
    762,
    796,
    821,
    847,
    1161,
    1221,
    1318,
    1376,
    249,
    313,
    335,
    451,
    523,
    691,
    706,
    934,
    1182,
    1298,
]
print(outliers)

In [None]:
# remove outlers from predictors and response
train.data = train.data.drop(train.data.index[outliers])
train.target = train.target.drop(index=outliers)
print(train.data.shape)
print(train.target.shape)

### Validation

<a id = 'Validation5'></a>

## Missing data

-__MCAR__ - Completely unsystematic missingness, completely unralted to any of the other variables. simple imputation of mean, median or mode is most acceptable for this type of missingness.

-__MAR__ - The nature of the missing data is related to observed data in other variables, not the missing data. The missing data is conditional on some other variable.  For example, men are more likely to tell you their weight than woemn. The missingness of weight has to do with gender.

-__MNAR__ - There is a relationship between the propensity of a value to be missing and its values. For example, the wealthiest people choosing not to state their income.



<a id = 'Missing-data'></a>

### Evaluate


<a id = 'Evaluate1'></a>

In [None]:
# evaluate missing data
train.edaMissingSummary()

In [None]:
# evaluate missing data
valid.edaMissingSummary()

In [None]:
# compare feature with missing data
train.missingColCompare(train.data, valid.data)

In [None]:
# missingdata_df = merged_df.columns[merged_df.isnull().any()].tolist()
# msno.matrix(merged_df[missingdata_df])

# msno.bar(merged_df[missingdata_df], color="blue", log=True, figsize=(30,18))

# #
# msno.heatmap(merged_df[missingdata_df], figsize=(20,20))

### Training


<a id = 'Training1'></a>

In [None]:
# apply imputations to missing data in training dataset
trainPipe = pipeline.Pipeline(
    [
        (
            "imputeConstantCat",
            train.ConstantImputer(
                cols=[
                    "PoolQC",
                    "Alley",
                    "Fence",
                    "FireplaceQu",
                    "GarageType",
                    "GarageFinish",
                    "GarageQual",
                    "MiscFeature",
                    "GarageCond",
                    "BsmtQual",
                    "BsmtCond",
                    "BsmtExposure",
                    "BsmtFinType1",
                    "BsmtFinType2",
                    "MasVnrType",
                ],
                fill="Nonexistent",
            ),
        ),
        (
            "imputeConstantNum",
            train.ConstantImputer(cols=["GarageYrBlt", "MasVnrArea"], fill=0),
        ),
        ("imputeMode", train.ModeImputer(cols=["Electrical"])),
        (
            "imputeContext",
            train.ContextImputer(
                nullCol="LotFrontage", contextCol="Neighborhood", strategy="mean"
            ),
        ),
    ]
)
train.data = trainPipe.transform(train.data)
train.edaMissingSummary()

### Validation


<a id = 'Validation1'></a>

In [None]:
# apply imputations to missing data in validation dataset
validPipe = pipeline.Pipeline(
    [
        (
            "imputeConstantCat",
            valid.ConstantImputer(
                cols=[
                    "PoolQC",
                    "Alley",
                    "Fence",
                    "FireplaceQu",
                    "GarageType",
                    "GarageFinish",
                    "GarageQual",
                    "MiscFeature",
                    "GarageCond",
                    "BsmtQual",
                    "BsmtCond",
                    "BsmtExposure",
                    "BsmtFinType1",
                    "BsmtFinType2",
                    "MasVnrType",
                ],
                fill="Nonexistent",
            ),
        ),
        (
            "imputeConstantNum",
            valid.ConstantImputer(
                cols=[
                    "GarageYrBlt",
                    "MasVnrArea",
                    "BsmtUnfSF",
                    "GarageArea",
                    "BsmtFinSF1",
                    "TotalBsmtSF",
                    "BsmtFinSF2",
                ],
                fill=0,
            ),
        ),
        (
            "imputeModeCat",
            valid.ModeImputer(
                cols=[
                    "Functional",
                    "SaleType",
                    "Exterior1st",
                    "MSZoning",
                    "Exterior2nd",
                    "KitchenQual",
                    "Utilities",
                ]
            ),
        ),
        (
            "imputeModeNum",
            valid.NumericalImputer(
                cols=["BsmtHalfBath", "GarageCars", "BsmtFullBath"],
                strategy="most_frequent",
            ),
        ),
        (
            "imputeContext",
            valid.ContextImputer(
                nullCol="LotFrontage",
                contextCol="Neighborhood",
                strategy="mean",
                train=False,
                trainDf=trainPipe.named_steps["imputeContext"].fillDf,
            ),
        ),
    ]
)
valid.data = validPipe.transform(valid.data)
valid.edaMissingSummary()

## Engineering

<a id = 'Engineering'></a>

### Evaluate


<a id = 'Evaluate3'></a>

### Training


<a id = 'Training3'></a>

In [None]:
# additional features
train.data["BsmtFinSF"] = train.data["BsmtFinSF1"] + train.data["BsmtFinSF2"]
train.data["TotalSF"] = (
    train.data["TotalBsmtSF"] + train.data["1stFlrSF"] + train.data["2ndFlrSF"]
)

### Validation


<a id = 'Validation3'></a>

In [None]:
# additional features
valid.data["BsmtFinSF"] = valid.data["BsmtFinSF1"] + valid.data["BsmtFinSF2"]
valid.data["TotalSF"] = (
    valid.data["TotalBsmtSF"] + valid.data["1stFlrSF"] + valid.data["2ndFlrSF"]
)

## Encoding

<a id = 'Encoding'></a>

### Evaluate


<a id = 'Evaluate2'></a>

In [None]:
# counts of unique values in training data categorical columns
train.data[train.featureByDtype_["categorical"]].apply(pd.Series.nunique, axis=0)

In [None]:
# print unique values in each categorical columns
for col in train.data[train.featureByDtype_["categorical"]]:
    print(col, np.unique(train.data[col]))

In [None]:
# counts of unique values in validation data string columns
valid.data[valid.featureByDtype_["categorical"]].apply(pd.Series.nunique, axis=0)

In [None]:
# print unique values in each categorical columns
for col in valid.data[valid.featureByDtype_["categorical"]]:
    if col not in ["Name", "Cabin"]:
        print(col, np.unique(valid.data[col]))

In [None]:
# identify values that are present in the training data but not the validation data, and vice versa
for col in train.featureByDtype_["categorical"]:
    trainValues = train.data[col].unique()
    validValues = valid.data[col].unique()

    trainDiff = set(trainValues) - set(validValues)
    validDiff = set(validValues) - set(trainValues)

    if len(trainDiff) > 0 or len(validDiff) > 0:
        print("\n\n*** " + col)
        print("Value present in training data, not in validation data")
        print(trainDiff)
        print("Value present in validation data, not in training data")
        print(validDiff)

### Training


<a id = 'Training2'></a>

In [None]:
# ordinal column encoding instructions
ordinalEncodings = {
    "Street": {"Grvl": 0, "Pave": 1},
    "Alley": {"Nonexistent": 0, "Grvl": 1, "Pave": 2},
    "LotShape": {"IR3": 0, "IR2": 1, "IR1": 2, "Reg": 3},
    "Utilities": {"ELO": 0, "NoSeWa": 1, "NoSewr": 2, "AllPub": 3},
    "LotConfig": {"FR3": 0, "FR2": 1, "Corner": 2, "Inside": 3, "CulDSac": 4},
    "LandSlope": {"Sev": 0, "Mod": 1, "Gtl": 2},
    "ExterQual": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "ExterCond": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "BsmtQual": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "BsmtCond": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "BsmtExposure": {"Nonexistent": 0, "No": 1, "Mn": 2, "Av": 3, "Gd": 4},
    "BsmtFinType1": {
        "Nonexistent": 0,
        "Unf": 1,
        "LwQ": 2,
        "BLQ": 3,
        "Rec": 4,
        "ALQ": 5,
        "GLQ": 6,
    },  # split?
    "BsmtFinType2": {
        "Nonexistent": 0,
        "Unf": 1,
        "LwQ": 2,
        "BLQ": 3,
        "Rec": 4,
        "ALQ": 5,
        "GLQ": 6,
    },  # split?
    "HeatingQC": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "CentralAir": {"N": 0, "Y": 1},
    "Electrical": {"FuseP": 0, "FuseF": 1, "FuseA": 2, "Mix": 3, "SBrkr": 4},
    "KitchenQual": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "Functional": {
        "Sal": 0,
        "Sev": 1,
        "Maj2": 2,
        "Maj1": 3,
        "Mod": 4,
        "Min2": 5,
        "Min1": 6,
        "Typ": 7,
    },
    "FireplaceQu": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "GarageFinish": {"Nonexistent": 0, "Unf": 1, "RFn": 2, "Fin": 3},
    "GarageQual": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "GarageCond": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "PavedDrive": {"N": 0, "P": 1, "Y": 2},
    "PoolQC": {"Nonexistent": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
}

# nominal columns
nomCatCols = [
    "MSSubClass",
    "MSZoning",
    "LandContour",
    "Neighborhood",
    "Condition1",
    "Condition2",
    "BldgType",
    "HouseStyle",
    "RoofStyle",
    "RoofMatl",
    "Exterior1st",
    "Exterior2nd",
    "MasVnrType",
    "Foundation",
    "Heating",
    "GarageType",
    "Fence",
    "SaleType",
    "SaleCondition",
    "MiscFeature",
]

# apply encodings to training data
trainPipe = pipeline.Pipeline(
    [
        ("encodeOrdinal", train.CustomOrdinalEncoder(encodings=ordinalEncodings)),
        ("dummyNominal", train.Dummies(cols=nomCatCols, dropFirst=False)),
    ]
)
train.data = trainPipe.transform(train.data)

### Validation


<a id = 'Validation2'></a>

In [None]:
# apply encodings to validation data
validPipe = pipeline.Pipeline(
    [
        ("encodeOrdinal", valid.CustomOrdinalEncoder(encodings=ordinalEncodings)),
        ("dummyNominal", valid.Dummies(cols=nomCatCols, dropFirst=False)),
        ("levels", valid.MissingDummies(trainCols=train.data.columns)),
    ]
)
valid.data = validPipe.transform(valid.data)

## Transformation

<a id = 'Transformation'></a>

### Evaluate


<a id = 'Evaluate4'></a>

In [None]:
# evaluate skew of continuous features - validation data
train.skewSummary()

In [None]:
# evaluate skew of continuous features - training data
valid.skewSummary()

### Training


<a id = 'Training4'></a>

In [None]:
# skew correct in training dataset, which also learns te best lambda value for each columns
trainPipe = pipeline.Pipeline(
    [
        (
            "skew",
            train.SkewTransform(
                cols=train.featureByDtype_["continuous"], skewMin=0.75, pctZeroMax=1.0
            ),
        )
    ]
)
train.data = trainPipe.transform(train.data)
train.skewSummary()

### Validation


<a id = 'Validation4'></a>

In [None]:
# skew correction in validation dataset using lambdas learned on training data
validPipe = pipeline.Pipeline(
    [
        (
            "skew",
            valid.SkewTransform(
                train=False, trainDict=trainPipe.named_steps["skew"].colValueDict_
            ),
        )
    ]
)
valid.data = validPipe.transform(valid.data)
valid.skewSummary()

## Outliers (final)

<a id = 'Outliers-final'></a>

### Training

<a id = 'Training6'></a>

In [None]:
# identify outliers using IQR
trainPipe = pipeline.Pipeline(
    [
        (
            "outlier",
            train.OutlierIQR(
                outlierCount=8, iqrStep=1.5, features=nonNullNumCol, dropOutliers=False
            ),
        )
    ]
)
train.data = trainPipe.transform(train.data)

# capture outliers
iqrOutliers = np.array(sorted(trainPipe.named_steps["outlier"].outliers_))
print(iqrOutliers)

In [None]:
# identify outliers using Isolation Forest
clf = ensemble.IsolationForest(
    behaviour="new", max_samples=train.data.shape[0], random_state=0, contamination=0.02
)
clf.fit(train.data[nonNullNumCol])
preds = clf.predict(train.data[nonNullNumCol])
# np.unique(preds, return_counts = True)

# evaluate index values
mask = np.isin(preds, -1)  # np.in1d if np.isin is not available
ifOutliers = np.where(mask)
print(ifOutliers)

In [None]:
# identify outliers using Extended Isolation Forest
import eif as iso

if_eif = iso.iForest(
    train.data[nonNullNumCol].values, ntrees=100, sample_size=256, ExtensionLevel=1
)

# calculate anomaly scores
anomalies_ratio = 0.009
anomaly_scores = if_eif.compute_paths(X_in=train.data[nonNullNumCol].values)
anomaly_scores_sorted = np.argsort(anomaly_scores)
eifOutliers = anomaly_scores_sorted[
    -int(np.ceil(anomalies_ratio * train.data.shape[0])) :
]
print(sorted(eifOutliers))

In [None]:
# identify outliers that are identified in multiple algorithms
# reduce(np.intersect1d, (iqrOutliers, ifOutliers, eifOutliers))
reduce(np.intersect1d, (ifOutliers, eifOutliers))

# Data evaluation

<a id = 'Data evaluation'></a>

## Feature importance

<a id = 'Feature-importance'></a>

In [None]:
# feature importance summary table
featureImp = train.featureImportanceSummary()
featureImp

## Rationality

<a id = 'Rationality'></a>

In [None]:
# percent difference summary
dfDiff = abs(
    (
        ((valid.data.describe() + 1) - (train.data.describe() + 1))
        / (train.data.describe() + 1)
    )
    * 100
)
dfDiff = dfDiff[dfDiff.columns].replace({0: np.nan})
dfDiff[dfDiff < 0] = np.nan
dfDiff = dfDiff.fillna("")
display(dfDiff)
display(train.data.describe())
display(valid.data.describe())

## Value override

<a id = 'Value override'></a>

In [None]:
# change clearly erroneous value to what it probably was
valid.data["GarageYrBlt"].replace({2207: 2007}, inplace=True)

## Continuous feature EDA

<a id = 'Continuous-feature-EDA3'></a>

### Univariate & feature vs. target

<a id = 'Univariate-&-feature-vs.-target3'></a>

In [None]:
# continuous features
train.edaNumTargetNumFeat()

### Correlation

<a id = 'Correlation3'></a>

#### Correlation (top vs. target)

<a id = 'Correlation-top-vs-target3'></a>

In [None]:
# correlation heat map with most highly correlated features relative to the target
p = PrettierPlot()
ax = p.makeCanvas()
p.prettyCorrHeatmapTarget(df=train.data, target=train.target, thresh=0.6, ax=ax)

> Remarks - There are three pairs of highly correlated features:
    - 'GarageArea' and 'GarageCars'
    - 'TotRmsAbvGrd' and 'GrLivArea'
    - '1stFlrSF' and 'TotalBsmtSF
This makes sense, given what each feature represents and how each pair items relate to each other. We likely only need one feature from each pair.

# Modeling

<a id = 'Modeling'></a>

## Prepare training data

<a id = 'Prepare-training-data'></a>

In [None]:
# import training data
dfTrain = pd.read_csv("/home/data-science-portfolio/data/kaggleHousingPrices/train.csv")
train = mlm.Machine(
    data=dfTrain,
    target=["SalePrice"],
    removeFeatures=["Id", "MiscVal"],
    overrideCat=[
        "MSSubClass",
        "OverallQual",
        "OverallCond",
        "YearBuilt",
        "YearRemodAdd",
        "MoSold",
        "YrSold",
    ],
    targetType="continuous",
)

### training data transformation pipeline
### ordinal columns
ordinalEncodings = {
    "Street": {"Grvl": 0, "Pave": 1},
    "Alley": {"Nonexistent": 0, "Grvl": 1, "Pave": 2},
    "LotShape": {"IR3": 0, "IR2": 1, "IR1": 2, "Reg": 3},
    "Utilities": {"ELO": 0, "NoSeWa": 1, "NoSewr": 2, "AllPub": 3},
    "LotConfig": {"FR3": 0, "FR2": 1, "Corner": 2, "Inside": 3, "CulDSac": 4},
    "LandSlope": {"Sev": 0, "Mod": 1, "Gtl": 2},
    "ExterQual": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "ExterCond": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "BsmtQual": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "BsmtCond": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "BsmtExposure": {"Nonexistent": 0, "No": 1, "Mn": 2, "Av": 3, "Gd": 4},
    "BsmtFinType1": {
        "Nonexistent": 0,
        "Unf": 1,
        "LwQ": 2,
        "BLQ": 3,
        "Rec": 4,
        "ALQ": 5,
        "GLQ": 6,
    },  # split?
    "BsmtFinType2": {
        "Nonexistent": 0,
        "Unf": 1,
        "LwQ": 2,
        "BLQ": 3,
        "Rec": 4,
        "ALQ": 5,
        "GLQ": 6,
    },  # split?
    "HeatingQC": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "CentralAir": {"N": 0, "Y": 1},
    "Electrical": {"FuseP": 0, "FuseF": 1, "FuseA": 2, "Mix": 3, "SBrkr": 4},
    "KitchenQual": {"Po": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
    "Functional": {
        "Sal": 0,
        "Sev": 1,
        "Maj2": 2,
        "Maj1": 3,
        "Mod": 4,
        "Min2": 5,
        "Min1": 6,
        "Typ": 7,
    },
    "FireplaceQu": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "GarageFinish": {"Nonexistent": 0, "Unf": 1, "RFn": 2, "Fin": 3},
    "GarageQual": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "GarageCond": {"Nonexistent": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5},
    "PavedDrive": {"N": 0, "P": 1, "Y": 2},
    "PoolQC": {"Nonexistent": 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4},
}

### nominal columns
nomCatCols = [
    "MSSubClass",
    "MSZoning",
    "LandContour",
    "Neighborhood",
    "Condition1",
    "Condition2",
    "BldgType",
    "HouseStyle",
    "RoofStyle",
    "RoofMatl",
    "Exterior1st",
    "Exterior2nd",
    "MasVnrType",
    "Foundation",
    "Heating",
    "GarageType",
    "Fence",
    "SaleType",
    "SaleCondition",
    "MiscFeature",
]

### additional features
train.data["BsmtFinSF"] = train.data["BsmtFinSF1"] + train.data["BsmtFinSF2"]
train.data["TotalSF"] = (
    train.data["TotalBsmtSF"] + train.data["1stFlrSF"] + train.data["2ndFlrSF"]
)

### observation removal
outliers = [
    53,
    185,
    197,
    437,
    492,
    762,
    796,
    821,
    847,
    1161,
    1221,
    1318,
    1376,
    249,
    313,
    335,
    451,
    523,
    691,
    706,
    934,
    1182,
    1298,
]
train.data = train.data.drop(train.data.index[outliers])
train.target = train.target.drop(index=outliers)

### pre-processing pipeline
trainPipe = pipeline.Pipeline(
    [
        (
            "imputeConstantCat",
            train.ConstantImputer(
                cols=[
                    "PoolQC",
                    "Alley",
                    "Fence",
                    "FireplaceQu",
                    "GarageType",
                    "GarageFinish",
                    "GarageQual",
                    "MiscFeature",
                    "GarageCond",
                    "BsmtQual",
                    "BsmtCond",
                    "BsmtExposure",
                    "BsmtFinType1",
                    "BsmtFinType2",
                    "MasVnrType",
                ],
                fill="Nonexistent",
            ),
        ),
        (
            "imputeConstantNum",
            train.ConstantImputer(cols=["GarageYrBlt", "MasVnrArea"], fill=0),
        ),
        ("imputeMode", train.ModeImputer(cols=["Electrical"])),
        (
            "imputeContext",
            train.ContextImputer(
                nullCol="LotFrontage", contextCol="Neighborhood", strategy="mean"
            ),
        ),
        ("encodeOrdinal", train.CustomOrdinalEncoder(encodings=ordinalEncodings)),
        ("dummyNominal", train.Dummies(cols=nomCatCols, dropFirst=False)),
        (
            "skew",
            train.SkewTransform(
                cols=train.featureByDtype_["continuous"], skewMin=0.75, pctZeroMax=1.0
            ),
        ),
        ("scale", train.Robust(cols="non-binary")),
    ]
)
train.data = trainPipe.transform(train.data)

train.target = np.log1p(train.target)

## Prepare validation data

<a id = 'Prepare-validation-data'></a>

In [None]:
# import valid data
dfValid = pd.read_csv("/home/data-science-portfolio/data/kaggleHousingPrices/test.csv")
valid = mlm.Machine(
    data=dfValid,
    removeFeatures=["Id", "MiscVal"],
    overrideCat=[
        "MSSubClass",
        "OverallQual",
        "OverallCond",
        "YearBuilt",
        "YearRemodAdd",
        "MoSold",
        "YrSold",
    ],
    targetType="continuous",
)

### additional features
valid.data["BsmtFinSF"] = valid.data["BsmtFinSF1"] + valid.data["BsmtFinSF2"]
valid.data["TotalSF"] = (
    valid.data["TotalBsmtSF"] + valid.data["1stFlrSF"] + valid.data["2ndFlrSF"]
)
valid.data.loc[valid.data["TotalSF"].isnull(), "TotalSF"] = (
    valid.data["1stFlrSF"] + valid.data["2ndFlrSF"]
)

### pre-processing pipeline
validPipe = pipeline.Pipeline(
    [
        (
            "imputeConstantCat",
            valid.ConstantImputer(
                cols=[
                    "PoolQC",
                    "Alley",
                    "Fence",
                    "FireplaceQu",
                    "GarageType",
                    "GarageFinish",
                    "GarageQual",
                    "MiscFeature",
                    "GarageCond",
                    "BsmtQual",
                    "BsmtCond",
                    "BsmtExposure",
                    "BsmtFinType1",
                    "BsmtFinType2",
                    "MasVnrType",
                ],
                fill="Nonexistent",
            ),
        ),
        (
            "imputeConstantNum",
            valid.ConstantImputer(
                cols=[
                    "GarageYrBlt",
                    "MasVnrArea",
                    "BsmtUnfSF",
                    "GarageArea",
                    "BsmtFinSF1",
                    "TotalBsmtSF",
                    "BsmtFinSF2",
                ],
                fill=0,
            ),
        ),
        (
            "imputeModeCat",
            valid.ModeImputer(
                cols=[
                    "Functional",
                    "SaleType",
                    "Exterior1st",
                    "MSZoning",
                    "Exterior2nd",
                    "KitchenQual",
                    "Utilities",
                ]
            ),
        ),
        (
            "imputeModeNum",
            valid.NumericalImputer(
                cols=["BsmtHalfBath", "GarageCars", "BsmtFullBath"],
                strategy="most_frequent",
            ),
        ),
        (
            "imputeContext",
            valid.ContextImputer(
                nullCol="LotFrontage",
                contextCol="Neighborhood",
                strategy="mean",
                train=False,
                trainDf=trainPipe.named_steps["imputeContext"].fillDf,
            ),
        ),
        ("encodeOrdinal", valid.CustomOrdinalEncoder(encodings=ordinalEncodings)),
        ("dummyNominal", valid.Dummies(cols=nomCatCols, dropFirst=False)),
        (
            "skew",
            valid.SkewTransform(
                cols=valid.featureByDtype_["continuous"],
                train=False,
                trainDict=trainPipe.named_steps["skew"].colValueDict_,
            ),
        ),
        (
            "scale",
            valid.Robust(
                cols="non-binary",
                train=False,
                trainDict=trainPipe.named_steps["scale"].colValueDict_,
            ),
        ),
        ("levels", valid.MissingDummies(trainCols=train.data.columns)),
    ]
)
valid.data = validPipe.transform(valid.data)

## GridSearch

<a id = 'GridSearch'></a>

In [None]:
# parameter space
allSpace = {
    "linear_model.Lasso": {"alpha": hp.uniform("alpha", 0.0000001, 10)},
    "linear_model.Ridge": {"alpha": hp.uniform("alpha", 0.0001, 20)},
    "linear_model.ElasticNet": {
        "alpha": hp.uniform("alpha", 0.0000001, 10),
        "l1_ratio": hp.uniform("l1_ratio", 0.0, 0.2),
    },
    "kernel_ridge.KernelRidge": {
        "alpha": hp.uniform("alpha", 0.0001, 15),
        "kernel": hp.choice("kernel", ["linear", "polynomial", "rbf"]),
        "degree": hp.choice("degree", [2, 3]),
        "gamma": hp.uniform("gamma", 6.0, 8.0),
    },
    "lightgbm.LGBMRegressor": {
        "colsample_bytree": hp.uniform("colsample_bytree", 0.4, 0.65),
        "boosting_type": hp.choice("boosting_type", ["gbdt"]),
        "subsample": hp.uniform("subsample", 0.5, 1),
        "learning_rate": hp.uniform("learning_rate", 0.000000001, 0.05),
        "max_depth": hp.choice("max_depth", np.arange(2, 8, dtype=int)),
        "min_child_samples": hp.uniform("min_child_samples", 10, 100),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 4000, 10, dtype=int)),
        "num_leaves": hp.uniform("num_leaves", 8, 150),
        "reg_alpha": hp.uniform("reg_alpha", 0.0, 0.2),
        "reg_lambda": hp.uniform("reg_lambda", 0.05, 0.25),
        "subsample_for_bin": hp.uniform("subsample_for_bin", 200000, 400000),
    },
    "xgboost.XGBRegressor": {
        "colsample_bytree": hp.uniform("colsample_bytree", 0.4, 0.65),
        "gamma": hp.uniform("gamma", 0.0, 2),
        "reg_alpha": hp.uniform("reg_alpha", 0.0, 0.3),
        "reg_lambda": hp.uniform("reg_lambda", 0.4, 1.0),
        "learning_rate": hp.uniform("learning_rate", 0.00001, 0.08),
        "max_depth": hp.choice("max_depth", np.arange(2, 12, dtype=int)),
        "min_child_weight": hp.uniform("min_child_weight", 1, 8),
        "n_estimators": hp.choice("n_estimators", np.arange(4000, 10000, 10, dtype=int))
        # ,'objective' : hp.choice('objective', ['binary:logistic'])
        ,
        "subsample": hp.uniform("subsample", 0.5, 0.8),
    },
    "ensemble.RandomForestRegressor": {
        "bootstrap": hp.choice("bootstrap", [True, False]),
        "max_depth": hp.choice("max_depth", np.arange(8, 20, dtype=int)),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 40000, 10, dtype=int)),
        "max_features": hp.choice("max_features", ["sqrt"]),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(2, 20, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 15, dtype=int)),
    },
    "ensemble.GradientBoostingRegressor": {
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
        "max_features": hp.choice("max_features", ["auto", "sqrt"]),
        "learning_rate": hp.uniform("learning_rate", 0.01, 0.2),
        "loss": hp.choice("loss", ["ls", "lad", "huber", "quantile"]),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(2, 40, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 40, dtype=int)),
    },
    "ensemble.AdaBoostRegressor": {
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "learning_rate": hp.uniform("learning_rate", 0.01, 0.2),
        "loss": hp.choice("loss", ["linear", "square", "exponential"]),
    },
    "ensemble.ExtraTreesRegressor": {
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(2, 40, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 40, dtype=int)),
        "max_features": hp.choice("max_features", ["auto", "sqrt"]),
    },
    "svm.SVR": {
        "C": hp.uniform("C", 0.00001, 10),
        "kernel": hp.choice("kernel", ["linear", "poly", "rbf", "sigmoid"]),
        "degree": hp.choice("degree", [2, 3]),
        "gamma": hp.uniform("gamma", 0.0001, 10),
        "epsilon": hp.uniform("epsilon", 0.001, 5),
    },
    "neighbors.KNeighborsRegressor": {
        "algorithm": hp.choice("algorithm", ["auto", "ball_tree", "kd_tree", "brute"]),
        "n_neighbors": hp.choice("n_neighbors", np.arange(1, 20, dtype=int)),
        "weights": hp.choice("weights", ["distance", "uniform"]),
        "p": hp.choice("p", [1, 2]),
    },
}

In [None]:
# execute bayesian optimization grid search
analysis = "housing"
train.execBayesOptimSearch(
    allSpace=allSpace,
    resultsDir="data/{}_hyperopt_{}.csv".format(rundate, analysis),
    X=train.data,
    y=train.target,
    scoring="rmsle",
    n_folds=8,
    n_jobs=16,
    iters=1500,
    verbose=0,
)

### Evaluation

<a id = 'Evaluation'></a>

In [None]:
# create model with full set of predictor variables
linReg = linear_model.LinearRegression()
linReg.fit(XTrain, yTrain)
yPredsTrain = linReg.predict(XTrain)
yPredsTest = linReg.predict(XTest)

In [None]:
# repeat value 1 X times (len of train then test array)
yActual = np.vstack((yTrain, yTest))
yPreds = np.vstack((yPredsTrain, yPredsTest))
yType = np.hstack((np.repeat(0, yTrain.shape[0]), np.repeat(1, yTest.shape[0])))

In [None]:
# visualize predictions using residual plot
p = PrettierPlot()
ax = p.makeCanvas(title="", xLabel="Predicted values", yLabel="Residuals", yShift=0.8)
p.pretty2dScatterHue(
    x=yPreds,
    y=yPreds - yActual,
    target=yType,
    label=["Training", "Test"],
    xUnits="f",
    yUnits="f",
    bbox=(1.2, 0.9),
    ax=ax,
)
plt.hlines(y=0, xmin=-10, xmax=50, color="black", lw=5)

In [None]:
# read scores summary table
resultsDf = pd.read_csv("data/20190504_hyperopt_housing.csv", na_values="nan")
results = train.unpackParams(resultsDf)

In [None]:
# loss plot
train.lossPlot(resultsDf=results)

In [None]:
# estimator parameter plots
train.paramPlot(results=results, allSpace=allSpace, nIter=100)

### Model explanability

https://www.kaggle.com/learn/machine-learning-explainability

<a id = 'Feature-importance'></a>

#### Permutation importance

<a id = 'Permutation-importance'></a>

In [None]:
# permutation importance
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names=val_X.columns.tolist())

#### Partial plots

<a id = 'Partial-plots'></a>

In [None]:
#
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(
    model=tree_model, dataset=val_X, model_features=feature_names, feature="Goal Scored"
)

# plot it
pdp.pdp_plot(pdp_goals, "Goal Scored")
plt.show()

In [None]:
feature_to_plot = "Distance Covered (Kms)"
pdp_dist = pdp.pdp_isolate(
    model=tree_model,
    dataset=val_X,
    model_features=feature_names,
    feature=feature_to_plot,
)

pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()

In [None]:
# Build Random Forest model
rf_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)

pdp_dist = pdp.pdp_isolate(
    model=rf_model, dataset=val_X, model_features=feature_names, feature=feature_to_plot
)

pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()

In [None]:
# 2D plots
# Similar to previous PDP plot except we use pdp_interact instead of pdp_isolate and pdp_interact_plot instead of pdp_isolate_plot
features_to_plot = ["Goal Scored", "Distance Covered (Kms)"]
inter1 = pdp.pdp_interact(
    model=tree_model,
    dataset=val_X,
    model_features=feature_names,
    features=features_to_plot,
)

pdp.pdp_interact_plot(
    pdp_interact_out=inter1, feature_names=features_to_plot, plot_type="contour"
)
plt.show()

#### SHAP values

<a id = 'SHAP-values'></a>

In [None]:
#
row_to_show = 5
data_for_prediction = val_X.iloc[
    row_to_show
]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)


my_model.predict_proba(data_for_prediction_array)

In [None]:
import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

In [None]:
# use Kernel SHAP to explain test set predictions
k_explainer = shap.KernelExplainer(my_model.predict_proba, train_X)
k_shap_values = k_explainer.shap_values(data_for_prediction)
shap.force_plot(k_explainer.expected_value[1], k_shap_values[1], data_for_prediction)

In [None]:
shap.DeepExplainer

In [None]:
import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
shap_values = explainer.shap_values(val_X)

# Make plot. Index of [1] is explained in text below.
shap.summary_plot(shap_values[1], val_X)

In [None]:
import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# calculate shap values. This is what we will plot.
shap_values = explainer.shap_values(X)

# make plot.
shap.dependence_plot(
    "Ball Possession %", shap_values[1], X, interaction_index="Goal Scored"
)

## Stacking

<a id = 'Stacking'></a>

### Primary models

<a id = 'Primary-models'></a>

### Meta model

<a id = 'Meta-model'></a>

# Submission

<a id = 'Submission'></a>

## Standard

<a id = 'Standard'></a>

In [None]:
# generate prediction submission file
my_submission = pd.DataFrame({"Id": dfTest.Id, "SalePrice": np.expm1(yPred)})
my_submission.to_csv("data/submission.csv", index=False)

## Stack

<a id = 'Stack'></a>

In [None]:
# generate prediction submission file
my_submission = pd.DataFrame({"Id": dfTest.Id, "SalePrice": np.expm1(yPred)})
my_submission.to_csv("data/submission.csv", index=False)

# misc code


In [None]:
# Filling missing value of Age

## Fill Age with the median age of similar rows according to Pclass, Parch and SibSp
# Index of NaN age rows
index_NaN_age = list(dataset["Age"][dataset["Age"].isnull()].index)

for i in index_NaN_age:
    age_med = dataset["Age"].median()
    age_pred = dataset["Age"][
        (
            (dataset["SibSp"] == dataset.iloc[i]["SibSp"])
            & (dataset["Parch"] == dataset.iloc[i]["Parch"])
            & (dataset["Pclass"] == dataset.iloc[i]["Pclass"])
        )
    ].median()
    if not np.isnan(age_pred):
        dataset["Age"].iloc[i] = age_pred
    else:
        dataset["Age"].iloc[i] = age_med

In [None]:
# libs

from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split

In [None]:
n_folds = 10


def rmsle_cv(model):
    kf = model_selection.KFold(n_folds, shuffle=True, random_state=42).get_n_splits(
        train.data.values
    )
    rmse = np.sqrt(
        -model_selection.cross_val_score(
            model,
            train.data.values,
            train.target,
            scoring="neg_mean_squared_error",
            cv=kf,
        )
    )
    return rmse


averaged_models = AveragingModels(models=(topXGBoost, topGBR, topLGBM))
avg_scores = rmsle_cv(averaged_models)


cv = train.rmsleCV(
    estimator=topLGBM,
    X=train.data,
    y=train.target,
    scoring="neg_mean_squared_error",
    cv=10,
    modelDesc="lightgbm",
)


#

z = list(zip(rfrFinal.feature_importances_, np.append(numCols, catCols)))
z = sorted(z, key=lambda tup: tup[0], reverse=True)[:20]

# plot horizontal bar by feature importance
z.sort(reverse=False)
values, labels = zip(*z)
plt.figure(figsize=(15, 8))
plt.subplot(121)
plt.barh(labels, values)
plt.xlabel("Percent Contribution to Random Forest Model")
plt.ylabel("Feature Names")
plt.title("Comparison of Features by Importance")

# reverse sorting (for a more intuitive aesthetic) and plot the cumulative value of features
plt.subplot(122)
z.sort(reverse=True)
values, labels = zip(*z)
plt.plot(np.cumsum(values))
plt.ylabel("Contribution to the Random Forest Model")
plt.xlabel("Number of Features")
plt.title("Cumulative Value of Features by Importance")

plt.subplots_adjust(wspace=0.75)

In [None]:
class AveragingModels(base.BaseEstimator, base.RegressorMixin, base.TransformerMixin):
    def __init__(self, models):
        self.models = models

    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [base.clone(x) for x in self.models]

        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self

    # Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([model.predict(X) for model in self.models_])
        return np.mean(predictions, axis=1)

In [None]:
#


class ClaimAggregater(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        agg_op_dt_claim = {
            "PayDelay": {
                "max_PayDelay": "max",
                "min_PayDelay": "min",
                "avg_PayDelay": "mean",
            },
            "LengthOfStay": {"max_LOS": "max", "min_LOS": "min", "avg_LOS": "mean"},
            "DSFS": {"max_dsfs": "max", "min_dsfs": "min", "avg_dsfs": "mean"},
            "CharlsonIndex": {
                "max_CharlsonIndex": "max",
                "min_CharlsonIndex": "min",
                "avg_CharlsonIndex": "mean",
            },
        }

        # add binary categorical columns to agg_op_dt_claim for groupby
        for i in X.columns[np.array(X.dtypes == "uint8")]:
            agg_op_dt_claim["{0}".format(i)] = {"Sum_{0}".format(i): "sum"}

        result = X.groupby(["Year", "MemberID"]).agg(agg_op_dt_claim)
        result.columns = result.columns.droplevel()
        result = result.reset_index(level=["Year", "MemberID"])
        result["range_dsfs"] = result["max_dsfs"] - result["min_dsfs"]
        result["range_CharlsonIndex"] = (
            result["max_CharlsonIndex"] - result["min_CharlsonIndex"]
        )
        return result

In [None]:
# preprocess via pipeline


class DrugAttributesAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        dsfs_dt = {
            "0- 1 month": 15,
            "1- 2 months": 45,
            "2- 3 months": 75,
            "3- 4 months": 105,
            "4- 5 months": 135,
            "5- 6 months": 165,
            "6- 7 months": 195,
            "7- 8 months": 225,
            "8- 9 months": 255,
            "9-10 months": 285,
            "10-11 months": 315,
            "11-12 months": 345,
        }
        X["DSFS"] = X["DSFS"].apply(lambda x: dsfs_dt[x])
        X["DrugCount"] = X["DrugCount"].apply(lambda x: 7 if x == "7+" else int(x))
        return X


class DrugAggregater(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        agg_op_dt_drug = {
            "DrugCount": {
                "max_DrugCount": "max",
                "min_DrugCount": "min",
                "avg_DrugCount": "mean",
                "months_DrugCount": "count",
            }
        }
        result = X.groupby(["Year", "MemberID"]).agg(agg_op_dt_drug)
        result.columns = result.columns.droplevel()
        result = result.reset_index(level=["Year", "MemberID"])
        return result

In [None]:
# convert 'AgeAtFirstClaim' to numerical approximation


class MemberAttributesAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        age_dt = {
            "40-49": 45,
            "70-79": 75,
            "50-59": 55,
            "60-69": 65,
            "30-39": 35,
            "10-19": 15,
            "0-9": 5,
            "20-29": 25,
            "80+": 85,
        }
        X["AgeAtFirstClaim"] = X["AgeAtFirstClaim"].apply(
            lambda x: None if pd.isnull(x) else age_dt[x]
        )
        return X

In [None]:
def splitPrep(dataset):
    dfTrain, dfTest = train_test_split(dataset, test_size=0.3, random_state=42)

    yTrain = dfTrain["label"]
    yTest = dfTest["label"]

    xTrain = dfTrain.drop(["label"], axis=1)
    xTest = dfTest.drop(["label"], axis=1)

    allCols = xTrain.columns.values
    catCols = ["ClaimsTruncated", "F", "M"]
    index = [np.argwhere(allCols == i)[0][0] for i in catCols]
    numCols = np.delete(allCols, index)

    numPipeline = Pipeline(
        [
            ("selector", DataFrameSelector(numCols)),
            ("imputer", Imputer(strategy="median")),
            ("std_scaler", StandardScaler()),
        ]
    )

    catPipeline = Pipeline([("selector", DataFrameSelector(catCols))])

    fullPipeline = FeatureUnion(
        transformer_list=[("numPipeline", numPipeline), ("catPipeline", catPipeline)]
    )

    xTrain = fullPipeline.fit_transform(xTrain)
    xTest = fullPipeline.transform(xTest)

    return xTrain, xTest, yTrain, yTest, numCols, catCols

LandContour: Flatness of the property

       Lvl	Near Flat/Level	
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression

MiscFeature: Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None
       
     MiscVal: $Value of miscellaneous feature


Condition1: Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
	
Condition2: Proximity to various conditions (if more than one is present)
		
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

