 __Project - Wine classification__

1. [Import](#Import)
    1. [Tools](#Tools)
    1. [Data](#Data)    
1. [Initial EDA](#Initial-EDA)
    1. [Categorical feature EDA](#Categorical-feature-EDA)
        1. [Univariate & feature vs. target](#Univariate-&-feature-vs.-target)
    1. [Continuous feature EDA](#Continuous-feature-EDA)
        1. [Univariate & feature vs. target](#Univariate-&-feature-vs.-target2)
        1. [Correlation](#Correlation)
        1. [Pair plot](#Pair-plot)
    1. [Faceting](#Faceting)
    1. [Target variable evaluation](#Target-variable-evaluation)    
1. [Data preparation](#Data-preparation)
    1. [Outliers (preliminary)](#Outliers-preliminary)
    1. [Missing data](#Missing-data)
    1. [Engineering](#Engineering)
    1. [Encoding](#Encoding)
    1. [Transformation](#Transformation)
        1. [Polynomial features](#Polynomial-features)
        1. [Skew](#Skew)
    1. [Outliers (final)](#Outliers-final)
1. [Data evaluation](#Data-evaluation)
    1. [Feature importance](#Feature-importance)    
    1. [Rationality](#Rationality)
    1. [Value override](#Value-override)
    1. [Continuous feature EDA](#Continuous-feature-EDA3)
    1. [Correlation](#Correlation3)
1. [Modeling](#Modeling)
    1. [Data preparation](#Data-preparation-1)
    1. [Bayesian hyper-parameter optimization](#Bayesian-hyper-parameter-optimization)
        1. [Model loss by iteration](#Model-loss-by-iteration)
        1. [Parameter selection by iteration](#Parameter-selection-by-iteration)
    1. [Model performance evaluation - standard models](#Model-performance-evaluation-standard-models)
    1. [Validation set evaluation - standard models](#Validation-set-evaluation-standard-models)
    1. [Model explanability](#Model-explanability)
        1. [Permutation importance](#Permutation-importance)
        1. [Partial plots](#Partial-plots)
        1. [SHAP values](#SHAP-values)
1. [Stacking](#Stacking)
    1. [Primary models](#Primary-models)
    1. [Meta model](#Meta-model)
    1. [Model performance evaluation - stacked models](#Model-performance-evaluation-stacked-models)
    1. [Validation set evaluation - stacked models](#Validation-set-evaluation-stacked-models)

# Import

<a id = 'Import'></a>

## Tools

<a id = 'Tools'></a>

In [None]:
# standard libary and settings
import os
import sys
import importlib
from functools import reduce
import time; rundate = time.strftime("%Y%m%d")

import warnings
warnings.simplefilter("ignore")

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np
np.set_printoptions(threshold=np.inf, suppress=True)

import pandas as pd
pd.set_option("display.max_rows", 500); pd.set_option("display.max_columns", 500)
pd.options.display.float_format = "{:,.6f}".format

# modeling extensions
import sklearn.base as base
import sklearn.datasets as datasets
import sklearn.ensemble as ensemble
import sklearn.impute as impute
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing

import eif
import shap
shap.initjs()
from eli5.sklearn import PermutationImportance
from pdpbox import pdp, get_dataset, info_plots

# visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

%matplotlib inline

try:
    #     import mlmachine as mlm
    #     from prettierplot.plotter import PrettierPlot
    #     import prettierplot.style as style
    import asdfasd
except ModuleNotFoundError:
    sys.path.append(
        "../../../mlmachine"
    ) if "../../../../mlmachine" not in sys.path else None
    sys.path.append(
        "../../../prettierplot"
    ) if "../../../../prettierplot" not in sys.path else None

    import mlmachine as mlm
    from mlmachine.features.preprocessing import (
        DataFrameSelector,
        PlayWithPandas,
        UnprocessedColumnAdder,
        ContextImputer,
        PandasFeatureUnion,
        DualTransformer,
        dataRefresh,
    )
    from prettierplot.plotter import PrettierPlot
    import prettierplot.style as style
else:
    print(
        "This notebook relies on the libraries mlmachine and prettierplot. Please run:"
    )
    print("\tpip install mlmachine")
    print("\tpip install prettierplot")

## Data

<a id = 'Data'></a>

In [None]:
# load and inspect data
data = pd.read_csv("../../data/wine.data", header=None)

data.columns = [
    "Class label",
    "Alcohol",
    "Malic acid",
    "Ash",
    "Alcalinity of ash",
    "Magnesium",
    "Total phenols",
    "Flavanoids",
    "Nonflavanoid phenols",
    "Proanthocyanins",
    "Color intensity",
    "Hue",
    "OD280/OD315 of diluted wines",
    "Proline",
]

print("Training data dimensions: {}".format(data.shape))

In [None]:
# display info and first 5 rows
data.info()
display(data[:5])

In [None]:
# review counts of different column types
data.dtypes.value_counts()

In [None]:
# split dataset into train and validation datasets
dfTrain, dfValid = mlm.trainTestCompile(data=data, targetCol='Class label')

In [None]:
# Load training data into mlmachine
train = mlm.Machine(
    data=dfTrain,
    target="Class label",
    targetType="categorical",
)
print(train.data.shape)

In [None]:
# Load training data into mlmachine
valid = mlm.Machine(
    data=dfValid,
    target="Class label",
    targetType="categorical",
)
print(valid.data.shape)

# Initial EDA

<a id = 'Initial-EDA'></a>

## Categorical feature EDA

<a id = 'Categorical-feature-EDA'></a>

### Univariate & feature vs. target

<a id = 'Univariate-&-feature-vs.-target'></a>

In [None]:
# categorical features
for feature in train.featureByDtype["categorical"]:
    train.edaCatTargetCatFeat(feature=feature)

## Continuous feature EDA

<a id = 'Continuous-feature-EDA'></a>

### Univariate & feature vs. target

<a id = 'Univariate-&-feature-vs.-target2'></a>

In [None]:
# continuous features
for feature in train.featureByDtype["continuous"]:
    train.edaCatTargetNumFeat(feature=feature)

### Correlation

<a id = 'Correlation'></a>

##### Correlation (all samples)

In [None]:
# correlation heat map
p = PrettierPlot()
ax = p.makeCanvas()
p.prettyCorrHeatmap(df=train.data, annot=False, ax=ax)

##### Correlation (top vs. target)

In [None]:
# correlation heat map with most highly correlated features relative to the target
p = PrettierPlot(plotOrientation='tall')
ax = p.makeCanvas()
p.prettyCorrHeatmapTarget(
    df=train.data, target=train.target, thresh=0.02, annot=True, ax=ax
)

### Pair plot

<a id = 'Pair-plot'></a>

In [None]:
# pair plot
p = PrettierPlot(chartProp=12)
p.prettyPairPlot(df=train.data, cols=train.featureByDtype['continuous'], diag_kind="auto")

In [None]:
# pair plot
p = PrettierPlot(chartProp=12)
p.prettyPairPlot(
    df=train.data.dropna(),
    diag_kind="kde",
    target=train.target,
    cols=train.featureByDtype['continuous'][:10],
#     legendLabels=["Stays", "Leaves"],
    bbox=(2.0, 0.0),
)

## Faceting

<a id = 'Faceting'></a>

## Target variable evaluation

<a id = 'Target-variable-evaluation'></a>

In [None]:
# null score
pd.Series(train.target).value_counts(normalize=True)

# Data preparation

<a id = 'Data-preparation'></a>

## Outliers (preliminary)


<a id = 'Outliers-preliminary'></a>

##### Training

In [None]:
# identify columns that have zero missing values
nonNull = train.data.columns[train.data.isnull().sum() == 0].values.tolist()

# identify intersection between non-null columns and continuous columns
nonNullNumCol = list(set(nonNull).intersection(train.featureByDtype["continuous"]))
print(nonNullNumCol)

In [None]:
# identify outliers using IQR
trainPipe = pipeline.Pipeline([
    ("outlier",train.OutlierIQR(
                outlierCount=2,
                iqrStep=1.5,
                features=nonNullNumCol,
                dropOutliers=False,))
    ])
train.data = trainPipe.transform(train.data)

# capture outliers
iqrOutliers = np.array(sorted(trainPipe.named_steps["outlier"].outliers_))
print(iqrOutliers)

In [None]:
# identify outliers using Isolation Forest
clf = ensemble.IsolationForest(
    behaviour="new", max_samples=train.data.shape[0], random_state=0, contamination=0.02
)
clf.fit(train.data[nonNullNumCol])
preds = clf.predict(train.data[nonNullNumCol])

# evaluate index values
mask = np.isin(preds, -1)
ifOutliers = np.array(train.data[mask].index)
print(ifOutliers)

In [None]:
# identify outliers using extended isolation forest
trainPipe = pipeline.Pipeline([
    ("outlier",train.ExtendedIsoForest(
                cols=nonNullNumCol,
                nTrees=100,
                sampleSize=int(np.ceil(train.data.shape[0] * .25)),
                ExtensionLevel=1,
                anomaliesRatio=0.03,
                dropOutliers=False,))
    ])
train.data = trainPipe.transform(train.data)

# capture outliers
eifOutliers = np.array(sorted(trainPipe.named_steps["outlier"].outliers_))
print(eifOutliers)

In [None]:
# identify outliers that are identified in multiple algorithms
outliers = reduce(np.intersect1d, (iqrOutliers, ifOutliers, eifOutliers))
# outliers = reduce(np.intersect1d, (ifOutliers, eifOutliers))
print(outliers)

In [None]:
# review outlier identification summary
outlierSummary = train.outlierSummary(iqrOutliers=iqrOutliers,
                             ifOutliers=ifOutliers,
                             eifOutliers=eifOutliers
                            )
outlierSummary

##### Validation

##### Remove

In [None]:
# remove outlers from predictors and response
outliers = np.array([59,121])
train.data = train.data.drop(outliers)
train.target = train.target.drop(index=outliers)

## Missing data

No missing data


<a id = 'Missing-data'></a>

##### Training

In [None]:
# evaluate missing data
train.edaMissingSummary()

##### Validation

In [None]:
# evaluate missing data
valid.edaMissingSummary()

##### Impute

## Engineering

<a id = 'Engineering'></a>

##### Training

In [None]:
# print new columns
for col in train.data.columns:
    if (
        col not in train.featureByDtype["categorical"]
        and col not in train.featureByDtype["continuous"]
    ):
        print(col)

In [None]:
# evaluate additional features
for feature in train.featureByDtype["categorical"]:
    train.edaCatTargetCatFeat(feature=feature)

##### Validation

In [None]:
# print new columns
for col in valid.data.columns:
    if (
        col not in valid.featureByDtype["categorical"]
        and col not in valid.featureByDtype["continuous"]
    ):
        print(col)

##### Encoding

No categorical features

<a id = 'Encoding'></a>

## Transformation

<a id = 'Transformation'></a>

### Polynomial features

<a id = 'Polynomial-features'></a>

##### Transformation

In [None]:
X = train.data
Y = valid.data

pipe = PandasFeatureUnion([
    ("polynomial", pipeline.make_pipeline(
        DataFrameSelector(train.featureByDtype["continuous"]),
        PlayWithPandas(preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=False))
    )),
])

X = pipe.fit_transform(train.data)
Y = pipe.transform(valid.data)

In [None]:
# update main train and validation datasets
train.data, valid.data = dataRefresh(transformeredTrainData=X,
                                     trainData=train.data,
                                     transformeredValidationData=Y,
                                     validationData=valid.data,
#                                      columnsToDrop=nominalColumns
                                    )

### Skew

<a id = 'Skew'></a>

##### Training

In [None]:
# evaluate skew of continuous features - training data
train.skewSummary()

##### Validation

In [None]:
# evaluate skew of continuous features - validation data
valid.skewSummary()

##### Transform

In [None]:
X = train.data
Y = valid.data

pipe = PandasFeatureUnion([
    ("ordinal", pipeline.make_pipeline(
        DataFrameSelector(train.featureByDtype["continuous"]),
        DualTransformer(),
    )),
    
])

X = pipe.fit_transform(train.data)
Y = pipe.transform(valid.data)

In [None]:
# update main train and validation datasets
train.data, valid.data = dataRefresh(transformeredTrainData=X,
                                     trainData=train.data,
                                     transformeredValidationData=Y,
                                     validationData=valid.data,
#                                      columnsToDrop=nominalColumns
                                    )

## Outliers (final)


<a id = 'Outliers-final'></a>

##### Training

In [None]:
# identify outliers using IQR
trainPipe = pipeline.Pipeline([
    ("outlier",train.OutlierIQR(
                outlierCount=2,
                iqrStep=1.5,
                features=nonNullNumCol,
                dropOutliers=False,))
    ])
train.data = trainPipe.transform(train.data)

# capture outliers
iqrOutliers = np.array(sorted(trainPipe.named_steps["outlier"].outliers_))
print(iqrOutliers)

In [None]:
# identify outliers using Isolation Forest
clf = ensemble.IsolationForest(
    behaviour="new", max_samples=train.data.shape[0], random_state=0, contamination=0.02
)
clf.fit(train.data[nonNullNumCol])
preds = clf.predict(train.data[nonNullNumCol])

# evaluate index values
mask = np.isin(preds, -1)
ifOutliers = np.array(train.data[mask].index)
print(ifOutliers)

In [None]:
# identify outliers using extended isolation forest
trainPipe = pipeline.Pipeline([
    ("outlier",train.ExtendedIsoForest(
                cols=nonNullNumCol,
                nTrees=100,
                sampleSize=int(np.ceil(train.data.shape[0] * .25)),
                ExtensionLevel=1,
                anomaliesRatio=0.03,
                dropOutliers=False,))
    ])
train.data = trainPipe.transform(train.data)

# capture outliers
eifOutliers = np.array(sorted(trainPipe.named_steps["outlier"].outliers_))
print(eifOutliers)

In [None]:
# identify outliers that are identified in multiple algorithms
outliers = reduce(np.intersect1d, (iqrOutliers, ifOutliers, eifOutliers))
# outliers = reduce(np.intersect1d, (ifOutliers, eifOutliers))
print(outliers)

In [None]:
# review outlier identification summary
outlierSummary = train.outlierSummary(iqrOutliers=iqrOutliers,
                                      ifOutliers=ifOutliers,
                                      eifOutliers=eifOutliers)
outlierSummary

##### Validation

##### Remove

In [None]:
# # remove outlers from predictors and response
# outliers = np.array([59,121])
# train.data = train.data.drop(outliers)
# train.target = train.target.drop(index=outliers)

# Data evaluation

<a id = 'Data evaluation'></a>

In [None]:
# scale features
trainPipe = pipeline.Pipeline([
        ("scale", train.Robust(cols="non-binary")),
    ])
train.data = trainPipe.transform(train.data)

In [None]:
# scale and sync
validPipe = pipeline.Pipeline([
        ("scale",valid.Robust(cols="non-binary",train=False,trainValue=trainPipe.named_steps["scale"].trainValue_)),
    ])
valid.data = validPipe.transform(valid.data)

## Feature importance

<a id = 'Feature-importance'></a>

In [None]:
# generate feature importance summary
estimators = [
    "lightgbm.LGBMClassifier",
    "ensemble.RandomForestClassifier",
    "ensemble.GradientBoostingClassifier",
    "ensemble.ExtraTreesClassifier",
    "ensemble.AdaBoostClassifier",
    "xgboost.XGBClassifier",
]

featureSummary = train.featureSelectorSummary(estimators=estimators)

In [None]:
# calculate cross-validation performance
estimators = [
    "svm.SVC",
    "lightgbm.LGBMClassifier",
    "linear_model.LogisticRegression",
    "xgboost.XGBClassifier",
    "ensemble.RandomForestClassifier",
    "ensemble.GradientBoostingClassifier",
    "ensemble.AdaBoostClassifier",
    "ensemble.ExtraTreesClassifier",
    "neighbors.KNeighborsClassifier",
]

cvSummary = train.featureSelectorCrossVal(
    estimators=estimators,
    featureSummary=featureSummary,
    metrics=["accuracy","f1_macro"],
    nFolds=8,
    step=1
)

###### Accuracy

In [None]:
# visualize CV performance for diminishing feature set
train.featureSelectorResultsPlot(
    cvSummary=cvSummary,
    featureSummary=featureSummary,
    metric="accuracy",
    showFeatures=True,
    titleScale=0.8,
)

In [None]:
df = train.featuresUsedSummary(
    cvSummary=cvSummary, metric="accuracy", featureSummary=featureSummary
)
df

In [None]:
# list feature that showed up in at least X models
df[df["count"] >= 7].index

###### F! macro

In [None]:
# visualize CV performance for diminishing feature set
train.featureSelectorResultsPlot(
    cvSummary=cvSummary,
    featureSummary=featureSummary,
    metric="f1_macro",
    showFeatures=True,
    titleScale=0.8,
)

In [None]:
df = train.featuresUsedSummary(
    cvSummary=cvSummary, metric="f1_macro", featureSummary=featureSummary
)
df

In [None]:
# list feature that showed up in at least X models
df[df["count"] >= 7].index

## Rationality

<a id = 'Rationality'></a>

In [None]:
# percent difference summary
dfDiff = abs(
    (
        ((valid.data.describe() + 1) - (train.data.describe() + 1))
        / (train.data.describe() + 1)
    )
    * 100
)
dfDiff = dfDiff[dfDiff.columns].replace({0: np.nan})
dfDiff[dfDiff < 0] = np.nan
dfDiff = dfDiff.fillna("")
display(dfDiff)
display(train.data[dfDiff.columns].describe())
display(valid.data[dfDiff.columns].describe())

## Value override

<a id = 'Value override'></a>

In [None]:
# change clearly erroneous value to what it probably was
# exploreValid.data['GarageYrBlt'].replace({2207 : 2007}, inplace = True)

## Continuous feature EDA

<a id = 'Continuous-feature-EDA3'></a>

## Correlation

<a id = 'Correlation3'></a>

In [None]:
# correlation heat map with most highly correlated features relative to the target
p = PrettierPlot()
ax = p.makeCanvas()
p.prettyCorrHeatmapTarget(df=train.data, target=train.target, thresh=0.2, ax=ax)

# Modeling

<a id = 'Modeling'></a>

## Data preparation

<a id = 'Data-preparation-1'></a>

In [None]:
# split dataset into train and validation datasets
dfTrain, dfValid = mlm.trainTestCompile(data=data, targetCol='Class label')

##### Prepare training data

In [None]:
# import training data
train = mlm.Machine(
    data=dfTrain,
    target="Class label",
    targetType="categorical",
)

# remove outliers
outliers = np.array([59,121])
train.data = train.data.drop(outliers)
train.target = train.target.drop(index=outliers)

### pipeline
trainPipe = pipeline.Pipeline([
        ("coerce",train.NumericCoercer()),
        ("skew",train.DualTransformer(cols=train.featureByDtype["continuous"])),
        ("scale", train.Robust(cols="non-binary")),
    ])
train.data = trainPipe.transform(train.data)

# ['Flavanoids', 'Proline', 'OD280/OD315 of diluted wines',
#        'Color intensity', 'Flavanoids_bc', 'Flavanoids_yj', 'Hue',
#        'Color intensity_bc', 'OD280/OD315 of diluted wines_bc',
#        'Color intensity_yj', 'OD280/OD315 of diluted wines_yj', 'Proline_bc',
#        'Hue_bc', 'Alcohol']

# drop features
print('completed')

##### Prepare validation data

In [None]:
### import valid data
valid = mlm.Machine(
    data=dfValid,
    target="Class label",
    targetType="categorical",
)

### pipeline
validPipe = pipeline.Pipeline([
        ("sync", valid.FeatureSync(trainCols=train.data.columns)),
        ("coerce",valid.NumericCoercer()),
        ("skew",valid.DualTransformer(train=False, yjLambdasDict=trainPipe.named_steps["skew"].yjLambdasDict_,
            bcLambdasDict=trainPipe.named_steps["skew"].bcLambdasDict_, bcP1LambdasDict=trainPipe.named_steps["skew"].bcP1LambdasDict_)),
        ("sync", valid.FeatureSync(trainCols=train.data.columns)),
        ("scale",valid.Robust(cols="non-binary",train=False,trainValue=trainPipe.named_steps["scale"].trainValue_)),
    ])
valid.data = validPipe.transform(valid.data)
print('completed')

## Bayesian hyper-parameter optimization

<a id = 'Bayesian-hyper-parameter-optimization'></a>

In [None]:
# model/parameter space
allSpace = {
    "lightgbm.LGBMClassifier": {
        "class_weight": hp.choice("class_weight", [None, "balanced"]),
        "colsample_bytree": hp.uniform("colsample_bytree", 0.5, 1.0),
        "boosting_type": hp.choice("boosting_type", ["gbdt", "dart", "goss"])
        # ,'boosting_type': hp.choice('boosting_type'
        #                    ,[{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)}
        #                    ,{'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)}
        #                    ,{'boosting_type': 'goss', 'subsample': 1.0}])
        ,
        "learning_rate": hp.uniform("learning_rate", 0.000001, 0.2),
        "max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
        "min_child_samples": hp.uniform("min_child_samples", 20, 500),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "num_leaves": hp.uniform("num_leaves", 8, 150),
        "reg_alpha": hp.uniform("reg_alpha", 0.0, 1.0),
        "reg_lambda": hp.uniform("reg_lambda", 0.0, 1.0),
        "subsample_for_bin": hp.uniform("subsample_for_bin", 20000, 400000),
    },
    "linear_model.LogisticRegression": {
        "C": hp.loguniform("C", np.log(0.001), np.log(0.2)),
        "penalty": hp.choice("penalty", ["l1", "l2"]),
    },
    "xgboost.XGBClassifier": {
        "colsample_bytree": hp.uniform("colsample_bytree", 0.5, 1.0),
        "gamma": hp.uniform("gamma", 0.0, 10),
        "learning_rate": hp.uniform("learning_rate", 0.000001, 0.2),
        "max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
        "min_child_weight": hp.uniform("min_child_weight", 1, 20),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "subsample": hp.uniform("subsample", 0.5, 1),
    },
    "ensemble.RandomForestClassifier": {
        "bootstrap": hp.choice("bootstrap", [True, False]),
        "max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "max_features": hp.choice("max_features", ["auto", "sqrt"]),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(2, 40, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 40, dtype=int)),
    },
    "ensemble.AdaBoostClassifier": {
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "learning_rate": hp.uniform("learning_rate", 0.000001, 0.2),
        "algorithm": hp.choice("algorithm", ["SAMME", "SAMME.R"]),
    },
    "ensemble.ExtraTreesClassifier": {
        "n_estimators": hp.choice("n_estimators", np.arange(100, 10000, 10, dtype=int)),
        "max_depth": hp.choice("max_depth", np.arange(2, 20, dtype=int)),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(2, 40, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 40, dtype=int)),
        "max_features": hp.choice("max_features", ["auto", "sqrt"]),
        "criterion": hp.choice("criterion", ["gini", "entropy"]),
    },
    "svm.SVC": {
        "C": hp.uniform("C", 0.00001, 10),
        "decision_function_shape": hp.choice("decision_function_shape", ["ovo", "ovr"]),
        "gamma": hp.uniform("gamma", 0.00001, 10),
    },
    "neighbors.KNeighborsClassifier": {
        "algorithm": hp.choice("algorithm", ["auto", "ball_tree", "kd_tree", "brute"]),
        "n_neighbors": hp.choice("n_neighbors", np.arange(1, 20, dtype=int)),
        "weights": hp.choice("weights", ["distance", "uniform"]),
    },
}

In [None]:
# execute bayesian optimization grid search
analysis = "wine"
train.execBayesOptimSearch(
    allSpace=allSpace,
    resultsDir="{}_hyperopt_{}.csv".format(rundate, analysis),
    X=train.data,
    y=train.target,
    scoring="accuracy",
    nFolds=2,
    nJobs=3,
    iters=8,
    verbose=0,
)

### Model loss by iteration

<a id = 'Model-loss-by-iteration'></a>

In [None]:
# read scores summary table
analysis = "wine"
rundate = '20190808'
bayesOptimSummary = pd.read_csv("{}_hyperopt_{}.csv".format(rundate, analysis), na_values="nan")
bayesOptimSummary[:5]

In [None]:
# model loss plot
for estimator in np.unique(bayesOptimSummary["estimator"]):
    train.modelLossPlot(bayesOptimSummary=bayesOptimSummary, estimator=estimator)

### Parameter selection by iteration

<a id = 'Parameter-selection-by-iteration'></a>

In [None]:
# estimator parameter plots
for estimator in np.unique(bayesOptimSummary['estimator']):
    train.modelParamPlot(bayesOptimSummary = bayesOptimSummary,
                         estimator=estimator,
                         allSpace=allSpace,
                         nIter=100,
                         chartProp=15)

In [None]:
sampleSpace = {
                'param': hp.uniform('param', np.log(0.4), np.log(0.6))
#     "": 0.000001 + hp.uniform("gamma", 0.000001, 10)
    #             'param2': hp.loguniform('param2', np.log(0.001), np.log(0.01))
}

train.samplePlot(sampleSpace, 1000)

## Model performance evaluation - standard models

<a id = 'Model-performance-evaluation-standard-models'></a>

In [None]:
topModels = train.topBayesOptimModels(bayesOptimSummary=bayesOptimSummary, numModels=1)
topModels

In [None]:
# classification panel, single model
estimator = "svm.SVC"; modelIter = 66
# estimator = 'ensemble.GradientBoostingClassifier'; modelIter = 590
# estimator = 'xgboost.XGBClassifier'; modelIter = 380

model = train.BayesOptimModelBuilder(
    bayesOptimSummary=bayesOptimSummary, estimator=estimator, modelIter=modelIter
)

train.classificationPanel(
    model=model, XTrain=train.data, yTrain=train.target, labels=[0, 1], nFolds=4
)

In [None]:
# create classification reports
for estimator, modelIters in topModels.items():
    for modelIter in modelIters:
        model = train.BayesOptimModelBuilder(
            bayesOptimSummary=bayesOptimSummary,
            estimator=estimator,
            modelIter=modelIter,
        )
        train.classificationPanel(
            model=model, XTrain=train.data, yTrain=train.target, labels=[0, 1], nFolds=4
        )

## Validation set evaluation - standard models

<a id = 'Validation-set-evaluation-standard-models'></a>

In [None]:
## standard model fit and predict
# select estimator and iteration
# estimator = "lightgbm.LGBMClassifier"; modelIter = 476
estimator = "xgboost.XGBClassifier"; modelIter = 418
# estimator = "ensemble.RandomForestClassifier"; modelIter = 382
# estimator = "ensemble.GradientBoostingClassifier"; modelIter = 238
# estimator = "svm.SVC"; modelIter = 135

# extract params and instantiate model
model = train.BayesOptimModelBuilder(
    bayesOptimSummary=bayesOptimSummary, estimator=estimator, modelIter=modelIter
)

# classification panel for validation data
train.classificationPanel(
    model=model,
    XTrain=train.data,
    yTrain=train.target,
    XValid=valid.data,
    yValid=valid.target,
    labels=[0, 1],
)

In [None]:
# create classification reports
for estimator, modelIters in topModels.items():
    for modelIter in modelIters:
        model = train.BayesOptimModelBuilder(
            bayesOptimSummary=bayesOptimSummary,
            estimator=estimator,
            modelIter=modelIter,
        )
        train.classificationPanel(
            model=model,
            XTrain=train.data,
            yTrain=train.target,
            XValid=valid.data,
            yValid=valid.target,
            labels=[0, 1],
        )

## Model explanability

<a id = 'Feature-importance'></a>

In [None]:
# 
estimator = "ensemble.ExtraTreesClassifier"; modelIter = 145

modelE = train.BayesOptimModelBuilder(
    bayesOptimSummary=bayesOptimSummary, estimator=estimator, modelIter=modelIter
)

modelE.fit(train.data.values, train.target.values)

### Permutation importance

<a id = 'Permutation-importance'></a>

In [None]:
# permutation importance - how much does performance decrease when shuffling a certain feature?
perm = PermutationImportance(modelR.model, random_state=1).fit(train.data, train.target)
eli5.show_weights(perm, feature_names=featureNames)

### Partial plots

<a id = 'Partial-plots'></a>

In [None]:
for feature in featureNames:
    pdpFeature = pdp.pdp_isolate(
        model=modelR.model, dataset=train.data, model_features=featureNames, feature=feature
    )

    pdp.pdp_plot(pdpFeature, feature)
    plt.rcParams["axes.facecolor"] = "white"
    plt.rcParams["figure.facecolor"] = "white"

    plt.grid(b=None)
    plt.show()

### SHAP values

<a id = 'SHAP-values'></a>

##### Force plots - single observations

In [None]:
for i in np.arange(0, 4):
    train.singleShapVizTree(obsIx=i, model=modelR, data=train.data)

##### Force plots - multiple observations

In [None]:
visual = train.multiShapVizTree(obsIxs=np.arange(0, 800), model=modelR, data=train.data)
visual

##### Dependence plots

In [None]:
obsData, _, obsShapValues = train.multiShapValueTree(
    obsIxs=np.arange(0, 800), model=modelR, data=train.data
)
train.shapDependencePlot(
    obsData=obsData,
    obsShapValues=obsShapValues,
    scatterFeature="Fare",
    colorFeature="Age",
    featureNames=train.data.columns.tolist(),
)

In [None]:
obsData, _, obsShapValues = train.multiShapValueTree(
    obsIxs=np.arange(0, 800), model=modelL, data=train.data
)
featureNames = train.data.columns.tolist()
topShap = np.argsort(-np.sum(np.abs(obsShapValues), 0))

# generate force plot
for topIx in topShap:
    train.shapDependencePlot(
        obsData=obsData,
        obsShapValues=obsShapValues,
        scatterFeature=featureNames[topIx],
        colorFeature="Age",
        featureNames=featureNames,
    )

##### Summary plots

In [None]:
obsData, _, obsShapValues = train.multiShapValueTree(
    obsIxs=np.arange(0, 800), model=modelG, data=train.data
)
featureNames = train.data.columns.tolist()
train.shapSummaryPlot(
        obsData=obsData,
        obsShapValues=obsShapValues,
        featureNames=featureNames,
    )

# Stacking

<a id = 'Stacking'></a>

## Primary models

<a id = 'Primary-models'></a>

In [None]:
# get out-of-fold predictions
oofTrain, oofValid, columns = train.modelStacker(
    models=topModels,
    bayesOptimSummary=bayesOptimSummary,
    XTrain=train.data.values,
    yTrain=train.target.values,
    XValid=valid.data.values,
    nFolds=10,
    nJobs=10,
)

In [None]:
# view correlations of predictions
p = PrettierPlot()
ax = p.makeCanvas()
p.prettyCorrHeatmap(
    df=pd.DataFrame(oofTrain, columns=columns), annot=True, ax=ax, vmin=0
)

## Meta model

<a id = 'Meta-model'></a>

In [None]:
# parameter space
allSpace = {
    "lightgbm.LGBMClassifier": {
        "class_weight": hp.choice("class_weight", [None]),
        "colsample_bytree": hp.uniform("colsample_bytree", 0.4, 0.7),
        "boosting_type": hp.choice("boosting_type", ["dart"]),
        "subsample": hp.uniform("subsample", 0.5, 1),
        "learning_rate": hp.uniform("learning_rate", 0.15, 0.25),
        "max_depth": hp.choice("max_depth", np.arange(4, 20, dtype=int)),
        "min_child_samples": hp.quniform("min_child_samples", 50, 150, 5),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 4000, 10, dtype=int)),
        "num_leaves": hp.quniform("num_leaves", 30, 70, 1),
        "reg_alpha": hp.uniform("reg_alpha", 0.75, 1.25),
        "reg_lambda": hp.uniform("reg_lambda", 0.0, 1.0),
        "subsample_for_bin": hp.quniform("subsample_for_bin", 100000, 350000, 20000),
    },
    "xgboost.XGBClassifier": {
        "colsample_bytree": hp.uniform("colsample_bytree", 0.4, 0.7),
        "gamma": hp.quniform("gamma", 0.0, 10, 0.05),
        "learning_rate": hp.quniform("learning_rate", 0.01, 0.2, 0.01),
        "max_depth": hp.choice("max_depth", np.arange(2, 15, dtype=int)),
        "min_child_weight": hp.quniform("min_child_weight", 2.5, 7.5, 1),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 4000, 10, dtype=int)),
        "subsample": hp.uniform("subsample", 0.4, 0.7),
    },
    "ensemble.RandomForestClassifier": {
        "bootstrap": hp.choice("bootstrap", [True, False]),
        "max_depth": hp.choice("max_depth", np.arange(2, 10, dtype=int)),
        "n_estimators": hp.choice("n_estimators", np.arange(100, 8000, 10, dtype=int)),
        "max_features": hp.choice("max_features", ["sqrt"]),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(15, 25, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 20, dtype=int)),
    },
    "ensemble.GradientBoostingClassifier": {
        "n_estimators": hp.choice("n_estimators", np.arange(100, 4000, 10, dtype=int)),
        "max_depth": hp.choice("max_depth", np.arange(2, 11, dtype=int)),
        "max_features": hp.choice("max_features", ["sqrt"]),
        "learning_rate": hp.quniform("learning_rate", 0.01, 0.09, 0.01),
        "loss": hp.choice("loss", ["deviance", "exponential"]),
        "min_samples_split": hp.choice(
            "min_samples_split", np.arange(2, 40, dtype=int)
        ),
        "min_samples_leaf": hp.choice("min_samples_leaf", np.arange(2, 40, dtype=int)),
    },
    "svm.SVC": {
        "C": hp.uniform("C", 0.00000001, 15),
        "decision_function_shape": hp.choice("decision_function_shape", ["ovr", "ovo"]),
        "gamma": hp.uniform("gamma", 0.00000001, 1.5),
    },
}

In [None]:
# execute bayesian optimization grid search
train.execBayesOptimSearch(
    allSpace=allSpace,
    resultsDir="{}_hyperopt_meta_{}.csv".format(rundate, analysis),
    X=oofTrain,
    y=train.target,
    scoring="accuracy",
    nFolds=8,
    nJobs=10,
    iters=1000,
    verbose=0,
)

In [None]:
# read scores summary table
analysis = "wine"
rundate = "20190807"
bayesOptimSummaryMeta = pd.read_csv("{}_hyperopt_meta_{}.csv".format(rundate, analysis))
bayesOptimSummaryMeta[:5]

In [None]:
# model loss plot
for estimator in np.unique(bayesOptimSummaryMeta["estimator"]):
    train.modelLossPlot(bayesOptimSummary=bayesOptimSummaryMeta, estimator=estimator)

In [None]:
# estimator parameter plots
for estimator in np.unique(bayesOptimSummaryMeta["estimator"]):
    train.modelParamPlot(
        bayesOptimSummary=bayesOptimSummaryMeta,
        estimator=estimator,
        allSpace=allSpace,
        nIter=100,
        chartProp=15,
    )

## Model performance evaluation - stacked models

<a id = 'Model-performance-evaluation-stacked-models'></a>

In [None]:
topModels = train.topBayesOptimModels(
    bayesOptimSummary=bayesOptimSummaryMeta, numModels=1
)
topModels

In [None]:
# best second level learning model
estimator = "lightgbm.LGBMClassifier"; modelIter = 668
# estimator = "xgboost.XGBClassifier"; modelIter = 380
# estimator = "ensemble.RandomForestClassifier"; modelIter = 411
# estimator = "ensemble.GradientBoostingClassifier"; modelIter = 590
# estimator = "svm.SVC"; modelIter = 135

# extract params and instantiate model
model = train.BayesOptimModelBuilder(
    bayesOptimSummary=bayesOptimSummaryMeta, estimator=estimator, modelIter=modelIter
)
train.classificationPanel(
    model=model, XTrain=oofTrain, yTrain=train.target, labels=[0, 1]
)

In [None]:
# create classification reports
for estimator, modelIters in topModels.items():
    for modelIter in modelIters:
        model = train.BayesOptimModelBuilder(
            bayesOptimSummary=bayesOptimSummaryMeta,
            estimator=estimator,
            modelIter=modelIter,
        )
        train.classificationPanel(
            model=model, XTrain=oofTrain, yTrain=train.target, labels=[0, 1], nFolds=4
        )

## Validation set evaluation - stacked models

<a id = 'Validation-set-evaluation-stacked-models'></a>

In [None]:
## standard model fit and predict
# select estimator and iteration
estimator = "lightgbm.LGBMClassifier"; modelIter = 668
# estimator = "xgboost.XGBClassifier"; modelIter = 380
# estimator = "ensemble.RandomForestClassifier"; modelIter = 411
# estimator = "ensemble.GradientBoostingClassifier"; modelIter = 590
# estimator = "svm.SVC"; modelIter = 135

# extract params and instantiate model
model = train.BayesOptimModelBuilder(
    bayesOptimSummary=bayesOptimSummaryMeta, estimator=estimator, modelIter=modelIter
)
model.fit(oofTrain, train.target.values)

# fit model and make predictions
yPred = model.predict(oofValid)

In [None]:
train.classificationPanel(
    model=model,
    XTrain=oofTrain,
    yTrain=train.target,
    XValid=oofValid,
    yValid=valid.target,
    labels=[0, 1],
)

In [None]:
# create classification reports
for estimator, modelIters in topModels.items():
    for modelIter in modelIters:
        model = train.BayesOptimModelBuilder(
            bayesOptimSummary=bayesOptimSummaryMeta,
            estimator=estimator,
            modelIter=modelIter,
        )
        train.classificationPanel(
            model=model,
            XTrain=oofTrain,
            yTrain=train.target,
            XValid=oofValid,
            yValid=valid.target,
            labels=[0, 1],
        )