# HOW TO USE OUT OF FOLD PREDICTIONS IN MACHINE LEARNING

Machine Learning Algorithms are typically evaluated using *resampling techniques* such as **K-fold cross-validation** 

During cross-validation process, predictions are made on test sets comprised of data not used to train the model. These predictions are referred to as **out-of-fold predictions**, a type of out-of-sample predictions

Out-of-fold predictions play an important role in ML in both estimating the performance of a model when making predictions on new data in the future, so-called the generalization performance of the model and in the development of ensemble models

<li>Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model</li>
<li>Out-of-fold predictions are mostly used to estimate the performance of the of a model when making predictions on an unseen data</li>
<li>Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble</li>

Two main uses for **out-of-fold predictions**

<li>Estimate the performance of the model on unseen data</li>
<li>Fit an ensemble model</li>

## Out-of-fold Predictions for evaluation

In [1]:
# The example below prepares a data sample and summarizes the shape of the 
# input and output elements of the dataset
from sklearn.datasets import make_blobs

# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
print(X.shape, y.shape)

(1000, 100) (1000,)


In [10]:
# Evaluate model by averaging performance across each fold
import numpy as np
from tqdm.auto import tqdm
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data_y, data_yhat = list(), list()
kfold = KFold(n_splits=10, shuffle=True)
# enumerate splits
for fold, (tr_idx, test_idx) in enumerate(tqdm(kfold.split(X), total=10)):
    # get the data
    train_X, test_X = X[tr_idx], X[test_idx]
    train_y, test_y = y[tr_idx], y[test_idx]
    # fit the model
    model = KNeighborsClassifier()
    model.fit(train_X, train_y)
    # evaluate the model
    yhat = model.predict(test_X)
    # store
    data_y.extend(test_y)
    data_yhat.extend(yhat)
    
# summarize the model performance
acc = accuracy_score(data_y, data_yhat)
print(f"Accuracy {acc:.3f}%")

  0%|          | 0/10 [00:00<?, ?it/s]

Accuracy 0.941%


## Out-of-fold Predictions for Ensembles

An ensemble is a machine learning model that combines the predictions from two or more models prepared on the same training dataset

Out-of-fold predictions in aggregate provide information about how the model performs on each example in the training dataset when not used to train the model. This information can be used to train a model to correct or improve upon those predictions

First, the k-fold cross-validation procedure is performed on each base model of interest, and each out-of-fold predictions are collected. 
<li><b>Base Models</b>: Models evaluated using k-fold cross-validation on the training dataset and all out-of-fold predictions are retained</li>
<li><b>Meta Models</b>: Model that takes the <em>out-of-fold predictions</em> made by one or more models as input and shows how to best combine and correct the predictions</li>
<li><b>Meta Model Input</b>: Input portion of a given sample concatenated with the predictions made by each base model</li>
<li><b>Meta Model Output</b>: Output portion of a given sample</li>

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# create a meta dataset
def create_meta_dataset(data_x, yhat1, yhat2):
    # convert to columns
    yhat1 = np.array(yhat1).reshape((len(yhat1), 1))
    yhat2 = np.array(yhat2).reshape((len(yhat2), 1))
    # stack as separate columns
    meta_X = np.hstack((data_x, yhat1, yhat2))
    return meta_X


# make predictions with stacked model
def stack_prediction(model1, model2, meta_model, X):
    # make predictions
    yhat1 = model1.predict(X)
    yhat2 = model2.predict(X)
    meta_X = create_meta_dataset(X, yhat1, yhat2)
    # predict
    return meta_model.predict(meta_X)


# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20)
# split the data
X, X_val, y, y_val = train_test_split(X, y, test_size=0.33)
# collect out of sample predictions
data_x, data_y, knn_yhat, cart_yhat = list(), list(), list(), list()
kfold = KFold(n_splits=10, shuffle=True)
for fold, (tr_idx, test_idx) in enumerate(tqdm(kfold.split(X), total=10)):
    # get data
    train_X, test_X = X[tr_idx], X[test_idx]
    train_y, test_y = y[tr_idx], y[test_idx]
    data_x.extend(test_X)
    data_y.extend(test_y)
    # fit and make predictions with cart
    model1 = DecisionTreeClassifier()
    model1.fit(train_X, train_y)
    yhat1 = model1.predict_proba(test_X)[:, 0]
    cart_yhat.extend(yhat1)
    
    # fit and make predictions with knn
    model2 = KNeighborsClassifier()
    model2.fit(train_X, train_y)
    yhat2 = model2.predict_proba(test_X)[:, 0]
    knn_yhat.extend(yhat2)
    
# construct meta dataset
meta_X = create_meta_dataset(data_x, knn_yhat, cart_yhat)

# fit final ensemble
model1 = DecisionTreeClassifier()
model1.fit(X, y)

model2 = KNeighborsClassifier()
model2.fit(X, y)

# construct meta classifier
meta_model = LogisticRegression(solver='liblinear')
meta_model.fit(meta_X, data_y)

# evaluate sub models on hold out dataset
acc1 = accuracy_score(y_val, model1.predict(X_val))
acc2 = accuracy_score(y_val, model2.predict(X_val))
print(f"Model 1 acc: {acc1:.4f}\nModel 2 acc: {acc2: 4f}")

# evaluate meta model on hold out dataset
yhat = stack_prediction(model1, model2, meta_model, X_val)
acc = accuracy_score(y_val, yhat)
print(f"Meta model accuracy: {acc:.4f}")

  0%|          | 0/10 [00:00<?, ?it/s]

Model 1 acc: 0.7152
Model 2 acc:  0.924242
Meta model accuracy: 0.9576
