# # Building Ensembles with Scikit-Learn and PyTorch
* https://www.youtube.com/watch?v=przbLRCRL24&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi&index=39
* https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_08_2_pytorch_ensembles.ipynb




In [7]:
try:
    from google.colab import drive
    drive.mount('/content/drive')
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Mounted at /content/drive
Note: using Google CoLab


In [8]:
# Notice formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return f"{h}:{m}:{round(s,1)}"

In [9]:
# Early stopping
import copy
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_model is None:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
        elif self.best_loss - val_loss > self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False


# Make use of a GPU or cpu
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cpu


## Evaluating Feature Importance
Feature importance tells us how important each feature (from the feature/import vector) is to predicting a neural network or another model. There are many different ways to evaluate the feature importance of neural networks. The following paper presents an excellent (and readable) overview of the varios means of assessing the significance of neural network inputs/features.
    * An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data [http://depts.washington.edu/oldenlab/wordpress/wp-content/uploads/2013/03/EcologicalModelling_2004.pdf]. Ecological Modelling, 178(3), 389-397.

In summary, the following methods are available to neural networks:
* Connection Weights Algorithm
* Partial Derivatives
* Input Perturbation
* Sensitivity Analysis
* Forward Stepwise Addition
* Improved Stepwise Selection 1
* Backward Stepwise Elimination
* Improved Stepwise Selection

For this chapter, we will use the input Perturbation feature ranking algorithm. This algorithm will work with any regression or claasification network. In the next section, I provide an implementation of the input perturbation algorithm for scikit-learn. This code implements a function below that will work with any scikit-learn model.


Leo Breiman provided this algorithm in his seminal paper on random forests. Althourh he presented this algorithm in conjunction with random forests, it is model-independent and appropriate for any supervised learning model. This algorithm, known as the input perturbation algorithm, works by evaluating a trained model's accuracy with each input individually shuffled from a data set.
Shuffling an input causes it to become useless -- effectivvely removing it from the model. More important inputs will produce a less accurate score when they are removed by shuffling them. This process makes sense because important features will contribute to the model's accuracy.
    * Early stabilizing feature importance for TensorFlow deep neural networks [https://www.heatonresearch.com/dload/phd/IJCNN%202017-v2-final.pdf]

This algorithm will use log loss to evaluate a classification problem and RMSE for regression.


In [None]:
from sklearn import metrics
import scipy as sp
import numpy as np
import pandas as pd
import math

import torch
import torch.nn as nn
import torch.nn.functional as F

In [14]:
def perturbation_rank(device, model, x, y, names, regression):
    model.to(device)
    model.eval() # Set the model to evaluation mode

    errors = []

    for i in range(x.shape[1]):
        hold = x[:, i].clone() # 元の列値を保存
        # i列の値を乱数で置き換える
        x[:, i] = torch.randperm(x.shape[0]).to(device) # randperm:整数の順列を生成

        with torch.no_grad():
            pred = model(x)

        if regression:
            loss_fn = nn.MSELoss()
            error = loss_fn(pred, y).item()
        else:
            # pred should be probabilities: apply softmax if not done in model's forward method
            if len(pred.shape) == 2 and pred.shape[1] > 1: # 分類数が2以上ならsoftmax
                pred = F.softmax(pred, dim=1)
                loss_fn = nn.CrossEntropyLoss()
                error = loss_fn(pred, y.long()).item()
            else:
                loss_fn = nn.MSELoss() # Mean Square Loss
                error = loss_fn(pred, y).item()

        errors.append(error)
        x[:, i] = hold

    # feature importanceの算出
    max_error = max(errors)
    importances = [e / max_error for e in errors]

    data = {'name': names, 'error':errors, 'importance':importances}
    result = pd.DataFrame(data, columns=[ 'name', 'error', 'importance'])
    result = result.sort_values(by='importance', ascending=False)
    result.reset_index(inplace=True, drop=True)
    return result

## Classification and Input Perturbation Ranking
We now look at the code to perform perturbation ranking for a classification neural network. The implementation technique is slightly diffferent for classification vs. regression, so I must provide two different implementaions. The primary difference between classification and regression is how we evaluate the accuracy of the neural network in each of these two network types. We will use the Root Mean Square (RMSE) error calculatuion, whereas we will use log loss for classirfication.

In [None]:
import time

import numpy as np
import pandas as pd
import torch
import tqdm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset

In [None]:
from os import XATTR_CREATE
# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

def load_data():
    df = pd.read_csv(
        "https://data.heatonresearch.com/data/t81-558/iris.csv", na_values=["NA", "?"]
    )

    le = LabelEncoder() # transform category to numbers

    x = df[["sepal_l", "sepal_w", "petal_l", "petal_w"]].values
    y = le.fit_transform(df["species"])
    species = le.classes_

    # Split into validation and training stes
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.25, random_state=42
    )

    scaler = StandardScaler()
    x_train = scaler.fit_transform(x_train)
    x_test = scaler.transform(x_test)

    # Numpy to Torch Tensor
    x_train = torch.tensor(x_train, device=device, dtype=torch.float32)
    y_train = torch.tensor(y_train, device=device, dtype=torch.long)

    x_test = torch.tensor(x_test, device=device, dtype=torch.float32)
    y_test = torch.tensor(y_test, device=device, dtype=torch.long)

    return x_train, x_test, y_train, y_test, species, df.columns

x_train, x_test, y_train, y_test, species, columns = load_data()
columns = list(columns)
columns.remove("species") # remove the target(y)

In [None]:
# Create datasets
BATCH_SIZE = 16

dataset_train = TensorDataset(x_train, y_train)
dataloader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)

dataset_test = TensorDataset(x_test, y_test)
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=False)

# Create model using nn.Sequenctial
model = nn.Sequential(
    nn.Linear(x_train.shape[1], 50),
    nn.ReLU(),
    nn.Linear(50, 25),
    nn.ReLU(),
    nn.Linear(25, len(species)),
    nn.LogSoftmax(dim=1),
)

model = torch.compile(model, backend="aot_eager").to(device)

# Set loss
loss_fn = nn.CrossEntropyLoss()

# Set optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Set early stopping
es = EarlyStopping()

epoch = 0
done = False
while epoch < 1000 and not done:
    epoch += 1
    steps = list(enumerate(dataloader_train))
    pbar = tqdm.tqdm(steps)
    model.train()
    for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch.to(device))
        loss = loss_fn(y_batch_pred, y_batch.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss, current = loss.item(), (i + 1) * len(x_batch)
        if i == len(steps) - 1:
            model.eval()
            with torch.no_grad():
                pred = model(x_test.to(device))
                vloss = loss_fn(pred, y_test.to(device))
                if es(model, vloss):
                    done = True
                pbar.set_description(
                    f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, {es.status}"
                )
        else:
            pbar.set_description(f"Epoch: {epoch}, tloss: {loss}")

Epoch: 1, tloss: 0.7962242960929871, vloss: 0.607716, : 100%|██████████| 7/7 [00:00<00:00, 13.25it/s]
Epoch: 2, tloss: 0.2632836103439331, vloss: 0.254798, Improvement found, counter reset to 0: 100%|██████████| 7/7 [00:00<00:00, 143.22it/s]
Epoch: 3, tloss: 0.1994396448135376, vloss: 0.159267, Improvement found, counter reset to 0: 100%|██████████| 7/7 [00:00<00:00, 149.91it/s]
Epoch: 4, tloss: 0.18039251863956451, vloss: 0.096190, Improvement found, counter reset to 0: 100%|██████████| 7/7 [00:00<00:00, 128.36it/s]
Epoch: 5, tloss: 0.0985158309340477, vloss: 0.062836, Improvement found, counter reset to 0: 100%|██████████| 7/7 [00:00<00:00, 137.68it/s]
Epoch: 6, tloss: 0.16301701962947845, vloss: 0.045974, Improvement found, counter reset to 0: 100%|██████████| 7/7 [00:00<00:00, 150.73it/s]
Epoch: 7, tloss: 0.02891465462744236, vloss: 0.035409, Improvement found, counter reset to 0: 100%|██████████| 7/7 [00:00<00:00, 164.05it/s]
Epoch: 8, tloss: 0.030066024512052536, vloss: 0.024942,

Next, we evaluate the accuracy of the trained model. Here we see that the neural network performs great with an accuracy of 1.0. We might fear overfitting with such high accuracy for a more complex dataset. However, for this example, we are more interested in determining the importance of each column.

In [None]:
from sklearn.metrics import accuracy_score

pred = model(x_test)
_, predict_classes = torch.max(pred, dim=1)
print('Accuracy:')
accuracy_score(y_test.cpu().numpy(), predict_classes.cpu().numpy())

Accuracy:


1.0

We are now ready to call the input perturbation algorithm. First, we extract the column names and remove the target column. The target column is not important, as it is the objective, not one of the inputs. In supervised learning, the target is of the utmost importance.


We can see importance displayed in the following table. The most important column is always 1.0, and lessor columns will continue in a downward trend. The least important column will have the lowest rank.

In [None]:
# Rank the features
from IPython.display import display, HTML

rank = perturbation_rank(device, model, x_test, y_test, columns, False)
display(rank)

Unnamed: 0,name,error,importance
0,petal_w,1.205691,1.0
1,petal_l,1.197067,0.992847
2,sepal_w,1.081718,0.897176
3,sepal_l,0.968614,0.803368


## Regression and Input Perturbation Ranking
We now see how to use input perturbation ranking for a regression neural network. we will use the MPG dataset as a demonstration. The code below loads the MPG dataset and creates a regression neural network for this dataset. The code trains the neural network and calculates an RMSE evaluation.

In [4]:
import time

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import tqdm
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset

In [5]:
# Read the MPG dataset.
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv", na_values=["NA", "?"]
)

cars = df["name"]

# Handle missing value
df["horsepower"] = df["horsepower"].fillna(df["horsepower"].median())
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [10]:
# Pandas to Numpy
x = df[
    [
        "cylinders",
        "displacement",
        "horsepower",
        "weight",
        "acceleration",
        "year",
        "origin",
    ]
].values
y = df["mpg"].values  # regression

# Split into validation and training sets
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=42
)

# Numpy to Torch Tensor
x_train = torch.tensor(x_train, device=device, dtype=torch.float32)
y_train = torch.tensor(y_train, device=device, dtype=torch.float32)

x_test = torch.tensor(x_test, device=device, dtype=torch.float32)
y_test = torch.tensor(y_test, device=device, dtype=torch.float32)


# Create datasets
BATCH_SIZE = 16

dataset_train = TensorDataset(x_train, y_train)
dataloader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)

dataset_test = TensorDataset(x_test, y_test)
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=True)

In [12]:
# Create model
model = nn.Sequential(
    nn.Linear(x_train.shape[1], 50),
    nn.ReLU(),
    nn.Linear(50, 25),
    nn.ReLU(),
    nn.Linear(25, 1)
)
model = torch.compile(model, backend="aot_eager").to(device)

# Set loss function for regression
loss_fn = nn.MSELoss()

# Set optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Set early stopping
es = EarlyStopping()


epoch = 0
done = False
while epoch < 1000 and not done:
    epoch += 1
    steps = list(enumerate(dataloader_train))
    pbar = tqdm.tqdm(steps)
    model.train()
    for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch.to(device)).flatten()
        loss = loss_fn(y_batch_pred, y_batch.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss, current = loss.item(), (i + 1) * len(x_batch)
        if i == len(steps) - 1:
            model.eval()
            with torch.no_grad():
                pred = model(x_test.to(device)).flatten()
                vloss = loss_fn(pred, y_test.to(device))
                if es(model, vloss):
                    done = True
                pbar.set_description(
                    f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, {es.status}"
                )
        else:
            pbar.set_description(f"Epoch: {epoch}, tloss: {loss}")

from sklearn import metrics

# Measure RMSE error.
pred = model(x_test)
score = torch.sqrt(torch.nn.functional.mse_loss(pred.flatten(), y_test))
print(f"RMSE: {score}")

Epoch: 1, tloss: 216.49368286132812, vloss: 456.890594, : 100%|██████████| 19/19 [00:01<00:00, 18.40it/s]
Epoch: 2, tloss: 317.75323486328125, vloss: 242.168182, Improvement found, counter reset to 0: 100%|██████████| 19/19 [00:00<00:00, 99.67it/s] 
Epoch: 3, tloss: 191.02847290039062, vloss: 191.179001, Improvement found, counter reset to 0: 100%|██████████| 19/19 [00:00<00:00, 112.12it/s]
Epoch: 4, tloss: 91.01396179199219, vloss: 132.316315, Improvement found, counter reset to 0: 100%|██████████| 19/19 [00:00<00:00, 35.11it/s]
Epoch: 5, tloss: 230.85031127929688, vloss: 99.294991, Improvement found, counter reset to 0: 100%|██████████| 19/19 [00:00<00:00, 30.25it/s]
Epoch: 6, tloss: 103.84315490722656, vloss: 75.213066, Improvement found, counter reset to 0: 100%|██████████| 19/19 [00:00<00:00, 55.16it/s]
Epoch: 7, tloss: 61.395042419433594, vloss: 71.006981, Improvement found, counter reset to 0: 100%|██████████| 19/19 [00:00<00:00, 122.55it/s]
Epoch: 8, tloss: 108.26927185058594, 

RMSE: 3.049299716949463


Just as before, we extract the column names and discard the target. We can now create a ranking of the importancee of each of the input features. The feature with a ranking of 1.0 is the most important.

In [15]:
# Rank the features
from IPython.display import display, HTML

names = list(df.columns) # x+y column names
names.remove("name")
names.remove("mpg") # remove the target(y)
rank = perturbation_rank(device, model, x_test, y_test, names, True)
display(rank)

  return F.mse_loss(input, target, reduction=self.reduction)


Unnamed: 0,name,error,importance
0,weight,618.877869,1.0
1,origin,545.104797,0.880795
2,year,313.514557,0.506586
3,cylinders,160.791,0.259811
4,acceleration,152.779205,0.246865
5,displacement,115.985611,0.187413
6,horsepower,101.247238,0.163598


## Biological Response with Neural Network
The following sections will demonstrate how to use feature importance ranking and ensumbling with a more complex dataset. Ensumbling is the process where you combine multiple models for greater accuracy. Kaggle competition winners frequently make use of ensumbling for high-ranking solution.


We will use the biological response dataset, a Kaggle dataset, where there is an unusually high number of columns. **Because of the large number of columns, it is essential to use feature ranking to determine the importance of these columns.** We begin by loading the dataset and preprocessing. This Kaggle dataset is a binary classification problem. You must predict if certain conditions will cause a biological response.

In [16]:
import pandas as pd
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold
from IPython.display import HTML, display

URL = "https://data.heatonresearch.com/data/t81-558/kaggle/"

df_train = pd.read_csv(
    URL+"bio_train.csv",
    na_values=['NA', '?'])

df_test = pd.read_csv(
    URL+"bio_test.csv",
    na_values=['NA', '?'])

activity_classes = df_train['Activity']

In [18]:
print(df_train.shape)
display(df_train.head())

(3751, 1777)


Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.0,0.497009,0.1,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.0333,0.480124,0.0,0.0,0.209791,0.61035,0.356453,0.51772,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.0,0.538825,0.0,0.5,0.196344,0.72423,0.235606,0.288764,0.80511,...,0,0,0,0,0,0,0,0,0,0
4,0,0.1,0.517794,0.0,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0


A large number of columns is evident when we display the shape of the dataset.



The following code constructs a classification neural network and trains it for the biological response dataset. Once trained, the accuracy is measured.

In [19]:
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import numpy as np
import sklearn
from sklearn import metrics
from torch.utils.data import DataLoader, TensorDataset

In [22]:
# Assuming df_train and df_test are predefined
x_columns = df_train.columns.drop('Activity')
x = torch.tensor(df_train[x_columns].values, dtype=torch.float32)
y = torch.tensor(df_train['Activity'].values, dtype=torch.float32).view(-1, 1) # For binary cross entropy
x_submit = torch.tensor(df_test[x_columns].values, dtype=torch.float32)

In [23]:
y.shape

torch.Size([3751, 1])

In [24]:
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

# Move to GPU if available
x_train, y_train, x_test, y_test = map(lambda t: t.clone().detach().to(device), (x_train, y_train, x_test, y_test))

train_dataset = TensorDataset(x_train, y_train)
test_dataset = TensorDataset(x_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Define model using Sequential
model = nn.Sequential(
    nn.Linear(x_train.shape[1], 25),
    nn.ReLU(),
    nn.Linear(25, 10),
    nn.Linear(10, 1),
    nn.Sigmoid()
).to(device)

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

In [26]:
# Training with early stopping
best_loss = float('inf')
patience = 5
no_improvements = 0
for epoch in range(1000):
    model.train()
    for batch in train_loader:
        input, lables = batch

        optimizer.zero_grad()
        outputs = model(input)
        loss = criterion(outputs, lables)
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        val_loss = sum(criterion(model(inputs), labels) for inputs, labels in test_loader)
        if val_loss < best_loss - 1e-3:
            best_loss = val_loss
            no_improvements = 0
        else:
            no_improvements += 1
        if no_improvements >= patience:
            print("Early stopping")
            break


# Prediction
with torch.no_grad():
    pred = model(x_test).cpu().numpy().flatten()
    pred = np.clip(pred, a_min=1e-6, a_max=1-1e-6)

    print("Validation logloss: {}".format(sklearn.metrics.log_loss(y_test.cpu(), pred)))

    pred_binary = (pred > 0.5).astype(int)
    score = metrics.accuracy_score(y_test.cpu().numpy(), pred_binary)
    print("Validation accuracy score: {}".format(score))

    pred_submit = model(x_submit.to(device)).cpu().numpy().flatten()
    pred_submit = np.clip(pred_submit, a_min=1e-6, a_max=1-1e-6)

    submit_df = pd.DataFrame({'MoleculeId': [x+1 for x in range(len(pred_submit))], 'PredictedProbability': pred_submit})


Early stopping
Validation logloss: 0.5600866362644804
Validation accuracy score: 0.767590618336887


### What Features/Columns are Important
The following uses perturbation ranking to evaluate the neural network.

In [27]:
# Rank the features
from IPython.display import display, HTML

names = list(df_train.columns) # x+y column names
names.remove("Activity") # remove the target(y)
rank = perturbation_rank(device, model, x_test, y_test, names, False)
display(rank[0:10])

Unnamed: 0,name,error,importance
0,D603,0.571014,1.0
1,D129,0.570891,0.999785
2,D179,0.570877,0.999761
3,D149,0.570474,0.999055
4,D490,0.570162,0.998508
5,D240,0.570091,0.998385
6,D827,0.570015,0.99825
7,D887,0.569835,0.997935
8,D273,0.569802,0.997879
9,D850,0.569619,0.997557


## Neural Network Ensemble
A neural network ensemble combines neural network predictions with other models. The program determines the exact blend of these models by logistic regression. The following code performs this blend for a classification.

In [28]:
import numpy as np
import os
import pandas as pd
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

In [33]:
SHUFFLE = False
FOLDS = 10

# Using nn.Sequential to define the model
def build_ann(input_size, classes, neurons):
    model = nn.Sequential(
        nn.Linear(input_size, neurons),
        nn.ReLU(),
        nn.Linear(neurons, 1),
        nn.Linear(1, classes),
        nn.Softmax(dim=1)
    )
    return model

def mlogloss(y_test, preds):
    epsilon = 1e-15
    sum = 0
    for row in zip(preds,y_test):
        x = row[0][row[1]]
        x = max(epsilon,x)
        x = min(1-epsilon,x)
        sum+=math.log(x)
    return( (-1/len(preds))*sum)

def stretch(y):
    return (y - y.min()) / (y.max() - y.min())

def blend_ensemble(x, y, x_submit):
    kf = StratifiedKFold(FOLDS)
    folds = list(kf.split(x,y))

    models = [
        build_ann(x.shape[1], 2, 20),
        KNeighborsClassifier(n_neighbors=3),
        RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)
    ]

    dataset_blend_train = np.zeros((x.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))

    for j, model in enumerate(models):
        print("Model: {} : {}".format(j, model))
        fold_sums = np.zeros((x_submit.shape[0], len(folds)))
        total_loss = 0
        for i, (train, test) in enumerate(folds):
            x_train = torch.tensor(x[train], dtype=torch.float32)
            y_train = torch.tensor(y[train].values, dtype=torch.int64)
            x_test = torch.tensor(x[test], dtype=torch.float32)
            y_test = torch.tensor(y[test].values, dtype=torch.int64)

            if isinstance(model, nn.Module):  # Check if the model is a PyTorch model
                optimizer = optim.Adam(model.parameters())
                criterion = nn.CrossEntropyLoss()

                # Training
                optimizer.zero_grad()
                outputs = model(x_train)
                loss = criterion(outputs, y_train)
                loss.backward()
                optimizer.step()

                # Prediction
                with torch.no_grad():
                    outputs_test = model(x_test)
                    _, predicted = outputs_test.max(1)
                    pred = F.softmax(outputs_test, dim=1).numpy()
                    outputs_submit = model(torch.tensor(x_submit, dtype=torch.float32))
                    pred2 = F.softmax(outputs_submit, dim=1).numpy()
            else:
                model.fit(x_train, y_train)
                pred = np.array(model.predict_proba(x_test))
                pred2 = np.array(model.predict_proba(x_submit))

            dataset_blend_train[test, j] = pred[:, 1]
            fold_sums[:, i] = pred2[:, 1]
            loss = mlogloss(y_test, pred)
            total_loss+=loss
            print("Fold #{}: loss={}".format(i,loss))
        print("{}: Mean loss={}".format(model.__class__.__name__, total_loss/len(folds)))
        dataset_blend_test[:, j] = fold_sums.mean(1)

    print()
    print("Blending models.")
    blend = LogisticRegression(solver='lbfgs')
    blend.fit(dataset_blend_train, y)
    return blend.predict_proba(dataset_blend_test), dataset_blend_train, dataset_blend_test

In [34]:
np.random.seed(42)  # seed to shuffle the train set

print("Loading data...")
URL = "https://data.heatonresearch.com/data/t81-558/kaggle/"

df_train = pd.read_csv(URL+"bio_train.csv", na_values=['NA', '?'])
df_submit = pd.read_csv(URL+"bio_test.csv", na_values=['NA', '?'])

Loading data...


In [32]:
df_train.head()

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.0,0.497009,0.1,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.0333,0.480124,0.0,0.0,0.209791,0.61035,0.356453,0.51772,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.0,0.538825,0.0,0.5,0.196344,0.72423,0.235606,0.288764,0.80511,...,0,0,0,0,0,0,0,0,0,0
4,0,0.1,0.517794,0.0,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0


In [44]:
df_train['Activity'].unique()

array([1, 0])

In [36]:
predictors = list(df_train.columns.values)
predictors.remove('Activity')
x = df_train[predictors].values
y = df_train['Activity']
x_submit = df_submit.values

if SHUFFLE:
    idx = np.random.permutation(y.size)
    x = x[idx]
    y = y[idx]

submit_data, dataset_blend_train, dataset_blend_test = blend_ensemble(x, y, x_submit)
submit_data = stretch(submit_data)

Model: 0 : Sequential(
  (0): Linear(in_features=1776, out_features=20, bias=True)
  (1): ReLU()
  (2): Linear(in_features=20, out_features=1, bias=True)
  (3): Linear(in_features=1, out_features=2, bias=True)
  (4): Softmax(dim=1)
)
Fold #0: loss=0.6958829953342398
Fold #1: loss=0.6941009778087774
Fold #2: loss=0.6927663971013983
Fold #3: loss=0.6913332873280058
Fold #4: loss=0.6905866068713171
Fold #5: loss=0.6904894890764086
Fold #6: loss=0.6880869763879794
Fold #7: loss=0.6873027566818372
Fold #8: loss=0.6851593466737296
Fold #9: loss=0.6805799294265442
Sequential: Mean loss=0.6896288762690237
Model: 1 : KNeighborsClassifier(n_neighbors=3)
Fold #0: loss=3.606678388314123
Fold #1: loss=2.2256421551487593
Fold #2: loss=3.6815437059542186
Fold #3: loss=2.416161292225968
Fold #4: loss=4.442472310149748
Fold #5: loss=4.321350530738247
Fold #6: loss=3.400455469543658
Fold #7: loss=3.1724147110842513
Fold #8: loss=2.117356283193681
Fold #9: loss=3.0532135963322586
KNeighborsClassifier: Me

In [37]:
# Build submit file
ids = [id+1 for id in range(submit_data.shape[0])]
submit_df = pd.DataFrame({'MoleculeId': ids, 'PredictedProbability': submit_data[:, 1]}, columns=['MoleculeId','PredictedProbability'])

In [43]:
display(submit_data.shape)
display(dataset_blend_train.shape)
display(dataset_blend_test.shape)
display(submit_df)

(2501, 2)

(3751, 7)

(2501, 7)

Unnamed: 0,MoleculeId,PredictedProbability
0,1,0.950695
1,2,0.963416
2,3,0.419992
3,4,0.985059
4,5,0.066068
...,...,...
2496,2497,0.260016
2497,2498,0.065480
2498,2499,0.978737
2499,2500,0.782989


In [41]:
dataset_blend_train
# 行ごとに、各モデルの予測値(バイナリclassificationの1の確率)が入っている

array([[0.60081685, 0.66666667, 0.9       , ..., 0.88      , 0.8       ,
        0.8474707 ],
       [0.60228235, 1.        , 1.        , ..., 0.99      , 1.        ,
        0.88462191],
       [0.59956253, 0.        , 0.18      , ..., 0.22      , 0.13      ,
        0.18570176],
       ...,
       [0.55831301, 0.33333333, 0.33      , ..., 0.5       , 0.39      ,
        0.25553465],
       [0.57499862, 0.66666667, 0.88      , ..., 0.88      , 0.91      ,
        0.79543032],
       [0.57660043, 0.        , 0.1       , ..., 0.03      , 0.02      ,
        0.1344547 ]])

In [42]:
dataset_blend_train

array([[0.60081685, 0.66666667, 0.9       , ..., 0.88      , 0.8       ,
        0.8474707 ],
       [0.60228235, 1.        , 1.        , ..., 0.99      , 1.        ,
        0.88462191],
       [0.59956253, 0.        , 0.18      , ..., 0.22      , 0.13      ,
        0.18570176],
       ...,
       [0.55831301, 0.33333333, 0.33      , ..., 0.5       , 0.39      ,
        0.25553465],
       [0.57499862, 0.66666667, 0.88      , ..., 0.88      , 0.91      ,
        0.79543032],
       [0.57660043, 0.        , 0.1       , ..., 0.03      , 0.02      ,
        0.1344547 ]])

### The function of `mlogloss`
```python
def mlogloss(y_test, preds):
    epsilon = 1e-15
    sum = 0
    for row in zip(preds,y_test):
        x = row[0][row[1]]
        x = max(epsilon,x)
        x = min(1-epsilon,x)
        sum += math.log(x)
    return (-1/len(preds)) * sum
```
#### Purpose:
The `mlogloss` function calculates the multi-class logarithmic loss, also known as log loss. Log loss is a performance metric for classification models, particulary useful for probabilistic models where the output is a probability distribution over classes.

#### Explanation:
* `epsilon` is a small value to ensure numerical stability and avoil log(0).
* The function iterates over each prediction and the corresponding true label.
* For each prediction, it retrieves the predicted probability of the true class label.
* The prediction probability is clipped between `epsilon` and `1-epsilon` to avoid log(0) errors.
* The log of this probability is summed over all samples.
* Finally, the sum is normalized by the number of predictions and negated(否定された) to give the final log loss value.

### Function `stretch`
```python
def stretch(y):
    return (y - y.min()) / (y.max() - y.min())
```

#### Purpose:
The `stretch` function normalizes an array of values to the range [0, 1]. This is often useful in machine learning to ensure that all features have the same scale, which can improve the performance of many models.

#### Explanation:
* `y.min()` and `y.max()` are the minimum and maximum values in the array `y`.
* The function scales the values in `y` such that the minimum value becomes 0 and the maximum value becomes 1.
* This is done by subtracting the minimum value from each element and then dividing by the range (max - min).

### Function `blend_ensemble`
#### Purpose:
The `blend_ensemble` function performs model blending, an ensemble technique where multiple models are trained and their predicitons are combined to produce a final prediction. This often improves the overall performance by leveraging the strengths of different models.

### Explanation:
* `StratifiedKFold` is used to create stratified folds cross-validation, ensuring each fold has the same proportion of class labels as original dataset.
* A list of models (including neural networks and varions ensemble classifiers) is created.
* Two arrays, `dataset_blend_train` and `dataset_blend_test`, are initialized to store the blended predictions.
* For each model, cross-validation is performed:
     * If the model is a PyTorch model, it is trained using Adam optimizer and cross-entropy loss.
    * If the model is a scikit-learn model, it is trained using its fit method.
    * Predictions are made for both the validation set and the submission set.
    * Log loss is calculated for the validation set predictions.
* After all folds, the mean predictions for the submission set are stored in `dataset_blend_test`.
* Finally, a logistic regression model is trained on the blended training predictions and used to make the final predicitons for the submission set.

### detail of `mlogloss` function
```python
def mlogloss(y_test, preds):
    epsilon = 1e-15
    sum = 0
    for row in zip(preds, y_test):
        x = row[0][row[1]]
        x = max(epsilon, x)
        x = min(1 - epsilon, x)
        sum += math.log(x)
    return (-1 / len(preds)) * sum
```

#### Purpose:
The `mlogloss` function calculates the multi-class logarithmic loss, a performance metric often used in classification problems to measure the accuracy of probabilistic predictions.

### Detailed Breakdown:
1. **Intialize `epsilon` and `sum`:
```python
epsilon = 1e-15
sum = 0
```
* `epsilon` is set to a very smal value (**1e-15`**). This is used to avoid issue with taking the logarithm of zero or one, which would result in underfined or infinite values.
* `sum` is initialized to zero and will accumulate the log loss values for each prediction.

2. **Loop over each prediction and true label pair**:
```python
for row in zip(preds, y_test):
```
* The `zip(preds y_test)` function pairs each prediction (from `preds`) with its corresponding true label (from `y_test`)
* The loop iterates over these pairs, processing one pair at a time.

3. **Extract the predicted probability for the true class**
```python
x = row[0][row[1]]
```
* `row[0]` is the predicted probability distribution (an array) for a particular sample.
* `row[1]` is the true class labels for that sample.
* `row[0][row[1]]` extracts the predicted probability of the true class label.

4. **Clip the predicted probability to avoid log(0) and log(1)**:
```python
x = max(epsilon, x)
x = min(1 - epsilon, x)
```
* The predicted probability `x` is adjusted to ensure it is within the range `[epsilon, 1 - epsilon]`.
* This prevents `x` from being exactly 0 or 1, which would cause isssues when taking the logarithm.

5. **Accumulated the log loss**:
```python
sum += math.log(x)
```
* The natural logarithm of the cipped predicted probability `x` is added to the cumulative `sum`.

6. **Calculate the average log loss**:
```python
return (-1 / len(preds)) * sum
```
* The cumulative log loss `sum` is divided by the number of predictions (`len(preds)`) to get the average log loss.
* The result is negated to ensure the log loss is a positive value (since the log of probabilities between 0 and 1 is negative).


### Summary:
The `mlogloss` function cumputes the log loss for a set of predictions, providing a measure of how well the predicted probabilities match the true class labels. It ensures numerical stabitily by clipping the predicted probabilities and returns the average log loss over all samples. This metric is useful for evaluating the performance of probabilistic classification models, with lower values indicating better performance.


# About Logarithmic loss
Logarithmic loss, often abbreviated as log loss, is a performance metric used to evaluate the accuracy of probabilistic classification models. It measures the uncertainty of predictions and penalizes both false classifications and confident but incorrect predictions. Log loss is particularly useful for models that provide probability estimates rather than just class labels.

## Key Points of Log Loss
1. **Probabilistic Predictions**:
    * Log loss evaluates the predicted probabilities of each class rather than the predicted class labels.
    * A good probabilistic model assigns high probabilities to the correct classes and low probabilities to the incorrect classes.
2. **Penalty for Incorrect Predictions**:
    * Incorrect predictions are penalized based on their confidence. For example, predicting a high probability for the wrong class results in a higher penalty than predicting a low probability for the wrong class.
3. **Range and Interpretation**:
    * Log loss values range from 0 to infinity, where 0 indicates perfect predictions and higher values indicate worse performance.
    * Lower log loss values are better, indicating more accurate and confident predictions.

## Formula for Log Loss:
For a set of $N$ samples, the log loss is calculated as:
$$
\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} \log(p_{ij})
$$

where:
* $N$ is the number of samples.
* $M$ is the number of classes.
* $y_{ij}$ is a binary indicator (0 or 1) if the true class label of sample $i$ is $j$.
* $p_{ij}$ is the predicted probability thta sample $i$ belongs to class $j$.

## Example Calculation
Consider a binary classification problem with two classes (0 and 1). Suppose we have three samples with the following true labels and predicted probabilities:

| Sample | True Label | Predicted Probability (Class 0) | Predicted Probability (Class 1) |
|--------|------------|----------------------------------|----------------------------------|
| 1      | 0          | 0.8                              | 0.2                              |
| 2      | 1          | 0.4                              | 0.6                              |
| 3      | 1          | 0.1                              | 0.9                              |

<br>

The log loss for each sample is calculated as:

- Sample 1: $-\log(0.8)$
- Sample 2: $-\log(0.6)$
- Sample 3: $-\log(0.9)$

Then, the overall log loss is the average of these individual log losses:

$$
\text{Log Loss} = -\frac{1}{3} \left[ \log(0.8) + \log(0.6) + \log(0.9) \right]
$$

Using a calculator to find the logarithms:

- $$\log(0.8) \approx -0.223$$
- $$\log(0.6) \approx -0.511$$
- $$\log(0.9) \approx -0.105$$

So,
$$
\text{Log Loss} = -\frac{1}{3} \left[ -0.223 + -0.511 + -0.105 \right] \approx 0.279
$$

## Significance:
* **Model Evaluation**: Log loss provides a way to evaluate how well a model's predictd probabiilities align with the true class labels.
* **Model Comparison**: It allows for the comparison of different models, especially in scenarios where models output probabilistic predictions.
* **Hyperparameter Tuning**: Log loss can be used as a metric for optimizing model hyperparameters to improve probabilistic accuracy.

In summary, log loss is a crucial metric for evaluating and comparing the performance of probabilistic classification models, emphasizing the importance of both correct and cofident predictions.






Logarithmic loss, often abbreviated as log loss, is a performance metric used to evaluate the accuracy of probabilistic classification models. It measures the uncertainty of predictions and penalizes both false classifications and confident but incorrect predictions. Log loss is particularly useful for models that provide probability estimates rather than just class labels.

### Key Points of Log Loss:

1. **Probabilistic Predictions:**
   - Log loss evaluates the predicted probabilities of each class rather than the predicted class labels.
   - A good probabilistic model assigns high probabilities to the correct classes and low probabilities to the incorrect classes.

2. **Penalty for Incorrect Predictions:**
   - Incorrect predictions are penalized based on their confidence. For example, predicting a high probability for the wrong class results in a higher penalty than predicting a low probability for the wrong class.

3. **Range and Interpretation:**
   - Log loss values range from 0 to infinity, where 0 indicates perfect predictions and higher values indicate worse performance.
   - Lower log loss values are better, indicating more accurate and confident predictions.

### Formula for Log Loss:

For a set of \( N \) samples, the log loss is calculated as:

\[ \text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} \log(p_{ij}) \]

where:
- \( N \) is the number of samples.
- \( M \) is the number of classes.
- \( y_{ij} \)
- \( p_{ij} \) is the predicted probability that sample \( i \) belongs to class \( j \).

### Example Calculation:

Consider a binary classification problem with two classes (0 and 1). Suppose we have three samples with the following true labels and predicted probabilities:


### Significance:

- **Model Evaluation:** Log loss provides a way to evaluate how well a model's predicted probabilities align with the true class labels.
- **Model Comparison:** It allows for the comparison of different models, especially in scenarios where models output probabilistic predictions.
- **Hyperparameter Tuning:** Log loss can be used as a metric for optimizing model hyperparameters to improve probabilistic accuracy.

In summary, log loss is a crucial metric for evaluating and comparing the performance of probabilistic classification models, emphasizing the importance of both correct and confident predictions.

# The reason of using Logistic Regression for Model Blending

Using `LogisticRegression(solver='lbfgs')` to blend models in an ensemble approach is a technique known as **stacking** or **stacked generalization**. The idea is to use a meta-model to learn how to best combine the predictions of several base models. Here's a detailed explanation of why this approach is used:

## Why Use Logistic Regression for Blending:
1. **Combining Predictions**:
    * When we have multiple models, each providing their own predictions, the goal is to combine these predictions in a way that leverages the strengths of each model.
    * Logistic regression is a simple yet powerful algorithm that can learn the optimal weights for combining these predictions. It essentially learns weighted sum of the base model predictions to produce the final prediction.

2. **Handling Probalistic Outputs**:
    * Logistic regression works well with probabilistic outputs, which is often the case in classfication problems. Each base model provides a probability distribution over the classes.
    * Logistic regression can take these probabilities as input features and learn how to combine them to maximize the accuracy of the final predictions.

3. **Flexibility and Simplicity**:
    * Logisitic regression is relatively simple and computationally efficient, making it suitable for blending even when the number of base models is large.
    * It doesn't require extensive hyperparameter tuning compared to more complex models, which simplifies the stacking process.

## Detailed Process in the Code:
1. **Create Base Models**:
```python
models = [
    build_ann(x.shape[1], 2, 20),
    KNeighborsClassifier(n_neighbors=3),
    RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
    RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
    ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
    ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
    GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)
]
```
    * A list of diverse models is created, including neural networks and several ensemble classifiers.

2. **Training Base Models and Collecting Predictions**:
```python
dataset_blend_train = np.zeros((x.shape[0], len(models)))
dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))
```
    * Two arrays are initialized to store the predictions from each base model for both the training and submission datasets.

3. **Cross-Validation and Blending**:
```python
for j, model in enumerate(models):
    ...
    for i, (train, test) in enumerate(folds):
        ...
        if isinstance(model, nn.Module):
            ...
        else:
            model.fit(x_train, y_train)
            pred = np.array(model.predict_proba(x_test))
            pred2 = np.array(model.predict_proba(x_submit))
            
        dataset_blend_train[test, j] = pred[:, 1]
        fold_sums[:, i] = pred2[:, 1]
        ...
    dataset_blend_test[:, j] = fold_sums.mean(1)
```
    * Each base model is trained using cross-validation.
    * Predictions for both training and submission datasets are stored in `dataset_blend_train` and `dataset_blend_test`.

4. **Blending with Logistic Regression**:
```python
blend = LogisticRegression(solver='lbfgs')
blend.fit(dataset_blend_train, y)
return blend.predict_proba(dataset_blend_test)
```
    * Logistic regression is used as the meta-model to blend the predicitons from the base models.
    * It learns the optimal weights for combining the base model predictios to produce the final prediction.


## Intuition Behind Using Logistic Regression:
* **Ensemble Learning**: Combining the predicitons of multiple models generally leads to better performance than any individual model. Each model's errors tend to be uncorrelated, so the ensemble can average out these errors.
* **Weight Learning**: Logistic regression effectively learns the importance of each base model's predictions. It can assign higher weights to more accurate models and lower weights to less accurate ones.
* **Regulariztion**: The `lbfgs` solver includes regularizaion by default, which helps prevent overfitting when combining the predictions.


By using logistic regression to blend the outputs of multiple models, we leverage the strengths of each model, mitigate their individual weaknesses, and produce a more robust final prediction. This method of stacking with logistic regression is a practical and effective way to improve predictive performance.

## Difference between `Logistic Regression` and `Simple Average` of ensemble model

The key difference between using Logistic Regression to blend ensemble model predicitons and simply calculating the mean of all predictions lies in the flexibility and learning capability of the two approaches:

### 1. Logistic Regression for Blending:
Losistic Regression is a supervised learning algorithm that learns the optimal weights for combining predictions from different models. Here's how it works:
* **Learning Weights**:
    * Logistic Regression learns different weights for each model's predictions based on how well each model performs on the training data. Models that are more accurate will receive higher weights.
    * It effectively adjusts the contribution of each model's prediction to the final ensemble prediction.

* **Handling Probabilities**:
    * It works well with probabilistic outputs (i.e., predicted probabilities from each model) and can learn the best way to combine these probabilities to minimize log loss or another suitable metric.

* **Regularization**:
    * Logistic Regression can include regularization (like L1 or L2 regularization) to prevent overfitting, especially useful when there are many base models.

### 2. Calculating the Mean of All Model Predictions:
Averaging predictions is a simpler approach where each model's prediction is given equal weight. Here's how it works:
* **Equal Weighting**:
    * Each model's prediction contributes equally to the final predicitons. No learning or adjustment of weights occurs based on model performance.
    * The final prediction is simply the average of all model predictions.

* **Simplicity**:
    * This approach is straightforward an doesn't require a meta-model or additional training.
    * It's easy to implement and computationally less intensive since no additional learning step is involved.

### Comparison:
**Felexibility and Adaptiveness**:
* **Logistic Regression**: Adapts to the strengths and weaknesses of individual models by learning weights based on their performance. This can lead to better overall performance as the ensemble can give more weight to more accurate models.
* **Mean Averaging**: Assumes all models are equally good, which might not be true in practice. It doesn't adapt to the varying performance levels of different models.

**Handling Probabilities**:
* **Logistic Regression**: Can efffectively handle and combine probabilstic outputs from models, making it more suitable for tasks where probabilistic predicitons are crucial.
* **Mean Avraging**: simply averages the probabilities, which might not always lead to the optimal combination of predictions.

**Overfitting and Regularization**:
* **Logistic Regression**: Can include regularization to prevent overfitting, making it robust, especially with many models or small datasets.
* **Mean Averaging**: Doesn't inherently include any mechanism to prevent overfitting. If some models overfit, their influence is not adjusted.


### When to Use Each Approach:
* **Logistic Regression**:
    * When you have a diverse set of models with varying performance.
    * When the task benefits from learning the optimal combination of model predictions.
    * When you need a more sophisticated and potentially more accurate ensemble method.
    * when you can afford the additional computational cost and complexity.

* **Mean Average**:
    * When simplicity and ease of implementation are priorites.
    * When you have models of similar performance and want a quick way to ensemble them.
    * When computational resources are limited.
    * When you want a baseline ensemble method to compare against more complex methods.

### Example to Illustrate:
Suppose we have three models with the following predicted probabilities for a binary classification problem (Class 1):

| Sample | Model 1 | Model 2 | Model 3 |
|--------|---------|---------|---------|
| 1      | 0.7     | 0.8     | 0.6     |
| 2      | 0.4     | 0.5     | 0.3     |
| 3      | 0.9     | 0.7     | 0.8     |


**Mean Averaging:**
- For Sample 1: $$ (0.7 + 0.8 + 0.6) / 3 = 0.7\ $$
- For Sample 2: $$ (0.4 + 0.5 + 0.3) / 3 = 0.4\ $$
- For Sample 3: $$ (0.9 + 0.7 + 0.8) / 3 = 0.8\ $$

**Logistic Regression:**
- Logistic regression might learn weights **(e.g., 0.5, 0.3, 0.2)** based on model performance.
- For Sample 1: \$$ 0.7 \times 0.5 + 0.8 \times 0.3 + 0.6 \times 0.2) = 0.74\ $$
- For Sample 2: $$ (0.4 \times 0.5 + 0.5 \times 0.3 + 0.3 \times 0.2) = 0.43\ $$
- For Sample 3: $$(0.9 \times 0.5 + 0.7 \times 0.3 + 0.8 \times 0.2) = 0.81\ $$


In summary, using Logistic Regression allows the ensemble to be more adaptive and potentially more accurate by learning the best combination of model predicitons, while meann averaging is a simpler and less adaptive approach.


# How to Combine predictions of ensemble models for Regression

When dealing with regression problems and using ensemble models, there are several ways to combine the predictions of all models. Here are some common ensemble methods for regression:

### 1. Simple Averaging

In simple averaging, the predictions from each model are given equal weight, and the final prediction is the mean of these predictions.

**Formula:**
$$ \hat{y} = \frac{1}{M} \sum_{m=1}^{M} \hat{y}_m $$

where $\hat{y}_m$ is the prediction from the $m$-th model and $M$ is the total number of models.

**Implementation:**
```python
import numpy as np

predictions = [model1.predict(X_test), model2.predict(X_test), model3.predict(X_test)]
final_prediction = np.mean(predictions, axis=0)
```

### 2. Weighted Averaging

In weighted averaging, different models' predictions are given different weights based on their performance or importance. The final prediction is a weighted sum of the individual predictions.

**Formula:**
$$ \hat{y} = \sum_{m=1}^{M} w_m \hat{y}_m $$

where $w_m$ is the weight assigned to the $m$-th model's prediction.

**Implementation:**
```python
weights = [0.5, 0.3, 0.2]
predictions = [model1.predict(X_test), model2.predict(X_test), model3.predict(X_test)]
final_prediction = np.average(predictions, axis=0, weights=weights)
```

### 3. Stacking

Stacking involves using a meta-model to learn how to best combine the predictions of the base models. The predictions of the base models are used as input features to the meta-model.

**Steps:**
1. Train each base model on the training data.
2. Use each base model to make predictions on the training and test data.
3. Use these predictions as features to train a meta-model on the training data predictions.
4. Use the meta-model to make the final prediction on the test data predictions.

**Implementation:**
```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
train_meta_features = np.zeros((X_train.shape[0], len(models)))
test_meta_features = np.zeros((X_test.shape[0], len(models)))

for i, model in enumerate(models):
    for train_idx, valid_idx in kf.split(X_train):
        model.fit(X_train[train_idx], y_train[train_idx])
        train_meta_features[valid_idx, i] = model.predict(X_train[valid_idx])
    test_meta_features[:, i] = model.predict(X_test)

meta_model = LinearRegression()
meta_model.fit(train_meta_features, y_train)
final_prediction = meta_model.predict(test_meta_features)
```

### 4. Bagging (Bootstrap Aggregating)

Bagging involves training multiple instances of the same model on different subsets of the data and averaging their predictions. Random Forest is a common example of bagging applied to decision trees.

**Implementation:**
```python
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

model = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
model.fit(X_train, y_train)
final_prediction = model.predict(X_test)
```

### 5. Boosting

Boosting involves training models sequentially, with each model focusing on correcting the errors of the previous one. The final prediction is a weighted sum of all model predictions. Gradient Boosting and AdaBoost are common examples.

**Implementation:**
```python
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)
final_prediction = model.predict(X_test)
```

### 6. Voting Regressor

Voting regressor is a simple ensemble method where multiple different regressors are trained, and their predictions are averaged to produce the final prediction.

**Implementation:**
```python
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

model1 = LinearRegression()
model2 = DecisionTreeRegressor()

voting_regressor = VotingRegressor(estimators=[('lr', model1), ('dt', model2)])
voting_regressor.fit(X_train, y_train)
final_prediction = voting_regressor.predict(X_test)
```

### Summary

Each of these ensemble methods has its own strengths and weaknesses, and the choice of method often depends on the specific problem and dataset. Simple and weighted averaging are straightforward and easy to implement, while stacking and boosting can provide more powerful and accurate predictions at the cost of increased complexity and computation. Bagging, such as Random Forest, is effective for reducing variance, while boosting is effective for reducing bias. Voting regressor is a simple and effective method for combining different types of regressors.

## How to select each way from above

Certainly! Here's a guide on the best situations for using each ensemble method in regression problems:

### 1. Simple Averaging

**Best Situation:**
- **Homogeneous Models:** When you have multiple models of similar type and performance.
- **Quick Ensemble:** When you need a simple and quick way to combine predictions.
- **Baseline:** When you want to establish a baseline ensemble method before trying more complex approaches.

**Example:**
You have trained several linear regression models with slightly different parameters or features, and they perform similarly. Averaging their predictions can give you a robust baseline.

### 2. Weighted Averaging

**Best Situation:**
- **Heterogeneous Models:** When you have models of different types or varying performance.
- **Performance-Based Weighting:** When you can assign weights based on model performance, such as cross-validation scores or validation set errors.
- **Expert Knowledge:** When you have domain knowledge to justify the weights assigned to different models.

**Example:**
You have trained a linear regression, a decision tree, and a neural network, and you want to combine their predictions by giving more weight to the neural network due to its better performance on validation data.

### 3. Stacking

**Best Situation:**
- **Complex Relationships:** When you believe that a meta-model can learn complex relationships between the base model predictions.
- **Diverse Models:** When you have a diverse set of base models (e.g., linear models, tree-based models, and neural networks).
- **Adequate Data:** When you have enough data to train both the base models and the meta-model effectively.

**Example:**
You have several models trained on different subsets or features of the data. You use their predictions as input to a meta-model, such as a linear regression or another machine learning algorithm, to learn the optimal combination of predictions.

### 4. Bagging (Bootstrap Aggregating)

**Best Situation:**
- **Reducing Variance:** When your base models (e.g., decision trees) are prone to overfitting and you want to reduce variance.
- **Homogeneous Models:** When you can use the same model type but want to improve stability and accuracy.
- **Parallel Training:** When you can train multiple models in parallel to save time.

**Example:**
You are using decision trees, which are known to be high-variance models. Using a Bagging Regressor (e.g., Random Forest) helps reduce overfitting by averaging the predictions of many decision trees trained on different bootstrap samples of the data.

### 5. Boosting

**Best Situation:**
- **Reducing Bias:** When you need to improve a model that is high-bias and underfitting the data.
- **Sequential Learning:** When you can afford sequential training of models to focus on correcting errors of previous models.
- **High Performance Needs:** When you need high performance and can manage the increased computational cost.

**Example:**
You have a weak base learner, such as a shallow decision tree, and you use boosting (e.g., Gradient Boosting or AdaBoost) to sequentially train multiple models, each focusing on the errors of the previous ones, to create a strong predictive model.

### 6. Voting Regressor

**Best Situation:**
- **Diverse Models:** When you have a set of different regressors and you want to combine their predictions.
- **Simplicity:** When you want a straightforward way to ensemble models without the complexity of training a meta-model.
- **Equal Contribution:** When you want to give equal importance to all models in the ensemble.

**Example:**
You have trained a linear regression, a decision tree regressor, and a k-nearest neighbors regressor. Using a Voting Regressor, you combine their predictions by averaging them, leveraging the strengths of each model type.

### Summary

- **Simple Averaging:** Best for homogeneous models with similar performance, or as a quick baseline.
- **Weighted Averaging:** Ideal for heterogeneous models with varying performance, when you can assign meaningful weights.
- **Stacking:** Suitable for diverse models with potential complex relationships between their predictions, and when you have enough data.
- **Bagging:** Effective for reducing variance in high-variance models like decision trees, providing stability and accuracy.
- **Boosting:** Powerful for reducing bias in weak learners, improving performance through sequential error correction.
- **Voting Regressor:** Great for combining different types of regressors in a simple, straightforward manner, giving equal importance to all models.

By understanding the strengths and best-use scenarios of each method, you can choose the most appropriate ensemble technique for your specific regression problem and dataset.