[The source of code](https://www.kaggle.com/code/unmoved/classify-possum-population)

**Company Name:**
- OpenIntro Statistics Book

**Problem Type:**
- Binary Classification

**Problem:**
- Can you use your skills to predict the age of a possum, its head length, whether it is male or female?

**Goal:**
- We will be trying to determine the population as either Victoria or other (New South Wales or Queensland) based on the features provided.
- These details (features we will use to predict) are as follows:
  - case = observation number
  - site = The site number where the possum was trapped.
  - sex = Gender, either m (male) or f (female).
  - age = Age of possum
  - hdlngth = Head length, in mm.
  - skullw = Skull width, in mm.
  - totlngth = Total length, in cm.
  - taill = Tail length, in cm.
  - footlgth = foot length
  - earconch = ear conch length
  - eye = distance from medial canthus to lateral canthus of right eye
  - chest = chest girth (in cm)
  - belly = belly girth (in cm)
  
  
- Which will let us determine the target variable which is:
  - Pop = Population, either Vic (Victoria) or other (New South Wales or Queensland).

# UNMOVED TEMPLATE GUIDE

<div style="border:3px solid #FFD700; padding: 15px; border-radius: 15px; background-color: #FFFACD;">
    <h3 style="color: #DAA520; text-align: center;"><a href="https://www.kaggle.com/unmoved" style="color: #DAA520; text-decoration: none;">Unmoved's Template Guide</a></h3>
    <p style="font-size: 15px; color: #333333; text-align: center;">
        Hey, I'm <strong><a href="https://www.kaggle.com/unmoved" style="color: #0000EE;">Unmoved</a></strong>, and this is my template. 
    </p>
    <p style="font-size: 15px; color: #333333; text-align: center;">
        Please ensure to change these variables inside of the <strong>SET IMPORTANT VARIABLES</strong> section of this notebook before proceeding.
    </p>
    <p style="font-size: 15px; color: #333333; text-align: center;">
        If you use my template, please be sure to keep this text in your notebook to credit me.
    </p>
    <h3 style="color: #DAA520;">Basics</h3>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>seed_value</strong>: For reproducibility</li>
        <li><strong>problem_type</strong>: 'classify' or 'regress' depending on your task.</li>
        <li><strong>target_column</strong>: The target column in your dataset.</li>
        <li><strong>load_from_file</strong>: Load data from a CSV file or generate sample data.</li>
        <li><strong>file_name</strong>: Provide the name of the CSV file if loading data from a file.</li>
        <li><strong>desired_samples</strong>: 
            <ul>
                <li>Controls the total number of samples to generate if not loading from a file.</li>
                <li>Determines the number of samples retained after undersampling if applicable.</li>
            </ul>
        </li>
        <li><strong>do_undersample_data</strong>: Controls if you want to undersample your data (will preserve balance class distribution if classification), regardless if it was loaded or generated. The <strong>desired_samples</strong> parameter will control the number of samples retained.</li>
        <li><strong>show_eda</strong>: Decide whether to display EDA (Exploratory Data Analysis).</li>
    </ul>
    <h3 style="color: #DAA520;">Cross Validation</h3>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>n_folds</strong>: Set the number of folds for cross-validation. This setting applies to everything that uses cross-validation in the notebook.</li>
    </ul>
    <h3 style="color: #DAA520;">Genetic Algorithm and Model Loading Settings</h3>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>load_existing_model</strong>: Load an existing model from the current directory. Remember, the code saves the model after making by default, so will use the same model unless you delete it.</li>
        <li><strong>model_name</strong>: The name to save the new model as, or the name to try loading the existing model as.</li>
        <li><strong>categorical_threshold</strong>: Adjust the threshold for treating a column as categorical.</li>
        <li><strong>n_population</strong>: Adjust the population size for the genetic algorithm.</li>
        <li><strong>n_generations</strong>: Set the number of generations for the genetic algorithm.</li>
        <li><strong>cxpb</strong>: Modify the crossover probability for the genetic algorithm.</li>
        <li><strong>mutpb</strong>: Adjust the mutation probability for the genetic algorithm.</li>
    </ul>
    <h3 style="color: #DAA520;">Permutation Feature Importance Settings</h3>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>get_permutation_importance</strong>: Decide whether to show permutation importance.</li>
        <li><strong>n_repeats</strong>: Set the number of times to repeat the permutation importance calculation. This will be validated by the same number of folds in n_folds.</li>
    </ul>
</div>


# INITIALIZE

## INSTALLS

In [1]:
import sys
import subprocess

print("Note: If you get a cant find _C error, try simply restarting the notebook, otherwise you may need to manually ensure these are installed\n")

packages = [
    'numpy', 'pandas', 'psutil', 'joblib', 'deap', 'torch', 
    'sklearn', 'xgboost', 'catboost', 'lightgbm', 'plotly', 
    'IPython', 'sweetviz', 'tqdm', 'deap', 'seaborn', 
    'matplotlib'
]

for package in packages:
    try:
        __import__(package)
        print(f"'{package}' is already installed.")
    except ImportError:
        print(f"'{package}' not found, installing...")
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])

Note: If you get a cant find _C error, try simply restarting the notebook, otherwise you may need to manually ensure these are installed

'numpy' is already installed.
'pandas' is already installed.
'psutil' is already installed.
'joblib' is already installed.
'deap' is already installed.
'torch' is already installed.
'sklearn' is already installed.
'xgboost' is already installed.
'catboost' is already installed.


'lightgbm' is already installed.
'plotly' is already installed.
'IPython' is already installed.
'sweetviz' not found, installing...
Collecting sweetviz
  Downloading sweetviz-2.3.1-py3-none-any.whl (15.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.1/15.1 MB 64.7 MB/s eta 0:00:00
Installing collected packages: sweetviz
Successfully installed sweetviz-2.3.1




'tqdm' is already installed.
'deap' is already installed.
'seaborn' is already installed.
'matplotlib' is already installed.


## IMPORTS

In [2]:
# Data Manipulation
import numpy as np
import pandas as pd
import random
from scipy.stats import uniform, randint

# System 
import os
import sys
import psutil
import platform
import subprocess
import time
import copy
import hashlib
import warnings
import pickle
from itertools import product
from joblib import Parallel, delayed

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Sample Datasets
from sklearn.datasets import make_classification, make_regression

# Data Splitting and Cross-Validation
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_val_score

# Imputation
from sklearn.impute import SimpleImputer

# Encoders
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# Scaling
from sklearn.preprocessing import StandardScaler

# Feature Extraction and Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Feature importance
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFECV

# Feature Transforms
from sklearn.preprocessing import PolynomialFeatures

# Pipelines and Quality of Life
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin, RegressorMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.utils.validation import check_X_y, check_array
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Classifiers
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, StackingClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingRegressor, VotingClassifier
from sklearn.linear_model import RidgeClassifier, RidgeClassifierCV, LogisticRegression, LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

# Regressors
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, StackingRegressor, GradientBoostingRegressor, AdaBoostRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LinearRegression, ElasticNet, ElasticNetCV
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.neural_network import MLPRegressor

# Hyperparameter Optimization
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from deap import base, creator, tools, algorithms # For genetic search 

# Plotting and Visuals
import plotly.express as px
from IPython.display import IFrame
import sweetviz as sv
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from sklearn.utils import estimator_html_repr

## GENETIC SEARCH FUNCTION

In [3]:
def make_pipeline_and_genetic_search(X, y, cat_cols, num_cols, problem_type='classify', n_folds=5, n_population=20, n_generations=10, cxpb=0.5, mutpb=0.2):   
        
    # Numerical transformer with scaling
    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])

    # Categorical transformer
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
        ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
    ])

    # Combine transformers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, num_cols),
            ('cat', categorical_transformer, cat_cols)
        ]
    )

    # Define PCA
    pca = PCA()

    # Check if GPU is available
    gpu_available = torch.cuda.is_available()

    if problem_type == 'classify':
        model_choices = {
            'rf': RandomForestClassifier(verbose=0, n_jobs=-1),
            'xgb': XGBClassifier(verbosity=0, n_jobs=-1, device='gpu' if gpu_available else 'cpu'),
            'lgbm': LGBMClassifier(verbosity=-1, n_jobs=-1, device='gpu' if gpu_available else 'cpu')
        }
        cv = StratifiedKFold(n_splits=n_folds)
        scoring_metric = 'accuracy'
    elif problem_type == 'regress':
        model_choices = {
            'rf': RandomForestRegressor(verbose=0, n_jobs=-1),
            'xgb': XGBRegressor(verbosity=0, n_jobs=-1, device='cuda' if gpu_available else 'cpu'),
            'lgbm': LGBMRegressor(verbosity=-1, n_jobs=-1, device='gpu' if gpu_available else 'cpu')
        }
        cv = KFold(n_splits=n_folds)
        scoring_metric = 'r2'

    best_estimators = {}
    parameter_hashes = {}

    # Start a timer
    evaluating_start_time = time.time()

    for model_name, model in model_choices.items():
        print(f"Optimizing model: {model_name} using Genetic Algorithm")
        
        # Define inner functions for evaluation, mutation, and mating

        def hash_parameters(params):
            """Generate a hash for a given set of parameters."""
            param_str = str(params)
            return hashlib.md5(param_str.encode()).hexdigest()

        def evaluate_individual_rf(individual):
            param_hash = hash_parameters(individual)
            if param_hash in parameter_hashes:
                print(f"[rf] Skipping duplicate evaluation: {individual}, score reused: {parameter_hashes[param_hash]:.4f}")
                return parameter_hashes[param_hash],
            
            (n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, bootstrap, pca_n_components) = individual
            
            model.set_params(
                n_estimators=int(n_estimators),
                max_depth=int(max_depth),
                min_samples_split=int(min_samples_split),
                min_samples_leaf=int(min_samples_leaf),
                max_features=max_features,
                bootstrap=bootstrap
            )
            
            pca.n_components = pca_n_components if pca_n_components else None
            
            pipeline = Pipeline(steps=[
                ('preprocessor', preprocessor),
                ('pca', pca),
                ('model', model)
            ])
            
            scores = cross_val_score(pipeline, X, y, cv=cv, scoring=scoring_metric, n_jobs=-1)
            score_mean = scores.mean()
            print(f"[rf] Evaluating: {individual} | Score: {score_mean:.4f}")
            parameter_hashes[param_hash] = score_mean  # Cache the score
            return score_mean,

        def evaluate_individual_xgb(individual):
            param_hash = hash_parameters(individual)
            if param_hash in parameter_hashes:
                print(f"[xgb] Skipping duplicate evaluation: {individual}, score reused: {parameter_hashes[param_hash]:.4f}")
                return parameter_hashes[param_hash],
            
            (n_estimators, max_depth, learning_rate, subsample, colsample_bytree, gamma, 
             reg_alpha, reg_lambda, pca_n_components) = individual
            
            model.set_params(
                n_estimators=int(n_estimators),
                max_depth=int(max_depth),
                learning_rate=learning_rate,
                subsample=subsample,
                colsample_bytree=colsample_bytree,
                gamma=gamma,
                reg_alpha=reg_alpha,
                reg_lambda=reg_lambda
            )
            
            pca.n_components = pca_n_components if pca_n_components else None
            
            pipeline = Pipeline(steps=[
                ('preprocessor', preprocessor),
                ('pca', pca),
                ('model', model)
            ])
            
            scores = cross_val_score(pipeline, X, y, cv=cv, scoring=scoring_metric, n_jobs=-1)
            score_mean = scores.mean()
            print(f"[xgb] Evaluating: {individual} | Score: {score_mean:.4f}")
            parameter_hashes[param_hash] = score_mean  # Cache the score
            return score_mean,

        def evaluate_individual_lgbm(individual):
            param_hash = hash_parameters(individual)
            if param_hash in parameter_hashes:
                print(f"[lgbm] Skipping duplicate evaluation: {individual}, score reused: {parameter_hashes[param_hash]:.4f}")
                return parameter_hashes[param_hash],
            
            (n_estimators, max_depth, learning_rate, num_leaves, min_child_samples, subsample, 
             colsample_bytree, reg_alpha, reg_lambda, pca_n_components) = individual
            
            model.set_params(
                n_estimators=int(n_estimators),
                max_depth=int(max_depth),
                learning_rate=learning_rate,
                num_leaves=int(num_leaves),
                min_child_samples=int(min_child_samples),
                subsample=subsample,
                colsample_bytree=colsample_bytree,
                reg_alpha=reg_alpha,
                reg_lambda=reg_lambda
            )
            
            pca.n_components = pca_n_components if pca_n_components else None
            
            pipeline = Pipeline(steps=[
                ('preprocessor', preprocessor),
                ('pca', pca),
                ('model', model)
            ])
            
            scores = cross_val_score(pipeline, X, y, cv=cv, scoring=scoring_metric, n_jobs=-1)
            score_mean = scores.mean()
            print(f"[lgbm] Evaluating: {individual} | Score: {score_mean:.4f}")
            parameter_hashes[param_hash] = score_mean  # Cache the score
            return score_mean,

        def custom_mutate_rf(individual):
            print(f"[rf] Before mutation: {individual}")

            for i in range(len(individual)):
                if i == 0:  # n_estimators, integer mutation
                    individual[i] += int(random.gauss(0, 50))  
                    individual[i] = max(100, min(1000, int(individual[i])))  
                elif i == 1:  # max_depth, integer mutation
                    individual[i] += int(random.gauss(0, 5))
                    individual[i] = max(10, min(50, int(individual[i])))
                elif i == 2:  # min_samples_split, integer mutation
                    individual[i] += int(random.gauss(0, 1))
                    individual[i] = max(2, min(10, int(individual[i])))  
                elif i == 3:  # min_samples_leaf, integer mutation
                    individual[i] += int(random.gauss(0, 1))
                    individual[i] = max(1, min(10, int(individual[i])))  
                elif i == 4:  # max_features, categorical mutation
                    individual[i] = random.choice([None, 'sqrt', 'log2'])  
                elif i == 5:  # bootstrap, boolean mutation
                    individual[i] = random.choice([True, False])
                elif i == 6:  # pca_n_components, categorical mutation
                    individual[i] = random.choice([None, 0.90, 0.99])

            print(f"[rf] After mutation: {individual}")
            return individual,

        def custom_mutate_xgb(individual):
            print(f"[xgb] Before mutation: {individual}")

            for i in range(len(individual)):
                if i == 0:  # n_estimators, integer mutation
                    individual[i] += int(random.gauss(0, 50))  
                    individual[i] = max(100, min(1000, int(individual[i])))  
                elif i == 1:  # max_depth, integer mutation
                    individual[i] += int(random.gauss(0, 5))
                    individual[i] = max(10, min(50, int(individual[i])))
                elif i == 2:  # learning_rate, continuous mutation
                    individual[i] += random.gauss(0, 0.01)
                    individual[i] = max(0.01, min(0.3, individual[i]))
                elif i == 3:  # subsample, continuous mutation
                    individual[i] += random.gauss(0, 0.1)
                    individual[i] = max(0.6, min(1.0, individual[i]))
                elif i == 4:  # colsample_bytree, continuous mutation
                    individual[i] += random.gauss(0, 0.1)
                    individual[i] = max(0.5, min(1.0, individual[i]))
                elif i == 5:  # gamma, continuous mutation
                    individual[i] += random.gauss(0, 0.1)
                    individual[i] = max(0, min(1.0, individual[i]))
                elif i == 6:  # reg_alpha, continuous mutation
                    individual[i] += random.gauss(0, 1.0)
                    individual[i] = max(0, min(10.0, individual[i]))
                elif i == 7:  # reg_lambda, continuous mutation
                    individual[i] += random.gauss(0, 1.0)
                    individual[i] = max(0, min(10.0, individual[i]))
                elif i == 8:  # pca_n_components, categorical mutation
                    individual[i] = random.choice([None, 0.90, 0.99])

            print(f"[xgb] After mutation: {individual}")
            return individual,

        def custom_mutate_lgbm(individual):
            print(f"[lgbm] Before mutation: {individual}")

            for i in range(len(individual)):
                if i == 0:  # n_estimators, integer mutation
                    individual[i] += int(random.gauss(0, 50))  
                    individual[i] = max(100, min(1000, int(individual[i])))  
                elif i == 1:  # max_depth, integer mutation
                    individual[i] += int(random.gauss(0, 5))
                    individual[i] = max(10, min(50, int(individual[i])))
                elif i == 2:  # learning_rate, continuous mutation
                    individual[i] += random.gauss(0, 0.01)
                    individual[i] = max(0.01, min(0.3, individual[i]))
                elif i == 3:  # num_leaves, integer mutation
                    individual[i] += int(random.gauss(0, 5))
                    individual[i] = max(10, min(50, int(individual[i])))
                elif i == 4:  # min_child_samples, integer mutation
                    individual[i] += int(random.gauss(0, 5))
                    individual[i] = max(5, min(100, int(individual[i])))
                elif i == 5:  # subsample, continuous mutation
                    individual[i] += random.gauss(0, 0.1)
                    individual[i] = max(0.6, min(1.0, individual[i]))
                elif i == 6:  # colsample_bytree, continuous mutation
                    individual[i] += random.gauss(0, 0.1)
                    individual[i] = max(0.5, min(1.0, individual[i]))
                elif i == 7:  # reg_alpha, continuous mutation
                    individual[i] += random.gauss(0, 1.0)
                    individual[i] = max(0, min(10.0, individual[i]))
                elif i == 8:  # reg_lambda, continuous mutation
                    individual[i] += random.gauss(0, 1.0)
                    individual[i] = max(0, min(10.0, individual[i]))
                elif i == 9:  # pca_n_components, categorical mutation
                    individual[i] = random.choice([None, 0.90, 0.99])

            print(f"[lgbm] After mutation: {individual}")
            return individual,

        def custom_mate(ind1, ind2):
            print(f"Before crossover:\n  Parent1: {ind1}\n  Parent2: {ind2}")
            tools.cxTwoPoint(ind1, ind2)
            print(f"After crossover:\n  Child1: {ind1}\n  Child2: {ind2}")
            return ind1, ind2  

        # DEAP setup
        creator.create("FitnessMax", base.Fitness, weights=(1.0,))
        creator.create("Individual", list, fitness=creator.FitnessMax)

        toolbox = base.Toolbox()
        
        if model_name == 'rf':
            toolbox.register("n_estimators", random.randint, 100, 1000)
            toolbox.register("max_depth", random.randint, 10, 50)
            toolbox.register("min_samples_split", random.randint, 2, 10)
            toolbox.register("min_samples_leaf", random.randint, 1, 8)
            toolbox.register("max_features", random.choice, [None, 'sqrt', 'log2'])
            toolbox.register("bootstrap", random.choice, [True, False])
            toolbox.register("pca_n_components", random.choice, [None, 0.90, 0.99])
            toolbox.register("individual", tools.initCycle, creator.Individual,
                             (toolbox.n_estimators, toolbox.max_depth, toolbox.min_samples_split,
                              toolbox.min_samples_leaf, toolbox.max_features, toolbox.bootstrap,
                              toolbox.pca_n_components), n=1)
            toolbox.register("evaluate", evaluate_individual_rf)
            toolbox.register("mutate", custom_mutate_rf)

        elif model_name == 'xgb':
            toolbox.register("n_estimators", random.randint, 100, 1000)
            toolbox.register("max_depth", random.randint, 10, 50)
            toolbox.register("learning_rate", random.uniform, 0.01, 0.3)
            toolbox.register("subsample", random.uniform, 0.6, 1.0)
            toolbox.register("colsample_bytree", random.uniform, 0.5, 1.0)
            toolbox.register("gamma", random.uniform, 0, 1.0)
            toolbox.register("reg_alpha", random.uniform, 0, 10.0)
            toolbox.register("reg_lambda", random.uniform, 0, 10.0)
            toolbox.register("pca_n_components", random.choice, [None, 0.90, 0.99])
            toolbox.register("individual", tools.initCycle, creator.Individual,
                             (toolbox.n_estimators, toolbox.max_depth, toolbox.learning_rate, toolbox.subsample,
                              toolbox.colsample_bytree, toolbox.gamma, toolbox.reg_alpha, toolbox.reg_lambda,
                              toolbox.pca_n_components), n=1)
            toolbox.register("evaluate", evaluate_individual_xgb)
            toolbox.register("mutate", custom_mutate_xgb)

        elif model_name == 'lgbm':
            toolbox.register("n_estimators", random.randint, 100, 1000)
            toolbox.register("max_depth", random.randint, 10, 50)
            toolbox.register("learning_rate", random.uniform, 0.01, 0.3)
            toolbox.register("num_leaves", random.randint, 10, 50)
            toolbox.register("min_child_samples", random.randint, 5, 100)
            toolbox.register("subsample", random.uniform, 0.6, 1.0)
            toolbox.register("colsample_bytree", random.uniform, 0.5, 1.0)
            toolbox.register("reg_alpha", random.uniform, 0, 10.0)
            toolbox.register("reg_lambda", random.uniform, 0, 10.0)
            toolbox.register("pca_n_components", random.choice, [None, 0.90, 0.99])
            toolbox.register("individual", tools.initCycle, creator.Individual,
                             (toolbox.n_estimators, toolbox.max_depth, toolbox.learning_rate, toolbox.num_leaves,
                              toolbox.min_child_samples, toolbox.subsample, toolbox.colsample_bytree,
                              toolbox.reg_alpha, toolbox.reg_lambda, toolbox.pca_n_components), n=1)
            toolbox.register("evaluate", evaluate_individual_lgbm)
            toolbox.register("mutate", custom_mutate_lgbm)

        toolbox.register("population", tools.initRepeat, list, toolbox.individual)
        toolbox.register("mate", custom_mate)
        toolbox.register("select", tools.selTournament, tournsize=3)
        
        population = toolbox.population(n=n_population)
        algorithms.eaSimple(population, toolbox, cxpb=cxpb, mutpb=mutpb, ngen=n_generations, verbose=True)
        
        best_individual = tools.selBest(population, k=1)[0]
        
        if model_name == 'rf':
            best_model = model.set_params(
                n_estimators=int(best_individual[0]),
                max_depth=int(best_individual[1]),
                min_samples_split=int(best_individual[2]),
                min_samples_leaf=int(best_individual[3]),
                max_features=best_individual[4],
                bootstrap=best_individual[5]
            )
        elif model_name == 'xgb':
            best_model = model.set_params(
                n_estimators=int(best_individual[0]),
                max_depth=int(best_individual[1]),
                learning_rate=best_individual[2],
                subsample=best_individual[3],
                colsample_bytree=best_individual[4],
                gamma=best_individual[5],
                reg_alpha=best_individual[6],
                reg_lambda=best_individual[7]
            )
        elif model_name == 'lgbm':
            best_model = model.set_params(
                n_estimators=int(best_individual[0]),
                max_depth=int(best_individual[1]),
                learning_rate=best_individual[2],
                num_leaves=int(best_individual[3]),
                min_child_samples=int(best_individual[4]),
                subsample=best_individual[5],
                colsample_bytree=best_individual[6],
                reg_alpha=best_individual[7],
                reg_lambda=best_individual[8]
            )
        
        best_estimators[model_name] = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('pca', pca),
            ('model', best_model)
        ])
    
    estimators = [(name, model) for name, model in best_estimators.items()]

    if problem_type == 'classify':
        stacking_model = StackingClassifier(estimators=estimators, final_estimator=RidgeClassifier(), n_jobs=-1)
    elif problem_type == 'regress':
        stacking_model = StackingRegressor(estimators=estimators, final_estimator=Ridge(), n_jobs=-1)

    print("\nEvaluating Stacking Model...")
    cv_results = cross_val_score(stacking_model, X, y, cv=cv, scoring=scoring_metric, verbose=1, n_jobs=-1)
    
    evaluating_end_time = time.time()
    print(f"\nModel evaluation took: {evaluating_end_time - evaluating_start_time:.2f} seconds.\n")
    return stacking_model, cv_results

## HELPER FUNCTIONS

In [4]:
def manual_undersampling(df, do_under_sample_data=True, target_column=None, desired_samples=1000, seed_value=42, problem_type='classify'):
    if not do_under_sample_data:
        print("Skipping undersampling as 'do_under_sample_data' is set to False.")
        return df

    np.random.seed(seed_value)

    original_distribution = df[target_column].value_counts()
    original_distribution_normalized = df[target_column].value_counts(normalize=True)
    sampled_dataframes = []

    if problem_type == 'classify':
        print("Original class distribution (before resampling):")
        original_distribution_df = pd.DataFrame({
            target_column: original_distribution.index,
            'Raw Counts': original_distribution.values,
            'Proportion': original_distribution_normalized.values
        })
        display(original_distribution_df)
        
        class_distribution = df[target_column].value_counts(normalize=True)
        samples_per_class = (class_distribution * desired_samples).round().astype(int)

        total_sampled = samples_per_class.sum()
        if total_sampled != desired_samples:
            samples_per_class = (class_distribution * (desired_samples / total_sampled * desired_samples)).round().astype(int)
        
        for cls in df[target_column].unique():
            class_df = df[df[target_column] == cls]
            class_sample_count = min(len(class_df), samples_per_class[cls])
            class_df = class_df.sample(n=class_sample_count, random_state=seed_value)
            sampled_dataframes.append(class_df)

        df_resampled = pd.concat(sampled_dataframes)
        new_distribution = df_resampled[target_column].value_counts()
        new_distribution_normalized = df_resampled[target_column].value_counts(normalize=True)
        
        print("\nNew class distribution (after resampling):")
        new_distribution_df = pd.DataFrame({
            target_column: new_distribution.index,
            'Raw Counts': new_distribution.values,
            'Proportion': new_distribution_normalized.values
        })
        display(new_distribution_df)

        final_sampled = len(df_resampled)
        if final_sampled != desired_samples:
            print(f"\nDesired {desired_samples} samples, but obtained {final_sampled} samples due to class distribution constraints.")
    elif problem_type == 'regress':
        if len(df) > desired_samples:
            df_resampled = df.sample(n=desired_samples, random_state=seed_value)
        else:
            df_resampled = df
        final_sampled = len(df_resampled)

    print(f"\nData under-sampled successfully and reduced to {final_sampled} rows.")
    return df_resampled

In [5]:
# Function to get system info
def get_system_info():
    # Get CPU Name
    CPU_Name = None
    try:
        # Try fetching CPU model on Windows
        CPU_Name = platform.processor()
        if not CPU_Name:
            raise ValueError("No CPU name found")
    except Exception:
        try:
            # Try fetching CPU model on Linux
            with open('/proc/cpuinfo') as f:
                for line in f:
                    if "model name" in line:
                        CPU_Name = line.split(':')[1].strip()
                        break
                if not CPU_Name:
                    raise ValueError("No CPU name found")
        except Exception:
            try:
                # Try fetching CPU model on macOS
                result = subprocess.run(['sysctl', '-n', 'machdep.cpu.brand_string'], text=True, capture_output=True)
                CPU_Name = result.stdout.strip()
                if not CPU_Name:
                    raise ValueError("No CPU name found")
            except Exception:
                CPU_Name = "CPU name could not be determined."

    # Get GPU Info
    GPU_Info = None
    try:
        if torch.cuda.is_available():
            GPU_Info = torch.cuda.get_device_name(0)
        else:
            GPU_Info = "No GPU available"
    except Exception:
        GPU_Info = "GPU info could not be determined."

    # Collect system information
    data = {
        "OS Name": os.name,
        "Python Version": sys.version.split()[0],
        "Python Executable": sys.executable,
        "Working Directory": os.getcwd(),
        "Total RAM (GB)": f"{psutil.virtual_memory().total / 1e9:.2f}",
        "Available RAM (GB)": f"{psutil.virtual_memory().available / 1e9:.2f}",
        "Current Memory Use (GB)": f"{psutil.virtual_memory().used / 1e9:.2f}",
        "CPU Name": CPU_Name,
        "CPU Freq": f"{psutil.cpu_freq().current:.2f}Mhz {psutil.cpu_percent()}%",
        "Number of Physical CPUs": psutil.cpu_count(logical=False),
        "CPU Cores": psutil.cpu_count(),
        "GPU Info": GPU_Info,
        "Disk Total (GB)": f"{psutil.disk_usage('/').total / 1e9:.2f}",
        "Disk Free (GB)": f"{psutil.disk_usage('/').free / 1e9:.2f}",
    }

    # Convert dictionary to DataFrame and transpose it
    system_info_df = pd.DataFrame(data, index=['VALUE']).T

    return system_info_df

In [6]:
def obtain_unique_values_columns(df):
    # Sort columns based on the number of unique values, prioritize those with fewer unique values
    sorted_columns = sorted(df.columns, key=lambda col: df[col].nunique())

    # Collect data for the DataFrame
    unique_values_in_columns = []
    for col in sorted_columns:
        num_unique = df[col].nunique()
        unique_values = df[col].unique() if num_unique < 11 else None
        unique_values_in_columns.append({
            "Column": col,
            "Number of Unique Values": num_unique,
            "Unique Values (for under 11 unique)": unique_values
        })

    # Create and return the DataFrame
    unique_values_columns_df = pd.DataFrame(unique_values_in_columns)
    
    return unique_values_columns_df

In [7]:
# Function to save load and test and confirm the model structure
def save_load_test(X, y, model=None, model_file_name=None):
    # Save the model to a file
    with open(model_file_name, 'wb') as file:
        pickle.dump(model, file)
    print(f"Model saved to: {model_file_name}")

    # Load the model from the file
    with open(model_file_name, 'rb') as file:
        loaded_model = pickle.load(file)
    print("\nModel loaded successfully.")

    # Make a prediction on some of the data to ensure it still works
    predictions = loaded_model.predict(X.head())

    # Show prediction results and actual values
    test_results = pd.DataFrame({
        'Prediction': predictions,
        'Actual': y.head()
    })

    return loaded_model, test_results

In [8]:
def load_or_generate_data(load_from_file=None, file_name=None, problem_type=None, desired_samples=None, seed_value=None):
     
    # See if load from file is true
    if load_from_file:
        try:
            # Load the data from the CSV file
            df = pd.read_csv(file_name)
            print("Data loaded successfully.")
        except Exception as e:
            # Look for the file in all subdirectories of /kaggle/
            file_found = False
            try:
                for dirname, _, filenames in os.walk('/kaggle/'):
                    for filename in filenames:
                        if filename == file_name:
                            df = pd.read_csv(os.path.join(dirname, filename))
                            print("Data loaded successfully.")
                            file_found = True
                            break
                    if file_found:
                        break
                if not file_found:
                    raise FileNotFoundError(f"File {file_name} not found in local directory or in any /kaggle/ subdirectory.")
            except Exception as e:
                print(f"An error occurred: {e}")
                print("Please check the file name and path and try again.")
                return None
    else:
        # Generate synthetic data based on problem type
        if problem_type == 'classify':
            X, y = make_classification(n_samples=desired_samples, n_features=20, n_classes=2, n_informative=10, n_clusters_per_class=2, random_state=seed_value)
        elif problem_type == 'regress':
            X, y = make_regression(n_samples=desired_samples, n_features=20, n_informative=10, noise=0.1, random_state=seed_value)
        else:
            raise ValueError("Invalid 'problem_type' specified. Please set 'problem_type' to either 'classify' or 'regress'.")

        # Create a DataFrame from the generated data
        df = pd.DataFrame(X, columns=[f"Feature_{i+1}" for i in range(X.shape[1])])
        df['target'] = y

        print("Data generated successfully.")

    # Show the data
    print("Rows: ", df.shape[0])
    print("Columns: ", df.shape[1])
    print("Total Missing values: ", df.isnull().sum().sum())
    print("Columns with Missing: ", df.columns[df.isnull().any()].tolist())
    print("\nFirst 5 rows of the dataframe: ")
    display(df.head())
    
    return df

In [9]:
def sweetviz_eda(df, report_name="Sweetviz_Report.html", show_eda=True):
    if show_eda:
        try:
            # Create a Sweetviz report
            report = sv.analyze(df)

            # Save the report to an HTML file
            report.show_html(report_name, open_browser=False)

            # Display the report in the Jupyter Notebook
            display(IFrame(src=report_name, width=1200, height=1200))

        except Exception as e:
            print(f"There was an issue generating the Sweetviz report: {e}")
    else:
        print("Skipping EDA as show_eda is not set to True.")

In [10]:
def bug_alert_colab():
    try:
        import google.colab
        print("[Alert from function bug_alert_colab()]\n\nWARNING:\n- Hey, we noticed you are in Google Colab.\n- Please note that as of the time of writing this, LightGBM is not working on Google Colab if GPU is enabled.\n- Consider setting it to no longer auto-use GPU if it's available, if needed.\n- If this is no longer applicable feel free to delete this bug alert function.")
    except ImportError:
        pass

In [11]:
def make_new_or_load_existing_model(df, target_column=None, categorical_threshold=15, model_name='stacking_model', load_existing_model=True, problem_type='classify', n_folds=5, n_population=20, n_generations=10, cxpb=0.5, mutpb=0.2):
    
    def display_cv_results(cv_results):
        print("[Stacking Model Genetic Cross-Validation Results]")
        for i, score in enumerate(cv_results):
            print(f"- Fold {i+1} Score: {score:.3f}")
        print(f"- Average CV Score: {np.mean(cv_results):.3f}")

    def determine_num_and_cat_cols(df, target_column=None, categorical_threshold=15):
        # Initialize lists to hold names of categorical and numerical columns
        cat_cols = []
        num_cols = []

        # If no target column is specified, use the last column by default
        if target_column is None:
            print("\nWARNING: No target column specified, using the last column as default.\n")
            target_column = df.columns[-1]  # Use the last column as the default target

        # Iterate through each column in the DataFrame
        print("[Auto determining categorical and numerical columns for preprocessors]\n")
        for col in df.columns:
            if col != target_column:  # Skip the target column
                # Count the number of unique values in the column
                unique_vals = df[col].nunique()

                # Determine if the column should be categorical
                if unique_vals <= categorical_threshold:
                    # If the column is determined to be categorical due to the threshold
                    cat_cols.append(col)  # Add to categorical columns list
                    print(f"'{col}' auto added to cat_cols, due to passing categorical_threshold.")
                    # Convert the column to strings then category type
                    df[col] = df[col].astype(str).astype('category')
                    print(f"- Converting '{col}' to strings then setting type to category.\n")
                elif df[col].dtype == 'object' or df[col].dtype.name == 'category':
                    # If the column is determined to be categorical due to its type
                    cat_cols.append(col)  # Add to categorical columns list
                    print(f"'{col}' auto added to cat_cols, due to being object or categorical type.")
                    # Convert the column to strings then category type
                    df[col] = df[col].astype(str).astype('category')
                    print(f"- Converting '{col}' to strings then setting type to category.\n")
                else:
                    # If the column is determined to be numerical
                    num_cols.append(col)  # Add to numerical columns list
                    print(f"'{col}' auto added to num_cols, due to not being an object or categorical type.")
                    # Convert the column to a float type
                    df[col] = df[col].astype(float)
                    print(f"- Converting '{col}' to floats.\n")

        # Separate the features (X) from the target (y)
        X = df.drop(columns=[target_column])  # Drop the target column to get the feature matrix
        y = df[target_column]  # Extract the target column

        # Return the features, target, and lists of categorical and numerical columns
        return X, y, cat_cols, num_cols

    model_filename = f"{model_name}.pkl"
    stacking_model = None

    if load_existing_model:
        try:
            print(f"Trying to load the model '{model_filename}' from the file since 'load_existing_model' is set to True.")
            # Load the model from the file
            with open(model_filename, 'rb') as file:
                stacking_model = pickle.load(file)
                print(f"Model '{model_filename}' loaded successfully.")
        except FileNotFoundError:
            print(f"File '{model_filename}' not found. Proceeding to create a new model.")
            load_existing_model = False
        except Exception as e:
            print(f"An error occurred loading the model '{model_filename}': {e}")
            print("Please check the file name and path and try again.")
            load_existing_model = False

    if not load_existing_model:
        print(f"Creating a new model since 'load_existing_model' is set to False or loading the model failed.")
        print("\nTraining will now begin.\n")
        # Preprocess Data and Create a New Model
        X, y, cat_cols, num_cols = determine_num_and_cat_cols(df, target_column=target_column, categorical_threshold=categorical_threshold)
        stacking_model, cv_results = make_pipeline_and_genetic_search(X, y, cat_cols, num_cols,
                                                                      problem_type=problem_type,
                                                                      n_folds=n_folds,
                                                                      n_population=n_population,
                                                                      n_generations=n_generations,
                                                                      cxpb=cxpb,
                                                                      mutpb=mutpb
                                                                      )
        display_cv_results(cv_results)
        # Fit the model on all the data before saving
        print("\nEvaluation complete, fitting on all available data.")
        stacking_model.fit(X, y)

        # Save the newly created model
        with open(model_filename, 'wb') as file:
            pickle.dump(stacking_model, file)
        print(f"\nNew model '{model_filename}' created and saved successfully.")

        # Save the pipeline structure to 'pipeline.html'
        with open("pipeline.html", "w", encoding="utf-8") as f:
            f.write(estimator_html_repr(stacking_model))
        print("Pipeline structure has been saved to 'pipeline.html'.")

    return stacking_model

In [12]:
def get_permutation_importance_plots(df, target_column=None, get_permutation_importance=None, model=None, n_folds=5, seed_value=42, n_repeats=10, colorscale='Viridis'):
    if get_permutation_importance is None or not get_permutation_importance:
        print("Skipping permutation importance as get_permutation_importance is set to False.")
        return None

    if target_column is None:
        raise ValueError("You must specify a target_column.")

    # Separate the features (X) and the target (y)
    X = df.drop(columns=[target_column])
    y = df[target_column]

    # Function to calculate permutation importance for a single fold
    def compute_permutation_importance(fold):
        return permutation_importance(model, X, y, n_repeats=n_repeats, random_state=seed_value + fold).importances_mean

    # Use joblib to parallelize the permutation importance computation across folds
    permutation_importance_folds = Parallel(n_jobs=-1)(delayed(compute_permutation_importance)(fold) for fold in range(n_folds))

    # Convert the list to a NumPy array for easier manipulation
    permutation_importance_folds = np.array(permutation_importance_folds)

    # Create the 3D Plotly figure
    x = np.array([np.arange(len(X.columns)) for _ in range(n_folds)])
    y = np.array([[fold] * len(X.columns) for fold in range(n_folds)])
    z = permutation_importance_folds

    fig_3d = go.Figure(data=[go.Surface(z=z, x=x, y=y, colorscale=colorscale)])
    fig_3d.update_layout(
        title='3D Permutation Importance Across CV Folds',
        scene=dict(
            xaxis=dict(
                ticktext=X.columns,
                tickvals=np.arange(len(X.columns)),
                title="Features"
            ),
            yaxis=dict(title="CV Fold"),
            zaxis=dict(title="Permutation Importance"),
            camera_eye=dict(x=1.5, y=1.5, z=0.6)
        ),
        width=800,
        height=800
    )

    # Return the 3D figure
    return fig_3d

## HELPER CLASSES

In [13]:
# This can be integrated into SKlearn pipeline if desired, or used standalone to fit and transform data.
class CustomSupConFeatureExtractor(BaseEstimator, TransformerMixin):
    class SupConLoss(nn.Module):
        def __init__(self, temperature=0.07):
            super(CustomSupConFeatureExtractor.SupConLoss, self).__init__()
            self.temperature = temperature

        def forward(self, features, labels):
            features = nn.functional.normalize(features, dim=1)
            similarity_matrix = torch.matmul(features, features.T) / self.temperature
            labels = labels.unsqueeze(1)
            mask = torch.eq(labels, labels.T).float()
            log_softmax_sim = nn.functional.log_softmax(similarity_matrix, dim=1)
            loss = -torch.sum(mask * log_softmax_sim) / torch.sum(mask)
            return loss

    class SimpleNN(nn.Module):
        def __init__(self, input_dim, output_dim):
            super(CustomSupConFeatureExtractor.SimpleNN, self).__init__()
            self.fc1 = nn.Linear(input_dim, 128)
            self.fc2 = nn.Linear(128, 64)
            self.fc3 = nn.Linear(64, output_dim)

        def forward(self, x):
            x = torch.relu(self.fc1(x))
            x = torch.relu(self.fc2(x))
            x = self.fc3(x)
            return x

    def __init__(self, input_dim, output_dim=None, temperature=0.07, epochs=1, batch_size=32, learning_rate=0.01, early_stopping=False, patience=5):
        self.input_dim = input_dim
        self.output_dim = output_dim  # Keep the original value of output_dim
        self.temperature = temperature
        self.epochs = epochs
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.early_stopping = early_stopping
        self.patience = patience
        self.model = None  # Initialize model as None
        self.criterion = None
        self.optimizer = None

        if self.input_dim is not None and self.output_dim is not None:
            if isinstance(self.output_dim, float) and self.output_dim < 1:
                effective_output_dim = max(1, int(np.ceil(self.input_dim * self.output_dim)))
            else:
                effective_output_dim = self.output_dim

            self.model = self.SimpleNN(input_dim, effective_output_dim)
            self.criterion = self.SupConLoss(temperature=temperature)
            self.optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)

    def fit(self, X, y=None):
        if self.model is None:
            return self  # Skip fitting if model is None

        # Automatically convert DataFrame to numpy array if necessary
        if isinstance(X, pd.DataFrame):
            X = X.values
        if isinstance(y, pd.Series):
            y = y.values

        train_dataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.long))
        train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)

        best_loss = np.inf
        epochs_no_improve = 0

        for epoch in range(self.epochs):
            self.model.train()
            epoch_loss = 0

            for inputs, labels in train_loader:
                self.optimizer.zero_grad()
                features = self.model(inputs)
                loss = self.criterion(features, labels)
                loss.backward()
                self.optimizer.step()
                epoch_loss += loss.item()

            epoch_loss /= len(train_loader)

            if self.early_stopping:
                if epoch_loss < best_loss:
                    best_loss = epoch_loss
                    epochs_no_improve = 0
                else:
                    epochs_no_improve += 1

                if epochs_no_improve >= self.patience:
                    print(f"[CustomSupConFeatureExtractor] Early stopping triggered after {epoch+1} epochs.")
                    break

        return self

    def transform(self, X):
        if self.model is None:
            return X  # Return the original features if model is None

        # Automatically convert DataFrame to numpy array if necessary
        if isinstance(X, pd.DataFrame):
            X = X.values

        self.model.eval()
        with torch.no_grad():
            X_tensor = torch.tensor(X, dtype=torch.float32)
            return self.model(X_tensor).numpy()

# SET IMPORTANT VARIABLES

In [14]:
# Basics

seed_value = 42                      # For reproducibility
problem_type = 'classify'            # 'classify' or 'regress' depending on your task.
target_column = 'Pop'                # The target column in your dataset.
load_from_file = True                # Load data from a CSV file or generate sample data.
file_name = 'possum.csv'             # Provide the name of the CSV file if loading data from a file.
desired_samples = 4000               # Controls the total number of samples to generate if not loading from a file.
                                     # Determines the number of samples retained after undersampling if applicable.
do_under_sample_data = True          # Controls if you want to undersample your data to balance class distribution or reduce dataset size.
                                     # The desired_samples parameter will control the number of samples retained.
show_eda = True                      # Decide whether to display EDA (Exploratory Data Analysis).

# Genetic Algorithm and Model Loading Settings

load_existing_model = False          # Load an existing model from the current directory.
                                     # The code saves the model after making by default, so will use the same model unless you delete it.
model_name = 'possum_model'          # The name to save the new model as, or the name to try loading the existing model as.
n_folds = 5                          # Set the number of folds for cross-validation. This setting applies to everything that uses cross-validation in the notebook.
                                     # This also applies to both the cross-validation during permutation feature importance
categorical_threshold = 15           # Adjust the threshold for treating a column as categorical.
n_population = 9                     # Adjust the population size for the genetic algorithm.
n_generations = 4                    # Set the number of generations for the genetic algorithm.
cxpb = 0.5                           # Modify the crossover probability for the genetic algorithm.
mutpb = 0.2                          # Adjust the mutation probability for the genetic algorithm.

# Permutation Feature Importance Settings

get_permutation_importance = True    # Decide whether to show permutation importance.
n_repeats = 10                       # Set the number of times to repeat the permutation importance calculation.
                                     # This will be validated by the same number of folds in n_folds.

# Hide warnings
warnings.simplefilter("ignore")  

# Set random seeds for reproducibility
os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)  # if using multi-GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# KNOWN BUG ALERTS

In [15]:
# If the user is in colab, give them any warnings related to colab
bug_alert_colab()

# SYSTEM INFO

In [16]:
# Get system information
system_info_df = get_system_info()

# Display system information
system_info_df

Unnamed: 0,VALUE
OS Name,posix
Python Version,3.7.12
Python Executable,/opt/conda/bin/python
Working Directory,/kaggle/working
Total RAM (GB),33.66
Available RAM (GB),32.42
Current Memory Use (GB),0.76
CPU Name,x86_64
CPU Freq,2200.26Mhz 21.5%
Number of Physical CPUs,2


# LOAD OR GENERATE DATA

<div style="border:3px solid #9370DB; padding: 15px; border-radius: 15px; background-color: #F3E5F5;">
    <h3 style="color: #6A5ACD; text-align: center;">Reminder: Adjust Data Loading and Sampling Settings</h3>
    <p style="font-size: 15px; color: #333333; text-align: center;">
        Before proceeding, please make sure to review and adjust the following settings in the <strong>SET IMPORTANT VARIABLES</strong> section of this notebook as needed:
    </p>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>load_from_file</strong>: Load data from a CSV file or generate sample data.</li>
        <li><strong>file_name</strong>: Provide the name of the CSV file if loading data from a file.</li>
        <li><strong>problem_type</strong>: 'classify' or 'regress' depending on your task.</li>
        <li><strong>desired_samples</strong>: 
            <ul>
                <li>Controls the total number of samples to generate if not loading from a file.</li>
                <li>Determines the number of samples retained after undersampling if applicable.</li>
            </ul>
        </li>
        <li><strong>seed_value</strong>: For reproducibility.</li>
    </ul>
</div>


In [17]:
df = load_or_generate_data(load_from_file=load_from_file, 
                           file_name=file_name, 
                           problem_type=problem_type,
                           desired_samples=desired_samples,
                           seed_value=seed_value)

Data loaded successfully.
Rows:  104
Columns:  14
Total Missing values:  3
Columns with Missing:  ['age', 'footlgth']

First 5 rows of the dataframe: 


Unnamed: 0,case,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,1,1,Vic,m,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,2,1,Vic,f,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
2,3,1,Vic,f,6.0,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
3,4,1,Vic,f,6.0,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
4,5,1,Vic,f,2.0,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0


# DTYPES

In [18]:
# Show the data types of each column
df.dtypes

case          int64
site          int64
Pop          object
sex          object
age         float64
hdlngth     float64
skullw      float64
totlngth    float64
taill       float64
footlgth    float64
earconch    float64
eye         float64
chest       float64
belly       float64
dtype: object

# UNIQUES

In [19]:
unique_values_columns_df = obtain_unique_values_columns(df)

unique_values_columns_df

Unnamed: 0,Column,Number of Unique Values,Unique Values (for under 11 unique)
0,Pop,2,"[Vic, other]"
1,sex,2,"[m, f]"
2,site,7,"[1, 2, 3, 4, 5, 6, 7]"
3,age,9,"[8.0, 6.0, 2.0, 1.0, 9.0, 5.0, 3.0, 4.0, 7.0, ..."
4,taill,19,
5,chest,19,
6,belly,24,
7,totlngth,34,
8,eye,35,
9,skullw,64,


# OPTIONALLY UNDERSAMPLE

<div style="border:3px solid #FFD700; padding: 15px; border-radius: 15px; background-color: #FFFACD;">
    <h3 style="color: #DAA520; text-align: center;">Reminder: Adjust Sampling and Undersampling Settings</h3>
    <p style="font-size: 15px; color: #333333; text-align: center;">
        Before proceeding, please make sure to review and adjust the following settings in the <strong>SET IMPORTANT VARIABLES</strong> section of this notebook as needed:
    </p>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>do_undersample_data</strong>: Controls whether to undersample your data (maintaining class balance in classification tasks) regardless of whether the data was loaded or generated. The <strong>desired_samples</strong> parameter controls the number of samples retained after undersampling.</li>
        <li><strong>target_column</strong>: The target column in your dataset.</li>
        <li><strong>desired_samples</strong>: 
            <ul>
                <li>Controls the total number of samples to generate if not loading from a file.</li>
                <li>Determines the number of samples retained after undersampling, if applicable.</li>
            </ul>
        </li>
        <li><strong>seed_value</strong>: For reproducibility.</li>
        <li><strong>problem_type</strong>: 'classify' or 'regress' depending on your task.</li>
    </ul>
</div>


In [20]:
# Optionally under-sample the data
df = manual_undersampling(df, 
                          do_under_sample_data=do_under_sample_data, 
                          target_column=target_column, 
                          desired_samples=desired_samples, 
                          seed_value=seed_value, 
                          problem_type=problem_type)

Original class distribution (before resampling):


Unnamed: 0,Pop,Raw Counts,Proportion
0,other,58,0.557692
1,Vic,46,0.442308



New class distribution (after resampling):


Unnamed: 0,Pop,Raw Counts,Proportion
0,other,58,0.557692
1,Vic,46,0.442308



Desired 4000 samples, but obtained 104 samples due to class distribution constraints.

Data under-sampled successfully and reduced to 104 rows.


# MANUAL ADJUSTMENTS

<div style="border:2px solid #FF6347; padding: 10px; border-radius: 10px; background-color: #FFE4E1;">
    <h3 style="color: #FF4500; text-align: center;">Reminder: Perform Manual Adjustments</h3>
    <p style="font-size: 14px; color: #333333; text-align: center;">
        This section is intended for any manual adjustments you may need to make to your target or features. Ensure that you review and apply any necessary changes here.
    </p>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>Target Adjustments</strong>: Check if any conversions or adjustments are required for your target variable.</li>
        <li><strong>Feature Adjustments</strong>: Review your features to see if any manual modifications are needed.</li>
    </ul>
</div>

In [21]:
# Tell user target adjustments being made
print("Target manually adjusted as per the following steps:")

# Convert values in the "Pop" column from "Vic" to 1 and "Other" to 0
df.loc[df["Pop"] == "Vic", "target"] = "1"
df.loc[df["Pop"] == "other", "target"] = "0"
print(" - Converted target values from Vic and other to 1 and 0")

print("\nFeatures manually adjusted as per the following steps:")

print(" - No conversions needed")

Target manually adjusted as per the following steps:
 - Converted target values from Vic and other to 1 and 0

Features manually adjusted as per the following steps:
 - No conversions needed


# EDA

<div style="border:3px solid #32CD32; padding: 15px; border-radius: 15px; background-color: #F0FFF0;">
    <h3 style="color: #228B22; text-align: center;">Reminder: Adjust EDA Display Setting</h3>
    <p style="font-size: 15px; color: #333333; text-align: center;">
        Before proceeding, please make sure to review and adjust the following settings in the <strong>SET IMPORTANT VARIABLES</strong> section of this notebook as needed:
    </p>
    <ul style="font-size: 14px; color: #333333;">
        <li><strong>show_eda</strong>: Decide whether to display EDA (Exploratory Data Analysis) based on whether you need a quick overview of the data or wish to skip it for rapid testing.</li>
    </ul>
</div>


In [22]:
sweetviz_eda(df, 
             show_eda=show_eda)

                                             |          | [  0%]   00:00 -> (? left)

Report Sweetviz_Report.html was generated.


# GENETIC CV

<div style="border:2px solid #87CEFA; padding: 20px; border-radius: 10px; background-color: #F0F8FF; color: #333333; font-family: Arial, sans-serif;">
    <h3 style="color: #4682B4; text-align: center;">Reminder: Adjust Genetic Search and Model Loading Settings</h3>
    <p style="font-size: 14px; color: #333333; text-align: center;">
        Before proceeding, please make sure to review and adjust the following settings in the <strong>SET IMPORTANT VARIABLES</strong> section of this notebook as needed:
    </p>
    <ul style="font-size: 14px; color: #333333; margin-top: 10px;">
        <li><strong>target_column</strong>: The target column in your dataset.</li>
        <li><strong>categorical_threshold</strong>: Adjust the threshold for treating a column as categorical.</li>
        <li><strong>model_name</strong>: The name to save the new model as, or the name to try loading the existing model as.</li>
        <li><strong>load_existing_model</strong>: Decide whether to load an existing model from the current directory. Remember, the code saves the model at the end, so by default, it will use the same model unless you delete it.</li>
        <li><strong>problem_type</strong>: 'classify' or 'regress' depending on your task.</li>
        <li><strong>n_folds</strong>: Set the number of folds for cross-validation. This setting applies to everything that uses cross-validation in the notebook.</li>
        <li><strong>n_population</strong>: Adjust the number of individuals in the population based on your problem size and available resources.</li>
        <li><strong>n_generations</strong>: Modify the number of generations to ensure adequate exploration of the search space.</li>
        <li><strong>cxpb</strong>: Review the crossover probability to balance exploration and exploitation.</li>
        <li><strong>mutpb</strong>: Adjust the mutation probability to control the diversity in the population.</li>
    </ul>
    <p style="font-size: 14px; color: #333333; margin-top: 15px;">
        <strong>Technical Example:</strong><br>
        Let's follow the journey of three models: Model1 (RandomForest), Model2 (XGBoost), and Model3 (LightGBM) through the genetic algorithm.
    </p>
    <ol style="font-size: 14px; color: #333333; margin-top: 10px;">
        <li><strong>Initial Population Generation:</strong><br>
        The genetic algorithm starts by generating an initial population of lets say 20 individuals for each model. Each individual is a unique configuration of hyperparameters. For example, an individual for Model1 might have 500 trees (<code>n_estimators</code>), a maximum depth of 15 (<code>max_depth</code>), and a 70% sampling rate (<code>subsample</code>).</li>
        <li><strong>Evaluation:</strong><br>
        Each individual's fitness is evaluated using cross-validation on the dataset. For instance, an individual for Model1 might achieve an accuracy of 0.85. This fitness score determines how well the individual performs.</li>
        <li><strong>Tournament Selection:</strong><br>
        The algorithm conducts tournaments to select parents for the next generation. In each tournament, a small group of individuals is randomly selected from the population. The individual with the highest fitness in this group wins the tournament and is selected as a parent. For example, if Model3's tournament includes individuals with accuracies of 0.82, 0.84, and 0.88, the individual with 0.88 would be selected.</li>
        <li><strong>Crossover (Mating):</strong><br>
        Once parents are selected, they undergo crossover to produce offspring. For example, if a Model1 individual with 500 trees and a maximum depth of 15 mates with another individual with 400 trees and a depth of 20, their offspring might inherit 500 trees from one parent and a depth of 20 from the other.</li>
        <li><strong>Mutation:</strong><br>
        After crossover, some offspring undergo mutation. This might involve randomly increasing the <code>learning_rate</code> of a Model2 individual from 0.1 to 0.12. Mutation ensures that the algorithm continues to explore new regions of the hyperparameter space.</li>
        <li><strong>New Generation:</strong><br>
        The offspring replace the old population, and the process of evaluation, selection, crossover, and mutation is repeated. Over multiple generations, the population evolves, with better-performing individuals gradually becoming more common.</li>
        <li><strong>Final Model Selection:</strong><br>
        After 10 generations, the best individual for each model is selected. For example, Model1 might have evolved to use 700 trees with a maximum depth of 18, achieving an accuracy of 0.90. These best individuals are the final optimized models.</li>
        <li><strong>Stacking:</strong><br>
        The optimized models (Model1, Model2, and Model3) are then combined in a stacking ensemble. In stacking, the predictions from each model are used as inputs to a final estimator (e.g., RidgeClassifier). This final model learns how to best combine the strengths of Model1, Model2, and Model3 to improve overall prediction performance.</li>
        <li><strong>Final Evaluation:</strong><br>
        The stacked model is evaluated using cross-validation to ensure that it generalizes well to unseen data. The final result is a robust, optimized model that benefits from the combined power of multiple machine learning algorithms.</li>
    </ol>
</div>

In [23]:
# Make new or load existing stacking model
stacking_model = make_new_or_load_existing_model(df,
                                                 target_column=target_column,
                                                 categorical_threshold=categorical_threshold,
                                                 model_name=model_name, 
                                                 load_existing_model=load_existing_model, 
                                                 problem_type=problem_type, 
                                                 n_folds=n_folds, 
                                                 n_population=n_population, 
                                                 n_generations=n_generations, 
                                                 cxpb=cxpb, 
                                                 mutpb=mutpb)

Creating a new model since 'load_existing_model' is set to False or loading the model failed.

Training will now begin.

[Auto determining categorical and numerical columns for preprocessors]

'case' auto added to num_cols, due to not being an object or categorical type.
- Converting 'case' to floats.

'site' auto added to cat_cols, due to passing categorical_threshold.
- Converting 'site' to strings then setting type to category.

'sex' auto added to cat_cols, due to passing categorical_threshold.
- Converting 'sex' to strings then setting type to category.

'age' auto added to cat_cols, due to passing categorical_threshold.
- Converting 'age' to strings then setting type to category.

'hdlngth' auto added to num_cols, due to not being an object or categorical type.
- Converting 'hdlngth' to floats.

'skullw' auto added to num_cols, due to not being an object or categorical type.
- Converting 'skullw' to floats.

'totlngth' auto added to num_cols, due to not being an object or categor

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.6min finished



Model evaluation took: 210.40 seconds.

[Stacking Model Genetic Cross-Validation Results]
- Fold 1 Score: 1.000
- Fold 2 Score: 1.000
- Fold 3 Score: 1.000
- Fold 4 Score: 0.952
- Fold 5 Score: 0.900
- Average CV Score: 0.970

Evaluation complete, fitting on all available data.

New model 'possum_model.pkl' created and saved successfully.
Pipeline structure has been saved to 'pipeline.html'.


# PERMUTATION FEATURE IMPORTANCE CV
### (Final Estimator In Stack)

<div style="border:2px solid #FF69B4; padding: 10px; border-radius: 10px; background-color: #FFD1DC;">
    <h3 style="color: #FF1493; text-align: center;">Reminder: Adjust Permutation Feature Importance Settings</h3>
    <p style="font-size: 14px; color: #4B0082; text-align: center;">
        Before proceeding, please make sure to review and adjust the following settings in the <strong>SET IMPORTANT VARIABLES</strong> section of this notebook as needed:
    </p>
    <ul style="font-size: 14px; color: #4B0082;">
        <li><strong>target_column</strong>: The target column in your dataset.</li>
        <li><strong>get_permutation_importance</strong>: Decide whether to show permutation importance.</li>
        <li><strong>model</strong>: The model used for generating permutation importance plots.</li>
        <li><strong>n_folds</strong>: Set the number of folds for cross-validation.</li>
        <li><strong>seed_value</strong>: For reproducibility.</li>
        <li><strong>n_repeats</strong>: Set the number of times to repeat the permutation importance calculation. This will be validated by the same number of folds in <code>n_folds</code>.</li>
        <li><strong>colorscale</strong>: Optionally control the colorscale to be used for the plot (e.g., 'Inferno'). This setting is not actually in SET IMPORTANT VARIABLES since it's not important so you will need to set it here.</li>
    </ul>
</div>


In [24]:
# Perform n_folds cross-validation using the stacking model while permutating each feature n_repeats times to determine most important features
permutation_fig_3d = get_permutation_importance_plots(df=df, 
                                                      target_column=target_column,
                                                      get_permutation_importance=get_permutation_importance, 
                                                      model=stacking_model, 
                                                      n_folds=n_folds, 
                                                      seed_value=seed_value, 
                                                      n_repeats=n_repeats, 
                                                      colorscale='Inferno')

# Assuming permutation_fig_3d is returned from the function
if permutation_fig_3d is not None:
    # Save the figure as an HTML file
    permutation_fig_3d.write_html('permutation_fig_3d.html')

    # Display the saved HTML file in a notebook
    display(IFrame('permutation_fig_3d.html', width=900, height=900))