# Playground S4E3 - Steel Plate Defect Prediction

Author: [shpatrickguo](https://www.kaggle.com/shpatrickguo)

The goal of the notebook is to predict the probability of various defects on steel plate. The dataset for this competition (both train and test) was generated from a deep learning model trained on the [Steel Plates Faults dataset](https://archive.ics.uci.edu/dataset/198/steel+plates+faults) from UCI. Individual AUC scores are calculated for each different categorical class, and then averaged together to get an overall AUC score. 

There are 7 different types of defects that can occur in steel plates:

- `Pastry`
- `Z_Scratch`
- `K_Scatch`
- `Stains`
- `Dirtiness`
- `Bumps`
- `Other_Faults`

## Imports

In [1]:
%%capture
# Install extra packages
!pip install lazypredict -q

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
import optuna
from sklearn.inspection import permutation_importance
import lazypredict
from lazypredict.Supervised import LazyClassifier
import time
import json
import warnings
from collections import defaultdict
import h2o
from h2o.automl import H2OAutoML

warnings.filterwarnings('ignore')

## Load Data

1. Location Features:
    - `X_Minimum`: The minimum x-coordinate of the fault.
    - `X_Maximum`: The maximum x-coordinate of the fault.
    - `Y_Minimum`: The minimum y-coordinate of the fault.
    - `Y_Maximum`: The maximum y-coordinate of the fault.
2. Size Features:
    - `Pixels_Areas`: Area of the fault in pixels.
    - `X_Perimeter`: Perimeter along the x-axis of the fault.
    - `Y_Perimeter`: Perimeter along the y-axis of the fault.
3. Luminosity Features:
    - `Sum_of_Luminosity`: Sum of luminosity values in the fault area.
    - `Minimum_of_Luminosity`: Minimum luminosity value in the fault area.
    - `Maximum_of_Luminosity`: Maximum luminosity value in the fault area.
4. Material and Index Features:
    - `TypeOfSteel_A300`: Type of steel (A300).
    - `TypeOfSteel_A400`: Type of steel (A400).
    - `Steel_Plate_Thickness`: Thickness of the steel plate.
    - `Edges_Index`, `Empty_Index`, `Square_Index`, `Outside_X_Index`, `Edges_X_Index`, `Edges_Y_Index`, `Outside_Global_Index`: Various index values related to edges and geometry.
5. Logarithmic Features:
    - `LogOfAreas`: Logarithm of the area of the fault.
    - `Log_X_Index`, `Log_Y_Index`: Logarithmic indices related to X and Y coordinates.
6. Statistical Features:
    - `Orientation_Index`: Index describing orientation.
    - `Luminosity_Index`: Index related to luminosity.
    - `SigmoidOfAreas`: Sigmoid function applied to areas.

In [3]:
train = pd.read_csv('/kaggle/input/playground-series-s4e3/train.csv')
faults = pd.read_csv('/kaggle/input/faulty-steel-plates/faults.csv')
test = pd.read_csv('/kaggle/input/playground-series-s4e3/test.csv')
sample_submission = pd.read_csv('/kaggle/input/playground-series-s4e3/sample_submission.csv')

## Feature Engineering

Feature generation adapted from https://www.kaggle.com/competitions/playground-series-s4e3/discussion/481687 by [Ivan Zadorozniy](https://www.kaggle.com/ivanzadorozniy).

In [4]:
# Define target classes
target_classes = ["Pastry", "Z_Scratch", "K_Scatch", "Stains", "Dirtiness", "Bumps", "Other_Faults"]

# Remove 'id' column from train and test DataFrames
train.drop("id", axis=1, inplace=True)
test.drop("id", axis=1, inplace=True)

# Calculate the sum of target classes for each row
row_sums = train[target_classes].sum(axis=1)

# Filter out rows where the sum is greater than 1 or equal to 0
filtered_train = train[(row_sums > 0) & (row_sums <= 1)]

# Specify if dataset is synthetically generated
train['generated'] = 1
faults['generated'] = 0
test['generated'] = 1

# Concatenate faults DataFrame with train DataFrame
train = pd.concat([train, faults], ignore_index=True).reset_index(drop=True)

# Separate features (X) and target (y)
X = train.drop(target_classes, axis=1)
y = train[target_classes]

In [5]:
def generate_features(data):
    epsilon = 1e-6  # A small constant to avoid division by zero or taking the logarithm of zero
    
    # Location Features
    data['X_Distance'] = data['X_Maximum'] - data['X_Minimum']
    data['Y_Distance'] = data['Y_Maximum'] - data['Y_Minimum']

    # Density Feature
    data['Density'] = data['Pixels_Areas'] / (data['X_Perimeter'] + data['Y_Perimeter'] + epsilon)

    # Relative Perimeter Feature
    data['Relative_Perimeter'] = data['X_Perimeter'] / (data['X_Perimeter'] + data['Y_Perimeter'] + epsilon)

    # Circularity Feature
    data['Circularity'] = data['Pixels_Areas'] / (data['X_Perimeter'] ** 2 + epsilon)

    # Symmetry Index Feature
    data['Symmetry_Index'] = np.abs(data['X_Distance'] - data['Y_Distance']) / (data['X_Distance'] + data['Y_Distance'] + epsilon)

    # Color Contrast Feature
    data['Color_Contrast'] = data['Maximum_of_Luminosity'] - data['Minimum_of_Luminosity']

    # Combined Geometric Index Feature
    data['Combined_Geometric_Index'] = data['Edges_Index'] * data['Square_Index']

    # Interaction Term Feature
    data['X_Distance*Pixels_Areas'] = data['X_Distance'] * data['Pixels_Areas']

    # Additional Features
    data['sin_orientation'] = np.sin(data['Orientation_Index'])
    data['Edges_Index2'] = np.exp(data['Edges_Index'] + epsilon)
    data['X_Maximum2'] = np.sin(data['X_Maximum'])
    data['Y_Minimum2'] = np.sin(data['Y_Minimum'])
    data['Aspect_Ratio_Pixels'] = np.where(data['Y_Perimeter'] == 0, 0, data['X_Perimeter'] / (data['Y_Perimeter'] + epsilon))
    data['Aspect_Ratio'] = np.where(data['Y_Distance'] == 0, 0, data['X_Distance'] / (data['Y_Distance'] + epsilon))

    # Average Luminosity Feature
    data['Average_Luminosity'] = (data['Sum_of_Luminosity'] + data['Minimum_of_Luminosity']) / 2

    # Normalized Steel Thickness Feature
    data['Normalized_Steel_Thickness'] = (data['Steel_Plate_Thickness'] - data['Steel_Plate_Thickness'].min()) / (data['Steel_Plate_Thickness'].max() - data['Steel_Plate_Thickness'].min())

    # Logarithmic Features
    data['Log_Perimeter'] = np.log(data['X_Perimeter'] + data['Y_Perimeter'] + epsilon)
    data['Log_Luminosity'] = np.log(data['Sum_of_Luminosity'] + epsilon)
    data['Log_Aspect_Ratio'] = np.log(data['Aspect_Ratio'] ** 2 + epsilon)

    # Statistical Features
    data['Combined_Index'] = data['Orientation_Index'] * data['Luminosity_Index']
    data['Sigmoid_Areas'] = 1 / (1 + np.exp(-data['LogOfAreas'] + epsilon))

    return data

X = generate_features(X)
test = generate_features(test)

In [6]:
cat_cols = ['TypeOfSteel_A300', 'TypeOfSteel_A400', 'Outside_Global_Index', 'generated']
# Convert columns to object dtype
X[cat_cols] = X[cat_cols].astype('category')
test[cat_cols] = test[cat_cols].astype('category')

## Feature Selection

In [7]:
def remove_highly_correlated_features(df_train, df_test, threshold=0.95):
    # Compute the correlation matrix for the training set
    corr_matrix_train = df_train.corr().abs()
    
    # Exclude the main diagonal
    np.fill_diagonal(corr_matrix_train.values, 0)
    
    # Create a mask for features with high correlation in the training set
    mask_train = corr_matrix_train > threshold
    
    # Find the index of features to drop in the training set
    features_to_drop = set()
    for col in mask_train.columns:
        correlated_cols = list(mask_train.index[mask_train[col]])
        for correlated_col in correlated_cols:
            if col < correlated_col:  # Only keep one of the correlated features
                features_to_drop.add(correlated_col)
    
    # Print out the dropped columns and the columns they were correlated to
    for col in features_to_drop:
        correlated_cols = list(mask_train.index[mask_train[col]])
        corr_values = list(corr_matrix_train.loc[mask_train[col], col])
        for i, correlated_col in enumerate(correlated_cols):
            print(f"Dropped column: {col}, Correlated to: {correlated_col}, Correlation coefficient: {corr_values[i]}")
    
    # Drop highly correlated features from both training and test sets
    df_train_filtered = df_train.drop(columns=features_to_drop)
    df_test_filtered = df_test.drop(columns=features_to_drop)
    
    return df_train_filtered, df_test_filtered

X, test = remove_highly_correlated_features(X, test)

Dropped column: TypeOfSteel_A400, Correlated to: TypeOfSteel_A300, Correlation coefficient: 0.9979390823770107
Dropped column: X_Minimum, Correlated to: X_Maximum, Correlation coefficient: 0.989682979661914
Dropped column: sin_orientation, Correlated to: Orientation_Index, Correlation coefficient: 0.9992870031386843
Dropped column: Edges_Index2, Correlated to: Edges_Index, Correlation coefficient: 0.9934093935302362
Dropped column: Log_Luminosity, Correlated to: LogOfAreas, Correlation coefficient: 0.9716250377331292
Dropped column: Sum_of_Luminosity, Correlated to: Average_Luminosity, Correlation coefficient: 0.9999999985943983
Dropped column: Steel_Plate_Thickness, Correlated to: Normalized_Steel_Thickness, Correlation coefficient: 0.9999999999999988
Dropped column: Y_Minimum, Correlated to: Y_Maximum, Correlation coefficient: 0.9720417278216671
Dropped column: Log_Perimeter, Correlated to: LogOfAreas, Correlation coefficient: 0.96363945583183


## Feature Scaling

In [8]:
features_to_scale = [
    'X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
    'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity',
    'Maximum_of_Luminosity', 'Length_of_Conveyer', 'Steel_Plate_Thickness',
    'X_Distance', 'Y_Distance', 'Density', 'Circularity', 'Symmetry_Index',
    'Color_Contrast', 'X_Distance*Pixels_Areas', 'Aspect_Ratio_Pixels',
    'Aspect_Ratio', 'Average_Luminosity'
]

# Filter features to scale based on the remaining columns in the DataFrame
features_to_scale = [col for col in features_to_scale if col in X.columns]

# Initialize StandardScaler
scaler = StandardScaler()
scaler.fit(X[features_to_scale])
X[features_to_scale] = scaler.transform(X[features_to_scale])
test[features_to_scale] = scaler.transform(test[features_to_scale])

## Model Selection

In [9]:
"""
for target in target_classes:
    print(f"Lazy predict for target class: {target}")
    print("*" * 80)

    # Splitting dataset into training and testing part
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y[target],
        test_size=0.3,
        random_state=42,
        stratify=y[target], 
        shuffle=True
    )
    clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
    models, predictions = clf.fit(X_train, X_test, y_train, y_test)
    print(models)
"""



## Model Tuning

In [10]:
"""
%%capture
# Define dictionaries to store best hyperparameters and ROC AUC values
best_params_xgb = {}
best_auc_xgb = {}
best_params_lgb = {}
best_auc_lgb = {}

# Iterate over each target class
for target_class in target_classes:
    print(f"Tuning hyperparameters for {target_class}...")

    # Split the data into train and validation sets for the current target class
    X_train, X_val, y_train, y_val = train_test_split(X, y[target_class], test_size=0.2, random_state=42, stratify=y[target_class], shuffle=True)
    
    # Define the objective function for hyperparameter optimization for XGBoost
    def objective_xgb(trial):
        params = {
            "objective": "binary:logistic",
            "n_estimators": 1000,
            "verbosity": 0,
            "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
            "max_depth": trial.suggest_int("max_depth", 1, 10),
            "subsample": trial.suggest_float("subsample", 0.05, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
            "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
            "enable_categorical": True
        }

        model = xgb.XGBClassifier(**params)
        model.fit(X_train, y_train, verbose=False)
        predictions = model.predict_proba(X_val)[:, 1]  # Predict probabilities for the positive class
        roc_auc = roc_auc_score(y_val, predictions)
        return roc_auc

    # Perform hyperparameter optimization using Optuna for XGBoost
    study_xgb = optuna.create_study(direction='maximize')  # Change direction to 'maximize'
    study_xgb.optimize(objective_xgb, n_trials=30)
    # Store the best hyperparameters and ROC AUC for XGBoost
    best_params_xgb[target_class] = study_xgb.best_params
    best_auc_xgb[target_class] = study_xgb.best_value
    
    # Define the objective function for hyperparameter optimization for LightGBM
    def objective_lgb(trial):
        params = {
            "objective": "binary",
            "metric": "auc",
            "n_estimators": 1000,
            "bagging_freq": 1,
            "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
            "num_leaves": trial.suggest_int("num_leaves", 2, 2**10),
            "subsample": trial.suggest_float("subsample", 0.05, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
            "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
            "enable_categorical": True
        }

        model = lgb.LGBMClassifier(**params)
        model.fit(X_train, y_train)
        predictions = model.predict_proba(X_val)[:, 1]  # Predict probabilities for the positive class
        roc_auc = roc_auc_score(y_val, predictions)
        return roc_auc

    # Perform hyperparameter optimization using Optuna for LightGBM
    study_lgb = optuna.create_study(direction='maximize')  # Change direction to 'maximize'
    study_lgb.optimize(objective_lgb, n_trials=30)
    # Store the best hyperparameters and ROC AUC for LightGBM
    best_params_lgb[target_class] = study_lgb.best_params
    best_auc_lgb[target_class] = study_lgb.best_value
    
# Save the dictionaries to JSON files
with open('best_params_xgb.json', 'w') as f:
    json.dump(best_params_xgb, f)

with open('best_auc_xgb.json', 'w') as f:
    json.dump(best_auc_xgb, f)

with open('best_params_lgb.json', 'w') as f:
    json.dump(best_params_lgb, f)

with open('best_auc_lgb.json', 'w') as f:
    json.dump(best_auc_lgb, f)
"""

'\n%%capture\n# Define dictionaries to store best hyperparameters and ROC AUC values\nbest_params_xgb = {}\nbest_auc_xgb = {}\nbest_params_lgb = {}\nbest_auc_lgb = {}\n\n# Iterate over each target class\nfor target_class in target_classes:\n    print(f"Tuning hyperparameters for {target_class}...")\n\n    # Split the data into train and validation sets for the current target class\n    X_train, X_val, y_train, y_val = train_test_split(X, y[target_class], test_size=0.2, random_state=42, stratify=y[target_class], shuffle=True)\n    \n    # Define the objective function for hyperparameter optimization for XGBoost\n    def objective_xgb(trial):\n        params = {\n            "objective": "binary:logistic",\n            "n_estimators": 1000,\n            "verbosity": 0,\n            "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),\n            "max_depth": trial.suggest_int("max_depth", 1, 10),\n            "subsample": trial.suggest_float("subsample", 0.05, 1.0

### Best Parameters

In [11]:
best_params_xgb = {
    'Pastry': {
        'learning_rate': 0.006926977447202338,
        'max_depth': 8,
        'subsample': 0.4940484982010708,
        'colsample_bytree': 0.2387416720485505,
        'min_child_weight': 7
    },
    'Z_Scratch': {
        'learning_rate': 0.004986245704292724,
        'max_depth': 7,
        'subsample': 0.9332436730077105,
        'colsample_bytree': 0.48907554356577143,
        'min_child_weight': 8
    },
    'K_Scatch': {
        'learning_rate': 0.012033749117039628,
        'max_depth': 3,
        'subsample': 0.7325661464279343,
        'colsample_bytree': 0.12231748494766136,
        'min_child_weight': 11
    },
    'Stains': {
        'learning_rate': 0.006196927928720472,
        'max_depth': 4,
        'subsample': 0.8534492089576168,
        'colsample_bytree': 0.3761987501528039,
        'min_child_weight': 12
    },
    'Dirtiness': {
        'learning_rate': 0.006031795590671394,
        'max_depth': 8,
        'subsample': 0.9258644109322758,
        'colsample_bytree': 0.19262200620009873,
        'min_child_weight': 1
    },
    'Bumps': {
        'learning_rate': 0.030511454287023506,
        'max_depth': 3,
        'subsample': 0.9894325575143829,
        'colsample_bytree': 0.2691197048033656,
        'min_child_weight': 14
    },
    'Other_Faults': {
        'learning_rate': 0.005695980576574583,
        'max_depth': 5,
        'subsample': 0.7415198064018484,
        'colsample_bytree': 0.22189734386288398,
        'min_child_weight': 10
    }
}

best_params_lgb = {
    'Pastry': {
        'learning_rate': 0.005838510189618896,
        'num_leaves': 413,
        'subsample': 0.668486759118746,
        'colsample_bytree': 0.32125270364553377,
        'min_data_in_leaf': 82
    },
    'Z_Scratch': {
        'learning_rate': 0.0028573346654447536,
        'num_leaves': 969,
        'subsample': 0.8069989336666283,
        'colsample_bytree': 0.5920712547068819,
        'min_data_in_leaf': 94
    },
    'K_Scatch': {
        'learning_rate': 0.0010011424770011905,
        'num_leaves': 878,
        'subsample': 0.8805178529367013,
        'colsample_bytree': 0.3669661156317522,
        'min_data_in_leaf': 28
    },
    'Stains': {
        'learning_rate': 0.0035045365968749084,
        'num_leaves': 684,
        'subsample': 0.7679208745010446,
        'colsample_bytree': 0.32902244287866944,
        'min_data_in_leaf': 21
    },
    'Dirtiness': {
        'learning_rate': 0.005251331571844952,
        'num_leaves': 258,
        'subsample': 0.6080883184894392,
        'colsample_bytree': 0.6583700658822181,
        'min_data_in_leaf': 24
    },
    'Bumps': {
        'learning_rate': 0.0056976290404213105,
        'num_leaves': 1001,
        'subsample': 0.36947216922049836,
        'colsample_bytree': 0.673006584019963,
        'min_data_in_leaf': 54
    },
    'Other_Faults': {
        'learning_rate': 0.0031447823170776255,
        'num_leaves': 366,
        'subsample': 0.991229746792238,
        'colsample_bytree': 0.3250828708107952,
        'min_data_in_leaf': 79
    }
}

## Ensemble Models

In [12]:
"""
%%capture
def train_xgb_model(X_train, X_test, y_train, y_test, params):
    # Initialize XGBoost classifier with given parameters
    xgb_model = XGBClassifier(**params)
    
    # Train the model on the training data
    xgb_model.fit(X_train, y_train)
    
    # Predict probabilities for the positive class on the test data
    y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
    
    # Calculate ROC AUC score
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    return xgb_model, roc_auc

def train_lgb_model(X_train, X_test, y_train, y_test, params):
    # Initialize LightGBM classifier with given parameters
    lgb_model = LGBMClassifier(**params)
    
    # Train the model on the training data
    lgb_model.fit(X_train, y_train)
    
    # Predict probabilities for the positive class on the test data
    y_pred_proba = lgb_model.predict_proba(X_test)[:, 1]
    
    # Calculate ROC AUC score
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    return lgb_model, roc_auc

# Define a dictionary to store the trained models and their ROC AUC scores
models = {}

# Train models for each target class
for target_class in target_classes:
    print(f"Training models for {target_class}...")

    # Get best parameters for XGBoost and LightGBM for the current target class
    best_params_xgb_target = best_params_xgb.get(target_class, {})
    best_params_xgb_target['enable_categorical'] = True
    best_params_lgb_target = best_params_lgb.get(target_class, {})
    best_params_lgb_target['enable_categorical'] = True

    # Initialize RepeatedStratifiedKFold
    kf = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42)
    
    # Initialize lists to store the scores for each fold
    xgb_roc_auc_scores = []
    lgb_roc_auc_scores = []

    # Split the data using StratifiedKFold
    for train_index, test_index in kf.split(X, y[target_class]):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y[target_class].iloc[train_index], y[target_class].iloc[test_index]

        # Train XGBoost model for the current target class
        xgb_model, xgb_roc_auc = train_xgb_model(X_train, X_test, y_train, y_test, best_params_xgb_target)
        xgb_roc_auc_scores.append(xgb_roc_auc)

        # Train LightGBM model for the current target class
        lgb_model, lgb_roc_auc = train_lgb_model(X_train, X_test, y_train, y_test, best_params_lgb_target)
        lgb_roc_auc_scores.append(lgb_roc_auc)

    # Calculate the mean ROC AUC scores across all folds
    mean_xgb_roc_auc = np.mean(xgb_roc_auc_scores)
    mean_lgb_roc_auc = np.mean(lgb_roc_auc_scores)

    print(f"Mean XGBoost ROC AUC for {target_class}: {mean_xgb_roc_auc}")
    print(f"Mean LightGBM ROC AUC for {target_class}: {mean_lgb_roc_auc}")

    # Store the trained models and their mean ROC AUC scores in the dictionary
    models[target_class] = {'xgb_model': xgb_model, 'xgb_roc_auc': mean_xgb_roc_auc,
                             'lgb_model': lgb_model, 'lgb_roc_auc': mean_lgb_roc_auc}

# Calculate weights based on mean ROC AUC scores
weights = {}
for target_class in target_classes:
    xgb_weight = models[target_class]['xgb_roc_auc'] / (models[target_class]['xgb_roc_auc'] + models[target_class]['lgb_roc_auc'])
    lgb_weight = 1 - xgb_weight
    weights[target_class] = {'xgb': xgb_weight, 'lgb': lgb_weight}

# Ensemble the models
ensemble_models = {}
for target_class in target_classes:
    xgb_model = models[target_class]['xgb_model']
    lgb_model = models[target_class]['lgb_model']
    ensemble_model = VotingClassifier(estimators=[('xgb', xgb_model), ('lgb', lgb_model)], 
                                      voting='soft', 
                                      weights=[weights[target_class]['xgb'], weights[target_class]['lgb']])
    ensemble_models[target_class] = ensemble_model
"""

'\n%%capture\ndef train_xgb_model(X_train, X_test, y_train, y_test, params):\n    # Initialize XGBoost classifier with given parameters\n    xgb_model = XGBClassifier(**params)\n    \n    # Train the model on the training data\n    xgb_model.fit(X_train, y_train)\n    \n    # Predict probabilities for the positive class on the test data\n    y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]\n    \n    # Calculate ROC AUC score\n    roc_auc = roc_auc_score(y_test, y_pred_proba)\n    \n    return xgb_model, roc_auc\n\ndef train_lgb_model(X_train, X_test, y_train, y_test, params):\n    # Initialize LightGBM classifier with given parameters\n    lgb_model = LGBMClassifier(**params)\n    \n    # Train the model on the training data\n    lgb_model.fit(X_train, y_train)\n    \n    # Predict probabilities for the positive class on the test data\n    y_pred_proba = lgb_model.predict_proba(X_test)[:, 1]\n    \n    # Calculate ROC AUC score\n    roc_auc = roc_auc_score(y_test, y_pred_proba)\n 

In [13]:
"""
# Calculate feature importance
for target_class in target_classes:
    print(f"Feature importance for {target_class}:")
    
    # XGBoost feature importance
    xgb_importance = permutation_importance(models[target_class]['xgb_model'], X_test, y_test, scoring='roc_auc', n_repeats=30, random_state=42)
    xgb_feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': xgb_importance.importances_mean})
    xgb_feature_importance = xgb_feature_importance.sort_values(by='Importance', ascending=False)
    print("XGBoost Model:")
    print(xgb_feature_importance)
"""

'\n# Calculate feature importance\nfor target_class in target_classes:\n    print(f"Feature importance for {target_class}:")\n    \n    # XGBoost feature importance\n    xgb_importance = permutation_importance(models[target_class][\'xgb_model\'], X_test, y_test, scoring=\'roc_auc\', n_repeats=30, random_state=42)\n    xgb_feature_importance = pd.DataFrame({\'Feature\': X.columns, \'Importance\': xgb_importance.importances_mean})\n    xgb_feature_importance = xgb_feature_importance.sort_values(by=\'Importance\', ascending=False)\n    print("XGBoost Model:")\n    print(xgb_feature_importance)\n'

## Predict

In [14]:
"""
%%capture
# Define a dictionary to store the predictions for each target class
predictions = {}

# Predict on test data using ensemble models
for target_class in target_classes:
    ensemble_model = ensemble_models[target_class]  # Get the ensemble model for the current target class
    ensemble_model.fit(X, y[target_class])
    y_pred_proba = ensemble_model.predict_proba(test)[:, 1]  # Predict probabilities for the positive class
    predictions[target_class] = y_pred_proba
"""

'\n%%capture\n# Define a dictionary to store the predictions for each target class\npredictions = {}\n\n# Predict on test data using ensemble models\nfor target_class in target_classes:\n    ensemble_model = ensemble_models[target_class]  # Get the ensemble model for the current target class\n    ensemble_model.fit(X, y[target_class])\n    y_pred_proba = ensemble_model.predict_proba(test)[:, 1]  # Predict probabilities for the positive class\n    predictions[target_class] = y_pred_proba\n'

In [15]:
# Import H2O
import h2o
from h2o.automl import H2OAutoML

h2o.init()

train_h2o = h2o.import_file('/kaggle/input/playground-series-s4e3/train.csv')
test_h2o = h2o.import_file('/kaggle/input/playground-series-s4e3/test.csv')
faults_h2o = h2o.import_file('/kaggle/input/faulty-steel-plates/faults.csv')

# Remove 'id' column from train and test DataFrames
train_h2o = train_h2o.drop("id", axis=1)
test_h2o = test_h2o.drop("id", axis=1)

# Calculate the sum of target classes for each row
row_sums = train_h2o[target_classes].sum(axis=1)

# Filter out rows where the sum is greater than 1 or equal to 0
filtered_train = train_h2o[(row_sums > 0) & (row_sums <= 1)]

# Specify if dataset is synthetically generated
train_h2o['generated'] = 1
faults_h2o['generated'] = 0
test_h2o['generated'] = 1

# Concatenate faults DataFrame with train DataFrame
train_h2o = train_h2o.rbind(faults_h2o)
train_h2o[target_classes] = train_h2o[target_classes].asfactor()

# Define a dictionary to store predictions
predictions_dict = {}

# Loop through each target class
for target_class in target_classes:
    print(f"Training AutoML model for {target_class}...")

    # Extract features (X) for the current target class
    X_h2o = train_h2o.columns[:-8]  # Assuming the last 8 columns are the target classes

    # Train AutoML model with ROC AUC as stopping metric
    aml = H2OAutoML(max_models=10, seed=42, stopping_metric='AUC')
    aml.train(x=X_h2o, y=target_class, training_frame=train_h2o)

    # Get the best model
    best_model = aml.leader

    # Predict probabilities for the test dataset
    predictions = best_model.predict(test_h2o)

    # Save predictions to the dictionary
    predictions_dict[target_class] = predictions.as_data_frame()

# Stop H2O cluster
h2o.shutdown()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.21" 2023-10-17; OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu120.04); OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpazr2_zlp
  JVM stdout: /tmp/tmpazr2_zlp/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpazr2_zlp/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,06 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,3 months and 3 days
H2O_cluster_name:,H2O_from_python_unknownUser_ku8pov
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.230 Gb
H2O_cluster_total_cores:,1
H2O_cluster_allowed_cores:,1


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Training AutoML model for Pastry...
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%
Training AutoML model for Z_Scratch...
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%
Training AutoML model for K_Scatch...
AutoML progress: |██████████████

## Submission

In [16]:
p1_values = {}

for target_class in target_classes:
    # Select the dataframe corresponding to the target class
    df_target_class = predictions_dict[target_class]
    
    # Extract the p0 values from the dataframe
    p1_values[target_class] = df_target_class['p1']

In [17]:
submission = pd.DataFrame(p1_values)
submission.insert(0, "id", sample_submission["id"])
submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,id,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,19219,0.47,0.0,0.0,0.0,0.01,0.15,0.34
1,19220,0.3,0.01,0.01,0.0,0.15,0.16,0.37
2,19221,0.0,0.04,0.05,0.0,0.01,0.26,0.48
3,19222,0.16,0.0,0.0,0.01,0.0,0.48,0.48
4,19223,0.0,0.0,0.0,0.0,0.0,0.66,0.39
