# **Improving Classification Algorithm for Difficult Data Conditions**

In this assignment, we will be implementing a classification algorithm from scratch, analyze how a specific data issue affects its performance, propose a modification to improve it, and validate the results through empirical evaluation on benchmark datasets.

# 1st Step - Install dependencies

In [32]:
!pip install pandas numpy matplotlib seaborn imbalanced-learn autograd tqdm nbformat plotly



# 2nd Step - Import Libraries

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
import base64
import io
import shutil
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE, ADASYN
import logging
import autograd.numpy as np
from autograd import grad
from tqdm.auto import tqdm
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# 3rd Step - Algorithm Selection

For this assignment we selected collectively to explore the Logistic Regression as our classification algorithm. We are going to use the standard version of the algorithm found on https://github.com/rushter/MLAlgorithms.


Logistic Regression models the probability of class membership by applying the sigmoid function to a linear combination of input features, enabling it to separate classes based on learned decision boundaries. It is sensitive to both noise and class imbalance, with a particular vulnerability to class imbalance, as it naturally biases toward the majority class during training. Although extensions allow logistic regression to handle multiclass problems, its performance is primarily affected by noise and imbalance rather than the multiclass setting itself.

For this specific reason we are going to evaluate our algorithm on the Dataset Group 2: Class imbalance in binary classification.

# 4th Step - Preprocess the Data

For our algorithm to work as expected we need to make sure our data is clean and ready to be used, so we need to check the following cases:

 - Files categorical values on it;
 - Files with missing values.

After we look up these files and fix them we also need to make sure that all our "target" classes have the same name, and after that we are going to attribute 1 to the minority classes and 0 to the majorities ones, to easier future comparison.

Let´s start by seeing how files have categorical features.

In [34]:
folder = "class_imbalance"

total_files = 0
files_with_categorical = 0
files_with_categorical_list = []

for file in os.listdir(folder):
    if file.endswith(".csv"):
        try:
            df = pd.read_csv(os.path.join(folder, file))
            total_files += 1

            print(f"\nFile: {file}")
            print(f"   Shape: {df.shape[0]} rows, {df.shape[1]} columns")

            # Check for categorical features
            feature_cols = df.columns[:-1]  # exclude target
            categorical_features = df[feature_cols].select_dtypes(include=['object', 'category']).columns.tolist()

            if categorical_features:
                print(f"   Categorical features detected: {categorical_features}")
                files_with_categorical += 1
                files_with_categorical_list.append(file)
            else:
                print("   No categorical features detected.")

            # Get the last column as the target
            target_col = df.columns[-1]
            print(f"   Target column: '{target_col}'")

            # Print class distribution
            print("\nClass distribution:")
            print(df[target_col].value_counts(normalize=True))

        except Exception as e:
            print(f"Error with file {file}: {e}")

# Final summary
print(f"\n\nTotal files processed: {total_files}")
print(f"Files with categorical features: {files_with_categorical}")


File: dataset_1000_hypothyroid.csv
   Shape: 3772 rows, 30 columns
   Categorical features detected: ['sex', 'on thyroxine', 'query on thyroxine', 'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured', 'T3 measured', 'TT4 measured', 'T4U measured', 'FTI measured', 'TBG measured', 'referral source']
   Target column: 'binaryClass'

Class distribution:
binaryClass
P    0.922853
N    0.077147
Name: proportion, dtype: float64

File: dataset_1002_ipums_la_98-small.csv
   Shape: 7485 rows, 56 columns
   Categorical features detected: ['gq', 'gqtypeg', 'farm', 'ownershg', 'relateg', 'sex', 'raceg', 'marst', 'chborn', 'school', 'educrec', 'schltype', 'empstatg', 'labforce', 'classwkg', 'wkswork2', 'hrswork2', 'workedyr', 'migrat5g', 'vetstat']
   Target column: 'binaryClass'

Class distribution:
binaryClass
P    0.894322
N    0.105678
Name: proporti

In [35]:
print("\nFiles with categorical features:")
for f in files_with_categorical_list:
    print(f"- {f}")



Files with categorical features:
- dataset_1000_hypothyroid.csv
- dataset_1002_ipums_la_98-small.csv
- dataset_1014_analcatdata_dmft.csv
- dataset_1016_vowel.csv
- dataset_1018_ipums_la_99-small.csv
- dataset_1023_soybean.csv
- dataset_38_sick.csv
- dataset_757_meta.csv
- dataset_764_analcatdata_apnea3.csv
- dataset_765_analcatdata_apnea2.csv
- dataset_767_analcatdata_apnea1.csv
- dataset_865_analcatdata_neavote.csv
- dataset_867_visualizing_livestock.csv
- dataset_875_analcatdata_chlamydia.csv
- dataset_966_analcatdata_halloffame.csv
- dataset_968_analcatdata_birthday.csv


16 files with categorical values as expected!

Now let's check for missing values, we are going to use common representations of missing values to garantee no mistake is made and to not let missing values go unseen.

In [36]:
folder = "class_imbalance"

# Variable to track how many files have missing values
files_with_missing_values = 0

# Define common representations of missing values
common_na = ["", "NA", "NaN", "n/a", "N/A", "null", "NULL", "-", "--", "?", "none", "None", " "]

# Loop through all CSV files in the folder
for file in os.listdir(folder):
    if file.endswith(".csv"):
        try:
            file_path = os.path.join(folder, file)

            # Read CSV with common missing value representations
            df = pd.read_csv(file_path, na_values=common_na)

            print(f"\nFile: {file}")

            # Check for missing values
            missing_values = df.isnull().sum()
            if missing_values.sum() == 0:
                print("No missing values.")
            else:
                print("Missing values:")
                print(missing_values[missing_values > 0])
                files_with_missing_values += 1

        except Exception as e:
            print(f"Error with file {file}: {e}")

# Final summary
print(f"\n\nTotal files with missing values: {files_with_missing_values}")


File: dataset_1000_hypothyroid.csv
Missing values:
age       1
sex     150
TSH     369
T3      769
TT4     231
T4U     387
FTI     385
TBG    3772
dtype: int64

File: dataset_1002_ipums_la_98-small.csv
Missing values:
ownershg     132
chborn      4424
educrec      322
schltype    5314
empstatg    1811
labforce    1811
classwkg    3184
wkswork2    3690
hrswork2    4154
workedyr    1811
migrat5g    3954
vetstat     1820
dtype: int64

File: dataset_1004_synthetic_control.csv
No missing values.

File: dataset_1013_analcatdata_challenger.csv
No missing values.

File: dataset_1014_analcatdata_dmft.csv
No missing values.

File: dataset_1016_vowel.csv
No missing values.

File: dataset_1018_ipums_la_99-small.csv
Missing values:
ownershg     168
chborn      5421
school       428
educrec      428
schltype     428
empstatg    2110
labforce    2110
classwkg    3671
wkswork2    4198
hrswork2    4782
yrlastwk    6172
workedyr    2110
migrat5g     707
vetstat     2110
dtype: int64

File: dataset_1020

Now that we know we have 16 files with categorical features and 10 with missing values we are going to fix them!

After a quick analysis we can see that some files have entire features with missing values, so we are going to remove them since they are useless for our project.

In [37]:
# Input and output folders
input_folder = "class_imbalance"
cleaned_folder = "class_imbalance_cleaned"
imputed_folder = "class_imbalance_imputed"

# Create output folders if they don't exist
os.makedirs(cleaned_folder, exist_ok=True)
os.makedirs(imputed_folder, exist_ok=True)

Once our new folders are created we can move on to cleaning.

In [38]:
total_files_cleaned = 0

for file in os.listdir(input_folder):
    if file.endswith(".csv"):
        input_path = os.path.join(input_folder, file)
        output_path = os.path.join(cleaned_folder, file)

        try:
            # Read CSV
            df = pd.read_csv(input_path)
            initial_columns = df.columns.tolist()

            # Drop columns that are all NaN
            df_cleaned = df.dropna(axis=1, how='all')
            final_columns = df_cleaned.columns.tolist()

            # Detect removed columns
            removed_columns = list(set(initial_columns) - set(final_columns))

            # Save cleaned file
            df_cleaned.to_csv(output_path, index=False)
            total_files_cleaned += 1

            # Only print if some columns were removed
            if removed_columns:
                print(f"\nFile: {file}")
                print(f"Removed columns: {removed_columns}")

        except Exception as e:
            print(f"Error cleaning {file}: {e}")

print(f"\n\nTotal files cleaned: {total_files_cleaned}")



File: dataset_1000_hypothyroid.csv
Removed columns: ['TBG']

File: dataset_38_sick.csv
Removed columns: ['TBG']


Total files cleaned: 50


We can proceed to impute the missing values, we are doing so by using the mode for binary and categorical features and using the KNN for the remaining!

In [39]:
def is_binary(series):
    """Checks if a series is binary (ignoring NaNs)."""
    unique_vals = series.dropna().astype(str).str.lower().unique()
    return len(unique_vals) == 2

def is_numeric(series):
    """Checks if a series is numeric."""
    try:
        pd.to_numeric(series.dropna())
        return True
    except:
        return False

def apply_smart_imputer(df):
    """Applies smart imputation: mode for binary/categorical, KNN for numeric."""
    feature_cols = df.columns[:-1]  # Last column is target
    target_col = df.columns[-1]
    
    X = df[feature_cols].copy()
    y = df[target_col]

    binary_cols = []
    categorical_cols = []
    numeric_cols = []

    for col in X.columns:
        if is_binary(X[col]):
            binary_cols.append(col)
        elif not is_numeric(X[col]) or X[col].nunique() <= 10:
            categorical_cols.append(col)
        else:
            numeric_cols.append(col)

    print(f"Imputing {len(binary_cols)} binary, {len(categorical_cols)} categorical, {len(numeric_cols)} numeric columns.")

    # Binary + Categorical: fill with mode
    for col in binary_cols + categorical_cols:
        if not X[col].mode().empty:
            mode_val = X[col].mode()[0]
            X[col] = X[col].fillna(mode_val)
        else:
            print(f"Warning: Column '{col}' has no mode (all NaNs). Skipping.")

    # Numeric: KNN impute
    valid_numeric = X[numeric_cols].dropna(axis=1, how='all')
    removed = set(numeric_cols) - set(valid_numeric.columns)

    if not valid_numeric.empty:
        imputer = KNNImputer(n_neighbors=5)
        imputed_array = imputer.fit_transform(valid_numeric)
        X[valid_numeric.columns] = imputed_array

    if removed:
        print(f"Skipped all-NaN numeric columns: {removed}")

    # Return combined DataFrame with target
    return pd.concat([X, y], axis=1)

def validate_imputation(original_df, imputed_df):
    """Compare missing values before and after imputation."""
    feature_cols = original_df.columns[:-1]

    before_missing = original_df[feature_cols].isnull().sum()
    after_missing = imputed_df[feature_cols].isnull().sum()

    still_missing = after_missing[after_missing > 0]

    if still_missing.empty:
        print("✅ Imputation successful: no missing values remain.")
    else:
        print("⚠️ Some columns still have missing values:")
        print(still_missing)

    # Optional: show change summary
    total_before = before_missing.sum()
    total_after = after_missing.sum()
    print(f"Missing before: {total_before} → after: {total_after}")

In [40]:
#Process cleaned files with smart imputer

total_files_imputed = 0

for file in os.listdir(cleaned_folder):
    if file.endswith(".csv"):
        input_path = os.path.join(cleaned_folder, file)
        output_path = os.path.join(imputed_folder, file)

        try:
            df = pd.read_csv(input_path)
            df_imputed = apply_smart_imputer(df)

            df_imputed.to_csv(output_path, index=False)
            total_files_imputed += 1

            print(f"Imputed and saved: {file}")

        except Exception as e:
            print(f"Error imputing {file}: {e}")

print(f"\nTotal files imputed: {total_files_imputed}")

Imputing 20 binary, 2 categorical, 6 numeric columns.
Imputed and saved: dataset_1000_hypothyroid.csv
Imputing 8 binary, 29 categorical, 18 numeric columns.
Imputed and saved: dataset_1002_ipums_la_98-small.csv
Imputing 0 binary, 0 categorical, 60 numeric columns.
Imputed and saved: dataset_1004_synthetic_control.csv
Imputing 0 binary, 1 categorical, 1 numeric columns.
Imputed and saved: dataset_1013_analcatdata_challenger.csv
Imputing 1 binary, 3 categorical, 0 numeric columns.
Imputed and saved: dataset_1014_analcatdata_dmft.csv
Imputing 1 binary, 1 categorical, 10 numeric columns.
Imputed and saved: dataset_1016_vowel.csv
Imputing 9 binary, 28 categorical, 19 numeric columns.
Imputed and saved: dataset_1018_ipums_la_99-small.csv
Imputing 0 binary, 0 categorical, 64 numeric columns.
Imputed and saved: dataset_1020_mfeat-karhunen.csv
Imputing 0 binary, 0 categorical, 10 numeric columns.
Imputed and saved: dataset_1021_page-blocks.csv
Imputing 0 binary, 240 categorical, 0 numeric colum

Let´s see if our code ran correctly checking again for missing values and manually see if the imputation was done correctly.

In [41]:
# Variable to track how many files have missing values
files_with_missing_values = 0

# Loop through all CSV files in the folder
for file in os.listdir(imputed_folder):
    if file.endswith(".csv"):
        try:
            df = pd.read_csv(os.path.join(imputed_folder, file))

            print(f"\nFile: {file}")

            # Check for missing values
            missing_values = df.isnull().sum()
            if missing_values.sum() == 0:
                print("No missing values.")
            else:
                print("Missing values:")
                print(missing_values[missing_values > 0])
                files_with_missing_values += 1

        except Exception as e:
            print(f"Error with file {file}: {e}")

# Final summary
print(f"\n\nTotal files with missing values: {files_with_missing_values}")


File: dataset_1000_hypothyroid.csv
No missing values.

File: dataset_1002_ipums_la_98-small.csv
No missing values.

File: dataset_1004_synthetic_control.csv
No missing values.

File: dataset_1013_analcatdata_challenger.csv
No missing values.

File: dataset_1014_analcatdata_dmft.csv
No missing values.

File: dataset_1016_vowel.csv
No missing values.

File: dataset_1018_ipums_la_99-small.csv
No missing values.

File: dataset_1020_mfeat-karhunen.csv
No missing values.

File: dataset_1021_page-blocks.csv
No missing values.

File: dataset_1022_mfeat-pixel.csv
No missing values.

File: dataset_1023_soybean.csv
No missing values.

File: dataset_1039_hiva_agnostic.csv
No missing values.

File: dataset_1045_kc1-top5.csv
No missing values.

File: dataset_1049_pc4.csv
No missing values.

File: dataset_1050_pc3.csv
No missing values.

File: dataset_1056_mc1.csv
No missing values.

File: dataset_1059_ar1.csv
No missing values.

File: dataset_1061_ar4.csv
No missing values.

File: dataset_1064_ar6.

Now that we have none missing values we can use Label Encoder and One-Hot Encoder to verify that all our data is numerical. For simple categories like "Yes/No", we use Label Encoding (assigning 0 and 1). For multiple categories like colors, we apply One-Hot Encoding, creating separate columns with 1s and 0s to avoid false numerical relationships.

We'll handle the target column separately later, ensuring the minority class is labeled as 1 and majority as 0 for clarity. These transformations give us clean, numerical data ready for machine learning models to process effectively.

In [42]:
def encode_features(df, ohe_threshold):
    feature_cols = df.columns[:-1]  # Para não tocar na coluna target
    target_col = df.columns[-1]

    categorical_features = df[feature_cols].select_dtypes(include=['object', 'category']).columns.tolist()

    if not categorical_features:
        return df, [], []

    df_encoded = df.copy()
    ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    ohe_cols = []
    label_cols = []

    for col in categorical_features:
        unique_values = df_encoded[col].nunique()

        if unique_values <= ohe_threshold:
            # One-Hot Encoder
            ohe_df = pd.DataFrame(
                ohe.fit_transform(df_encoded[[col]]),
                columns=[f"{col}{cat}" for cat in ohe.categories_[0]],
                index=df_encoded.index
            )
            df_encoded = pd.concat([df_encoded.drop(columns=[col]), ohe_df], axis=1)
            print(f" One-Hot Encoded '{col}' ({unique_values} categorias)")
            ohe_cols.append(col)
        else:
            # Label Encoder
            df_encoded[col] = pd.factorize(df_encoded[col])[0]
            print(f" Label Encoded '{col}' ({unique_values} categorias)")
            label_cols.append(col)

    # Manter target no fim
    cols = [col for col in df_encoded.columns if col != target_col] + [target_col]
    df_encoded = df_encoded[cols]

    return df_encoded, ohe_cols, label_cols

def process_files(original_folder, modified_folder, ohe_threshold):
    os.makedirs(modified_folder, exist_ok=True)
    report_lines = []
    
    datasets_with_encoders = 0  # Contador para datasets que utilizaram o encoder

    for file in os.listdir(original_folder):
        if file.endswith(".csv"):
            original_path = os.path.join(original_folder, file)
            modified_path = os.path.join(modified_folder, file)

            try:
                df = pd.read_csv(original_path)
                df_encoded, ohe_cols, label_cols = encode_features(df, ohe_threshold)

                df_encoded.to_csv(modified_path, index=False)
                print(f"Processed and saved: {file}\n")

                # Verificar se o dataset utilizou algum encoder
                if ohe_cols or label_cols:
                    datasets_with_encoders += 1  # Incrementar se houver uso de encoder
                
                report_lines.append(f"File: {file}")
                if ohe_cols:
                    report_lines.append(f"  One-Hot Encoded columns: {', '.join(ohe_cols)}")
                if label_cols:
                    report_lines.append(f"  Label Encoded columns: {', '.join(label_cols)}")
                if not ohe_cols and not label_cols:
                    report_lines.append("  No categorical columns detected.")
                report_lines.append("")  # linha em branco entre ficheiros

            except Exception as e:
                print(f"Error processing {file}: {e}\n")
                report_lines.append(f"Error processing {file}: {e}")
                report_lines.append("")

    # Guardar relatório
    report_path = os.path.join(modified_folder, "encoding_report.txt")
    with open(report_path, "w") as f:
        f.write("\n".join(report_lines))

    print("\nAll files processed! Report saved to 'encoding_report.txt'")

    # Imprimir o número de datasets que utilizaram o encoder
    print(f"\nTotal datasets processed with encoders (OHE or Label Encoding): {datasets_with_encoders}")
    return datasets_with_encoders


# Parâmetros de execução
original_folder = "class_imbalance_imputed"
modified_folder = "class_imbalance_modified"
ohe_threshold = 15

# Chamada da função para processar os arquivos
datasets_with_encoders = process_files(original_folder, modified_folder, ohe_threshold)


 One-Hot Encoded 'sex' (2 categorias)
 One-Hot Encoded 'on thyroxine' (2 categorias)
 One-Hot Encoded 'query on thyroxine' (2 categorias)
 One-Hot Encoded 'on antithyroid medication' (2 categorias)
 One-Hot Encoded 'sick' (2 categorias)
 One-Hot Encoded 'pregnant' (2 categorias)
 One-Hot Encoded 'thyroid surgery' (2 categorias)
 One-Hot Encoded 'I131 treatment' (2 categorias)
 One-Hot Encoded 'query hypothyroid' (2 categorias)
 One-Hot Encoded 'query hyperthyroid' (2 categorias)
 One-Hot Encoded 'lithium' (2 categorias)
 One-Hot Encoded 'goitre' (2 categorias)
 One-Hot Encoded 'tumor' (2 categorias)
 One-Hot Encoded 'hypopituitary' (2 categorias)
 One-Hot Encoded 'psych' (2 categorias)
 One-Hot Encoded 'TSH measured' (2 categorias)
 One-Hot Encoded 'T3 measured' (2 categorias)
 One-Hot Encoded 'TT4 measured' (2 categorias)
 One-Hot Encoded 'T4U measured' (2 categorias)
 One-Hot Encoded 'FTI measured' (2 categorias)
 One-Hot Encoded 'TBG measured' (1 categorias)
 One-Hot Encoded 'referr

We can see that the total datasets processed with encoders were 16, the same number of datasets with categorical values in them! Also, we have the report saved in the folder to check manually if everything went as expected.

Now that we have a folder with all the data cleaned we just need to do the last step, name the last column of all datasets as "target" and make sure the minority class is 1 and 0 the majority.

In [43]:
# Define input and output folders
input_folder = "class_imbalance_modified"             
output_folder = "class_imbalance_final"      
os.makedirs(output_folder, exist_ok=True)

# Process each CSV file in the input folder
for filename in os.listdir(input_folder):
    if not filename.lower().endswith(".csv"):
        continue  # skip non-CSV files
    
    # Read the dataset
    file_path = os.path.join(input_folder, filename)
    df = pd.read_csv(file_path)
    
    # Rename the last column to 'target'
    df.rename(columns={df.columns[-1]: "target"}, inplace=True)
    
    # Only handle binary classification cases
    counts = df['target'].value_counts()
    if len(counts) != 2:
        # Skip files that are not binary
        print(f"Skipping {filename}: not binary classification.")
        continue
    
    # Identify majority and minority class labels
    minority_label = counts.idxmin()  # label with fewer instances
    majority_label = counts.idxmax()  # label with more instances
    
    # Map the minority class to 1 and the majority class to 0
    df['target'] = df['target'].apply(lambda x: 1 if x == minority_label else 0)
    
    # Save the modified dataset to the new folder
    output_path = os.path.join(output_folder, filename)
    df.to_csv(output_path, index=False)

Let's check if everything is right!

In [44]:
folder = "class_imbalance_final"

total_files = 0
files_with_categorical = []
files_with_missing = []

for file in os.listdir(folder):
    if not file.lower().endswith(".csv"):
        continue

    df = pd.read_csv(os.path.join(folder, file))
    total_files += 1

    print(f"\nFile: {file}")
    print(f"   Shape: {df.shape[0]} rows, {df.shape[1]} columns")

    # Record files that have any categorical features
    feature_cols = df.columns[:-1]
    if df[feature_cols].select_dtypes(include=['object', 'category']).any().any():
        files_with_categorical.append(file)

    # Record files that have any missing values
    if df.isnull().any().any():
        files_with_missing.append(file)

    # Show target info
    target_col = df.columns[-1]
    print(f"   Target column: '{target_col}'")
    print("   Class distribution:")
    print(df[target_col].value_counts(normalize=True))

# Final summary counts
print(f"\nFiles with categorical features: {len(files_with_categorical)}")
print(f"Files with missing values: {len(files_with_missing)}")
print(f"Total files processed: {total_files}")


File: dataset_1000_hypothyroid.csv
   Shape: 3772 rows, 53 columns
   Target column: 'target'
   Class distribution:
target
0    0.922853
1    0.077147
Name: proportion, dtype: float64

File: dataset_1002_ipums_la_98-small.csv
   Shape: 7485 rows, 136 columns
   Target column: 'target'
   Class distribution:
target
0    0.894322
1    0.105678
Name: proportion, dtype: float64

File: dataset_1004_synthetic_control.csv
   Shape: 600 rows, 61 columns
   Target column: 'target'
   Class distribution:
target
0    0.833333
1    0.166667
Name: proportion, dtype: float64

File: dataset_1013_analcatdata_challenger.csv
   Shape: 138 rows, 3 columns
   Target column: 'target'
   Class distribution:
target
0    0.934783
1    0.065217
Name: proportion, dtype: float64

File: dataset_1014_analcatdata_dmft.csv
   Shape: 797 rows, 8 columns
   Target column: 'target'
   Class distribution:
target
0    0.805521
1    0.194479
Name: proportion, dtype: float64

File: dataset_1016_vowel.csv
   Shape: 990 ro

Now we are reading to start our real work!

# 5th Step - Model Training & Evaluation: Pre‑Changes

Now that we have our data ready we can start the model training and evaluation on our base model! Firstly let's define our algorithm.

In [45]:
np.random.seed(1000)


def binary_crossentropy(y_true, y_pred):
    eps = 1e-8  # to avoid log(0)
    return -np.mean(y_true * np.log(y_pred + eps) + (1 - y_true) * np.log(1 - y_pred + eps))


class BasicRegression:
    def __init__(self, lr=0.001, penalty=None, C=0.01, tolerance=1e-4, max_iters=1000):
        self.lr = lr
        self.penalty = penalty
        self.C = C
        self.tolerance = tolerance
        self.max_iters = max_iters
        self.errors = []
        self.theta = None
        self.n_samples, self.n_features = None, None
        self.cost_func = None

    def _add_penalty(self, loss, w):
        if self.penalty == "l1":
            loss += self.C * np.abs(w[1:]).sum()
        elif self.penalty == "l2":
            loss += 0.5 * self.C * (w[1:] ** 2).sum()
        return loss

    def _cost(self, X, y, theta):
        prediction = X.dot(theta)
        error = self.cost_func(y, prediction)
        return error

    def _add_intercept(self, X):
        b = np.ones([X.shape[0], 1])
        return np.concatenate([b, X], axis=1)

    def fit(self, X, y):
        X = np.array(X)
        y = np.array(y).reshape(-1)

        self.X = self._add_intercept(X)
        self.y = y
        self.n_samples, self.n_features = X.shape

        self.init_cost()

        self.theta = np.random.normal(size=(self.n_features + 1), scale=0.5)
        self.theta, self.errors = self._gradient_descent()

        logging.info(f"Training completed. Final loss: {self.errors[-1]}")

    def _gradient_descent(self):
        theta = self.theta
        errors = [self._cost(self.X, self.y, theta)]
        cost_d = grad(self._loss)

        for i in range(1, self.max_iters + 1):
            delta = cost_d(theta)
            theta -= self.lr * delta

            current_error = self._cost(self.X, self.y, theta)
            errors.append(current_error)

            logging.info(f"Iteration {i}, error {current_error}")

            if np.abs(errors[-2] - errors[-1]) < self.tolerance:
                logging.info("Convergence has reached.")
                break

        return theta, errors

    def predict(self, X, threshold=0.5):
        probs = self.predict_proba(X)
        return (probs >= threshold).astype(int)

    def predict_proba(self, X):
        raise NotImplementedError("This method should be implemented in a subclass.")

    def _loss(self, w):
        raise NotImplementedError()

    def init_cost(self):
        raise NotImplementedError()


class LogisticRegression(BasicRegression):
    def init_cost(self):
        self.cost_func = binary_crossentropy

    def _loss(self, w):
        predictions = self.sigmoid(np.dot(self.X, w))

        # Clamp predictions to avoid log(0)
        eps = 1e-8
        predictions = np.clip(predictions, eps, 1 - eps)

        loss = self.cost_func(self.y, predictions)
        return self._add_penalty(loss, w)

    @staticmethod
    def sigmoid(x):
        return 0.5 * (np.tanh(0.5 * x) + 1)

    def predict_proba(self, X):
        X = np.array(X)
        X = self._add_intercept(X)
        return self.sigmoid(np.dot(X, self.theta))


We now are going to train our model and evaluate it but for that we need to standardize features, we are going to do that using StandardScaler. To garantee a fair training we are also going to make sure both classes appear in both training and testing!

In [46]:
#Load Data and Initialize Structures

data_folder = "class_imbalance_final"
csv_files = sorted(glob.glob(os.path.join(data_folder, "*.csv")))

# Lists to accumulate results
results = []  # Will hold dicts of metrics for each dataset
roc_images = []  # Will hold (dataset_name, base64_png, auc) for plots

In [47]:
#Process Each Dataset
# For each CSV:
# 1. Read data into X (features) and y (binary target).
# 2. Repeatedly split (stratified) until both classes present in train and test.
# 3. Scale features with StandardScaler (fit on train, transform both).
# 4. Train logistic regression.
# 5. Compute metrics on test set.
# 6. Plot and save ROC curve for this dataset.

for file_path in tqdm(csv_files, desc="Datasets"):
    dataset_name = os.path.basename(file_path)
    df = pd.read_csv(file_path)
    if 'target' not in df.columns:
        # If not labeled, rename last col to 'target'
        df = df.rename(columns={df.columns[-1]: 'target'})
    X = df.drop(columns=['target']).values
    y = df['target'].values

    # Stratified split with retries to ensure both classes in train and test
    seed = 0
    while True:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, stratify=y, random_state=seed
        )
        # Check class presence
        if len(np.unique(y_train)) == 2 and len(np.unique(y_test)) == 2:
            break
        seed += 1

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Model training
    model = LogisticRegression()  # using default parameters
    model.fit(X_train_scaled, y_train)

    # Predictions and probabilities
    y_pred = model.predict(X_test_scaled)
    try:
        y_proba = model.predict_proba(X_test_scaled)[:, 1]  # probability of class 1
    except:
        # Some implementations might use a different method name
        y_proba = model.predict(X_test_scaled)
    
    # Compute metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    auc = roc_auc_score(y_test, y_proba)
    cm = confusion_matrix(y_test, y_pred)

    # Store results
    results.append({
        'Dataset': dataset_name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'ROC AUC': auc
    })

    # Plot ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.figure(figsize=(5, 4))
    plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
    plt.plot([0, 1], [0, 1], 'k--', alpha=0.3)
    plt.title(f'ROC Curve: {dataset_name}')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    plt.tight_layout()

    # Save plot to base64
    buf = io.BytesIO()
    plt.savefig(buf, format='png')
    plt.close()
    buf.seek(0)
    img_base64 = base64.b64encode(buf.read()).decode('utf-8')
    roc_images.append((dataset_name, img_base64, auc))

Datasets:   0%|          | 0/50 [00:00<?, ?it/s]


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encountered in log


invalid value encou

In [48]:
#Compile Metrics and Compute Averages

metrics_df = pd.DataFrame(results)
# Round metrics for display
metrics_df[['Accuracy','Precision','Recall','F1-Score','ROC AUC']] = metrics_df[
    ['Accuracy','Precision','Recall','F1-Score','ROC AUC']].round(3)

# Compute average metrics
avg_metrics = metrics_df[['Accuracy','Precision','Recall','F1-Score','ROC AUC']].mean().to_dict()

In [49]:
# Começar estrutura HTML
html_lines = []

html_lines.append("<html><head><title>Relatório Comparativo</title>")
html_lines.append("<style>")
html_lines.append("body {font-family: Arial, sans-serif; margin: 20px; background-color: #f0f8ff; color: #222;}")
html_lines.append("h1, h2, h3 { color: #007acc; }")
html_lines.append("table {border-collapse: collapse; width: 100%; margin-bottom: 30px; background-color: #ffffff;}")
html_lines.append("th, td {border: 1px solid #ddd; padding: 8px; text-align: center;}")
html_lines.append("th {background-color: #e6f2ff; cursor: pointer;}")
html_lines.append("th:hover {background-color: #d0e7ff;}")
html_lines.append("tr:nth-child(even) {background-color: #f9fbfd;}")
html_lines.append("tr:hover {background-color: #eef7ff;}")
html_lines.append("img {border: 1px solid #ccc; margin: 5px;}")
html_lines.append("a { color: #007acc; text-decoration: none; }")
html_lines.append("a:hover { text-decoration: underline; }")
html_lines.append("button { background-color: #007acc; color: white; border: none; padding: 8px 12px; border-radius: 4px; cursor: pointer; }")
html_lines.append("button:hover { background-color: #005f99; }")
html_lines.append("</style>")
html_lines.append("""
<script>
function sortTable(n, id) {
  var table = document.getElementById(id), rows, switching = true, i, x, y, shouldSwitch, dir = "asc", switchcount = 0;
  while (switching) {
    switching = false;
    rows = table.rows;
    for (i = 1; i < (rows.length - 1); i++) {
      shouldSwitch = false;
      x = rows[i].getElementsByTagName("TD")[n];
      y = rows[i + 1].getElementsByTagName("TD")[n];
      var xVal = isNaN(x.innerHTML) ? x.innerHTML.toLowerCase() : parseFloat(x.innerHTML);
      var yVal = isNaN(y.innerHTML) ? y.innerHTML.toLowerCase() : parseFloat(y.innerHTML);
      if ((dir === "asc" && xVal > yVal) || (dir === "desc" && xVal < yVal)) {
        shouldSwitch = true; break;
      }
    }
    if (shouldSwitch) {
      rows[i].parentNode.insertBefore(rows[i + 1], rows[i]);
      switching = true; switchcount++;
    } else {
      if (switchcount === 0 && dir === "asc") {
        dir = "desc"; switching = true;
      }
    }
  }
}
</script>
""")
html_lines.append("</head><body>")
html_lines.append('<a id="inicio"></a>')
html_lines.append("<h1>Relatório Comparativo: Modelos de Classificação</h1>")
html_lines.append('<div style="margin-bottom: 20px; font-size: 16px;">')
html_lines.append('<b>Ir para:</b> ')
html_lines.append('<a href="#modelo1">Baseline</a> | ')
html_lines.append('<a href="#modelo2">After Changes</a> | ')
html_lines.append('<a href="#modelo3">With SMOTE</a>')
html_lines.append('<a href="#comparacao">Comparação por Dataset</a>')
html_lines.append('</div>')



In [50]:
html_lines.append('<h2 id="modelo1">Modelo 1: Logistic Regression (Baseline) <a href="#inicio"> Voltar ao topo</a></h2>')
html_lines.append("<h3>Médias das Métricas</h3>")
html_lines.append("<ul>")
html_lines.append(f"<li>Accuracy: {avg_metrics['Accuracy']:.3f}</li>")
html_lines.append(f"<li>Precision: {avg_metrics['Precision']:.3f}</li>")
html_lines.append(f"<li>Recall: {avg_metrics['Recall']:.3f}</li>")
html_lines.append(f"<li>F1-Score: {avg_metrics['F1-Score']:.3f}</li>")
html_lines.append(f"<li>ROC AUC: {avg_metrics['ROC AUC']:.3f}</li>")
html_lines.append("</ul>")

html_lines.append('<table id="table_baseline"><tr>')
cols = ['Dataset', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC AUC']
for i, col in enumerate(cols):
    html_lines.append(f'<th onclick="sortTable({i}, \'table_baseline\')">{col}</th>')
html_lines.append("</tr>")
for _, row in metrics_df.iterrows():
    html_lines.append("<tr>")
    for col in cols:
        val = row[col]
        html_lines.append(f"<td>{val:.3f}</td>" if isinstance(val, float) else f"<td>{val}</td>")
    html_lines.append("</tr>")
html_lines.append("</table>")

html_lines.append("<h3>Curvas ROC</h3><div style='display: flex; flex-wrap: wrap;'>")
for name, img_b64, auc in roc_images:
    html_lines.append("<div style='margin:10px; text-align:center;'>")
    html_lines.append(f"<img src='data:image/png;base64,{img_b64}' width='300'><br>")
    html_lines.append(f"<span><b>{name}</b><br>AUC = {auc:.3f}</span>")
    html_lines.append("</div>")
html_lines.append("</div><hr>")


We are going to make a HTML report to evaluate our metrics!

The overall averages using the Logistic Regression with no modification were:

- Accuracy: 0.639

- Precision: 0.167

- Recall: 0.507

- F1-Score: 0.226

- ROC AUC: 0.579

And these results align what we were expecting because the **precision is low** due to the struggle to identify the minority class; **recall is higher** because it's correctly flagging some positives but with many false positives; **F1-score is approximately 0.22** what's in line with a precision-recall trade-off.

Also, we have some datasets where the precision and the recall were 0 because the model predicted only one class due to the strong imbalance!

And we have too somes cases with high accuracy and bad recall, a typical illusion of accuracy when the majority class dominates.

Now we can proceed for the next step!

# 6th Step - Model Training & Evaluation: Post‑Changes

For this next step we are going to change our initial algorithm by creating a new hiperparameter 'imbalance_penalty' that does:
- Helps address class imbalance by **increasing the weight of the minority class** in the loss function.
- If set to *>1.0* , the model will **penalize errors on the minority class more heavily**, making it more sensitive to that class.
- Applied dynamically based on which class is underrepresented in the training data.
- Affects both **loss calculation** and **gradient updates**, guiding the model to perform better on imbalanced datasets.

In [51]:
class LossLogisticRegression:
    def __init__(self, lr=0.01, penalty=None, C=0.01, tolerance=1e-4, max_iters=1000, imbalance_penalty=1.0):
        self.lr = lr
        self.penalty = penalty    # 'l1', 'l2', or None
        self.C = C
        self.tolerance = tolerance
        self.max_iters = max_iters
        self.imbalance_penalty = imbalance_penalty
        self.coef_ = None  # Includes intercept as first element

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        X = np.array(X)
        y = np.array(y)
        N = len(y)
        # Determine class counts
        N_pos = np.sum(y == 1)
        N_neg = np.sum(y == 0)
        # Setup class weights manually based on imbalance_penalty
        if N_pos == 0 or N_neg == 0:
            # If one class is absent, no weighting needed
            self.w_pos = 1.0
            self.w_neg = 1.0
        else:
            if self.imbalance_penalty != 1.0:
                # Identify minority class
                if N_pos < N_neg:
                    self.w_pos = self.imbalance_penalty
                    self.w_neg = 1.0
                elif N_neg < N_pos:
                    self.w_pos = 1.0
                    self.w_neg = self.imbalance_penalty
                else:
                    # If classes are balanced, no extra weighting
                    self.w_pos = 1.0
                    self.w_neg = 1.0
            else:
                # No imbalance penalty => no weighting
                self.w_pos = 1.0
                self.w_neg = 1.0

        # Add intercept term
        X_b = np.hstack([np.ones((N,1)), X])
        # Initialize coefficients (bias + weights)
        self.coef_ = np.zeros(X_b.shape[1])

        prev_cost = float('inf')
        for i in range(self.max_iters):
            # Compute predictions
            z = X_b.dot(self.coef_)
            y_pred = self.sigmoid(z)
            # Clip predictions to avoid log(0)
            eps = 1e-15
            y_pred = np.clip(y_pred, eps, 1 - eps)

            # Compute weighted binary cross-entropy loss
            cost = -(1.0 / N) * (
                self.w_pos * (y * np.log(y_pred)).sum() +
                self.w_neg * ((1 - y) * np.log(1 - y_pred)).sum()
            )
            # Add regularization penalty (skip intercept at index 0)
            if self.penalty == 'l2':
                cost += 0.5 * self.C * np.sum(self.coef_[1:]**2)
            elif self.penalty == 'l1':
                cost += self.C * np.sum(np.abs(self.coef_[1:]))

            # Check for convergence
            if abs(prev_cost - cost) < self.tolerance:
                break
            prev_cost = cost

            # Compute gradient of weighted loss
            error = y_pred - y
            weights = np.where(y == 1, self.w_pos, self.w_neg)
            gradient = (X_b.T.dot(weights * error)) / N

            # Add gradient of penalty (not applying to intercept)
            if self.penalty == 'l2':
                gradient[1:] += self.C * self.coef_[1:]
            elif self.penalty == 'l1':
                gradient[1:] += self.C * np.sign(self.coef_[1:])

            # Gradient descent update
            self.coef_ -= self.lr * gradient

        return self

    def predict_proba(self, X):
        X = np.array(X)
        N = X.shape[0]
        X_b = np.hstack([np.ones((N,1)), X])
        return self.sigmoid(X_b.dot(self.coef_))

    def predict(self, X):
        proba = self.predict_proba(X)
        return (proba >= 0.5).astype(int)


In [52]:
#Load Data and Initialize Structures

data_folder = "class_imbalance_final"
csv_files = sorted(glob.glob(os.path.join(data_folder, "*.csv")))

# Lists to accumulate results
results1 = []  # Will hold dicts of metrics for each dataset
roc_images1 = []  # Will hold (dataset_name, base64_png, auc) for plots

In [53]:
#Process Each Dataset
# For each CSV:
# 1. Read data into X (features) and y (binary target).
# 2. Repeatedly split (stratified) until both classes present in train and test.
# 3. Scale features with StandardScaler (fit on train, transform both).
# 4. Train logistic regression.
# 5. Compute metrics on test set.
# 6. Plot and save ROC curve for this dataset.

for file_path in tqdm(csv_files, desc="Datasets"):
    dataset_name = os.path.basename(file_path)
    df = pd.read_csv(file_path)
    # Assume the last column is 'target' as preprocessed (0/1)
    if 'target' not in df.columns:
        # If not labeled, rename last col to 'target'
        df = df.rename(columns={df.columns[-1]: 'target'})
    X = df.drop(columns=['target']).values
    y = df['target'].values

    # Stratified split with retries to ensure both classes in train and test
    seed = 0
    while True:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, stratify=y, random_state=seed
        )
        # Check class presence
        if len(np.unique(y_train)) == 2 and len(np.unique(y_test)) == 2:
            break
        seed += 1

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Model training
    model = LossLogisticRegression()  # using default parameters
    model.fit(X_train_scaled, y_train)

    # Predictions and probabilities
    y_pred = model.predict(X_test_scaled)
    try:
        y_proba = model.predict_proba(X_test_scaled)[:, 1]  # probability of class 1
    except:
        # Some implementations might use a different method name
        y_proba = model.predict(X_test_scaled)
    
    # Compute metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    auc = roc_auc_score(y_test, y_proba)
    cm = confusion_matrix(y_test, y_pred)

    # Store results
    results1.append({
        'Dataset': dataset_name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'ROC AUC': auc
    })

    # Plot ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.figure(figsize=(5, 4))
    plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
    plt.plot([0, 1], [0, 1], 'k--', alpha=0.3)
    plt.title(f'ROC Curve: {dataset_name}')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    plt.tight_layout()

    # Save plot to base64
    buf = io.BytesIO()
    plt.savefig(buf, format='png')
    plt.close()
    buf.seek(0)
    img_base64 = base64.b64encode(buf.read()).decode('utf-8')
    roc_images1.append((dataset_name, img_base64, auc))

Datasets:   0%|          | 0/50 [00:00<?, ?it/s]

In [54]:
#Compile Metrics and Compute Averages

metrics_df1 = pd.DataFrame(results1)
# Round metrics for display
metrics_df1[['Accuracy','Precision','Recall','F1-Score','ROC AUC']] = metrics_df1[
    ['Accuracy','Precision','Recall','F1-Score','ROC AUC']].round(3)

# Compute average metrics
avg_metrics1 = metrics_df1[['Accuracy','Precision','Recall','F1-Score','ROC AUC']].mean().to_dict()

In [55]:
html_lines.append('<h2 id="modelo2">Modelo 2: Logistic Regression (After Changes) <a href="#inicio">🔝 Voltar ao topo</a></h2>')
html_lines.append("<h3>Médias das Métricas</h3>")
html_lines.append("<ul>")
html_lines.append(f"<li>Accuracy: {avg_metrics1['Accuracy']:.3f}</li>")
html_lines.append(f"<li>Precision: {avg_metrics1['Precision']:.3f}</li>")
html_lines.append(f"<li>Recall: {avg_metrics1['Recall']:.3f}</li>")
html_lines.append(f"<li>F1-Score: {avg_metrics1['F1-Score']:.3f}</li>")
html_lines.append(f"<li>ROC AUC: {avg_metrics1['ROC AUC']:.3f}</li>")
html_lines.append("</ul>")

html_lines.append('<table id="table_after"><tr>')
for i, col in enumerate(cols):
    html_lines.append(f'<th onclick="sortTable({i}, \'table_after\')">{col}</th>')
html_lines.append("</tr>")
for _, row in metrics_df1.iterrows():
    html_lines.append("<tr>")
    for col in cols:
        val = row[col]
        html_lines.append(f"<td>{val:.3f}</td>" if isinstance(val, float) else f"<td>{val}</td>")
    html_lines.append("</tr>")
html_lines.append("</table>")

html_lines.append("<h3>Curvas ROC</h3><div style='display: flex; flex-wrap: wrap;'>")
for name, img_b64, auc in roc_images1:
    html_lines.append("<div style='margin:10px; text-align:center;'>")
    html_lines.append(f"<img src='data:image/png;base64,{img_b64}' width='300'><br>")
    html_lines.append(f"<span><b>{name}</b><br>AUC = {auc:.3f}</span>")
    html_lines.append("</div>")
html_lines.append("</div><hr>")


Now we are also going to compare the result using our changed algorithm with SMOTE (Synthetic Minority Over-sampling Technique) to see if the results change!

We are using k=4 neighbours instead of the default k=5 because SMOTE requires at least *k+1* minority class samples, and we have some cases with only 5 minority cases.

In [56]:
# Load Data and Initialize Structures
data_folder = "class_imbalance_final"
csv_files = sorted(glob.glob(os.path.join(data_folder, "*.csv")))

results2 = []  # Will hold dicts of metrics for each dataset
roc_images2 = []  # Will hold (dataset_name, base64_png, auc) for plots

# Process each dataset
for file_path in tqdm(csv_files, desc="Datasets"):
    dataset_name = os.path.basename(file_path)
    df = pd.read_csv(file_path)

    # Assume the last column is 'target' as preprocessed (0/1)
    if 'target' not in df.columns:
        df = df.rename(columns={df.columns[-1]: 'target'})

    X = df.drop(columns=['target']).values
    y = df['target'].values

    # Stratified split with retries to ensure both classes in train and test
    seed = 0
    while True:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, stratify=y, random_state=seed
        )
        if len(np.unique(y_train)) == 2 and len(np.unique(y_test)) == 2:
            break
        seed += 1

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Apply SMOTE to training data
    smote = SMOTE(random_state=42, k_neighbors=4)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

    # Model training
    model = LossLogisticRegression()
    model.fit(X_train_resampled, y_train_resampled)

    # Predictions and probabilities
    y_pred = model.predict(X_test_scaled)
    try:
        y_proba = model.predict_proba(X_test_scaled)
    except:
        y_proba = y_pred  # fallback if proba isn't supported

    # Compute metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    auc = roc_auc_score(y_test, y_proba)
    cm = confusion_matrix(y_test, y_pred)

    # Store results
    results2.append({
        'Dataset': dataset_name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'ROC AUC': auc
    })

    # Plot ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.figure(figsize=(5, 4))
    plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
    plt.plot([0, 1], [0, 1], 'k--', alpha=0.3)
    plt.title(f'ROC Curve: {dataset_name}')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    plt.tight_layout()

    # Save plot to base64
    buf = io.BytesIO()
    plt.savefig(buf, format='png')
    plt.close()
    buf.seek(0)
    img_base64 = base64.b64encode(buf.read()).decode('utf-8')
    roc_images2.append((dataset_name, img_base64, auc))

    # Convert results list to DataFrame
metrics_df2 = pd.DataFrame(results2)

# Compute average metrics
avg_metrics2= metrics_df2[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC AUC']].mean()

Datasets:   0%|          | 0/50 [00:00<?, ?it/s]

In [57]:
html_lines.append('<h2 id="modelo3">Modelo 3: Logistic Regression (With SMOTE) <a href="#inicio">🔝 Voltar ao topo</a></h2>')
html_lines.append("<h3>Médias das Métricas</h3>")
html_lines.append("<ul>")
html_lines.append(f"<li>Accuracy: {avg_metrics2['Accuracy']:.3f}</li>")
html_lines.append(f"<li>Precision: {avg_metrics2['Precision']:.3f}</li>")
html_lines.append(f"<li>Recall: {avg_metrics2['Recall']:.3f}</li>")
html_lines.append(f"<li>F1-Score: {avg_metrics2['F1-Score']:.3f}</li>")
html_lines.append(f"<li>ROC AUC: {avg_metrics2['ROC AUC']:.3f}</li>")
html_lines.append("</ul>")

html_lines.append('<table id="table_smote"><tr>')
for i, col in enumerate(cols):
    html_lines.append(f'<th onclick="sortTable({i}, \'table_smote\')">{col}</th>')
html_lines.append("</tr>")
for _, row in metrics_df2.iterrows():
    html_lines.append("<tr>")
    for col in cols:
        val = row[col]
        html_lines.append(f"<td>{val:.3f}</td>" if isinstance(val, float) else f"<td>{val}</td>")
    html_lines.append("</tr>")
html_lines.append("</table>")

html_lines.append("<h3>Curvas ROC</h3><div style='display: flex; flex-wrap: wrap;'>")
for name, img_b64, auc in roc_images2:
    html_lines.append("<div style='margin:10px; text-align:center;'>")
    html_lines.append(f"<img src='data:image/png;base64,{img_b64}' width='300'><br>")
    html_lines.append(f"<span><b>{name}</b><br>AUC = {auc:.3f}</span>")
    html_lines.append("</div>")
html_lines.append("</div><hr>")


In [58]:
html_lines.append('<hr>')
html_lines.append('<a id="comparacao"></a>')
html_lines.append('<h2>🔍 Comparação por Dataset</h2>')
html_lines.append('<div style="margin-top: 20px;"><a href="#inicio">🔝 Voltar ao topo</a></div>')
html_lines.append("""
<label for="datasetSelect">Escolhe um dataset:</label>
<select id="datasetSelect" onchange="compararDataset()">
  <option value="">-- Seleciona --</option>
</select>

<table id="comparacaoResultados" style="margin-top:15px; display:none;">
  <thead>
    <tr>
      <th>Modelo</th>
      <th>Accuracy</th>
      <th>Precision</th>
      <th>Recall</th>
      <th>F1-Score</th>
      <th>ROC AUC</th>
    </tr>
  </thead>
  <tbody></tbody>
</table>

<script>
const modelos = ["Baseline", "After Changes", "With SMOTE"];
const tabelas = {
  "Baseline": document.getElementById("table_baseline"),
  "After Changes": document.getElementById("table_after"),
  "With SMOTE": document.getElementById("table_smote")
};


// Recolher todos os nomes de dataset existentes
const nomesDatasets = new Set();
for (const modelo in tabelas) {
  const linhas = tabelas[modelo].querySelectorAll("tbody tr");
  for (const linha of linhas) {
    const nome = linha.children[0].textContent.trim();
    nomesDatasets.add(nome);
  }
}

// Preencher o dropdown
const select = document.getElementById("datasetSelect");
[...nomesDatasets].sort().forEach(nome => {
  const option = document.createElement("option");
  option.value = nome;
  option.textContent = nome;
  select.appendChild(option);
});

function compararDataset() {
  const ds = select.value;
  const tabela = document.getElementById("comparacaoResultados");
  const tbody = tabela.querySelector("tbody");
  tbody.innerHTML = "";

  if (!ds) {
    tabela.style.display = "none";
    return;
  }

  for (const modelo in tabelas) {
    const linhas = tabelas[modelo].querySelectorAll("tbody tr");
    for (const linha of linhas) {
      if (linha.children[0].textContent.trim() === ds) {
        const novaLinha = document.createElement("tr");
        novaLinha.innerHTML = `<td>${modelo}</td>` +
          [...linha.children].slice(1).map(td => `<td>${td.textContent}</td>`).join("");
        tbody.appendChild(novaLinha);
        break;
      }
    }
  }
  tabela.style.display = "table";
}
</script>
""")


In [59]:
html_lines.append("</body></html>")

with open("relatorio_comparativo_final.html", "w", encoding="utf-8") as f:
    f.write("\n".join(html_lines))


We observed that adding SMOTE to a cost-sensitive logistic regression led to modest improvements, particularly in AUC but overall the gains were incremental.

In [60]:
html_lines = []

# Cabeçalho com CSS e JavaScript
html_lines.append("<html><head><title>Relatório Comparativo</title>")
html_lines.append("<style>")
html_lines.append("body {font-family: Arial, sans-serif; margin: 20px; background-color: #f0f8ff; color: #222;}")
html_lines.append("h1, h2, h3 { color: #007acc; }")
html_lines.append("table {border-collapse: collapse; width: 100%; margin-bottom: 30px; background-color: #ffffff;}")
html_lines.append("th, td {border: 1px solid #ddd; padding: 8px; text-align: center;}")
html_lines.append("th {background-color: #e6f2ff; cursor: pointer;}")
html_lines.append("th:hover {background-color: #d0e7ff;}")
html_lines.append("tr:nth-child(even) {background-color: #f9fbfd;}")
html_lines.append("tr:hover {background-color: #eef7ff;}")
html_lines.append("img {border: 1px solid #ccc; margin: 5px;}")
html_lines.append("a { color: #007acc; text-decoration: none; }")
html_lines.append("a:hover { text-decoration: underline; }")
html_lines.append("button { background-color: #007acc; color: white; border: none; padding: 8px 12px; border-radius: 4px; cursor: pointer; }")
html_lines.append("button:hover { background-color: #005f99; }")
html_lines.append("</style>")
html_lines.append("""
<script>
function sortTable(n, id) {
  var table = document.getElementById(id), rows, switching = true, i, x, y, shouldSwitch, dir = "asc", switchcount = 0;
  while (switching) {
    switching = false;
    rows = table.rows;
    for (i = 1; i < (rows.length - 1); i++) {
      shouldSwitch = false;
      x = rows[i].getElementsByTagName("TD")[n];
      y = rows[i + 1].getElementsByTagName("TD")[n];
      var xVal = isNaN(x.innerHTML) ? x.innerHTML.toLowerCase() : parseFloat(x.innerHTML);
      var yVal = isNaN(y.innerHTML) ? y.innerHTML.toLowerCase() : parseFloat(y.innerHTML);
      if ((dir === "asc" && xVal > yVal) || (dir === "desc" && xVal < yVal)) {
        shouldSwitch = true; break;
      }
    }
    if (shouldSwitch) {
      rows[i].parentNode.insertBefore(rows[i + 1], rows[i]);
      switching = true; switchcount++;
    } else {
      if (switchcount === 0 && dir === "asc") {
        dir = "desc"; switching = true;
      }
    }
  }
}
</script>
""")
html_lines.append("</head><body>")
html_lines.append("<h1>Relatório Comparativo: Modelos de Classificação</h1>")


# 7th Step - Creating the HTML for results viewing

Now that we have all the results we are going to create two HTML files for comparison of results!

We are going to be able to compare the results through models and through datasets as well!

In [61]:
# Assume these DataFrames exist from earlier steps:
# metrics_df  -> Baseline metrics (indexed by 'Dataset')
# metrics_df1 -> After Changes metrics (indexed by 'Dataset')
# metrics_df2 -> Changes + SMOTE metrics (indexed by 'Dataset')

# List of metric names (as in the notebook)
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC AUC']
# List of models and colors
models = ['Baseline', 'After Changes', 'Changes + SMOTE']
colors = {'Baseline': 'steelblue', 
          'After Changes': 'orange', 
          'Changes + SMOTE': 'green'}

# Ensure 'Dataset' is index (if not already)
metrics_df = metrics_df.set_index('Dataset')
metrics_df1 = metrics_df1.set_index('Dataset')
metrics_df2 = metrics_df2.set_index('Dataset')

# All dataset names to include in dropdown
dataset_names = list(metrics_df.index)

# Create subplots: 1 row x 2 cols
fig = make_subplots(rows=1, cols=2, subplot_titles=('Metrics Comparison', 'ROC Curve'))

# Initial dataset (first one)
initial_ds = dataset_names[0]
# Extract initial y-values for bars
y_base = [metrics_df.loc[initial_ds, m]    for m in metrics]
y_after = [metrics_df1.loc[initial_ds, m]  for m in metrics]
y_smote = [metrics_df2.loc[initial_ds, m]  for m in metrics]

# Add bar chart traces (one trace per model)
fig.add_trace(go.Bar(name='Baseline', x=metrics, y=y_base, marker_color=colors['Baseline']), row=1, col=1)
fig.add_trace(go.Bar(name='After Changes', x=metrics, y=y_after, marker_color=colors['After Changes']), row=1, col=1)
fig.add_trace(go.Bar(name='Changes + SMOTE', x=metrics, y=y_smote, marker_color=colors['Changes + SMOTE']), row=1, col=1)

# Add ROC curve traces for initial dataset (dummy example curves here)
# In practice, replace x= and y= with the actual fpr/tpr arrays for each model.
fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', name='Baseline ROC', 
                         line=dict(color=colors['Baseline'])), row=1, col=2)
fig.add_trace(go.Scatter(x=[0,1], y=[0.1,0.9], mode='lines', name='After Changes ROC', 
                         line=dict(color=colors['After Changes'])), row=1, col=2)
fig.add_trace(go.Scatter(x=[0,1], y=[0.05,0.95], mode='lines', name='Changes+SMOTE ROC', 
                         line=dict(color=colors['Changes + SMOTE'], dash='dash')), row=1, col=2)

# Update layout titles
fig.update_layout(title_text=f"Model Comparison: Dataset = {initial_ds}", 
                  showlegend=True, 
                  legend_title_text='Model',
                  yaxis=dict(title='Score'), 
                  yaxis2=dict(title='True Positive Rate'),
                  xaxis2=dict(title='False Positive Rate'))

# Prepare dropdown buttons (one per dataset)
buttons = []
for ds in dataset_names:
    # New bar heights for this dataset
    yb = [metrics_df.loc[ds, m]    for m in metrics]
    ya = [metrics_df1.loc[ds, m]   for m in metrics]
    ys = [metrics_df2.loc[ds, m]   for m in metrics]
    # (Here we would also compute the new ROC curves if we have data)
    # For example, new ROC points could be precomputed fpr/ tpr arrays:
    # fpr_base, tpr_base = compute_roc('Baseline', ds)
    # etc. As placeholders, we'll keep the same line segments.
    new_y = [yb, ya, ys,  # bar traces update
             [0,1], [0.1,0.9], [0.05,0.95]]  # ROC traces (dummy here)
    buttons.append(dict(label=ds,
                        method='update',
                        args=[{
                            'y': new_y,
                            'x': [metrics, metrics, metrics, [0,1], [0,1], [0,1]]
                        }, {
                            'title': f"Model Comparison: Dataset = {ds}"
                        }]))
# Add dropdown to layout
fig.update_layout(
    updatemenus=[dict(active=0, buttons=buttons, 
                      x=0.5, xanchor='center', y=1.15, yanchor='top')],
    margin=dict(t=100)  # make space for dropdown
)

fig.show()



# **Conclusion**

| Metric    | Baseline | After Changes | Changes + SMOTE |
| --------- | -------- | ------------- | --------------- |
| Accuracy  | 0.639    | 0.935         | 0.785           |
| Precision | 0.167    | 0.707         | 0.359           |
| Recall    | 0.507    | 0.429         | 0.845           |
| F1-Score  | 0.226    | 0.490         | 0.477           |
| ROC AUC   | 0.579    | 0.709         | 0.881           |


The model changes led to significant improvements in accuracy, precision, and F1-Score, although recall slightly decreased. After applying SMOTE, recall increased substantially (to 0.845), and overall performance improved (ROC AUC of 0.881), despite a drop in precision. Therefore, the SMOTE-enhanced model is more suitable when identifying positive cases is critical, while the model without SMOTE is preferable when higher precision is required.