In [2]:
pip install xgboost

Defaulting to user installation because normal site-packages is not writeable
Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/43/80/0b5a2dfcf5b4da27b0b68d2833f05d77e1a374d43db951fca200a1f12a52/xgboost-2.1.4-py3-none-win_amd64.whl.metadata
  Downloading xgboost-2.1.4-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.4-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB 330.3 kB/s eta 0:06:19
   ---------------------------------------- 0.1/124.9 MB 1.3 MB/s eta 0:01:36
   ---------------------------------------- 0.5/124.9 MB 3.5 MB/s eta 0:00:36
   ---------------------------------------- 0.7/124.9 MB 4.1 MB/s eta 0:00:31
   ---------------------------------------- 1.3/124.9 MB 5.6 MB/s eta 0:00:23
    --------------------------------------- 1.7/124.9 MB 6.3 MB/s eta 0:00:20
    ------------------------

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import xgboost as xgb
from xgboost import XGBClassifier
import matplotlib.pyplot as plt

In [4]:
train = pd.read_csv(r"C:\Users\cn4330\OneDrive - BDO AS\ML\Kaggle\kaggle_prediction_rainfall\data\train.csv")
test = pd.read_csv(r"C:\Users\cn4330\OneDrive - BDO AS\ML\Kaggle\kaggle_prediction_rainfall\data\test.csv")

In [5]:
train.head()

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
0,0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1
1,1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1
2,2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1
3,3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1
4,4,5,1021.8,21.3,18.4,15.2,9.6,52.0,45.0,3.6,40.0,24.8,0


Sjekker om det er noen kategoriske features i datasettet

In [6]:
# Define features
RMV = ["id", "rainfall"]
FEATURES = [c for c in train.columns if c not in RMV]
CATS = [c for c in FEATURES if train[c].dtype == "object"]

print(f"Features: {len(FEATURES)} (Categorical: {len(CATS)})")

Features: 11 (Categorical: 0)


Sett float64 til float32 for å spare minne og sett int64 til int32 for å spare minne

In [7]:
for c in FEATURES:
    if train[c].dtype == "float64":
        train[c] = train[c].astype("float32")
        test[c] = test[c].astype("float32")
    elif train[c].dtype == "int64":
        train[c] = train[c].astype("int32")
        test[c] = test[c].astype("int32")

## XG Boost

We focus on building a robust XGBoost model for the Kaggle Rainfall Prediction competition. Since the dataset contains no categorical features, we leverage XGBoost's native handling of numerical data and missing values to simplify preprocessing and improve performance.

XGBoost natively handles NaN values in numerical features by:

Learning optimal imputation during training: Instead of filling NaNs with a fixed value (e.g., mean or median), XGBoost learns the best way to handle missing values based on the target variable.
Using missing values as split information: NaNs are treated as a separate category during tree splits, allowing the model to capture patterns related to missingness.
This eliminates the need for manual imputation, reducing preprocessing complexity and preserving valuable information.

## Hyperparameter Tuning with Optuna for XGBoost
In this section, we use Optuna to fine-tune the hyperparameters of our XGBoost model. Optuna is a powerful hyperparameter optimization framework that automates the search for the best configuration, saving time and improving model performance.

Optuna Objective Function
We define an objective function that Optuna will optimize. This function:

Suggests hyperparameters for each trial.
Trains the XGBoost model using 5-fold cross-validation.
Evaluates performance using ROC AUC (the competition metric).

In [11]:
pip install optuna

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.





Collecting optuna
  Obtaining dependency information for optuna from https://files.pythonhosted.org/packages/28/09/c4d329f7969443cdd4d482048ca406b6f61cda3c8e99ace71feaec7c8734/optuna-4.2.1-py3-none-any.whl.metadata
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Obtaining dependency information for alembic>=1.5.0 from https://files.pythonhosted.org/packages/99/f7/d398fae160568472ddce0b3fde9c4581afc593019a6adc91006a66406991/alembic-1.15.1-py3-none-any.whl.metadata
  Downloading alembic-1.15.1-py3-none-any.whl.metadata (7.2 kB)
Collecting colorlog (from optuna)
  Obtaining dependency information for colorlog from https://files.pythonhosted.org/packages/e3/51/9b208e85196941db2f0654ad0357ca6388ab3ed67efdbfc799f35d1f83aa/colorlog-6.9.0-py3-none-any.whl.metadata
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Obtaining dependency information for Mako from https://files.pytho

In [12]:
import optuna
import logging
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score


In [13]:



def optimize_xgboost(train, FEATURES, n_trials=30):
    def objective(trial):
        # Hyperparameter suggestions
        params = {
            "max_depth": trial.suggest_int("max_depth", 3, 10),
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
            "subsample": trial.suggest_float("subsample", 0.6, 1.0),
            "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
            "gamma": trial.suggest_float("gamma", 0.0, 5.0),
            "scale_pos_weight": trial.suggest_float("scale_pos_weight", 1.0, 3.0),
            "n_estimators": trial.suggest_int("n_estimators", 1000, 5000),
            "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 10.0),
            "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 10.0),
        }

        # 5-fold cross-validation
        FOLDS = 5
        kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
        auc_scores = []

        for train_idx, valid_idx in kf.split(train):
            X_train = train.iloc[train_idx][FEATURES]
            y_train = train.iloc[train_idx]["rainfall"]
            X_valid = train.iloc[valid_idx][FEATURES]
            y_valid = train.iloc[valid_idx]["rainfall"]

            # Train XGBoost model
            model = XGBClassifier(
                **params,
                eval_metric="auc",
                early_stopping_rounds=650,
                random_state=42,
                tree_method="hist",
                enable_categorical=False,  # No categorical features in this dataset
                verbosity=0
            )

            model.fit(
                X_train, y_train,
                eval_set=[(X_valid, y_valid)],
                verbose=0
            )

            # Evaluate on validation set
            preds = model.predict_proba(X_valid)[:, 1]
            auc = roc_auc_score(y_valid, preds)
            auc_scores.append(auc)

        # Return mean AUC across folds
        return np.mean(auc_scores)

    # Create Optuna study
    optuna.logging.set_verbosity(optuna.logging.ERROR)
    study = optuna.create_study(direction="maximize")  # Maximize AUC
    study.optimize(objective, n_trials=n_trials)
    return study.best_params

# Run optimization
best_params = optimize_xgboost(train, FEATURES, n_trials=30)
print("Best hyperparameters:", best_params)

Defaulting to user installation because normal site-packages is not writeable
Collecting optima
  Downloading optima-0.1.2.tar.gz (1.4 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting torch>=1.0.0 (from optima)
  Obtaining dependency information for torch>=1.0.0 from https://files.pythonhosted.org/packages/11/c5/2370d96b31eb1841c3a0883a492c15278a6718ccad61bb6a649c80d1d9eb/torch-2.6.0-cp311-cp311-win_amd64.whl.metadata
  Downloading torch-2.6.0-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting typing-extensions>=4.10.0 (from torch>=1.0.0->optima)
  Obtaining dependency information for typing-extensions>=4.10.0 from https://files.pythonhosted.org/packages/26/9f/ad63fc0248c5379346306f8668cda6e2e2e9c95e01216d2b8ffd9ff037d0/typing_extensions-4.12.2-py3-none-any.whl.metadata
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy==1.13.1 (from torch>=1.0.0->optima)
  Obtaining dependenc



Best hyperparameters: {'max_depth': 3, 'learning_rate': 0.17442866330834925, 'colsample_bytree': 0.9480940469785304, 'subsample': 0.6146800205988725, 'min_child_weight': 5, 'gamma': 0.27522234243330534, 'scale_pos_weight': 2.3293170664287075, 'n_estimators': 4977, 'reg_alpha': 8.432357529943625, 'reg_lambda': 0.10870310010673911}


Why Optuna Gives Different Hyperparameters Each Run
Random Sampling
Optuna randomly selects values from your defined hyperparameter ranges.

Stochastic Processes
Cross-validation splits (if shuffled) and model training (e.g., XGBoost's subsample) introduce randomness.

Probabilistic Search
Optuna's TPE algorithm explores the parameter space probabilistically, leading to different paths each run.

No Fixed Seed
Optuna lacks a built-in random seed for its sampling process.

In [19]:
# Best hyperparameters from Optuna ( Change every run)
best_params = {
    'max_depth': 3,
    'learning_rate': 0.17442866330834925,
    'colsample_bytree': 0.9480940469785304,
    'subsample': 0.6146800205988725,
    'min_child_weight': 5,
    'gamma': 0.27522234243330534,
    'scale_pos_weight': 2.3293170664287075,
    'n_estimators': 4977,
    'reg_alpha':  8.432357529943625,
    'reg_lambda': 0.10870310010673911
}

## XG Boost config

Når Optuna har funnet de beste hyperparameterne kan vi trene den endelige modellen på hele treningssettet

In [20]:
model = XGBClassifier(
    **best_params,
    eval_metric="auc",
    early_stopping_rounds=300,
    random_state=42,
    tree_method="hist",
    enable_categorical=False,
)

## Cross-Validation Setup
We use 10-fold stratified cross-validation to:

Ensure robust evaluation of model performance.
Prevent overfitting by training on multiple subsets of the data.
Generate out-of-fold (OOF) predictions for ensembling or model diagnostics.
Each fold trains on 90% of the data and validates on the remaining 10%, with predictions aggregated across all folds.

In [21]:
%%time

FOLDS = 10
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=42)
    
oof_xgb = np.zeros(len(train))
pred_xgb = np.zeros(len(test))

for fold, (train_idx, val_idx) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {fold+1}")
    print("#"*25)

    X_train = train.iloc[train_idx][FEATURES]
    y_train = train.iloc[train_idx]["rainfall"]
    X_val = train.iloc[val_idx][FEATURES]
    y_val =  train.iloc[val_idx]["rainfall"]
    
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        verbose=200
    )
    
    oof_xgb[val_idx] = model.predict_proba(X_val)[:, 1]
    pred_xgb += model.predict_proba(test[FEATURES])[:, 1] / FOLDS

#########################
### Fold 1
#########################
[0]	validation_0-auc:0.79323


[200]	validation_0-auc:0.83065
[307]	validation_0-auc:0.83362
#########################
### Fold 2
#########################
[0]	validation_0-auc:0.82554
[200]	validation_0-auc:0.89172
[400]	validation_0-auc:0.89276
[455]	validation_0-auc:0.89324
#########################
### Fold 3
#########################
[0]	validation_0-auc:0.80726
[200]	validation_0-auc:0.90351
[400]	validation_0-auc:0.90327
[476]	validation_0-auc:0.90363
#########################
### Fold 4
#########################
[0]	validation_0-auc:0.80083
[200]	validation_0-auc:0.86215
[368]	validation_0-auc:0.86102
#########################
### Fold 5
#########################
[0]	validation_0-auc:0.89384
[200]	validation_0-auc:0.94283
[400]	validation_0-auc:0.94271
[498]	validation_0-auc:0.94294
#########################
### Fold 6
#########################
[0]	validation_0-auc:0.84416
[200]	validation_0-auc:0.88358
[400]	validation_0-auc:0.88737
[600]	validation_0-auc:0.88683
[800]	validation_0-auc:0.88672
[855]	validat

In [22]:
print(f"OOF AUC: {roc_auc_score(train['rainfall'], oof_xgb):.4f}")

OOF AUC: 0.8879


In [23]:
sub = pd.read_csvtest = pd.read_csv(r"C:\Users\cn4330\OneDrive - BDO AS\ML\Kaggle\kaggle_prediction_rainfall\data\sample_submission.csv")
ensemble_preds = pred_xgb 
sub['rainfall'] = ensemble_preds
sub.to_csv("submission_xgb_2.csv", index=False)