# **FOREWORD**

1. I start off with the competition data without any additional features
2. I train a number of ML models and TABPFNClassifier too
3. I collate their OOF predictions and Hill-Climb them to arrive at a solution blend
4. I use the best public kernel, akin to others and try and blend the results in step 3 with this kernel to check the LB score


### **COMPETITION DETAILS AND METRIC**

This kernel is for the Playground Season 5-Episode 3 competition with AUC score as the eval metric. <br>
This metric needs to be maximized and this is a **binary classifier** to predict the probability of rainfall, off a HongKong weather dataset.


My adjutant kernel [here](https://www.kaggle.com/code/ravi20076/playgrounds5e3-public-imports-v1) has all background scripts ready-to-use. I can't label this as AutoML, but at least this is a humble step in this regard. I shall load these scripts below, plug and play!

### **REFERENCES**
- https://www.kaggle.com/code/cdeotte/rapids-knn-starter-ensemble-lb-0-956-wow
- https://www.kaggle.com/code/act18l/lb-probing


In this version, I assume this is a time series problem and keep the last 2 years as dev-set. Let's see how we can score with this approach!


# **IMPORTS**

In [1]:
%%time 

!pip install -q uv
!uv pip install -q --system -r /kaggle/input/playgrounds5e3-public-imports-v1/req_kaggle.txt

exec( open(f"/kaggle/input/playgrounds5e3-public-imports-v1/myimports.py", "r").read() )
exec( open(f"/kaggle/input/playgrounds5e3-public-imports-v1/myutils.py", "r").read() )
exec( open(f"/kaggle/input/playgrounds5e3-public-imports-v1/training.py", "r").read() )

print()

---> Sklearn = 1.6.1| Pandas = 2.2.3
---> Imports- part 1 done
---> Commencing imports-part2
---> XGBoost = 2.1.4 | LightGBM = 4.5.0
---> Imports- part 2 done
---> Seeding everything

---> Imports done


CPU times: user 12.6 s, sys: 1.63 s, total: 14.2 s
Wall time: 30.4 s


# **CONFIGURATION**

In [2]:
%%time

class CFG:
    """
    Configuration class for parameters and CV strategy for tuning and training
    Some parameters may be unused here as this is a general configuration class
    """;

    # Data preparation:-
    version_nb  = 5
    model_id    = "V5_2"
    model_label = "ML"

    test_req           = False
    test_sample_frac   = 200

    gpu_switch         = "ON" if torch.cuda.is_available() else "OFF"
    state              = 42
    target             = f"rainfall"
    grouper            = f""
    tgt_mapper         = {}

    ip_path            = f"/kaggle/input/playground-series-s5e3"
    op_path            = f"/kaggle/working"
    orig_path          = f""
    data_path          = f""

    dtl_preproc_req    = False
    ftre_plots_req     = True
    ftre_imp_req       = True

    nb_orig            = 0
    orig_all_folds     = False

    # Model Training:-
    pstprcs_oof        = False
    pstprcs_train      = False
    pstprcs_test       = False
    ML                 = True
    test_preds_req     = True

    n_splits           = 1
    n_repeats          = 1
    nbrnd_erly_stp     = 0
    mdlcv_mthd         = 'KF'
    metric_obj         = 'maximize'

    # Global variables for plotting:-
    grid_specs = {'visible'  : True,
                  'which'    : 'both',
                  'linestyle': '--',
                  'color'    : 'lightgrey',
                  'linewidth': 0.75
                 }

    title_specs = {'fontsize'   : 9,
                   'fontweight' : 'bold',
                   'color'      : '#992600',
                  }

collect()

CPU times: user 249 ms, sys: 793 µs, total: 249 ms
Wall time: 251 ms


0

# **PREPROCESSING**

We load the datasets and fix the CV-folds here <br>
We also impute the solitary null value in the test set column before proceding to model development

In [3]:
%%time 

train  = pd.read_csv(f"{CFG.ip_path}/train.csv", index_col = "id")
test   = pd.read_csv(f"{CFG.ip_path}/test.csv", index_col = "id")
sub_fl = pd.read_csv(f"{CFG.ip_path}/sample_submission.csv", index_col = "id")

PrintColor(f"---> Shapes = {train.shape} {test.shape} {sub_fl.shape}")

Xtrain = train.drop(CFG.target, axis=1)
ytrain = train[CFG.target].astype(np.uint8)
Xtest  = test.copy()
Xtest  = Xtest[Xtrain.columns]

Xtest["winddirection"] = Xtest["winddirection"].fillna( Xtest["winddirection"].ffill())
Xtrain["Source"], Xtest["Source"] = ("Competition" , "Competition")

Xtrain["year_nb"] = Xtrain.index // 365

PrintColor(f"---> Shapes = {Xtrain.shape} {Xtest.shape} {ytrain.shape}\n")

[1m[34m---> Shapes = (2190, 12) (730, 11) (730, 1)[0m
[1m[34m---> Shapes = (2190, 13) (730, 12) (2190,)
[0m
CPU times: user 22.3 ms, sys: 4.03 ms, total: 26.3 ms
Wall time: 59 ms


# **MODEL TRAINING**

## **IO**

In [4]:
%%time 

Mdl_Master = \
{
    "LGBM1C" : LGBMC(
                    objective         = "binary",
                    eval_metric       = "auc",
                    n_estimators      = 400,
                    learning_rate     = 0.02,
                    max_depth         = 3,
                    colsample_bytree  = 0.95,
                    reg_alpha         = 0.001,
                    reg_lambda        = 0.001,
                    random_state      = CFG.state,
                    verbosity         = -1,
               ),

    "XGB1C" : XGBC(
                    objective         = "binary:logistic",
                    eval_metric       = "auc",
                    n_estimators      = 350,
                    learning_rate     = 0.02,
                    max_depth         = 3,
                    colsample_bytree  = 0.90,
                    colsample_bylevel = 0.80,
                    colsample_bynode  = 0.85,
                    reg_alpha         = 0.001,
                    reg_lambda        = 0.001,
                    random_state      = CFG.state,
                    verbosity         = 0,
                  ),

    "CB1C" : CBC(
                loss_function     = "Logloss",
                eval_metric       = "AUC",
                task_type         = "CPU",
                iterations        = 350,
                learning_rate     = 0.02,
                max_depth         = 3,
                colsample_bylevel = 0.90,
                random_state      = CFG.state,
                verbose           = 0,
                l2_leaf_reg       = 0.25,
                ),

    "HGB1C" : HGBC(
                    loss              = 'log_loss',
                    learning_rate     = 0.025,
                    max_iter          = 300,
                    max_depth         = 5,
                    l2_regularization = 0.40,
                    random_state      = CFG.state,
                  ),

    "RF1C"  : RFC(
                    n_estimators     = 250,
                    max_depth        = 5,
                    min_samples_leaf = 8,
                    random_state     = CFG.state,
                    verbose          = 0,
               ),

    "LR1C"  : LRC(C = 0.05, max_iter = 5_000, random_state = CFG.state),

    "KNN1C" : KNNC(n_neighbors = 125) ,
}

if CFG.gpu_switch == "ON" :
    Mdl_Master["TPFN1C"] = TPFNC(random_state = CFG.state, )
else:
    pass

OOF_Preds = {}
Mdl_Preds = {}
drop_cols = ["Source", "id", "Id", "Label", CFG.target, "fold_nb",  "year_nb"]


CPU times: user 535 µs, sys: 0 ns, total: 535 µs
Wall time: 2.47 ms


## **OFFLINE TRAINING**

We retain the final 2 years as dev-set and build the model pipeline with the blend!

In [5]:
%%time 

Xtr  = Xtrain.loc[ Xtrain.year_nb <= 3 ].drop(drop_cols, axis=1, errors = "ignore")
Xdev = Xtrain.loc[ Xtrain.year_nb > 3  ].drop(drop_cols, axis=1, errors = "ignore")
ytr  = ytrain.iloc[ Xtr.index ]
ydev = ytrain.iloc[ Xdev.index ]
Xt   = Xtest.drop( drop_cols, axis= 1, errors = "ignore" )

for method, mymodel in tqdm( Mdl_Master.items() ) :

    model = Pipeline([("SS", StandardScaler()), ("M", mymodel)])
    model.fit(Xtr, ytr)

    dev_preds         = model.predict_proba( Xdev)[:, 1]
    OOF_Preds[method] = dev_preds 
    Mdl_Preds[method] = model.predict_proba( Xt )[:, 1]

    score = utils.ScoreMetric( ydev, dev_preds)
    PrintColor( 
        f"---> Score = {score :,.8f} | {method}",
        color = Fore.CYAN,
    )

oof_preds = pd.DataFrame(OOF_Preds)
mdl_preds = pd.DataFrame(Mdl_Preds)   

hc = HillClimber()
dev_preds, preds, _ = \
hc.DoHillClimb(
    target    = CFG.target, 
    direction = CFG.metric_obj, 
    cutoff    = 100, 
    neg_wgt   = "Y",
    OOF_Preds = oof_preds,
    Mdl_Preds = mdl_preds.copy() ,
    y         = ydev,
)

score = utils.ScoreMetric( ydev, dev_preds)
PrintColor( 
    f"\n\n---> Final Ensemble Score = {score :,.8f}",
    color = Fore.BLACK,
    )

_ = utils.CleanMemory()

  0%|          | 0/7 [00:00<?, ?it/s]

[1m[36m---> Score = 0.87463439 | LGBM1C[0m
[1m[36m---> Score = 0.87702747 | XGB1C[0m
[1m[36m---> Score = 0.88113294 | CB1C[0m
[1m[36m---> Score = 0.87713383 | HGB1C[0m
[1m[36m---> Score = 0.87303900 | RF1C[0m
[1m[36m---> Score = 0.88424926 | LR1C[0m
[1m[36m---> Score = 0.87090650 | KNN1C[0m
[1m[34m
----- Data preparation: ------ 
[0m


Unnamed: 0,LR1C,CB1C,HGB1C,XGB1C,LGBM1C,RF1C,KNN1C
Score,0.88425,0.88113,0.87713,0.87703,0.87463,0.87304,0.87091


[1m[34m
 ----- Initiating hill-climb ----- 
[0m
[1m[36m 1.LR1C                                                Weight =  0.5000       Score = 0.881090[0m
[1m[36m 2.XGB1C                                               Weight = -0.5000       Score = 0.882579[0m
[1m[36m 3.RF1C                                                Weight = -0.4500       Score = 0.883622[0m
[1m[36m 4.CB1C                                                Weight =  0.4900       Score = 0.884345[0m
[1m[36m 5.KNN1C                                               Weight = -0.0600       Score = 0.884590[0m
[1m[30m

---> Final Ensemble Score = 0.88458961[0m
CPU times: user 9.31 s, sys: 517 ms, total: 9.83 s
Wall time: 8.28 s


# **FULL REFIT**

We use the ensemble weights from the previous offline models and make the full-refit for the leaderboard

In [6]:
%%time 

Mdl_Preds = {}

for method, mymodel in tqdm( Mdl_Master.items() ) :
    model = Pipeline([("SS", StandardScaler()), ("M", mymodel)])
    model.fit(Xtrain.drop(drop_cols, axis=1, errors = "ignore") , ytrain)
    Mdl_Preds[method] = model.predict_proba( Xt )[:, 1]

preds = \
np.average(
    pd.DataFrame( Mdl_Preds )[["LR1C", "CB1C", "XGB1C", "RF1C", "HGB1C", "KNN1C"]], 
    weights = [0.5000, 0.5000, -0.4800, -0.3000, 0.1500 , -0.1000],
    axis=1
)



  0%|          | 0/7 [00:00<?, ?it/s]

CPU times: user 3.79 s, sys: 528 ms, total: 4.32 s
Wall time: 2.29 s


# **SUBMISSION**

In [7]:
%%time 

sub = \
pd.read_csv(
    f"/kaggle/input/rapids-knn-starter-ensemble-lb-0-956-wow/submission_ensemble.csv"
)[CFG.target].values.flatten()


scl = MinMaxScaler((0, 100))
sub_fl[CFG.target] = \
np.average(
    scl.fit_transform( np.stack([preds, sub], axis=1) ),
    axis=1,
    weights = [-0.4, 1.4]
)

sub_fl.to_csv("submission.csv", index = True)


print()
!ls
print()
!head submission.csv
del sub
print()


catboost_info  __notebook__.ipynb  submission.csv  xgb_optimize.log

id,rainfall
2190,104.75637167107821
2191,104.48963176595213
2192,105.1569445016439
2193,8.99855087028047
2194,3.5654747363276456
2195,78.33000099073452
2196,87.40889489926793
2197,104.3485593016294
2198,100.34368214704736

CPU times: user 21.2 ms, sys: 24.1 ms, total: 45.3 ms
Wall time: 284 ms
