LGBM Method (gradient boosting)

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("train.csv")  # define the training dataset

In [3]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [4]:
from lightgbm import LGBMClassifier, early_stopping, log_evaluation
from sklearn.metrics import accuracy_score

In [5]:
drop_cols = []  # columns to remove 
if 'Id' in df.columns: drop_cols.append('Id')  # the "Id" column will not be part of our model as it's useless info for that

X = df.drop(['Cover_Type'] + drop_cols, axis=1)  # X : the dataset without 'Id' and 'cover type' -> X is the inputs
y = df['Cover_Type']  # Y : the 'Cover Type' columns -> Y is the target !

In [6]:
test = pd.read_csv("test-full.csv")  # define the testing dataset (used if we are happy with our model)

X_test = test.drop(drop_cols, axis=1)  # remove the 'Id' column from the testing dataset


In [7]:
# splitting the data set into training and test sets

from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [8]:
# LGBM method for prediction 
lgbm = LGBMClassifier(
    n_estimators=2000,       # total trees (large number, we’ll stop early)
    learning_rate=0.03,      # step size (smaller = more precise, needs more trees)
    num_leaves=64,           # controls complexity of each tree
    max_depth=-1,            # -1 = no limit (let num_leaves decide)
    subsample=0.8,           # row sampling for diversity
    colsample_bytree=0.8,    # feature sampling for diversity
    random_state=42
)

# Train with early stopping
lgbm.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],          # check performance on validation
    eval_metric="multi_logloss",        # metric for multi-class classification
)


[WinError 2] The system cannot find the file specified
  File "c:\Users\berna\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
        "wmic CPU Get NumberOfCores /Format:csv".split(),
        capture_output=True,
        text=True,
    )
  File "c:\Users\berna\anaconda3\Lib\subprocess.py", line 554, in run
    with Popen(*popenargs, **kwargs) as process:
         ~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\berna\anaconda3\Lib\subprocess.py", line 1039, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        pass_fds, cwd, env,
                        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
                        gid, gids, uid, umask,
                        ^^^^^^^^^^^^^^^^^^^^^^
                        start_new_session, process_group)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001010 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2155
[LightGBM] [Info] Number of data points in the train set: 12096, number of used features: 44
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910


In [9]:
# model accuracy 

val_pred = lgbm.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, val_pred))

Validation Accuracy: 0.8756613756613757


I will now try to improve the accuracy : 


Doing Cross-Validation : 


In [10]:
from sklearn.model_selection import StratifiedKFold, cross_val_score


In [11]:
# cross-validation. We can then check if our model setup is good

model = LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.03,
    num_leaves=31,
    random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy", n_jobs=-1)
print("CV accuracy: %.4f ± %.4f" % (np.mean(scores), np.std(scores)))

CV accuracy: 0.8759 ± 0.0062


Cross-Validation doesn't add more accuracy. Normal, but the accuracy result is more precise. Adds a lot of computation time

In [12]:
# We train the model here when we are happy with the parameters

from lightgbm import LGBMClassifier, early_stopping, log_evaluation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

params = dict(
    learning_rate=0.03,
    n_estimators=2000,        # large; we stop early
    num_leaves=31,           # try 63/127/255
    max_depth=-1,
    min_child_samples=200,     # try 20/60/120/200
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=4.0,           # try 1–5
    reg_alpha=0.7,
    random_state=42
)

accs, best_iters = [], []

for tr_idx, va_idx in cv.split(X, y):
    X_tr, X_val = X.iloc[tr_idx], X.iloc[va_idx]
    y_tr, y_val = y.iloc[tr_idx], y.iloc[va_idx]

    model = LGBMClassifier(**params)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_val, y_val)],
        eval_metric="multi_logloss",
        callbacks=[early_stopping(200), log_evaluation(200)]
    )
    pred = model.predict(X_val)
    accs.append(accuracy_score(y_val, pred))
    best_iters.append(model.best_iteration_)

print("LGBM CV accuracy: %.4f ± %.4f" % (np.mean(accs), np.std(accs)))
print("Avg best_iteration:", int(np.mean(best_iters)))


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000308 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2136
[LightGBM] [Info] Number of data points in the train set: 12096, number of used features: 35
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
Training until validation scores don't improve for 200 rounds
[200]	valid_0's multi_logloss: 0.476117
[400]	valid_0's multi_logloss: 0.410511
[600]	valid_0's multi_logloss: 0.390901
[800]	valid_0's multi_logloss: 0.381744
[1000]	valid_0's multi_logloss: 0.380715
[1200]	valid_0's multi_logloss: 0.380131
Early stopping, best iteratio

In [13]:
# overfit check

train_pred = model.predict(X_tr)
val_pred   = model.predict(X_val)

print("Train accuracy:", accuracy_score(y_tr, train_pred))
print("Val accuracy:", accuracy_score(y_val, val_pred))


Train accuracy: 0.9969411375661376
Val accuracy: 0.8449074074074074


The model is overfitted