# XGBoost Modeling Pipeline (main non-linear model)

This is our main non-linear model, where we build and tune per-language XGBoost classifiers to predict the root node. The main tricks/adjustments that we used over the "vanilla" XGBoost workflow are:
- Custom evaluation metric (root_score) selecting exactly one root per sentence.
- Sentence-wise grouping during train/validation splits to avoid data leakage.
- Class‐imbalance handling via scale_pos_weight.
- Regularization (L1/L2) and parameter grid search.
- GPU acceleration (cuda) with early stopping to speed up the training time.

In [1]:
import pandas as pd, numpy as np, ast
from time import time
from datetime import timedelta
from sklearn.preprocessing    import StandardScaler
from sklearn.model_selection  import GroupShuffleSplit
from itertools                import product
from tqdm                     import tqdm
import xgboost as xgb

import warnings
from sklearn.exceptions import FitFailedWarning

warnings.filterwarnings("ignore", category=FitFailedWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
TRAIN_FEATS_PATH   = '../data/normalized_expanded_train.csv'
TRAIN_META_PATH    = '../data/train.csv'
TEST_FEATS_PATH    = '../data/normalized_expanded_test.csv'
TEST_META_PATH     = '../data/test.csv'
LABELED_TEST_PATH  = '../data/labeled_test.csv'

We use the centrality scores and sentence length as features. We are aware that normalization is not a necessary step for XBG, but since our preprocessing already handled it, it remained in the pipeline.

In [3]:
FEATURES = [
    'n','degree','closeness','harmonic','betweeness','load','pagerank',
    'eigenvector','katz','information','current_flow_betweeness',
    'percolation','second_order','laplacian'
]

In [4]:
# Hyperparameter grid
param_grid = {
  'max_depth':        [4,6,8],
  'eta':               [0.01,0.05,0.1],
  'subsample':        [0.7,1.0],
  'colsample_bytree': [0.7,1.0],
  'gamma':            [0,1],
  # new ones:
  'min_child_weight': [1,5,10],
  'reg_alpha':        [0,0.01,0.1],
  'reg_lambda':       [1,10,100],
}

We use a custom scoring function, a sentence-level score that picks the node with maximum predicted probability per sentence:

$$
\mathrm{root\_score}
\;=\;
\frac{1}{|S|}\,
\sum_{s \in S}
\mathbf{1}\!\bigl(\arg\max_{v \in s}\,\hat p_v \;=\;\mathrm{true\_root}(s)\bigr)
$$

In [5]:
# Custom root‐accuracy scoring (1 root per sentence)
def root_score(sent_ids, y_true, probs):
    dfp = pd.DataFrame({'sent': sent_ids, 'y': y_true, 'p': probs})
    picks = dfp.loc[dfp.groupby('sent')['p'].idxmax()]
    return (picks.y == 1).mean()

In [6]:
print("Loading training data…")
exp  = pd.read_csv(TRAIN_FEATS_PATH)
meta = pd.read_csv(TRAIN_META_PATH)
meta['edgelist'] = meta['edgelist'].apply(ast.literal_eval)
df   = exp.merge(
    meta[['language','sentence','edgelist','root']],
    on=['language','sentence']
)

Loading training data…


### Training Loop
We iterate over each language, performing:
1. 80/20 GroupShuffleSplit by sentence
2. DMatrix creation
3. $$
\text{scale\_pos\_weight}
\;=\;
\frac{\#\text{negatives}}{\#\text{positives}}
$$
4. Grid search over all hyperparams, training with early stopping on validation error
5. Re-tune the number of trees based on our root_score over three candidate iterations:
    - best_iteration ±20
6. Store the best model, scaler, and tuned ntree_limit

In [7]:
models = {}
t0 = time()
print("Tuning per language XGB on GPU with early stopping\n")

for lang in tqdm(sorted(df.language.unique())):
    sub = df[df.language == lang].reset_index(drop=True)

    # 80/20 sentence-wise split (to avoid leakage)
    gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    tr_idx, val_idx = next(gss.split(sub, sub.root, groups=sub.sentence))
    train, val = sub.iloc[tr_idx], sub.iloc[val_idx]

    # raw feature matrices
    X_tr  = train[FEATURES].values
    y_tr  = (train.root == train.vertex).astype(int)
    X_val = val[FEATURES].values
    y_val = (val.root   == val.vertex ).astype(int)

    dtrain = xgb.DMatrix(X_tr,  label=y_tr)
    dval   = xgb.DMatrix(X_val, label=y_val)

    # class-imbalance weight
    pos = y_tr.sum();  neg = len(y_tr) - pos
    spw = (neg/pos) if pos > 0 else 1.0

    best_sc, best_cfg, best_bst = -1, None, None

    # grid search
    for md, eta, subs, colsm, gm, mcw, alpha, lmbda in product(
        param_grid['max_depth'],
        param_grid['eta'],
        param_grid['subsample'],
        param_grid['colsample_bytree'],
        param_grid['gamma'],
        param_grid['min_child_weight'],
        param_grid['reg_alpha'],
        param_grid['reg_lambda'],
    ):
        cfg = {
            'objective':        'binary:logistic',
            'eval_metric':      'error',
            'tree_method':      'hist',
            'device':           'cuda',
            'max_depth':        md,
            'eta':              eta,
            'subsample':        subs,
            'colsample_bytree': colsm,
            'gamma':            gm,
            'min_child_weight': mcw,
            'alpha':            alpha,
            'lambda':           lmbda,
            'scale_pos_weight': spw,
            'seed':             42,
            'verbosity':        0,
        }

        bst = xgb.train(
            cfg, dtrain,
            num_boost_round=200,
            evals=[(dval, 'validation')],
            early_stopping_rounds=20,
            verbose_eval=False
        )

        best_iter     = bst.best_iteration
        best_local_sc = -1
        best_local_it = best_iter

        for it in (best_iter-20, best_iter, best_iter+20):
            it = int(np.clip(it, 1, best_iter))
            p  = bst.predict(dval, iteration_range=(0, it))
            sc = root_score(val.sentence.values, y_val, p)
            if sc > best_local_sc:
                best_local_sc, best_local_it = sc, it

        if best_local_sc > best_sc:
            best_sc  = best_local_sc
            best_cfg = dict(cfg, ntree_limit=best_local_it)
            best_bst = bst

    print(f"{lang:12s} -> val-root-acc={best_sc:.3f}  "
          f"best_round={best_cfg['ntree_limit']}  cfg={best_cfg}")

    # store only the booster and tuned tree-count
    models[lang] = (best_bst, best_cfg['ntree_limit'])

print(f"\nTuning completed in {timedelta(seconds=time()-t0)}\n")

Tuning per language XGB on GPU with early stopping



  5%|█████▌                                                                                                                | 1/21 [12:19<4:06:37, 739.86s/it]

Arabic       -> val-root-acc=0.570  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 17.4775, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 10%|███████████▏                                                                                                          | 2/21 [30:14<4:56:33, 936.50s/it]

Chinese      -> val-root-acc=0.340  best_round=6  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 10, 'alpha': 0.1, 'lambda': 1, 'scale_pos_weight': 17.5475, 'seed': 42, 'verbosity': 0, 'ntree_limit': 6}


 14%|████████████████▊                                                                                                     | 3/21 [47:00<4:50:35, 968.63s/it]

Czech        -> val-root-acc=0.610  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 15.0875, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 19%|██████████████████████                                                                                              | 4/21 [1:01:27<4:22:59, 928.19s/it]

English      -> val-root-acc=0.680  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 17.715, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 24%|███████████████████████████▌                                                                                        | 5/21 [1:12:31<3:42:09, 833.09s/it]

Finnish      -> val-root-acc=0.570  best_round=7  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.1, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 12.5775, 'seed': 42, 'verbosity': 0, 'ntree_limit': 7}


 29%|█████████████████████████████████▏                                                                                  | 6/21 [1:27:46<3:35:13, 860.89s/it]

French       -> val-root-acc=0.480  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 21.4075, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 33%|██████████████████████████████████████▋                                                                             | 7/21 [1:44:24<3:31:19, 905.65s/it]

Galician     -> val-root-acc=0.550  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 20.0975, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 38%|████████████████████████████████████████████▏                                                                       | 8/21 [1:57:32<3:08:09, 868.45s/it]

German       -> val-root-acc=0.650  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 17.695, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 43%|█████████████████████████████████████████████████▋                                                                  | 9/21 [2:13:58<3:01:01, 905.16s/it]

Hindi        -> val-root-acc=0.320  best_round=17  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.1, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 20.6575, 'seed': 42, 'verbosity': 0, 'ntree_limit': 17}


 48%|██████████████████████████████████████████████████████▊                                                            | 10/21 [2:27:51<2:41:52, 882.92s/it]

Icelandic    -> val-root-acc=0.540  best_round=0  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 15.73, 'seed': 42, 'verbosity': 0, 'ntree_limit': 0}


 52%|████████████████████████████████████████████████████████████▏                                                      | 11/21 [2:44:39<2:33:29, 920.98s/it]

Indonesian   -> val-root-acc=0.560  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 5, 'alpha': 0, 'lambda': 1, 'scale_pos_weight': 16.1, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 57%|█████████████████████████████████████████████████████████████████▋                                                 | 12/21 [3:04:07<2:29:26, 996.27s/it]

Italian      -> val-root-acc=0.550  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 20.595, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 62%|██████████████████████████████████████████████████████████████████████▌                                           | 13/21 [3:27:27<2:29:08, 1118.61s/it]

Japanese     -> val-root-acc=0.150  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 5, 'alpha': 0, 'lambda': 1, 'scale_pos_weight': 24.7675, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 67%|████████████████████████████████████████████████████████████████████████████                                      | 14/21 [3:45:01<2:08:13, 1099.01s/it]

Korean       -> val-root-acc=0.370  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.1, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 10, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 14.02, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 71%|██████████████████████████████████████████████████████████████████████████████████▏                                | 15/21 [3:55:29<1:35:41, 956.88s/it]

Polish       -> val-root-acc=0.650  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 14.8325, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 76%|███████████████████████████████████████████████████████████████████████████████████████▌                           | 16/21 [4:11:22<1:19:38, 955.72s/it]

Portuguese   -> val-root-acc=0.510  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 19.825, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 81%|█████████████████████████████████████████████████████████████████████████████████████████████                      | 17/21 [4:25:42<1:01:48, 927.13s/it]

Russian      -> val-root-acc=0.670  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 15.4125, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 86%|████████████████████████████████████████████████████████████████████████████████████████████████████▎                | 18/21 [4:41:10<46:21, 927.18s/it]

Spanish      -> val-root-acc=0.500  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 4, 'eta': 0.01, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 5, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 20.0925, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▊           | 19/21 [4:55:56<30:29, 914.95s/it]

Swedish      -> val-root-acc=0.650  best_round=8  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 6, 'eta': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 10, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 16.1525, 'seed': 42, 'verbosity': 0, 'ntree_limit': 8}


 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍     | 20/21 [5:14:16<16:10, 970.42s/it]

Thai         -> val-root-acc=0.620  best_round=1  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 6, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 0.7, 'gamma': 0, 'min_child_weight': 5, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 21.0775, 'seed': 42, 'verbosity': 0, 'ntree_limit': 1}


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [5:25:07<00:00, 928.93s/it]

Turkish      -> val-root-acc=0.480  best_round=22  cfg={'objective': 'binary:logistic', 'eval_metric': 'error', 'tree_method': 'hist', 'device': 'cuda', 'max_depth': 8, 'eta': 0.01, 'subsample': 1.0, 'colsample_bytree': 1.0, 'gamma': 0, 'min_child_weight': 1, 'alpha': 0, 'lambda': 100, 'scale_pos_weight': 13.8225, 'seed': 42, 'verbosity': 0, 'ntree_limit': 22}

Tuning completed in 5:25:07.645642






In [8]:
print("Loading test data…")
test_feats = pd.read_csv(TEST_FEATS_PATH)
raw_test   = pd.read_csv(TEST_META_PATH)
raw_test['edgelist'] = raw_test['edgelist'].apply(ast.literal_eval)

test_df = test_feats.merge(
    raw_test[['id','language','sentence']],
    on=['id','language','sentence']
)

Loading test data…


### Prediction
Here, we pick the vertex with highest predicted probability

In [9]:
print("Predicting on test set…")
results = []

for tid, grp in test_df.groupby('id', sort=False):
    # unpack booster and ntree_limit
    bst, ntree_limit = models[grp.language.iloc[0]]

    Xs = grp[FEATURES].values          # raw features
    dm = xgb.DMatrix(Xs)
    probs = bst.predict(dm, iteration_range=(0, ntree_limit))

    pick = int(grp.vertex.values[probs.argmax()])
    results.append({'id': tid, 'root_pred': pick})

submission = pd.DataFrame(results)

Predicting on test set…


In [11]:
submission.to_csv('../data/submission_XGB_advanced_new.csv', index=False)