## MCTS-Strength | Relevant Baseline
This notebook aims at providing a neat reusable baseline for training regressors, without any redundant code that may make it harder to understand the key concepts. It surely is not complete nor comprehensive.

Inspired by: 
- https://www.kaggle.com/competitions/um-game-playing-strength-of-mcts-variants/discussion/532341
- https://www.kaggle.com/code/andreasbis/um-mcts-lightgbm-baseline

The key added features are: 
- **split_agent_features** which converts agent names to four categorical features covering their component characteristics
- **GroupKFold** with *GameRulesetName* used as the group - motivated by the clue that the test sets contain the same agent types but different set of games (see this [Best Single Model CV LB thread](https://www.kaggle.com/competitions/um-game-playing-strength-of-mcts-variants/discussion/532617)) - this seems to give CV results much closer to those achieved in Private LB

In [None]:
import os
import sys
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import polars as pl
import pandas as pd
from sklearn.model_selection import GroupKFold
import lightgbm as lgb
from lightgbm import early_stopping, log_evaluation

import kaggle_evaluation.mcts_inference_server

In [None]:
constant_cols = pd.read_csv('/kaggle/input/um-gps-of-mcts-variants-constant-columns/constant_columns.csv').columns.to_list()
game_col = 'GameRulesetName'
target_col = 'utility_agent1'
game_rule_cols = ['EnglishRules', 'LudRules']
output_cols = ['num_wins_agent1', 'num_draws_agent1', 'num_losses_agent1']
agent_cols = ['agent1', 'agent2']
dropped_cols = ['Id'] + output_cols + constant_cols + game_rule_cols

In [None]:
class Config:
    train_path = '/kaggle/input/um-game-playing-strength-of-mcts-variants/train.csv'
    
    early_stop = 100
    n_splits = 5
    split_agent_features = True
    
    lgbm_params = {
#         num_boost_round: 
#         - big enough to ensure early_stopping is triggerd in most cases
#         - but small enough not to take forever to compute in case it isn't
        'num_boost_round': 10_000,
        'seed': 1212,
        'verbose': -1,
#         Some common params to experiment with (here are default values):
#         The full list: https://lightgbm.readthedocs.io/en/latest/Parameters.html
        
#         'learning_rate': 0.1,
#         'reg_lambda': 0.0,
#         'num_leaves': 31,
#         'max_depth': -1,
#         'max_bin': 255,
#         'extra_trees': False,
    }

In [None]:
def preprocess_data(df): 
    df = df.drop(filter(lambda x: x in df.columns, dropped_cols))
    if Config.split_agent_features:
        for col in agent_cols:
            df = df.with_columns(pl.col(col).str.split(by="-").list.to_struct(fields=lambda idx: f"{col}_{idx}")).unnest(col).drop(f"{col}_0")
    df = df.with_columns([pl.col(col).cast(pl.Categorical) for col in df.columns if col[:6] in agent_cols])            
    df = df.with_columns([pl.col(col).cast(pl.Float32) for col in df.columns if col[:6] not in agent_cols and col != game_col])
    return df.to_pandas()

In [None]:
def train_lgb(data):
    X = data.drop([target_col, game_col], axis=1)
    y = data[target_col]
    groups = data[game_col]

    cv = GroupKFold(n_splits=Config.n_splits)
    models = []
    for fi, (train_idx, valid_idx) in enumerate(cv.split(X, y, groups)):
        print(f'Fold {fi+1}/{Config.n_splits} ...')
        model = lgb.LGBMRegressor(**Config.lgbm_params)
        model.fit(X.iloc[train_idx], y.iloc[train_idx],
                  eval_set=[(X.iloc[valid_idx], y.iloc[valid_idx])],
                  eval_metric='rmse',
                  callbacks=[lgb.early_stopping(Config.early_stop)])
        models.append(model)
    return models

def infer_lgb(data, models):
    return np.mean([model.predict(data) for model in models], axis=0)

### Submission via the competition API
We follow https://www.kaggle.com/code/sohier/mcts-demo-submission/ \
We add **run_i** global counter to train (or load) models in the first **predict** call, which is not limited to 10 minutes runtime.
    

In [None]:
run_i = 0
def predict(test_data, submission):
    global run_i, models
    if run_i == 0:
        train_df = pl.read_csv(Config.train_path)
        models = train_lgb(preprocess_data(train_df))
    run_i += 1
    
    test_data = preprocess_data(test_data).drop(columns=game_col)
    return submission.with_columns(pl.Series(target_col, infer_lgb(test_data, models)))

inference_server = kaggle_evaluation.mcts_inference_server.MCTSInferenceServer(predict)
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    inference_server.run_local_gateway(
        (
            '/kaggle/input/um-game-playing-strength-of-mcts-variants/test.csv',
            '/kaggle/input/um-game-playing-strength-of-mcts-variants/sample_submission.csv'
        )
    )