Load Data

In [1]:
import pandas as pd
#download  https://github.com/sifuHK/rasc/blob/main/Data/dat.xlsx
dat = pd.read_excel('../../Data/dat.xlsx')

In [2]:
# If the default language does not meet the user's needs (following the operating system and rascpy provides the corresponding language pack), 
# the user can manually switch the language pack (the language pack can be either pre-installed in the rascpy or provided by the user).

# from rascpy.Lan_EN import EN
# from rascpy import Tree
# Tree.lan = EN

Splitting Data  
1. The `sklearn.model_selection.train_test_split` method is overly simplistic. It only performs stratified sampling based on the y labels and does not sample according to the joint distribution of (X, y). This may lead to significant discrepancies in model evaluation metrics. However, due to the limitations of the sampling algorithm, this issue can only be addressed as overfitting (by sacrificing model validity to enhance model generalization).  
2. The `rascpy.Sampling.split_cls` sampling algorithm provided by rascpy is designed for high-precision sampling in binary classification problems. Compared to the stratified sampling offered by `sklearn.model_selection.train_test_split`, it effectively mitigates the inconsistency in the joint distribution of (X, y) across multiple split datasets. This achieves enhanced model generalization without compromising model validity.

In [3]:
from rascpy.Sampling import split_cls
train,valid = split_cls(dat,y='y',test_size=0.3,random_state=0)
train_X = train.loc[:,train.columns!='y']
valid_X = valid.loc[:,valid.columns!='y']

Train Model  
- **cands_num**: Retain the top 5 highest-scoring models.  
- **variance_level**: It is generally set to 1. If the variance of the returned models is very small when `variance_level` is set to 1 (i.e., the difference between the metrics on the training set and the validation set is minimal), users can set `variance_level` to 2. This will appropriately sacrifice some of the model's generalization ability to improve its evaluation metrics.  
- **cost_time**: In practical use by the author, setting `cost_time` to 3-5 minutes is sufficient to obtain the optimal model for the vast majority of real-world cases. It is not recommended to set it to 8 or higher, as this is highly likely to be a waste of time.

In [4]:
from rascpy.Tree import auto_xgb
perf_cands,params_cands,clf_cands,vars_cands = auto_xgb(train_X,train.y,valid_X,valid.y,metric='auc',cost_time=60*5,cands_num=5,variance_level = 1)

Return value `perf_cands`: list  
The number of elements is the same as `cands_num`. Each element is itself a dictionary, recording the following details for each model:  
- Recommendation priority  
- `train_xx` (where `xx` represents the metric set by the user)  
- `val_xx` (where `xx` represents the metric set by the user)  
- `|train - val|` (the absolute difference between the training and validation evaluation metrics)  
- Number of input variables

In [5]:
perf_cands

[{'Performance of the model': 1,
  'train_auc': np.float64(0.7832),
  'val_auc': np.float64(0.7534),
  '|train - val|': np.float64(0.02980000000000005),
  'Count of variables': np.int64(87)},
 {'Performance of the model': 2,
  'train_auc': np.float64(0.7755),
  'val_auc': np.float64(0.7493),
  '|train - val|': np.float64(0.0262),
  'Count of variables': np.int64(73)},
 {'Performance of the model': 3,
  'train_auc': np.float64(0.7717),
  'val_auc': np.float64(0.7472),
  '|train - val|': np.float64(0.024500000000000077),
  'Count of variables': np.int64(58)},
 {'Performance of the model': 4,
  'train_auc': np.float64(0.7819),
  'val_auc': np.float64(0.7516),
  '|train - val|': np.float64(0.030299999999999994),
  'Count of variables': np.int64(79)},
 {'Performance of the model': 5,
  'train_auc': np.float64(0.7595),
  'val_auc': np.float64(0.7365),
  '|train - val|': np.float64(0.02299999999999991),
  'Count of variables': np.int64(36)}]

Return value `params_cands`: list  
Each element represents the hyperparameters of a model and corresponds directly to each entry in `perf_cands`.

In [6]:
params_cands

[{'n_estimators': 877,
  'max_depth': 1,
  'min_child_weight': np.float64(8.872923967672701),
  'learning_rate': np.float64(0.028322515394950374),
  'subsample': np.float64(0.9216171973524886),
  'colsample_bylevel': np.float64(0.5014526925699818),
  'colsample_bytree': np.float64(0.8810992553587585),
  'reg_alpha': np.float64(1.5148828490326136),
  'reg_lambda': np.float64(0.3692803435996255),
  'n_jobs': -1,
  'verbosity': 0},
 {'n_estimators': 825,
  'max_depth': 1,
  'min_child_weight': np.float64(19.34469455637533),
  'learning_rate': np.float64(0.022059412941599865),
  'subsample': np.float64(0.8422789854685215),
  'colsample_bylevel': np.float64(0.5237206410323396),
  'colsample_bytree': np.float64(0.7185684638396863),
  'reg_alpha': np.float64(0.21591846567713957),
  'reg_lambda': np.float64(0.3949205099231725),
  'n_jobs': -1,
  'verbosity': 0},
 {'n_estimators': 201,
  'max_depth': 1,
  'min_child_weight': np.float64(18.670134585089667),
  'learning_rate': np.float64(0.076874

Return value `clf_cands`: list  
Each element is a trained XGBoost model, corresponding one-to-one with the entries in `perf_cands`.  
The models are original XGBoost instances, which facilitates deployment and application in production environments without requiring the installation of `rascpy` on the server.

In [7]:
clf_cands

[XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=np.float64(0.5014526925699818),
               colsample_bynode=None,
               colsample_bytree=np.float64(0.8810992553587585), device=None,
               early_stopping_rounds=None, enable_categorical=False,
               eval_metric=None, feature_types=None, gamma=None,
               grow_policy=None, importance_type=None,
               interaction_constraints=None,
               learning_rate=np.float64(0.028322515394950374), max_bin=None,
               max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=1, max_leaves=None,
               min_child_weight=np.float64(8.872923967672701), missing=nan,
               monotone_constraints=None, multi_strategy=None, n_estimators=877,
               n_jobs=-1, num_parallel_tree=None, random_state=0, ...),
 XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_by

Return value `vars_cands`: list  
Each element contains the actual input variables used by the model, corresponding one-to-one with the entries in `perf_cands`.

In [8]:
vars_cands

[array(['x3', 'x4', 'x5', 'x6', 'x9', 'x11', 'x12', 'x19', 'x20', 'x26',
        'x29', 'x34', 'x35', 'x37', 'x38', 'x43', 'x47', 'x56', 'x57',
        'x61', 'x62', 'x63', 'x67', 'x84', 'x89', 'x90', 'x95', 'x96',
        'x100', 'x101', 'x102', 'x103', 'x104', 'x105', 'x109', 'x110',
        'x111', 'x116', 'x117', 'x119', 'x120', 'x121', 'x122', 'x124',
        'x126', 'x129', 'x130', 'x131', 'x134', 'x138', 'x143', 'x144',
        'x147', 'x148', 'x150', 'x151', 'x152', 'x155', 'x157', 'x163',
        'x167', 'x168', 'x169', 'x170', 'x174', 'x175', 'x176', 'x180',
        'x188', 'x189', 'x190', 'x191', 'x192', 'x193', 'x194', 'x196',
        'x197', 'x201', 'x207', 'x208', 'x209', 'x210', 'x211', 'x214',
        'x215', 'x216', 'x217'], dtype='<U4'),
 array(['x3', 'x4', 'x5', 'x9', 'x15', 'x19', 'x20', 'x34', 'x35', 'x37',
        'x38', 'x43', 'x47', 'x56', 'x57', 'x61', 'x62', 'x63', 'x67',
        'x84', 'x89', 'x90', 'x95', 'x96', 'x100', 'x101', 'x102', 'x103',
        'x104'

Users can select the model they need by index for prediction.

In [9]:
# clf_cands[0]: Select the first model recommended by the automatic optimization algorithm. Similarly, users can choose any model from the candidate models for use.
xgb_model = clf_cands[0]

# XGBoost's native predict_proba requires passing all variables used during training to the XGBoost model.
hat = xgb_model.predict_proba(valid_X)

# The native predict_proba returns a two-dimensional numpy.ndarray containing probabilities for class 0 and class 1.
print(hat)
print('-' * 30)

# Using rascpy's built-in predict_proba, the model automatically ignores variables not required by the model during prediction, even if these variables were used during training.
from rascpy.Tool import predict_proba

# Only the variables actually used by the model need to be passed. This is more friendly for online model deployment, but the server needs to have rascpy installed.
cols_in_model = vars_cands[0]
hat = predict_proba(xgb_model, valid_X[cols_in_model], decimals=4)

# Returns a pandas.Series containing only the probability of class 1, while preserving the index column of the input data.
print(hat)
print('-' * 30)

# Of course, redundant variables can also be passed to rascpy.predict_proba, which will automatically filter them out.
hat = predict_proba(xgb_model, valid)

# The returned result is the same as when passing valid_X[cols_in_model].
print(hat)

[[0.9709641  0.02903594]
 [0.9679828  0.03201716]
 [0.98568344 0.01431655]
 ...
 [0.56139714 0.43860286]
 [0.7213056  0.2786944 ]
 [0.67495286 0.3250471 ]]
------------------------------
2785     0.0290
3506     0.0320
27512    0.0143
29244    0.0118
28684    0.0209
          ...  
38852    0.4711
26433    0.2351
2124     0.4386
25394    0.2787
27479    0.3250
Length: 11779, dtype: float64
------------------------------
2785     0.0290
3506     0.0320
27512    0.0143
29244    0.0118
28684    0.0209
          ...  
38852    0.4711
26433    0.2351
2124     0.4386
25394    0.2787
27479    0.3250
Length: 11779, dtype: float64
