### Using Boruta on the Madalon Data Set

この例では、Borutaを使ってMadelonデータセット内の関連する特徴量をすべて特定する方法を示します。MadelonはNIPS2003で使用された人工データセットであり、[Boruta論文](https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf)でも引用されています。

このデータセットは2000件の観測値と500個の特徴量を持っています。Borutaを用いて、分類タスクに関連する特徴量を特定します。

In [1]:
import polars as pl
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

In [20]:
def load_data(train_data_url: str, train_label_url: str) -> pl.DataFrame:  # noqa: D103
    x_data = pl.read_csv(train_data_url, separator=" ", has_header=False).select(
        pl.nth(range(499))
    )
    y_data = pl.read_csv(train_label_url, separator=" ", has_header=False).rename(
        {"column_1": "target"}
    )
    mapping_val = {f"column_{i + 1}": f"{i}" for i in range(x_data.width)}
    return y_data.hstack(x_data).rename(mapping_val)


In [21]:
# URLS for dataset via UCI
train_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data"
train_label_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels"

data = load_data(train_data_url, train_label_url)
data.head()

target,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,…,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498
i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,…,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
-1,485,477,537,479,452,471,491,476,475,473,455,500,456,507,478,491,447,422,480,482,515,482,464,484,477,496,509,491,459,482,483,505,508,458,509,517,…,496,488,462,498,480,511,500,437,537,470,515,476,467,401,485,499,495,490,508,463,487,531,515,476,482,463,467,479,477,481,477,485,511,485,481,479,475
-1,483,458,460,487,587,475,526,479,485,469,434,483,465,503,472,478,469,518,495,491,478,530,462,494,549,469,516,487,475,486,478,514,542,406,469,452,…,485,483,500,487,476,526,449,363,466,478,465,479,482,549,470,506,481,494,492,448,492,447,598,507,478,483,492,485,463,478,487,338,513,486,483,492,510
-1,487,542,499,468,448,471,442,478,480,477,468,497,477,491,493,502,465,567,510,475,474,483,490,492,544,482,454,496,491,495,476,438,489,432,486,512,…,513,445,509,492,487,524,479,441,529,481,485,478,479,454,503,501,500,484,479,470,466,529,482,486,487,480,522,481,487,481,492,650,506,501,480,489,499
1,480,491,510,485,495,472,417,474,502,476,455,520,437,472,481,436,503,522,488,468,492,488,476,479,477,485,506,476,472,506,479,472,504,553,492,494,…,472,564,496,481,490,446,486,485,576,481,457,476,481,602,500,503,481,487,465,473,477,515,525,468,487,458,489,490,491,480,474,572,454,469,475,482,494
1,484,502,528,489,466,481,402,478,487,468,432,494,493,434,505,497,486,471,467,476,455,517,483,465,512,552,505,464,472,419,479,543,491,488,446,563,…,528,525,485,492,468,505,476,513,444,476,500,479,483,560,474,486,459,494,509,456,472,520,468,482,473,473,498,485,488,479,452,435,486,508,481,504,495


In [22]:
y = data.get_column("target")
x = data.drop(y.name)

Borutaはscikit-learnのAPIに準拠しており、Pipeline内でも単独でも利用できます。ここでは単独での利用方法を説明します。

まず、Borutaが利用するestimatorをインスタンス化します。次に、Borutaオブジェクトをインスタンス化します。

In [23]:
rf = RandomForestClassifier(n_jobs=-1, class_weight=None, max_depth=7, random_state=0)
# Define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators="auto", verbose=2, random_state=0)

一度オブジェクトを作成すれば、このオブジェクトを使ってデータセット内の関連する特徴量を特定できます。

In [26]:
feat_selector.fit(x.to_numpy(), y.to_numpy())

0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.


Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	499
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	21
Rejected: 	478
Iteration: 	9 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	478
Iteration: 	10 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	478
Iteration: 	11 / 100
Confirmed: 	18
Tentative: 	3
Rejected: 	478
Iteration: 	12 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	478
Iteration: 	13 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	478
Iteration: 	14 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	478
Iteration: 	15 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	478
Iteration: 	16 / 100
Confirmed: 	19
Tenta

Borutaは有用な特徴量のみを選択しました。実行終了時、Borutaは2つの特徴量については決定できませんでした。

`.support_` を調べることで、どの特徴量が選択されたかを確認できます。`.support_` はブール値の配列を返し、これを使って特徴量行列から関連するカラムのみを抽出できます。もちろん、scikit-learn APIに従い `.transform` を使うこともできます。

In [29]:
# Check selected features
print(feat_selector.support_)
# Select the chosen features from our dataframe.
selected = x.select(
    pl.nth([i for i, is_selected in enumerate(feat_selector.support_) if is_selected])
)
print()
print("Selected Feature Matrix Shape")
print(selected.shape)

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False False
  True False False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False Fa

選択されなかった特徴量のランキングも `.ranking_` で確認できます。

In [30]:
feat_selector.ranking_

array([426, 228, 266, 431,   7, 193, 254, 473, 347, 429,   3, 254, 121,
        88, 165, 208, 301, 202,  71, 347, 337, 307, 390, 155, 189, 157,
        95, 375,   1, 246, 414, 132,  59, 176, 349,  92, 256, 373,  25,
       463, 452, 220, 101,  45,  40, 329, 213, 140,   1, 191, 239, 130,
       342, 393, 188,  26,  20, 250, 441,  89, 294, 332, 351, 458,   1,
       232, 215, 206, 144, 402, 285, 311, 309,  44, 204,  99, 448, 248,
       179, 208, 321,  87, 356, 154, 128,  57, 444, 414, 218, 165, 480,
       271, 323, 377, 421, 355, 437, 150, 244, 136, 405, 419, 428, 314,
       173,   1,  92, 393, 374, 101, 358, 223, 416, 268, 210, 367,  30,
       371, 186,  55, 459, 195, 287, 234, 369,  69, 181, 234,   1,  66,
        61,  89, 297, 446, 161, 247,   9,  19, 439, 310,  55, 378, 398,
       334, 340, 232, 412, 114, 420,  28, 362, 184,  52,   1, 462, 307,
        64, 304, 117, 259, 229, 120, 409, 429, 126, 274, 471, 426, 470,
       337, 315, 181, 184, 468, 334, 123, 369, 159, 296, 251,  7