# AutoGluon Ensemble for California Housing Dataset

이 노트북은 AutoGluon을 사용하여 **California Housing** 데이터를 예측하는 회귀 모델을 만듭니다.
요청하신 다음 모델들을 포함하여 튜닝하고 앙상블(Voting, Stacking)합니다.

- GBM (LightGBM)
- XGBoost
- CatBoost
- Random Forest
- Weighted Ensemble (Voting)
- Stacking (via AutoGluon capabilities)

In [None]:
# AutoGluon 설치가 필요하면 아래 주석을 해제하고 실행하세요.
# !pip install autogluon

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularDataset, TabularPredictor

# numpy 소수점 설정
np.set_printoptions(precision=4)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
%pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.8-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.15-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.8-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl (914 kB)
   ---------------------------------------- 0.0/914.9 kB ? eta -:--:--
   ---------------------------------------- 914.9/914.9 kB 10.4 MB/s  0:00:00
Downloading widgetsnbextension-4.0.15-py3-none-any.whl (2.2 MB)
   ---------------------------------------- 0.0/2.2 MB ? eta -:--:--
   ---------------------------------------- 2.2/2.2 MB 17.7 MB/s  0:00:00
Installing collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets

   ---------------------------------------- 0/3 [widgetsnbextension]
   ------------- -------

In [6]:
from tqdm import tqdm
import time

# 0부터 99까지 반복하며 진행 바를 표시합니다.
for i in tqdm(range(100)):
    time.sleep(0.05)  # 작업을 시뮬레이션하기 위한 짧은 대기 시간

100%|██████████| 100/100 [00:05<00:00, 18.79it/s]


## 1. 데이터 로드 및 전처리

In [2]:
# 데이터 로드
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

print("데이터 크기:", df.shape)
df.head()

데이터 크기: (20640, 9)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [3]:
# 학습/테스트 데이터 분리 (8:2)
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

print("Train size:", train_data.shape)
print("Test size:", test_data.shape)

Train size: (16512, 9)
Test size: (4128, 9)


## 2. AutoGluon 모델 학습

AutoGluon의 `TabularPredictor`를 사용합니다.
- **presets='best_quality'**: Bagging과 Stacking을 자동으로 적용하여 최고의 성능을 목표로 합니다.
- **hyperparameters**: GBM, XGB, CAT, RF 등을 명시적으로 지정하여 해당 모델들을 학습에 포함시킵니다.

In [4]:
# 학습 결과 저장을 위한 경로
save_path = 'agModels-california_housing'

# 타겟 컬럼
label = 'MedHouseVal'

# 사용할 모델 지정 (GBM=LightGBM)
hyperparameters = {
    'GBM': {},      # LightGBM
    'XGB': {},      # XGBoost
    'CAT': {},      # CatBoost
    'RF': {},       # Random Forest
}

predictor = TabularPredictor(label=label, path=save_path, problem_type='regression', eval_metric='rmse').fit(
    train_data,
    presets='best_quality', # High quality preset (enables stacking/bagging)
    hyperparameters=hyperparameters,
    time_limit=600,         # 시간 제한 (초), 필요에 따라 늘리세요 (예: 3600)
    num_stack_levels=1,     # Stacking 레벨 지정 (0이면 Bagging만, 1 이상이면 Stacking 수행)
    num_bag_folds=5         # Bagging Fold 수
)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.5.0
Python Version:     3.11.14
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22631
CPU Count:          16
Pytorch Version:    2.9.1+cpu
CUDA Version:       CUDA is not available
Memory Avail:       14.88 GB / 31.72 GB (46.9%)
Disk Space Avail:   325.24 GB / 476.83 GB (68.2%)
Presets specified: ['best_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=5, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
	This is used to identify the optimal `num_stack_levels` value. Copies of AutoGluon will be fit on subsets of the data. Then holdout validation data is used to detect stacked overfitting.
	R

[36m(_ray_fit pid=27736)[0m [1000]	valid_set's rmse: 0.445414


[36m(_dystack pid=12540)[0m 	-0.4494	 = Validation score   (-root_mean_squared_error)
[36m(_dystack pid=12540)[0m 	16.3s	 = Training   runtime
[36m(_dystack pid=12540)[0m 	1.29s	 = Validation runtime
[36m(_dystack pid=12540)[0m Fitting model: RandomForest_BAG_L1 ... Training model for up to 32.82s of the 69.78s of remaining time.
[36m(_dystack pid=12540)[0m 	Fitting 1 model on all data (use_child_oof=True) | Fitting with cpus=16, gpus=0, mem=0.1/11.0 GB
[36m(_dystack pid=12540)[0m 	-0.5041	 = Validation score   (-root_mean_squared_error)
[36m(_dystack pid=12540)[0m 	14.79s	 = Training   runtime
[36m(_dystack pid=12540)[0m 	1.59s	 = Validation runtime
[36m(_dystack pid=12540)[0m Fitting model: CatBoost_BAG_L1 ... Training model for up to 15.59s of the 52.55s of remaining time.
[36m(_dystack pid=12540)[0m 	Fitting 5 child models (S1F1 - S1F5) | Fitting with ParallelLocalFoldFittingStrategy (5 workers, per: cpus=3, gpus=0, memory=3.47%)
[36m(_ray_fit pid=28988)[0m 	R

## 3. 모델 성능 평가 (Leaderboard)

테스트 데이터를 사용하여 학습된 모든 모델의 성능을 비교합니다.
- **WeightedEnsemble**: 여러 모델의 예측을 가중 평균하여 만든 Voting 앙상블 모델입니다.
- **Stacker**: Stacking을 통해 만들어진 모델들입니다.

In [7]:
# 리더보드 출력
leaderboard = predictor.leaderboard(test_data, silent=True)

# RMSE가 낮은 순서대로 정렬되어 출력됩니다.
print(leaderboard[['model', 'score_test', 'score_val', 'stack_level', 'fit_time']])

                 model  score_test  score_val  stack_level    fit_time
0  WeightedEnsemble_L3   -0.420905  -0.432958            3  297.083992
1  WeightedEnsemble_L2   -0.423198  -0.434335            2  233.957715
2      CatBoost_BAG_L2   -0.423289  -0.435540            2  250.774108
3      CatBoost_BAG_L1   -0.423296  -0.436624            1  167.285167
4      LightGBM_BAG_L2   -0.423308  -0.438153            2  238.121042
5       XGBoost_BAG_L2   -0.423562  -0.440154            2  242.521442
6      LightGBM_BAG_L1   -0.428762  -0.448697            1   22.804592
7  RandomForest_BAG_L2   -0.429793  -0.447631            2  271.513745
8       XGBoost_BAG_L1   -0.433572  -0.454339            1   25.088161
9  RandomForest_BAG_L1   -0.501292  -0.501281            1   18.731850


In [11]:
# 가장 성능이 좋은 모델 확인
print("Best model:", predictor.model_best)

Best model: WeightedEnsemble_L3


## 4. 예측 수행
테스트 데이터의 일부에 대해 예측을 수행하고 실제 값과 비교해봅니다.

In [12]:
y_test = test_data[label]
test_data_nolab = test_data.drop(columns=[label])

# 예측
y_pred = predictor.predict(test_data_nolab)

# 실제값과 예측값 비교 (상위 5개)
comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
comparison.head()

Unnamed: 0,Actual,Predicted
20046,0.477,0.55533
3024,0.458,0.751491
15663,5.00001,5.030818
20484,2.186,2.456395
9814,2.78,2.516374


In [None]:
# Feature Importance 확인
# Stacking 모델인 경우 base 모델들의 기여도가 복잡하므로, 단일 베스트 모델이나 feature_importance 함수 사용
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 8 features using 4128 rows with 5 shuffle sets...
	261.04s	= Expected runtime (52.21s per shuffle set)
	82.46s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Latitude,1.10024,0.00764,2.789868e-10,5,1.115971,1.084509
Longitude,1.037359,0.009422,8.165002e-10,5,1.056759,1.017959
MedInc,0.305755,0.005765,1.51578e-08,5,0.317625,0.293885
AveOccup,0.169231,0.002746,8.313771e-09,5,0.174885,0.163578
AveRooms,0.139872,0.005526,2.918132e-07,5,0.151251,0.128493
HouseAge,0.056705,0.003777,2.349193e-06,5,0.064483,0.048927
AveBedrms,0.014482,0.002274,7.061248e-05,5,0.019164,0.0098
Population,0.012494,0.000736,1.436911e-06,5,0.014009,0.010979


[33m(raylet)[0m The node with node id: 8cf1c0b8e3b15a39666fa973cb2c29ae502039dc8f659e0aa2657316 and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload.
