<a href="https://colab.research.google.com/github/samipn/autogluon/blob/main/13_tabular_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoGluon Automatic Feature Engineering

*Prepared: 2025-10-14*

Demonstrates using AutoGluon's feature generator and inspecting engineered features.

In [7]:
!pip -q install -U autogluon scikit-learn

In [9]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularPredictor
from autogluon.features.generators import AutoMLPipelineFeatureGenerator

cal = fetch_california_housing(as_frame=True)
df = cal.frame.copy(); df.rename(columns={'MedHouseVal':'target'}, inplace=True)

# Create a datetime feature to demonstrate date part extraction
import numpy as np
import datetime as dt
base = pd.Timestamp('2010-01-01')
df['date'] = base + pd.to_timedelta((np.arange(len(df)) % 365), unit='D')

train_df, val_df = train_test_split(df, test_size=0.2, random_state=0)

# Enable raw text/ngrams if any text columns exist; date features will be auto-expanded
fg = AutoMLPipelineFeatureGenerator(
    enable_raw_text_features=True,
    enable_text_ngram_features=True,
    # datetime_features=['year','month','day','weekday'] # Removed datetime_features argument
)

predictor = TabularPredictor(label='target', problem_type='regression', eval_metric='rmse', path='ag_feat_eng/')
predictor.fit(train_df, time_limit=600, feature_generator=fg)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       10.49 GB / 12.67 GB (82.8%)
Disk Space Avail:   178.76 GB / 225.83 GB (79.2%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme' : New in v1.4: Massively better than 'best' on datasets <30000 samples by using new models meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, and TabM. Absolute best accuracy. Requires a GPU. Recommended 64 GB CPU memory and 32+ GB GPU memory.
	presets='best'    : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'    : Strong accuracy w

[1000]	valid_set's rmse: 0.489498
[2000]	valid_set's rmse: 0.480285
[3000]	valid_set's rmse: 0.477489
[4000]	valid_set's rmse: 0.476069
[5000]	valid_set's rmse: 0.475154
[6000]	valid_set's rmse: 0.474492
[7000]	valid_set's rmse: 0.474618


	-0.4743	 = Validation score   (-root_mean_squared_error)
	18.77s	 = Training   runtime
	1.45s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 578.63s of the 578.62s of remaining time.
	Fitting with cpus=1, gpus=0, mem=0.0/10.3 GB


[1000]	valid_set's rmse: 0.46108


	-0.4583	 = Validation score   (-root_mean_squared_error)
	4.97s	 = Training   runtime
	0.3s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 573.15s of the 573.14s of remaining time.
	Fitting with cpus=2, gpus=0
	-0.5258	 = Validation score   (-root_mean_squared_error)
	54.73s	 = Training   runtime
	0.23s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 514.30s of the 514.29s of remaining time.
	Fitting with cpus=1, gpus=0
	-0.4426	 = Validation score   (-root_mean_squared_error)
	69.75s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 444.50s of the 444.49s of remaining time.
	Fitting with cpus=2, gpus=0
	-0.5381	 = Validation score   (-root_mean_squared_error)
	14.27s	 = Training   runtime
	0.22s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 423.91s of the 423.90s of remaining time.
	Fitting with cpus=1, gpus=0, mem=0.0/10.2 GB
	-

[1000]	valid_set's rmse: 0.458538
[2000]	valid_set's rmse: 0.457234
[3000]	valid_set's rmse: 0.457017
[4000]	valid_set's rmse: 0.456939
[5000]	valid_set's rmse: 0.456908
[6000]	valid_set's rmse: 0.456898
[7000]	valid_set's rmse: 0.456895
[8000]	valid_set's rmse: 0.456891
[9000]	valid_set's rmse: 0.456891
[10000]	valid_set's rmse: 0.456891


	-0.4569	 = Validation score   (-root_mean_squared_error)
	54.68s	 = Training   runtime
	4.94s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.00s of the 42.33s of remaining time.
	Ensemble Weights: {'CatBoost': 0.667, 'LightGBMLarge': 0.25, 'NeuralNetTorch': 0.083}
	-0.4387	 = Validation score   (-root_mean_squared_error)
	0.02s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 557.74s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 332.1 rows/s (1652 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/content/ag_feat_eng")


<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7d606f766ea0>

In [10]:
# Inspect features that reached models
predictor.features()


['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude',
 'date']

In [11]:
# Importance
fi = predictor.feature_importance(val_df)
fi.head(30)


Computing feature importance via permutation shuffling for 9 features using 4128 rows with 5 shuffle sets...
	831.0s	= Expected runtime (166.2s per shuffle set)
	658.68s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Latitude,1.260842,0.018849,5.991885e-09,5,1.299652,1.222031
Longitude,1.155982,0.017113,5.762378e-09,5,1.191219,1.120746
MedInc,0.340177,0.007897,3.482169e-08,5,0.356437,0.323918
AveOccup,0.177111,0.002892,8.527639e-09,5,0.183066,0.171157
AveRooms,0.128915,0.003964,1.071532e-07,5,0.137077,0.120753
HouseAge,0.062497,0.002248,2.004086e-07,5,0.067125,0.057869
date,0.022724,0.001405,1.744952e-06,5,0.025617,0.019831
AveBedrms,0.020947,0.00078,2.301203e-07,5,0.022552,0.019341
Population,0.015191,0.002217,5.290196e-05,5,0.019756,0.010627
