<a href="https://colab.research.google.com/github/subhashpolisetti/AutoGluon_End-to-End_Implementations/blob/main/1c_AutoGluon_Tabular_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Feature Engineering in AutoGluon

Feature engineering transforms raw tabular data into a format suitable for machine learning models. It can also enhance certain features to provide models with more relevant information, improving accuracy.

AutoGluon handles much of this automatically, but you can customize the process. This document explains the default behavior, including how to modify it.

### Column Types
AutoGluon recognizes and processes the following feature types:
- **Boolean** (e.g., A, B)
- **Numerical** (e.g., 1.3, 2.0)
- **Categorical** (e.g., Red, Blue)
- **Datetime** (e.g., 1/31/2021)
- **Text** (e.g., "Mary had a little lamb")

The MultiModal option also supports additional feature types like images (e.g., 'path/image123.png').

### Column Type Detection
- **Boolean**: Columns with 2 unique values.
- **Categorical**: Non-numeric string columns.
- **Numerical**: Columns are passed as integers or floats.
- **Text**: Detected by uniqueness and presence of multiple words.
- **Datetime**: Detected by converting to Pandas datetimes.

### Problem Type Detection
AutoGluon infers whether the task is classification or regression based on the label column. You can override this by passing the `problem_type` argument to `TabularPredictor`.

### Automatic Feature Engineering
- **Numerical Columns**: No automatic feature engineering.
- **Categorical Columns**: Mapped to integers.
- **Datetime Columns**: Converted into year, month, day, and dayofweek.
- **Text Columns**: Processed using either a full Transformer model (with MultiModal) or via n-grams and special numerical features like word/character counts.

Additional processing includes dropping columns with only one unique value or duplicates.


In [1]:
# Upgrade pip and install the latest version of AutoGluon
!python -m pip install --upgrade pip
!pip install autogluon


Collecting pip
  Using cached pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Using cached pip-24.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.2
Collecting autogluon
  Downloading autogluon-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.core==1.1.1 (from autogluon.core[all]==1.1.1->autogluon)
  Downloading autogluon.core-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.features==1.1.1 (from autogluon)
  Downloading autogluon.features-1.1.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.tabular==1.1.1 (from autogluon.tabular[all]==1.1.1->autogluon)
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting autogluon.multimodal==1.1.1 (from autogluon)
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.t

In [3]:
# Import necessary libraries
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.features.generators import AutoMLPipelineFeatureGenerator, PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
from autogluon.common.features.types import R_INT, R_FLOAT
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from datetime import datetime
import random



In [4]:
# Load the California Housing dataset
california_housing = fetch_california_housing()
dfx = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
dfy = pd.DataFrame(california_housing.target, columns=['MedHouseValue'])

In [5]:
# Modify some columns for feature generation demonstration
dfx['AveOccup'] = (dfx['AveOccup']).astype(int)  # Convert to integer
dfx['HouseAge'] = datetime(2000,1,1) + pd.to_timedelta(dfx['HouseAge'].astype(int), unit='D')  # Convert to datetime
dfx['MedInc'] = pd.cut(dfx['MedInc'] * 10, [-np.inf, 2, 4, 6, np.inf], labels=['Low', 'Medium', 'High', 'Very High'])  # Categorical binning
dfx['Latitude'] = pd.Series(list(' '.join(random.choice(["a", "b", "c", "d", "e"]) for i in range(3)) for j in range(len(dfx))))  # Random strings


In [6]:
# Combine features and target into a single dataframe
df = pd.concat([dfx, dfy], axis=1)

In [7]:
#  Create the TabularDataset
dataset = TabularDataset(df)

In [8]:
# Feature Generation with AutoMLPipelineFeatureGenerator
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
processed_data = auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11456.28 MB
	Train Data (Original)  Memory Usage: 2.19 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['Latitude']
			Removing text_ngram feature due to error: '__nlp__'
	Stage 4 Generators:
		Fitting Dro

In [9]:
# Display processed features
print(processed_data.head())

   MedInc  AveRooms  AveBedrms  Population  AveOccup  Longitude Latitude  \
0       0  6.984127   1.023810       322.0         2    -122.23       83   
1       0  6.238137   0.971880      2401.0         2    -122.22      102   
2       0  8.288136   1.073446       496.0         2    -122.24       62   
3       0  5.817352   1.073059       558.0         2    -122.25      112   
4       0  6.281853   1.081081       565.0         2    -122.25       29   

             HouseAge  HouseAge.month  HouseAge.day  HouseAge.dayofweek  
0  950227200000000000               2            11                   4  
1  948499200000000000               1            22                   5  
2  951177600000000000               2            22                   1  
3  951177600000000000               2            22                   1  
4  951177600000000000               2            22                   1  


In [10]:
#  Training the AutoGluon model
predictor = TabularPredictor(label='MedHouseValue')  # Initialize the TabularPredictor with the target column
predictor.fit(df, hyperparameters={'GBM': {}}, feature_generator=auto_ml_pipeline_feature_generator)  # Train model


No path specified. Models will be saved in: "AutogluonModels/ag-20240927_055553"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       11.15 GB / 12.67 GB (88.0%)
Disk Space Avail:   61.77 GB / 107.72 GB (57.3%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Be

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x780ede6c9690>

In [13]:
#  Handling missing values - let's introduce a missing value
dfx.iloc[0] = np.nan

In [15]:
# Retrain after introducing missing values

# Re-create the feature generator since it cannot be refitted
auto_ml_pipeline_feature_generator_new = AutoMLPipelineFeatureGenerator()

# Apply the new feature generator to the dataset with missing values
processed_data_with_missing = auto_ml_pipeline_feature_generator_new.fit_transform(X=dfx)

# Display the processed data with missing values
print(processed_data_with_missing.head())

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11286.61 MB
	Train Data (Original)  Memory Usage: 2.19 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['Latitude']
			Removing text_ngram feature due to error: '__nlp__'
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerat

   AveRooms  AveBedrms  Population  AveOccup  Longitude MedInc Latitude  \
0       NaN        NaN         NaN       NaN        NaN    NaN      NaN   
1  6.238137   0.971880      2401.0       2.0    -122.22      1      102   
2  8.288136   1.073446       496.0       2.0    -122.24      1       62   
3  5.817352   1.073059       558.0       2.0    -122.25      1      112   
4  6.281853   1.081081       565.0       2.0    -122.25      1       29   

             HouseAge  HouseAge.month  HouseAge.day  HouseAge.dayofweek  \
0  949159199883715200               1            29                   5   
1  948499200000000000               1            22                   5   
2  951177600000000000               2            22                   1   
3  951177600000000000               2            22                   1   
4  951177600000000000               2            22                   1   

   Latitude.char_count  
0                    0  
1                    1  
2                    1 

In [17]:
# Custom Pipeline for feature generation
mypipeline = PipelineFeatureGenerator(
    generators=[
        [CategoryFeatureGenerator(maximum_num_cat=10),  # Generate category features
         IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT]))  # Generate identity features
        ]
    ])

In [18]:
# Apply the custom pipeline to the data
processed_pipeline_data = mypipeline.fit_transform(X=dfx)

Fitting PipelineFeatureGenerator...
	Available Memory:                    11240.88 MB
	Train Data (Original)  Memory Usage: 2.19 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Unused Original Features (Count: 1): ['HouseAge']
		These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
		Features can also be unused if they carry very little information, such as being categorical but having almost entirel

In [19]:
# Display the results from the custom pipeline
print(processed_pipeline_data.head())

  MedInc Latitude  AveRooms  AveBedrms  Population  AveOccup  Longitude
0    NaN      NaN       NaN        NaN         NaN       NaN        NaN
1      1      NaN  6.238137   0.971880      2401.0       2.0    -122.22
2      1        6  8.288136   1.073446       496.0       2.0    -122.24
3      1      NaN  5.817352   1.073059       558.0       2.0    -122.25
4      1      NaN  6.281853   1.081081       565.0       2.0    -122.25


In summary, AutoGluon's feature engineering framework automates much of the preprocessing, but also provides customizable pipelines to meet specific needs, offering both ease of use and flexibility for machine learning tasks.