<a href="https://colab.research.google.com/github/ykalathiya-2/AutoGluon/blob/main/titanic_feature_engineering/titanic_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# AutoGluon Tabular: Feature Engineering on Kaggle Titanic

**Dataset:** [Kaggle — Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)  


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install -U pip
!pip install -U --quiet autogluon

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.2
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[33m  DEPRECATION: Building 'nvidia-ml-py3' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree 

In [3]:
!unzip /content/drive/MyDrive/AutoGluon_dataset/titanic.zip -d /content/

Archive:  /content/drive/MyDrive/AutoGluon_dataset/titanic.zip
  inflating: /content/gender_submission.csv  
  inflating: /content/test.csv       
  inflating: /content/train.csv      



##Load data

In [4]:

import pandas as pd

train_path = '/content/train.csv'
test_path  = '/content/test.csv'

train_df = pd.read_csv(train_path)
test_df  = pd.read_csv(test_path)

print(train_df.shape, test_df.shape)
display(train_df.head())
train_df.isna().mean().sort_values(ascending=False).head(10)


(891, 12) (418, 11)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0,0
Cabin,0.771044
Age,0.198653
Embarked,0.002245
PassengerId,0.0
Name,0.0
Pclass,0.0
Survived,0.0
Sex,0.0
Parch,0.0
SibSp,0.0



##Manual feature engineering (domain-driven)

We'll create interpretable features often used in Titanic solutions:

- **Title** from `Name` (e.g., Mr, Mrs, Miss, Master, etc.).  
- **FamilySize** = `SibSp + Parch + 1`.  
- **IsAlone** = 1 if `FamilySize == 1` else 0.  
- **CabinDeck** = first letter of `Cabin`.  
- **CabinMultiple** = whether multiple cabins listed.  
- **TicketGroupSize** = size of the group sharing the same ticket.  
- **FarePerPerson** = `Fare / FamilySize`.  
- **AgeClass** = `Age * Pclass` (interaction).

We'll add these to both train and test consistently.


In [5]:

import re
import numpy as np
import pandas as pd

def extract_title(name):
    # Extract title using a simple regex
    m = re.search(r',\s*([^\.]+)\.', name)
    return m.group(1).strip() if m else 'Unknown'

def add_engineered_features(df):
    out = df.copy()
    # Title
    out['Title'] = out['Name'].astype(str).apply(extract_title)
    # Simplify rare titles
    mapping = {
        'Mlle':'Miss', 'Ms':'Miss', 'Mme':'Mrs',
        'Lady':'Royalty', 'Countess':'Royalty', 'Sir':'Royalty', 'Don':'Royalty', 'Dona':'Royalty', 'Jonkheer':'Royalty',
        'Capt':'Officer', 'Col':'Officer', 'Dr':'Officer', 'Major':'Officer', 'Rev':'Officer'
    }
    out['Title'] = out['Title'].replace(mapping)
    rare = out['Title'].value_counts()[out['Title'].value_counts()<10].index
    out['Title'] = out['Title'].replace({t:'Rare' for t in rare})

    # Family size & isolation
    out['FamilySize'] = out.get('SibSp', 0) + out.get('Parch', 0) + 1
    out['IsAlone'] = (out['FamilySize'] == 1).astype(int)

    # Cabin features
    out['Cabin'] = out['Cabin'].fillna('Unknown')
    out['CabinDeck'] = out['Cabin'].astype(str).str[0].replace({'U':'Unknown'})
    out['CabinMultiple'] = out['Cabin'].astype(str).apply(lambda x: int(len(str(x).split())>1))

    # Ticket group size
    if 'Ticket' in out.columns:
        ticket_counts = out['Ticket'].value_counts()
        out['TicketGroupSize'] = out['Ticket'].map(ticket_counts)
    else:
        out['TicketGroupSize'] = 1

    # Fare per person
    if 'Fare' in out.columns:
        out['FarePerPerson'] = out['Fare'] / out['FamilySize'].replace(0,1)
    else:
        out['FarePerPerson'] = np.nan

    # Age*Class interaction (some rows may miss Age)
    if 'Age' in out.columns and 'Pclass' in out.columns:
        out['AgeClass'] = out['Age'] * out['Pclass']
    else:
        out['AgeClass'] = np.nan

    return out

train_eng = add_engineered_features(train_df)
test_eng  = add_engineered_features(test_df)

display(train_eng[[
    'Name','Title','FamilySize','IsAlone','Cabin','CabinDeck','CabinMultiple','Ticket','TicketGroupSize','Fare','FarePerPerson','Pclass','Age','AgeClass'
]].head())


Unnamed: 0,Name,Title,FamilySize,IsAlone,Cabin,CabinDeck,CabinMultiple,Ticket,TicketGroupSize,Fare,FarePerPerson,Pclass,Age,AgeClass
0,"Braund, Mr. Owen Harris",Mr,2,0,Unknown,Unknown,0,A/5 21171,1,7.25,3.625,3,22.0,66.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Mrs,2,0,C85,C,0,PC 17599,1,71.2833,35.64165,1,38.0,38.0
2,"Heikkinen, Miss. Laina",Miss,1,1,Unknown,Unknown,0,STON/O2. 3101282,1,7.925,7.925,3,26.0,78.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Mrs,2,0,C123,C,0,113803,2,53.1,26.55,1,35.0,35.0
4,"Allen, Mr. William Henry",Mr,1,1,Unknown,Unknown,0,373450,1,8.05,8.05,3,35.0,105.0



### Train AutoGluon with manual features


In [6]:
label = 'Survived'

In [24]:
from autogluon.tabular import TabularDataset, TabularPredictor

# Create TabularDataset objects
train_eng_ag = TabularDataset(train_eng)
test_eng_ag  = TabularDataset(test_eng)

In [21]:
from autogluon.features.generators import PipelineFeatureGenerator, FillNaFeatureGenerator

minimal_gen = PipelineFeatureGenerator(
    generators=[
        FillNaFeatureGenerator()
    ]
)

predictor_manual = TabularPredictor(label=label).fit(
    train_eng_ag,
    feature_generator=minimal_gen,
)


No path specified. Models will be saved in: "AutogluonModels/ag-20251015_231451"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
Memory Avail:       48.22 GB / 50.99 GB (94.6%)
Disk Space Avail:   189.90 GB / 235.68 GB (80.6%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme' : New in v1.4: Massively better than 'best' on datasets <30000 samples by using new models meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, and TabM. Absolute best accuracy. Requires a GPU. Recommended 64 GB CPU memory and 32+ GB GPU memory.
	presets='best'    : Maximize accuracy. Recommended for most 

In [22]:
leaderboard_manual = predictor_manual.leaderboard(train_eng_ag, silent=True)
leaderboard_manual

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.962963,0.815642,accuracy,0.098886,0.0793,0.702761,0.098886,0.0793,0.702761,1,True,4
1,RandomForestGini,0.960718,0.804469,accuracy,0.100253,0.069435,0.720247,0.100253,0.069435,0.720247,1,True,3
2,ExtraTreesEntr,0.956229,0.782123,accuracy,0.102596,0.076503,0.6848,0.102596,0.076503,0.6848,1,True,7
3,ExtraTreesGini,0.956229,0.782123,accuracy,0.104453,0.068881,0.667079,0.104453,0.068881,0.667079,1,True,6
4,CatBoost,0.948373,0.832402,accuracy,0.00737,0.002013,1.092574,0.00737,0.002013,1.092574,1,True,5
5,XGBoost,0.948373,0.826816,accuracy,0.022173,0.003278,0.268842,0.022173,0.003278,0.268842,1,True,9
6,LightGBM,0.94725,0.826816,accuracy,0.0065,0.00221,0.455924,0.0065,0.00221,0.455924,1,True,2
7,LightGBMLarge,0.930415,0.821229,accuracy,0.003214,0.001431,0.781129,0.003214,0.001431,0.781129,1,True,11
8,NeuralNetTorch,0.860831,0.804469,accuracy,0.023582,0.009903,2.328692,0.023582,0.009903,2.328692,1,True,10
9,NeuralNetFastAI,0.857464,0.826816,accuracy,0.021312,0.007411,0.707917,0.021312,0.007411,0.707917,1,True,8



##Controlling FE with `FeatureGenerator`

AutoGluon exposes feature generators to customize transforms.  
Below, we configure a pipeline to:
- Encode categoricals,
- Expand datetimes (none here, but left as example),
- Extract text n‑grams from any text fields (e.g., `Name`, `Cabin`, `Ticket`),
- Impute missing values and drop near-constant columns.


In [28]:
# from autogluon.common.features.types import text
from autogluon.tabular import TabularDataset, TabularPredictor
# from autogluon.common.features.feature_metadata import FeatureMetadata
from autogluon.features.generators import AutoMLPipelineFeatureGenerator

custom_gen = AutoMLPipelineFeatureGenerator(enable_numeric_features=True,
                                            enable_categorical_features=True,
                                            enable_datetime_features=True,
                                            enable_text_special_features=True,
                                            enable_text_ngram_features=True,
                                            enable_raw_text_features=False,
                                            enable_vision_features=True
                                            )

predictor_custom = TabularPredictor(label=label, eval_metric='accuracy', path='ag_models__custom').fit(
    TabularDataset(train_df),
    feature_generator=custom_gen,
    presets='medium_quality',
    time_limit=600
)

leaderboard_custom = predictor_custom.leaderboard(train_df, silent=True)
leaderboard_custom


Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
Memory Avail:       48.19 GB / 50.99 GB (94.5%)
Disk Space Avail:   189.83 GB / 235.68 GB (80.5%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "/content/ag_models__custom"
Train Data Rows:    891
Train Data Columns: 11
Label Column:       Survived
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.int64(0), np.int64(1)]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessi

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.962963,0.815642,accuracy,0.099764,0.072452,0.700063,0.099764,0.072452,0.700063,1,True,4
1,RandomForestGini,0.962963,0.815642,accuracy,0.10057,0.072381,0.669903,0.10057,0.072381,0.669903,1,True,3
2,ExtraTreesGini,0.962963,0.815642,accuracy,0.10302,0.071636,0.724549,0.10302,0.071636,0.724549,1,True,6
3,ExtraTreesEntr,0.961841,0.810056,accuracy,0.100659,0.069512,0.65807,0.100659,0.069512,0.65807,1,True,7
4,LightGBMLarge,0.928171,0.815642,accuracy,0.004868,0.004167,0.846711,0.004868,0.004167,0.846711,1,True,11
5,LightGBM,0.905724,0.821229,accuracy,0.005553,0.003165,0.437307,0.005553,0.003165,0.437307,1,True,2
6,WeightedEnsemble_L2,0.903479,0.871508,accuracy,0.028933,0.018443,4.368245,0.001866,0.000732,0.080283,2,True,12
7,NeuralNetTorch,0.896745,0.849162,accuracy,0.019938,0.013376,3.305662,0.019938,0.013376,3.305662,1,True,10
8,NeuralNetFastAI,0.892256,0.826816,accuracy,0.025164,0.010485,0.811386,0.025164,0.010485,0.811386,1,True,8
9,XGBoost,0.87991,0.815642,accuracy,0.018471,0.005619,0.292594,0.018471,0.005619,0.292594,1,True,9



##Compare results & interpret

Let's line up the leaderboards and compute validation scores.  
We'll also check **per‑feature importance** from our best model.


In [29]:

def tidy_lb(lb, tag):
    x = lb.copy()
    x['setup'] = tag
    return x[['model','score_val','fit_time','pred_time_val','setup']]

lb_all = pd.concat([
    tidy_lb(leaderboard_manual, 'manual_fe'),
    tidy_lb(leaderboard_custom, 'custom_generator'),
], ignore_index=True)

lb_all.sort_values(['score_val','fit_time'], ascending=[False, True]).head(20)


Unnamed: 0,model,score_val,fit_time,pred_time_val,setup
18,WeightedEnsemble_L2,0.871508,4.368245,0.018443,custom_generator
19,NeuralNetTorch,0.849162,3.305662,0.013376,custom_generator
10,LightGBMXT,0.832402,0.391335,0.001798,manual_fe
11,WeightedEnsemble_L2,0.832402,0.472513,0.002528,manual_fe
4,CatBoost,0.832402,1.092574,0.002013,manual_fe
5,XGBoost,0.826816,0.268842,0.003278,manual_fe
6,LightGBM,0.826816,0.455924,0.00221,manual_fe
9,NeuralNetFastAI,0.826816,0.707917,0.007411,manual_fe
20,NeuralNetFastAI,0.826816,0.811386,0.010485,custom_generator
22,CatBoost,0.826816,0.982301,0.004335,custom_generator


In [30]:

# Pick the best of the three based on validation score
best_setup = lb_all.sort_values('score_val', ascending=False).iloc[0]['setup']
best_predictor = {'manual_fe': predictor_manual, 'custom_generator': predictor_custom}[best_setup]
print("Best setup:", best_setup)

fi = best_predictor.feature_importance(train_eng)
fi.head(20)


These features in provided data are not utilized by the predictor and will be ignored: ['Title', 'FamilySize', 'IsAlone', 'CabinDeck', 'CabinMultiple', 'TicketGroupSize', 'FarePerPerson', 'AgeClass']
Computing feature importance via permutation shuffling for 11 features using 891 rows with 5 shuffle sets...
	3.57s	= Expected runtime (0.71s per shuffle set)


Best setup: custom_generator


	1.45s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Sex,0.088664,0.007142,5e-06,5,0.103371,0.073958
Name,0.063749,0.009893,6.7e-05,5,0.084119,0.043379
Ticket,0.050505,0.005082,1.2e-05,5,0.060968,0.040042
Pclass,0.036139,0.004376,2.5e-05,5,0.045149,0.02713
Parch,0.026936,0.004827,0.000119,5,0.036876,0.016996
SibSp,0.023569,0.001775,4e-06,5,0.027223,0.019915
Age,0.017508,0.006023,0.001445,5,0.02991,0.005107
Embarked,0.014366,0.004078,0.000702,5,0.022762,0.00597
Cabin,0.012346,0.003272,0.000541,5,0.019083,0.005608
Fare,0.007632,0.003404,0.003711,5,0.014641,0.000623



##Train on full data & create Kaggle submission

We'll refit using the **best configuration** on the entire training set (no validation split), then predict on the test set and write `submission.csv` with columns `PassengerId,Survived`.


In [33]:
best_predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                       model  score_val eval_metric  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        WeightedEnsemble_L2   0.871508    accuracy       0.018443  4.368245                0.000732           0.080283            2       True         12
1             NeuralNetTorch   0.849162    accuracy       0.013376  3.305662                0.013376           3.305662            1       True         10
2                   CatBoost   0.826816    accuracy       0.004335  0.982301                0.004335           0.982301            1       True          5
3            NeuralNetFastAI   0.826816    accuracy       0.010485  0.811386                0.010485           0.811386            1       True          8
4                   LightGBM   0.821229    accuracy       0.003165  0.437307                0.003165           0.437307            1       True          2
5       

{'model_types': {'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_FULL': 'LGBModel',
  'LightGBM_FULL': 'LGBModel',
  'RandomForestGini_FULL': 'RFModel',
  'RandomForestEntr_FULL': 'RFModel',
  'CatBoost_FULL': 'CatBoostModel',
  'ExtraTreesGini_FULL': 'XTModel',
  'ExtraTreesEntr_FULL': 'XTModel',
  'NeuralNetFastAI_FULL': 'NNFastAiTabularModel',
  'XGBoost_FULL': 'XGBoostModel',
  'NeuralNetTorch_FULL': 'TabularNeuralNetTorchModel',
  'LightGBMLarge_FULL': 'LGBModel',
  'WeightedEnsemble_L2_FULL': 'WeightedEnsembleModel'},
 'model_performance': {'LightGBMXT': 0.8156424581005587,
  'LightGBM': 0.821229050279329

In [35]:
pred_test = best_predictor.predict(test_eng)

# 4) Write submission
import pandas as pd
sub = pd.DataFrame({
    'PassengerId': test_eng['PassengerId'],
    'Survived': pred_test
})
sub.to_csv('submission.csv', index=False)
print("Wrote submission.csv with model:", best_predictor.model_names()[0])


Wrote submission.csv with model: LightGBMXT
