# Testing ML Feasibility with AutoGluon: A Hands-On Tutorial

**MIDAS AI in Research Handbook — Chapter 12**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiaosuhu/midas-ai-in-research/blob/v1.0-dev/docs/notebooks/autogluon_tabular_demo.ipynb)

---

This notebook walks you through a complete ML feasibility test using AutoGluon on a tabular dataset. By the end, you will have trained a model, inspected which algorithms performed best, and identified which features matter most — all with under 20 lines of code.

**Dataset:** A 500-row sample based on the California Housing dataset, where each row represents a census block group. The task is to predict the median house value (`MedHouseVal`) from neighborhood characteristics.

**What you need:** A Google account to run this in Colab. No local installation required.

## Step 1 — Install AutoGluon

This step only needs to run once per Colab session. It will take about 2-3 minutes.

In [None]:
!pip install autogluon.tabular -q

## Step 2 — Load the Data

We load the dataset directly from the GitHub repo, so there is nothing to upload or download manually.

In [None]:
import pandas as pd

DATA_URL = "https://raw.githubusercontent.com/xiaosuhu/midas-ai-in-research/v1.0-dev/docs/data/ca_housing_sample.csv"

df = pd.read_csv(DATA_URL)
print(f"Dataset shape: {df.shape}")
df.head()

## Step 3 — Quick Look at the Data

Before modeling anything, it is worth spending a minute understanding what we have. Each row is a census block group in California. The columns are:

| Column | Description |
|---|---|
| `MedInc` | Median income (tens of thousands of USD) |
| `HouseAge` | Median age of houses in the block |
| `AveRooms` | Average number of rooms per household |
| `AveBedrms` | Average number of bedrooms per household |
| `Population` | Block group population |
| `AveOccup` | Average number of occupants per household |
| `Latitude` | Block group latitude |
| `Longitude` | Block group longitude |
| `MedHouseVal` | **Target** — Median house value (hundreds of thousands of USD) |

In [None]:
df.describe()

## Step 4 — Split into Train and Test Sets

We hold out 20% of the data for final evaluation. AutoGluon handles its own internal validation during training, but this test set is reserved so we can assess performance on data the model has never seen.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Training rows: {len(train_df)}")
print(f"Test rows:     {len(test_df)}")

## Step 5 — Train with AutoGluon

This is the core step. We tell AutoGluon which column is our target, give it a time budget, and let it run. It will train and evaluate multiple model types automatically — gradient boosting, random forests, neural networks, and others — then stack them into an ensemble.

The `time_limit=120` means AutoGluon will stop after 2 minutes. For a feasibility test, that is usually plenty.

In [None]:
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(
    label="MedHouseVal",
    eval_metric="rmse",
    path="autogluon_housing_model"
).fit(
    train_data=train_df,
    time_limit=120,
    presets="medium_quality"
)

## Step 6 — Inspect the Leaderboard

AutoGluon trains many models and ranks them by validation performance. The leaderboard is one of the most useful parts of the workflow — it shows you exactly what was tried and how each approach performed, rather than treating the process as a black box.

In [None]:
leaderboard = predictor.leaderboard(test_df, silent=True)
leaderboard

## Step 7 — Evaluate on the Test Set

Now we evaluate the best model on our held-out test set. Root Mean Squared Error (RMSE) tells us, on average, how far off our predictions are from the true house values. Since `MedHouseVal` is in hundreds of thousands of USD, an RMSE of 0.5 means we are off by about $50,000 on average.

In [None]:
performance = predictor.evaluate(test_df)
print(performance)

## Step 8 — Feature Importance

For research purposes, knowing which features drive predictions is often just as important as the prediction itself. AutoGluon estimates feature importance by measuring how much model performance drops when each feature is randomly shuffled — a method known as permutation importance.

This can help you generate hypotheses, identify redundant variables, or flag potential data leakage.

In [None]:
importance = predictor.feature_importance(test_df)
importance

Let's visualize the feature importance to make it easier to interpret.

In [None]:
import matplotlib.pyplot as plt

importance_sorted = importance["importance"].sort_values()

fig, ax = plt.subplots(figsize=(7, 5))
importance_sorted.plot(kind="barh", ax=ax, color="steelblue")
ax.set_xlabel("Permutation Importance")
ax.set_title("Feature Importance — AutoGluon Best Model")
ax.axvline(0, color="gray", linewidth=0.8)
plt.tight_layout()
plt.show()

---

## Hands-On Exercise

Now it is your turn. Try modifying the code above to answer these questions:

1. **Change the time limit** — what happens to the leaderboard if you give AutoGluon 5 minutes (`time_limit=300`) instead of 2?
2. **Change the preset** — try `presets="best_quality"` and compare RMSE to `medium_quality`. How much does it improve?
3. **Drop a feature** — remove `Latitude` and `Longitude` from the training data using `train_df.drop(columns=["Latitude", "Longitude"])`. Does performance drop? What does that tell you about these features?

The cells below are yours to work in.

In [None]:
# Exercise 1: Change the time limit
# YOUR CODE HERE


In [None]:
# Exercise 2: Try a different preset
# YOUR CODE HERE


In [None]:
# Exercise 3: Drop geographic features and retrain
# YOUR CODE HERE


---

## What's Next?

This notebook covered the core tabular prediction workflow. AutoGluon also supports time series forecasting and multimodal data (combining text, images, and tables). See the [MIDAS AI in Research Handbook](https://midas-ai-in-research.readthedocs.io) for more, and the [AutoGluon documentation](https://auto.gluon.ai) for the full API reference.

**Citation:** AutoGluon was developed by Amazon. If you use it in your research, please cite:

> Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., & Smola, A. (2020). AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. *arXiv preprint arXiv:2003.06505*.