# Introduction to **TabPFN** and **TabICL**

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-datascience-datacamp/datacamp-master/blob/main/11_tabular_foundational_models/01-tabpfn-tabicl.ipynb)

Author: [Pedro L. C. Rodrigues](https://plcrodrigues.github.io) and [Thomas Moreau](https://tommoral.github.io)

- **TabPFN** : Hollman et al. "Accurate predictions on small data with a tabular foundation model" (2025) [[link](https://www.nature.com/articles/s41586-024-08328-6)]
- **TabICL** : Qu et al. "TabICL: A Tabular Foundation Model for In-Context Learning on Large Data" (2025) [[link](https://arxiv.org/abs/2502.05564)]

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

%matplotlib inline

The Python implementation of TabPFN is developed by people from [Prior Labs](https://docs.priorlabs.ai/overview) and follows the same API from `scikit-learn`. 

Note, however, that to use the last version of the TabPFN's foundational model, you will need to authenticate at HuggingFace, which can be a bit messy. Because of this, we will be focusing on TabPFN-V2, which should be more than enough. 

⚡ GPU Recommended: For optimal performance, use a GPU (even older ones with ~8GB VRAM work well; 16GB needed for some large datasets). On CPU, only small datasets (≲1000 samples) are feasible. Note that **this notebook can be run on Codalab with the top button**.

First of all, you will need to install the package by as below

In [None]:
!pip install -U tabpfn

# Regression with TabPFN

We investigate how TabPFN can be used for regression and compare his performance versus other classic regressors.

In [None]:
from sklearn.utils import shuffle
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import fetch_california_housing, load_diabetes

from tabpfn import TabPFNRegressor
from tabpfn.constants import ModelVersion

import pandas as pd, requests

# Loading the Boston dataset
cols = ["CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"]
df_boston = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data",
                        sep='\\s+', header=None, names=cols)

print(df_boston.shape)
df_boston.head()

In [None]:
# Convert to pure numpy arrays
X, y = df_boston.drop(columns=["MEDV"]).values, df_boston["MEDV"].values

# Choose cross-validation strategy
cv = KFold(shuffle=True, n_splits=5)

# Instantiate TabPFN for regression
# regressor_tabpfn = TabPFNRegressor()
regressor_tabpfn = TabPFNRegressor.create_default_for_version(ModelVersion.V2)
regressor_tabpfn.n_estimators = 1

scores = []
for idx_train, idx_test in tqdm(cv.split(X, y)):
    X_train, y_train = X[idx_train], y[idx_train]
    X_test, y_test = X[idx_test], y[idx_test]
    regressor_tabpfn.fit(X_train, y_train)
    scores.append(regressor_tabpfn.score(X_test, y_test))
print(np.mean(scores))

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>What is happening in the <span><code>fit</code></span> method?</li>
     </ul>
</div>

Let's see now how a `RandomForestRegressor` and the `HistGradientRegressor` behave

In [None]:
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
regressor_rf = RandomForestRegressor()
regressor_hgbr = HistGradientBoostingRegressor()
est_dict = {'rf':regressor_rf, 'hgbr':regressor_hgbr}
for key, value in est_dict.items():
    scores = cross_val_score(value, X, y, cv=cv)
    print(key, np.mean(scores))

We see that TabPFN beats both baseslines by quite a margin. However, it took much more time...

Let's consider now a different dataset.

In [None]:
from sklearn.datasets import fetch_california_housing
df_california, targets = fetch_california_housing(return_X_y=True, as_frame=True)
print(df_california.shape)
df_california.head()

The dataset is bigger than the previous one, so let's see how TabPFN behaves.

In [None]:
from sklearn.model_selection import train_test_split # let's avoid cross-val for time sake
X, y = df_california.values, targets.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
regressor_tabpfn.fit(X_train, y_train)
print(regressor_tabpfn.score(X_test, y_test))

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>Why do you think we exploded in memory?</li>
     </ul>
</div>

One possible trick is to subsample the dataset and use an ensemble of TabPFN regressors as below.

In [None]:
regressor_tabpfn.ignore_pretraining_limits = True
regressor_tabpfn.n_estimators = 1
regressor_tabpfn.inference_config = {"SUBSAMPLE_SAMPLES": 500}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
regressor_tabpfn.fit(X_train, y_train)
print(regressor_tabpfn.score(X_test, y_test))

Or even better, use a GPU :-) 

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-datascience-datacamp/datacamp-master/blob/main/11_tabular_foundational_models/01-tabpfn-tabicl.ipynb)


In [None]:
regressor_rf = RandomForestRegressor()
regressor_hgbr = HistGradientBoostingRegressor()
est_dict = {'rf':regressor_rf, 'hgbr':regressor_hgbr}
for key, est in est_dict.items():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    est.fit(X_train, y_train)
    score = est.score(X_test, y_test)
    print(key, score)

In the slides, we saw that **TabICL** can, in principle, scale to any number of samples, due to the way that rows and columns are embedded in its architecture. So should we try to use it?

In [None]:
!pip install -U tabicl # watch out for the python version!

Checking the **TabICL** [documentation](https://github.com/soda-inria/tabicl) we notice that it currently does not work for regression... :'(

At least for now... ;)

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>Why can TabPFN do regression out-of-the-box whereas TabICL not?</li>
     </ul>
</div>

# Classification with TabPFN and TabICL

Let's switch to a classifcation problem first with a small then with a big dataset.

In [None]:
from sklearn.datasets import fetch_openml
df, target = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X, y = df.values, target.values
df.head()

We saw in previous classes that we can not simply plug the Titanic dataset into standard scikit-learn estimators. First, it is necessary to pre-process the data, encode categorical features, etc. But what happens in TabPFN ?

In [None]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(shuffle=True, n_splits=5)

# Instantiate TabPFN for classification
from tabpfn import TabPFNClassifier
clf_tabpfn = TabPFNClassifier.create_default_for_version(ModelVersion.V2)
clf_tabpfn.n_estimators = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_tabpfn.fit(X_train, y_train)
print(clf_tabpfn.score(X_test, y_test))

What about TabICL ?

In [None]:
from tabicl import TabICLClassifier
clf_icl = TabICLClassifier(n_estimators=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_icl.fit(X_train, y_train)
print(clf_icl.score(X_test, y_test))

The documention can help us : https://github.com/soda-inria/tabicl?tab=readme-ov-file#basic-integration

In [None]:
from skrub import TableVectorizer
from tabicl import TabICLClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    TableVectorizer(),  # automatically handles various data types
    TabICLClassifier(n_estimators=1)  # beware of the default parameters!
)

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.33, random_state=42) # note that we pass the dataframe!
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
         <li>Why can TabPFN preprocess categorical features directly and TabICL needs a pipeline?</li>
     </ul>
</div>

In [None]:
from skrub import tabular_pipeline
est = tabular_pipeline('classifier')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.33, random_state=42) # note that we pass the dataframe!
est.fit(X_train, y_train)
print(est.score(X_test, y_test))

Let's consider now a larger dataset and see how **TabICL** behaves.

In [None]:
import pandas as pd
from pathlib import Path
from urllib.request import urlretrieve

DATA_DIR = Path().parent / "data"

url = ("https://archive.ics.uci.edu/ml/machine-learning-databases"
       "/adult/adult.data")

local_filename =  DATA_DIR/ Path(url).name
if not local_filename.exists():
    print("Downloading Adult Census datasets from UCI")
    DATA_DIR.mkdir(exist_ok=True)
    urlretrieve(url, local_filename)

names = ("age, workclass, fnlwgt, education, education-num, "
         "marital-status, occupation, relationship, race, sex, "
         "capital-gain, capital-loss, hours-per-week, "
         "native-country, income").split(', ') 
df = pd.read_csv(local_filename, names=names)
df = df.rename(columns={'income': 'class'})

columns_to_plot = [
    "age",
    "education-num",
    "capital-loss",
    "capital-gain",
    "hours-per-week",
    "class",
]
df = df[columns_to_plot]
print(df.shape)
df.head()

In [None]:
target_name = "class"
target = df[target_name]
X, y = df.iloc[:,:-1].values, target.values
y = (y == ' <=50K').astype(np.int8)

In [None]:
clf_icl = TabICLClassifier(n_estimators=4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_icl.fit(X_train, y_train)
print(clf_icl.score(X_test, y_test))

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

clf_hgbr = HistGradientBoostingClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf_hgbr.fit(X_train, y_train)
print(clf_hgbr.score(X_test, y_test))