### Example: Three modes of TARTE post-training paradigms.

In this example, we run TARTE on `movies` data with various post-training paradigms: <br>

- Fine-tuning a specific task <br>
- Table featurizer with frozen backbone <br>
- Boosting a complementary model

In [1]:
# Set the current working directory and import packages
import os
from pathlib import Path
os.chdir(Path().cwd().parent)

Let's first load the data and set the splits.

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split
from tarte_ai import load_data

# Set basic specifications
data_name = "movies"      # Name of the data
num_train = 256     # Train-size
random_state = 1    # Random_state

data, configs = load_data(data_name)
data.fillna(value=np.nan, inplace=True)

target_name = configs["target_name"]
X = data.drop(columns=target_name)
y = data[target_name]
y = np.array(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = num_train, random_state=42)

(1) To run finetuning, we simply use:

- `TARTE_TablePreprocessor` to prepare the tables to a suitable input for the TARTE transformer.
- `TARTEFinetuneRegressor` to run TARTE fine-tuning (similar to the `CARTERegressor` in CARTE package).

In [3]:
from sklearn.metrics import r2_score
from tarte_ai import TARTE_TablePreprocessor, TARTEFinetuneRegressor

preprocessor = TARTE_TablePreprocessor()
X_train_tarte_ft = preprocessor.fit_transform(X_train, y_train)
X_test_tarte_ft = preprocessor.transform(X_test)

fixed_params = dict()
fixed_params["num_model"] = 10 # 10 models for the bagging strategy
fixed_params["disable_pbar"] = False # True if you want cleanness
fixed_params["random_state"] = 0
fixed_params["device"] = "cpu"
fixed_params["num_heads"] = 24
fixed_params["n_jobs"] = 10
fixed_params["num_layers"] = 1

estimator = TARTEFinetuneRegressor(**fixed_params)
estimator.fit(X=X_train_tarte_ft, y=y_train)

y_pred = estimator.predict(X_test_tarte_ft)
score = r2_score(y_test, y_pred)
print(f"\nThe R2 score for TARTE-Ridge:", "{:.4f}".format(score))


Model No. xx:  11%|█         | 56/500 [00:56<07:26,  1.01s/it]
Model No. xx:  12%|█▏        | 58/500 [00:56<07:13,  1.02it/s]
Model No. xx:  12%|█▏        | 59/500 [01:02<07:47,  1.06s/it]
Model No. xx:  12%|█▏        | 60/500 [01:03<07:47,  1.06s/it]
Model No. xx:  12%|█▏        | 61/500 [01:03<07:39,  1.05s/it]
Model No. xx:  12%|█▏        | 59/500 [01:04<08:02,  1.09s/it]
Model No. xx:  17%|█▋        | 83/500 [01:14<06:12,  1.12it/s]
Model No. xx:  17%|█▋        | 84/500 [01:20<06:40,  1.04it/s]
Model No. xx:  21%|██        | 104/500 [01:27<05:34,  1.19it/s]
Model No. xx:  23%|██▎       | 115/500 [01:35<05:20,  1.20it/s]



The R2 score for TARTE-Ridge: 0.5811


(2) To run TARTE featurizer, we simply use:

- `TARTE_TablePreprocessor` to prepare the tables to a suitable input for the TARTE transformer.
- `TARTE_TableEncoder` to encode the tables to a suitable input for the TARTE transformer
- A suiable estimator (RidgeCV in this case)

In [4]:
from tarte_ai import TARTE_TableEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV

tarte_tab_prepper = TARTE_TablePreprocessor()
tarte_tab_encoder = TARTE_TableEncoder(layer_index=2) # Can change which layer to extract the embeddings

prep_pipe = Pipeline([("prep", tarte_tab_prepper), ("tabenc", tarte_tab_encoder)])

X_train_featurizer = prep_pipe.fit_transform(X_train, y_train)
X_test_featurizer = prep_pipe.transform(X_test)

estimator = RidgeCV()
estimator.fit(X=X_train_featurizer, y=y_train)
y_pred = estimator.predict(X_test_featurizer)
score = r2_score(y_test, y_pred)
print(f"\nThe R2 score for TARTE-Ridge:", "{:.4f}".format(score))



The R2 score for TARTE-Ridge: 0.5751


(3) To run TARTE Boosting with complementary model, it requires an additional choice of the base model and bit of data input preparation. In this example, we use TabVec + TabPFNv2 as the base model. 

- `TableVectorizer` to prepare the input of the base model.
- `TARTE_TablePreprocessor` to prepare the tables to a suitable input for the TARTE transformer.
- `TARTE_TableEncoder` to encode the tables to a suitable input for the TARTE transformer.
- Form a list of input suitable for `TARTEBoostRegressor_TabPFN`.

In [None]:
from skrub import TableVectorizer

# Prepare tablevectorizer
tabvec = TableVectorizer()
X_train_tabvec = tabvec.fit_transform(X_train).to_numpy()
X_test_tabvec = tabvec.transform(X_test).to_numpy()

# Prepare input for TARTE-B Regressor
# TARTE features are already prepared from TARTE-Featurizer above
# It is very import to keep the order: TabVec and then TARTE
X_train_boost = [(X_train_tabvec[i], X_train_featurizer[i]) for i in range(len(y_train))]
X_test_boost = [(X_test_tabvec[i], X_test_featurizer[i]) for i in range(len(y_test))]




To run TabPFN it is advised to use gpus (thus, we set `device="cuda"`)

In [None]:
from tarte_ai import TARTEBoostRegressor_TabPFN

params_boost = dict()
params_boost["device"] = 'cuda'
params_boost["model_names"] = ["tabpfn", "ridge"]
params_boost["fit_order"] = "fixed"
params_boost["device"] = "cuda" # For TabPFN implmentation

estimator = TARTEBoostRegressor_TabPFN(**params_boost)
estimator.fit(X=X_train_boost, y=y_train)
y_pred = estimator.predict(X_test_boost)
score = r2_score(y_test, y_pred)
print(f"\nThe R2 score for TARTE-Ridge:", "{:.4f}".format(score))



The R2 score for TARTE-Ridge: 0.6133
