# TFM Tutorial Notebook 4: Final Assessment for Tabular Foundation Models for Time Series

In this tutorial, we will do a coding exercise by leveraging TabPFN v2 for the crop yield prediction task. We just use its zero-shot prediction capability for this task without any fine-tuning. Please make sure you have installed TabPFN already. If not yet, use the following to install

``!pip install tabpfn``

In this coding exercise, you are given a crop yield dataset from Kaggle. Please download from my [Google Drive](https://drive.google.com/file/d/1pJPOnNOVOLXfyu8PYOwtmfWYtoyQEtMS/view?usp=sharing) 

The target is **"crop yield"**, while there are several covariates, including **soil humidy, air temperature, air humidity, pressure, wind speed and wind gusts**. You are now required to complete the following two tasks:

1. Given varying sizes of datasets (50, 100, 500, 1000), fit a TabPFN with split ratio being 0.2, and calculate the Root Mean Squared Error and R^2 values. Plot them and discuss the results. For this task, you use all the covariates. Please also use a couple ML methods for comparison.
2. To investigate how the covariates impact the target, please run experiments with different sets of covariates with a fixed size (e.g., 1000) and report the results (RMSE and R^2). Try to find out which covariate has the highest impact.
3. Show the results either with tables or plots.


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error, r2_score
from tabpfn import TabPFNRegressor

After downloading the dataset, please specify the local path.

In [None]:
csv_path = "your_local_path"
df = pd.read_csv(csv_path)
df.drop(columns=['Unnamed: 0'], inplace=True)
print("Loaded CSV shape:", df.shape)
print("Columns:", df.columns.tolist())

Define the target and features and conduct train/test splits.

In [None]:
target_col = "Crop Yield"
features = [
    'Soil humidity 1',
    'Air temperature (C)',
    'Air humidity (%)',
    'Pressure (KPa)',
    'Wind speed (Km/h)',
    'Wind gust (Km/h)'
]
length = 1000 # If larger than 1000, you need GPU.
bare_mini = 10
horizon = 1 # We use current step to predict the next step.
df_sub = df[features + [target_col]].copy()[:length]
# Fill missing
for c in df_sub.columns:
    if df_sub[c].dtype.kind in 'iufc':
        df_sub[c] = df_sub[c].fillna(df_sub[c].median())
    else:
        df_sub[c] = df_sub[c].fillna('missing').astype(str)
if df_sub.shape[0] < bare_mini:
    print("Warning: very few rows for the crop; TabPFN works best on small-to-medium datasets but needs enough examples.")
X = df_sub[features][:-horizon]
y = df_sub[target_col][horizon:]
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

## TODO: Train/Test the TabPFN model and Show the Results

In [None]:
model = TabPFNRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(
    "Maize RMSE:", root_mean_squared_error(y_test, y_pred),
    "R2:", r2_score(y_test, y_pred)
)

# Plot ground truth & prediction.
# Ground truth from test split
plt.plot(
    range(len(y_test)),
    y_test,
    label='Ground Truth (Test)',
)

# Predictions
plt.plot(
    range(len(y_pred)),
    y_pred,
    label='Predicted (TabPFN)',
    color='red',
)

plt.title(f"Crop Yield Prediction for Crop Using TabPFN")
plt.xlabel("Sample")
plt.ylabel("Yield (tons/ha)")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()