# TabPFN

# Set up

‚öôÔ∏è Step 1: Set your notebook to GPU

The next two cells take ~2 min.... start running them now while we talk! üëáüëá

In [29]:
# get workshop code
import os
import sys
IN_COLAB = os.getenv("COLAB_RELEASE_TAG")
if IN_COLAB:
    !git clone https://github.com/rajaonsonella/crosstalk-uoft
    sys.path.append('./crosstalk-uoft')
else:
    sys.path.append('..')
!pip install -r crosstalk-uoft/requirements.txt
!pip install tabpfn

fatal: destination path 'crosstalk-uoft' already exists and is not an empty directory.


In [30]:
# Download data from google drive
import gdown
import os

file_ids = {'test' : '1Gyv_ldUTi0Ymy6wVMfruAO0UraCQ70CR',
            'train':'11S5p0QgP1X9rOFiIjNSLydLenJwm7hle'}

for name, file_id in file_ids.items():
    filename = f'crosstalk_{name}.parquet'
    if not os.path.exists(filename):
        gdown.download(id=file_id, output=filename, quiet=False)

We had a problem with too many requests on the shared data file. If you get an error with the above, you may be able to upload the file to your google drive, you can locate it with the following. Then, adjust file paths, eg. `content/drive/My Drive/crosstalk_train.parquet`


In [31]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## üìà Load the data

In [36]:
# package to efficiently read in data from parquet file
from pyarrow.parquet import ParquetFile
import pyarrow as pa
import pyarrow.parquet as pq

# packages to help us with data manipulation in tables
import pandas as pd
import numpy as np
import scipy
#
from tqdm.auto import tqdm
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
# TabPFN
import tabpfn
from tabpfn import TabPFNClassifier
from tabpfn.constants import ModelVersion
import more_itertools
import dataset

This is a large file, so we won't load the whole thing into memory all at once. Parquet has some nice utilities for telling us some info about the dataset, without having to actually load it.



In [33]:
# Get parquet file
pf = ParquetFile('crosstalk_train.parquet')

In [34]:
# How many rows and columns?
N = pf.metadata.num_rows
print(pf.metadata)

<pyarrow._parquet.FileMetaData object at 0x7ee880716d40>
  created_by: parquet-cpp-arrow version 17.0.0
  num_columns: 16
  num_rows: 375595
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 53693


In [35]:
# What are the column names?
print(pf.schema.names)

['ID', 'DEL_ID', 'DELLabel', 'RawCount', 'Target', 'ECFP4', 'ECFP6', 'FCFP4', 'FCFP6', 'MACCS', 'RDK', 'AVALON', 'ATOMPAIR', 'TOPTOR', 'MW', 'ALOGP']


## Incremental learning with batches

We'll write a data loader which will take each batch by iterating through the parquet file. Note: this only works for models which support incremental learning, which not all do.

We also take care of train/test split *inside* the dataloader in this case, because we are loading it incrementally.



‚ö†Ô∏è This cell takes a long time to run so we cut it off early for demo purposes. If you use it in your own code be sure to increase the `max_batches` in order to see all the data

In [39]:
batch_size = 1024
n_batches = N // batch_size
max_batches = None
iterator = dataset.parquet_split_dataloader("crosstalk_train.parquet", "ECFP6", "DELLabel",
                                    chunk_size=batch_size)

In [30]:
model = SGDClassifier(loss="log_loss")

In [19]:
first = True
test_x_list = []
test_y_list = []
for (x_train, y_train), (x_test, y_test) in tqdm(iterator, total = n_batches):
    if first:
        model.partial_fit(x_train, y_train, classes=[0, 1])
        first = False
    else:
        model.partial_fit(x_train, y_train)
    test_x_list.append(x_test)
    test_y_list.append(y_test)

  0%|          | 0/366 [00:00<?, ?it/s]

CPU times: user 2min 9s, sys: 1.12 s, total: 2min 11s
Wall time: 2min 13s


In [20]:
  # Evaluate
  y_pred = model.predict_proba(x_test)[:, 1]
  auc = roc_auc_score(y_test,  y_pred )
  auprc = average_precision_score(y_test,  y_pred )
  print(f"AUROC: {auc:.4f}")
  print(f"AUPRC: {auprc:.4f}")

AUROC: 0.9331
AUPRC: 0.6727


# TabPFN



In [37]:
x = dataset.load_x("crosstalk_train.parquet", ["ECFP6"])
y = dataset.load_y("crosstalk_train.parquet")

print(x.shape, y.shape)

Total rows: 375595
Expected Memory for inputs: 2.87 GBs


  0%|          | 0/375 [00:00<?, ?it/s]

(375595, 2048) (375595, 1)


Downsample feature dimension

In [38]:
feat_dim = 250
og_dim = x.shape[1]
assert feat_dim < og_dim, "Check your dims!"
feat_mask = np.zeros(og_dim, dtype=bool)
selected_indices = np.random.choice(og_dim, feat_dim, replace=False)
feat_mask[selected_indices] = True
x = x[:, feat_mask]

Downsample samples

In [39]:
n_samples = 5000

x_train, x_val, y_train, y_val = train_test_split(
    x,
    y,
    train_size=n_samples,
    stratify=y,
    random_state=42
)
print(x_train.shape, y_train.shape)

(5000, 250) (5000, 1)


Create a model

In [40]:
model = TabPFNClassifier.create_default_for_version(ModelVersion.V2)

tabpfn only support medium sized datasets!

In [42]:
model.fit(x_train, y_train.ravel())

In [59]:
batch_size = 1024
y_pred = []
n_val = len(x_val)
iterator = more_itertools.chunked(list(range(n_val)), batch_size)
n_chunks = n_val // batch_size
for chunk in tqdm(iterator, total=n_chunks):
  #indices = np.bsta
  y_pred.append(model.predict_proba(x_val[chunk])[:, 1])
y_preds = np.hstack(y_pred)
print(y_preds.shape)

  0%|          | 0/2 [00:00<?, ?it/s]

(370,)


Evals

In [None]:
auc = roc_auc_score(y_test,  y_pred )
auprc = average_precision_score(y_test,  y_pred )
print(f"AUROC: {auc:.4f}")
print(f"AUPRC: {auprc:.4f}")