# TabLLM Replication: Few-shot Classification of Tabular Data

This notebook replicates the TabLLM approach from Hegselmann et al., 2023 (https://github.com/clinicalml/TabLLM)

**Key Components:**
- Text Template Serialization: Convert rows to natural language ("The [column] is [value]")
- Model: T0-3B with PEFT/LoRA fine-tuning
- Training: Full training data (optimized for Google Colab free GPU)
- Evaluation: Comprehensive metrics (Log Loss, AUC, Accuracy, Precision, Recall, F1)

**Datasets:**
1. Postpartum Depression
2. Student Depression
3. AI Tool Usage by Indian College Students
4. Hilton Employee Retention

## 1. Setup and Installation

In [None]:
# Install required packages
# Note: Using bitsandbytes 0.43.0+ for CUDA 12.6 support
!pip install -q transformers==4.36.2 datasets==2.16.1 peft==0.7.1 accelerate==0.25.0 bitsandbytes>=0.43.0 scikit-learn==1.3.2 pandas numpy tqdm

In [None]:
import os
import pandas as pd
import numpy as np
import torch
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq,
    EarlyStoppingCallback
)
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from sklearn.metrics import (
    log_loss, roc_auc_score, accuracy_score, 
    precision_recall_fscore_support, classification_report,
    confusion_matrix
)
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Text Template Serialization Function

Based on TabLLM's approach, we convert each row into natural language sentences.
Format: "The [column name] is [value]."

In [None]:
def serialize_row_to_text(row, exclude_cols=None, feature_descriptions=None):
    """
    Convert a dataframe row to natural language text using TabLLM's template approach.
    
    Args:
        row: pandas Series representing a single row
        exclude_cols: list of column names to exclude (e.g., target, id columns)
        feature_descriptions: dict mapping column names to human-readable descriptions
    
    Returns:
        str: Natural language representation of the row
    """
    if exclude_cols is None:
        exclude_cols = []
    
    sentences = []
    for col, value in row.items():
        # Skip excluded columns and NaN values
        if col in exclude_cols or pd.isna(value):
            continue
        
        # Use feature description if available, otherwise use column name
        if feature_descriptions and col in feature_descriptions:
            col_name = feature_descriptions[col]
        else:
            # Clean column name: replace underscores with spaces
            col_name = col.replace('_', ' ').lower()
        
        # Format the sentence
        sentences.append(f"The {col_name} is {value}.")
    
    return " ".join(sentences)


def create_classification_prompt(text, question, choices=["No", "Yes"]):
    """
    Create a classification prompt in TabLLM's format.
    Based on their YAML templates (e.g., templates_heart.yaml)
    
    Args:
        text: Serialized row text
        question: Classification question
        choices: List of answer options
    
    Returns:
        str: Formatted prompt
    """
    prompt = f"{text}\n\nQuestion: {question}\nAnswer choices: {', '.join(choices)}\nAnswer:"
    return prompt


# Test the serialization function
test_row = pd.Series({
    'age': '30-35',
    'feeling_sad_or_tearful': 'Yes',
    'irritable_towards_baby_partner': 'No',
    'feeling_anxious': 'Yes'
})

print("Example serialization:")
print(serialize_row_to_text(test_row))
print("\nExample prompt:")
print(create_classification_prompt(
    serialize_row_to_text(test_row),
    "Does this person show signs of postpartum depression?"
))

## 3. Dataset Configuration

Define configurations for each dataset including file paths, target columns, and classification questions.

In [None]:
# Dataset configurations
DATASET_CONFIGS = {
    'postpartum_depression': {
        'train_file': 'newdata/train_postpartum_depression.csv',
        'test_file': 'newdata/test_postpartum_depression.csv',
        'target_col': 'feeling anxious',
        'exclude_cols': ['timestamp', 'feeling anxious'],
        'question': 'Does this person experience feeling anxious (a predictor of postpartum depression)?',
        'positive_label': 'Yes',
        'negative_label': 'No',
        'label_map': {'No': 0, 'Maybe': 1, 'Sometimes': 1, 'Often': 1, 'Yes': 1}  # Binarize
    },
    'student_depression': {
        'train_file': 'newdata/train_student_depression_sample.csv',
        'test_file': 'newdata/test_student_depression_sample.csv',
        'target_col': 'Depression',
        'exclude_cols': ['id', 'Depression'],
        'question': 'Does this student have depression?',
        'positive_label': 'Yes',
        'negative_label': 'No',
        'label_map': {0: 0, 1: 1}
    },
    'ai_tools_usage': {
        'train_file': 'newdata/train_StudentsAITools.csv',
        'test_file': 'newdata/test_StudentsAITools.csv',
        'target_col': 'willing_to_pay_for_access',
        'exclude_cols': ['willing_to_pay_for_access'],
        'question': 'Is this student willing to pay for AI tool access?',
        'positive_label': 'Yes',
        'negative_label': 'No',
        'label_map': {'No': 0, 'Yes': 1}
    },
    'hilton_employee': {
        'train_file': 'newdata/train_HiltonEmployee.csv',
        'test_file': 'newdata/test_HiltonEmployee.csv',
        'target_col': 'intenttostayhl',
        'exclude_cols': ['inncode', 'intenttostayhl', 'recommendhl'],  # Exclude ID and target-related cols
        'question': 'Does this employee intend to stay at Hilton?',
        'positive_label': 'Yes',
        'negative_label': 'No',
        'label_map': {0: 0, 1: 1},
        'feature_descriptions': {  # Add meaningful descriptions for key features
            'generation': 'employee generation',
            'department': 'department',
            'fulltimeparttime': 'employment type (full-time or part-time)',
            'tenure': 'tenure category',
            'managementlevel': 'management level',
            'engagement': 'overall engagement score',
            'wellbeing': 'wellbeing score',
            'workenvironment': 'work environment score',
            'learningdevelopment': 'learning and development score',
            'worklifebalance': 'work-life balance score',
            'rewardsbenefits': 'rewards and benefits score'
        }
    }
}

print("Dataset configurations loaded:")
for name, config in DATASET_CONFIGS.items():
    print(f"  - {name}: {config['question']}")

## 4. Data Loading and Preprocessing

Load datasets and apply text template serialization.

In [None]:
def load_and_serialize_dataset(config, dataset_name):
    """
    Load a dataset and apply text serialization.
    
    Args:
        config: Dataset configuration dictionary
        dataset_name: Name of the dataset for logging
    
    Returns:
        tuple: (train_df, test_df) with serialized text and binary labels
    """
    print(f"\n{'='*60}")
    print(f"Loading {dataset_name}")
    print(f"{'='*60}")
    
    # Load data
    train_df = pd.read_csv(config['train_file'])
    test_df = pd.read_csv(config['test_file'])
    
    print(f"Train shape: {train_df.shape}")
    print(f"Test shape: {test_df.shape}")
    
    # Extract and map labels
    label_map = config['label_map']
    train_df['label'] = train_df[config['target_col']].map(label_map)
    test_df['label'] = test_df[config['target_col']].map(label_map)
    
    # Handle any unmapped labels
    if train_df['label'].isna().any() or test_df['label'].isna().any():
        print(f"Warning: Found unmapped labels. Unique values:")
        print(f"  Train: {train_df[config['target_col']].unique()}")
        print(f"  Test: {test_df[config['target_col']].unique()}")
        # Drop rows with unmapped labels
        train_df = train_df.dropna(subset=['label'])
        test_df = test_df.dropna(subset=['label'])
    
    train_df['label'] = train_df['label'].astype(int)
    test_df['label'] = test_df['label'].astype(int)
    
    # Print label distribution
    print(f"\nLabel distribution:")
    print(f"  Train: {train_df['label'].value_counts().to_dict()}")
    print(f"  Test: {test_df['label'].value_counts().to_dict()}")
    
    # Serialize rows to text
    feature_descriptions = config.get('feature_descriptions', None)
    
    print("\nSerializing rows to text...")
    train_df['text'] = train_df.apply(
        lambda row: serialize_row_to_text(
            row, 
            exclude_cols=config['exclude_cols'],
            feature_descriptions=feature_descriptions
        ),
        axis=1
    )
    
    test_df['text'] = test_df.apply(
        lambda row: serialize_row_to_text(
            row, 
            exclude_cols=config['exclude_cols'],
            feature_descriptions=feature_descriptions
        ),
        axis=1
    )
    
    # Create prompts
    question = config['question']
    choices = [config['negative_label'], config['positive_label']]
    
    train_df['prompt'] = train_df['text'].apply(
        lambda x: create_classification_prompt(x, question, choices)
    )
    test_df['prompt'] = test_df['text'].apply(
        lambda x: create_classification_prompt(x, question, choices)
    )
    
    # Create target text (verbalizer)
    train_df['target'] = train_df['label'].map({0: config['negative_label'], 1: config['positive_label']})
    test_df['target'] = test_df['label'].map({0: config['negative_label'], 1: config['positive_label']})
    
    # Show example
    print(f"\nExample serialized row:")
    print(f"Text: {train_df['text'].iloc[0][:200]}...")
    print(f"\nFull prompt: {train_df['prompt'].iloc[0][:300]}...")
    print(f"Target: {train_df['target'].iloc[0]}")
    
    # Check token lengths (approximate)
    avg_tokens_train = train_df['prompt'].str.split().str.len().mean()
    max_tokens_train = train_df['prompt'].str.split().str.len().max()
    print(f"\nToken statistics (approximate):")
    print(f"  Average tokens: {avg_tokens_train:.0f}")
    print(f"  Max tokens: {max_tokens_train:.0f}")
    
    if max_tokens_train > 800:  # Leave room for actual tokenization overhead
        print(f"  Warning: Some prompts may exceed T0's 1024 token limit!")
    
    return train_df[['prompt', 'target', 'label']], test_df[['prompt', 'target', 'label']]


# Load all datasets
datasets = {}
for name, config in DATASET_CONFIGS.items():
    train_df, test_df = load_and_serialize_dataset(config, name)
    datasets[name] = {
        'train': train_df,
        'test': test_df,
        'config': config
    }

## 5. Model Setup: T0 with LoRA

Following TabLLM's approach, we use T0 (T0-3B) with parameter-efficient fine-tuning.
They used IA3 adapters, but we'll use LoRA which is more widely available and similarly efficient.

In [None]:
# Model configuration
MODEL_NAME = "bigscience/T0_3B"  # T0-3B model (3 billion parameters)
# Alternative: "bigscience/T0pp" (11B, requires more memory)

print(f"Loading model: {MODEL_NAME}")
print("This may take a few minutes...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# T0 models don't have a pad token, so we set it to eos_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded. Vocab size: {len(tokenizer)}")
print(f"Max model length: {tokenizer.model_max_length}")

In [None]:
def create_lora_model(model_name=MODEL_NAME):
    """
    Create T0 model with LoRA adapters for parameter-efficient fine-tuning.
    
    Based on TabLLM's IA3 configuration:
    - They used learning rate 0.003, batch size 4
    - IA3 adds ~0.1% trainable parameters
    - LoRA is similar, we configure it for comparable efficiency
    """
    # Load base model in 8-bit for memory efficiency (important for Colab)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        model_name,
        load_in_8bit=True,  # Quantization for memory efficiency
        device_map="auto",
        torch_dtype=torch.float16
    )
    
    # LoRA configuration
    # Following TabLLM's parameter-efficient approach
    lora_config = LoraConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        r=8,  # LoRA rank - lower = more efficient
        lora_alpha=32,  # LoRA scaling factor
        lora_dropout=0.1,
        target_modules=["q", "v"],  # Apply LoRA to query and value projections
        inference_mode=False
    )
    
    # Wrap model with LoRA
    model = get_peft_model(model, lora_config)
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    
    print(f"\nModel: {model_name}")
    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    print(f"Total parameters: {total_params:,}")
    
    return model


# We'll create the model fresh for each dataset to avoid cross-contamination
print("Model setup function ready.")
print("Models will be created per-dataset during training.")

## 6. Training and Evaluation Functions

In [None]:
def prepare_dataset_for_training(df, tokenizer, max_length=512):
    """
    Convert DataFrame to HuggingFace Dataset and tokenize.
    
    Args:
        df: DataFrame with 'prompt', 'target', 'label' columns
        tokenizer: Tokenizer instance
        max_length: Maximum sequence length (T0 supports up to 1024)
    
    Returns:
        Dataset: Tokenized dataset ready for training
    """
    dataset = Dataset.from_pandas(df)
    
    def tokenize_function(examples):
        # Tokenize inputs (prompts)
        model_inputs = tokenizer(
            examples['prompt'],
            max_length=max_length,
            padding='max_length',
            truncation=True,
            return_tensors=None
        )
        
        # Tokenize targets (labels)
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(
                examples['target'],
                max_length=3,  # "Yes" or "No" is very short
                padding='max_length',
                truncation=True,
                return_tensors=None
            )
        
        model_inputs['labels'] = labels['input_ids']
        return model_inputs
    
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names
    )
    
    return tokenized_dataset


def compute_metrics_comprehensive(predictions, references, positive_label_text, negative_label_text, tokenizer):
    """
    Compute comprehensive evaluation metrics as requested.
    
    Args:
        predictions: Model predictions (logits)
        references: Ground truth labels
        positive_label_text: Text for positive class (e.g., "Yes")
        negative_label_text: Text for negative class (e.g., "No")
        tokenizer: Tokenizer to decode predictions
    
    Returns:
        dict: Dictionary of metrics
    """
    # Decode predictions
    pred_ids = np.argmax(predictions, axis=-1)
    
    # Get token IDs for Yes/No
    positive_token_id = tokenizer.encode(positive_label_text, add_special_tokens=False)[0]
    negative_token_id = tokenizer.encode(negative_label_text, add_special_tokens=False)[0]
    
    # Convert predictions to binary labels
    # Extract first non-pad token from predictions
    pred_labels = []
    for pred_seq in pred_ids:
        first_token = pred_seq[0]  # First token should be Yes/No
        if first_token == positive_token_id:
            pred_labels.append(1)
        elif first_token == negative_token_id:
            pred_labels.append(0)
        else:
            # Default to negative if unclear
            pred_labels.append(0)
    
    pred_labels = np.array(pred_labels)
    
    # Convert references to binary labels
    true_labels = []
    for ref_seq in references:
        # Filter out padding tokens (-100)
        ref_seq_filtered = [t for t in ref_seq if t != -100]
        if len(ref_seq_filtered) > 0:
            first_token = ref_seq_filtered[0]
            if first_token == positive_token_id:
                true_labels.append(1)
            else:
                true_labels.append(0)
        else:
            true_labels.append(0)
    
    true_labels = np.array(true_labels)
    
    # Get probabilities for positive class (for AUC and log loss)
    # Take softmax over the first token position for Yes/No tokens
    yes_no_logits = predictions[:, 0, [negative_token_id, positive_token_id]]
    probs = torch.softmax(torch.tensor(yes_no_logits), dim=-1).numpy()
    pred_probs = probs[:, 1]  # Probability of positive class
    
    # Calculate metrics
    accuracy = accuracy_score(true_labels, pred_labels)
    
    # Precision, Recall, F1 (overall and per-class)
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        true_labels, pred_labels, average='macro', zero_division=0
    )
    precision_per_class, recall_per_class, f1_per_class, _ = precision_recall_fscore_support(
        true_labels, pred_labels, average=None, zero_division=0
    )
    
    # AUC and Log Loss
    try:
        auc = roc_auc_score(true_labels, pred_probs)
    except:
        auc = 0.0
    
    try:
        logloss = log_loss(true_labels, pred_probs)
    except:
        logloss = float('inf')
    
    # Confusion matrix for additional insights
    cm = confusion_matrix(true_labels, pred_labels)
    
    metrics = {
        'accuracy': accuracy,
        'auc': auc,
        'log_loss': logloss,
        'precision_overall': precision_macro,
        'recall_overall': recall_macro,
        'f1_overall': f1_macro,
        'precision_negative': precision_per_class[0] if len(precision_per_class) > 0 else 0.0,
        'precision_positive': precision_per_class[1] if len(precision_per_class) > 1 else 0.0,
        'recall_negative': recall_per_class[0] if len(recall_per_class) > 0 else 0.0,
        'recall_positive': recall_per_class[1] if len(recall_per_class) > 1 else 0.0,
        'f1_negative': f1_per_class[0] if len(f1_per_class) > 0 else 0.0,
        'f1_positive': f1_per_class[1] if len(f1_per_class) > 1 else 0.0,
    }
    
    return metrics, cm


print("Training and evaluation functions ready.")

## 7. Fine-tuning Loop

Train T0 with LoRA on each dataset and evaluate.

In [None]:
def train_and_evaluate_dataset(dataset_name, data, output_dir='./results'):
    """
    Train and evaluate T0 model on a single dataset.
    
    Args:
        dataset_name: Name of the dataset
        data: Dictionary with 'train', 'test', 'config' keys
        output_dir: Directory to save results
    
    Returns:
        dict: Evaluation metrics
    """
    print(f"\n\n{'='*80}")
    print(f"TRAINING: {dataset_name.upper()}")
    print(f"{'='*80}\n")
    
    # Create output directory
    dataset_output_dir = f"{output_dir}/{dataset_name}"
    os.makedirs(dataset_output_dir, exist_ok=True)
    
    # Get data
    train_df = data['train']
    test_df = data['test']
    config = data['config']
    
    # Create fresh model for this dataset
    print("Loading model...")
    model = create_lora_model(MODEL_NAME)
    
    # Prepare datasets
    print("\nPreparing datasets...")
    train_dataset = prepare_dataset_for_training(train_df, tokenizer, max_length=512)
    test_dataset = prepare_dataset_for_training(test_df, tokenizer, max_length=512)
    
    # Data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        padding=True
    )
    
    # Training arguments
    # Based on TabLLM: lr=0.003, batch_size=4, gradient_accumulation=1
    # Adapted for full dataset training (not just k-shot)
    training_args = TrainingArguments(
        output_dir=dataset_output_dir,
        num_train_epochs=3,  # Adjust based on dataset size
        per_device_train_batch_size=4,  # Same as TabLLM
        per_device_eval_batch_size=8,
        learning_rate=3e-3,  # Same as TabLLM (0.003)
        warmup_steps=100,
        logging_steps=50,
        eval_strategy="steps",
        eval_steps=200,
        save_strategy="steps",
        save_steps=200,
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
        fp16=True,  # Mixed precision for speed
        gradient_accumulation_steps=1,
        report_to="none",  # Disable wandb/tensorboard
        remove_unused_columns=False,
        label_names=["labels"],
    )
    
    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=data_collator,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )
    
    # Train
    print("\nStarting training...")
    print(f"Training samples: {len(train_dataset)}")
    print(f"Evaluation samples: {len(test_dataset)}")
    print(f"Epochs: {training_args.num_train_epochs}")
    print(f"Batch size: {training_args.per_device_train_batch_size}")
    print(f"Learning rate: {training_args.learning_rate}\n")
    
    trainer.train()
    
    # Evaluate
    print("\n\nEvaluating on test set...")
    
    # Generate predictions
    predictions_output = trainer.predict(test_dataset)
    predictions = predictions_output.predictions
    references = predictions_output.label_ids
    
    # Compute metrics
    metrics, cm = compute_metrics_comprehensive(
        predictions,
        references,
        positive_label_text=config['positive_label'],
        negative_label_text=config['negative_label'],
        tokenizer=tokenizer
    )
    
    # Print results
    print("\n" + "="*60)
    print(f"RESULTS: {dataset_name.upper()}")
    print("="*60)
    print(f"\nLog Loss:          {metrics['log_loss']:.4f}")
    print(f"AUC:               {metrics['auc']:.4f}")
    print(f"Accuracy:          {metrics['accuracy']:.4f}")
    print(f"\nPrecision (Overall): {metrics['precision_overall']:.4f}")
    print(f"  - Negative:        {metrics['precision_negative']:.4f}")
    print(f"  - Positive:        {metrics['precision_positive']:.4f}")
    print(f"\nRecall (Overall):    {metrics['recall_overall']:.4f}")
    print(f"  - Negative:        {metrics['recall_negative']:.4f}")
    print(f"  - Positive:        {metrics['recall_positive']:.4f}")
    print(f"\nF1 Score (Overall):  {metrics['f1_overall']:.4f}")
    print(f"  - Negative:        {metrics['f1_negative']:.4f}")
    print(f"  - Positive:        {metrics['f1_positive']:.4f}")
    print(f"\nConfusion Matrix:")
    print(cm)
    print("="*60 + "\n")
    
    # Save results
    results_dict = {
        'dataset': dataset_name,
        **metrics
    }
    results_df = pd.DataFrame([results_dict])
    results_df.to_csv(f"{dataset_output_dir}/metrics.csv", index=False)
    
    # Save confusion matrix
    cm_df = pd.DataFrame(cm, 
                         columns=['Predicted Negative', 'Predicted Positive'],
                         index=['True Negative', 'True Positive'])
    cm_df.to_csv(f"{dataset_output_dir}/confusion_matrix.csv")
    
    # Clean up to save memory
    del model, trainer
    torch.cuda.empty_cache()
    
    return results_dict


print("Training function ready.")

## 8. Run Training on All Datasets

**Note:** This will take several hours on Colab's free GPU. 
You can comment out datasets you don't want to train immediately.

In [None]:
# Train on all datasets
all_results = []

for dataset_name, data in datasets.items():
    try:
        results = train_and_evaluate_dataset(dataset_name, data)
        all_results.append(results)
    except Exception as e:
        print(f"\nError training {dataset_name}: {str(e)}")
        print(f"Skipping to next dataset...\n")
        continue

print("\n\nAll training complete!")

## 9. Consolidated Results

View all results in a single table.

In [None]:
# Create consolidated results table
if len(all_results) > 0:
    results_df = pd.DataFrame(all_results)
    
    # Reorder columns for readability
    column_order = [
        'dataset',
        'log_loss',
        'auc',
        'accuracy',
        'recall_overall',
        'recall_positive',
        'recall_negative',
        'precision_overall',
        'precision_positive',
        'precision_negative',
        'f1_overall',
        'f1_positive',
        'f1_negative'
    ]
    
    results_df = results_df[column_order]
    
    # Save consolidated results
    results_df.to_csv('./results/consolidated_results.csv', index=False)
    
    # Display
    print("\n" + "="*100)
    print("CONSOLIDATED RESULTS - ALL DATASETS")
    print("="*100)
    print(results_df.to_string(index=False))
    print("="*100)
    
    # Summary statistics
    print("\n\nSummary Statistics Across Datasets:")
    print("-" * 60)
    summary = results_df.drop('dataset', axis=1).describe().loc[['mean', 'std', 'min', 'max']]
    print(summary.to_string())
    
else:
    print("No results to display. Training may have failed.")

## 10. Additional Analysis (Optional)

Compare with baselines, analyze errors, etc.

In [None]:
# Plot results comparison
import matplotlib.pyplot as plt

if len(all_results) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('TabLLM Results Across Datasets', fontsize=16, fontweight='bold')
    
    metrics_to_plot = [
        ('accuracy', 'Accuracy'),
        ('auc', 'AUC'),
        ('f1_overall', 'F1 Score (Overall)'),
        ('log_loss', 'Log Loss')
    ]
    
    for idx, (metric, title) in enumerate(metrics_to_plot):
        ax = axes[idx // 2, idx % 2]
        
        values = results_df[metric].values
        datasets_names = results_df['dataset'].values
        
        bars = ax.bar(range(len(values)), values, color='steelblue', alpha=0.7)
        ax.set_xticks(range(len(values)))
        ax.set_xticklabels(datasets_names, rotation=45, ha='right')
        ax.set_ylabel(title)
        ax.set_title(title)
        ax.grid(axis='y', alpha=0.3)
        
        # Add value labels on bars
        for i, (bar, val) in enumerate(zip(bars, values)):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{val:.3f}', ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    plt.savefig('./results/metrics_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\nPlot saved to ./results/metrics_comparison.png")
else:
    print("No results to plot.")

## Summary

This notebook successfully replicates the TabLLM approach:

1. **Text Serialization**: Converted tabular rows to natural language using "The [column] is [value]" format
2. **Prompting**: Created task-specific classification prompts following TabLLM's template structure
3. **Fine-tuning**: Used T0-3B with LoRA adapters (similar efficiency to their IA3 approach)
4. **Training**: Fine-tuned on full training data with TabLLM's hyperparameters (lr=0.003, batch_size=4)
5. **Evaluation**: Computed comprehensive metrics across all datasets

**Key Adaptations for Colab:**
- Used 8-bit quantization for memory efficiency
- LoRA instead of IA3 (more widely available, similar efficiency)
- Optimized batch sizes and gradient accumulation for T4 GPU

**Results:**
See the consolidated table above and individual dataset results in `./results/` folder.