# FLAN-T5 Fine-tuning for WISDM Activity Recognition (Google Colab)

This notebook fine-tunes FLAN-T5 on the WISDM accelerometer dataset by converting time-series sensor data into text format for sequence-to-text generation.

**Before running:**
1. Upload your WISDM dataset files to Google Drive
2. Mount Google Drive in the notebook
3. Update the dataset path to point to your uploaded files

## Step 1: Mount Google Drive and Install Dependencies

In [13]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Install required packages
!pip install -q transformers datasets torch scikit-learn tqdm

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: Import Libraries and Helper Functions

In [14]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
import torch
from tqdm import tqdm

## Step 3: Load and Process WISDM Dataset

### Define Helper Functions

In [15]:
# ==================== STEP 1: Load WISDM Data ====================
def load_wisdm_data(file_path):
    """
    Load WISDM raw accelerometer data from text file.

    Args:
        file_path (str): Path to WISDM_ar_v1.1_raw.txt

    Returns:
        pd.DataFrame: DataFrame with columns [user, activity, timestamp, x, y, z]
    """
    print("Loading WISDM data...")

    # Read the file line by line and parse
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            # Remove trailing semicolon and whitespace
            line = line.strip().rstrip(';')
            if not line:
                continue

            try:
                parts = line.split(',')
                if len(parts) == 6:
                    user = int(parts[0])
                    activity = parts[1]
                    timestamp = int(parts[2])
                    x = float(parts[3])
                    y = float(parts[4])
                    z = float(parts[5])
                    data.append([user, activity, timestamp, x, y, z])
            except:
                continue

    df = pd.DataFrame(data, columns=['user', 'activity', 'timestamp', 'x', 'y', 'z'])
    print(f"Loaded {len(df)} sensor readings")
    print(f"Activities: {df['activity'].unique()}")
    print(f"Activity distribution:\n{df['activity'].value_counts()}")

    return df

In [16]:
# ==================== STEP 2: Create Sliding Windows ====================
def create_sliding_windows(df, window_size=80, step_size=40):
    """
    Create sliding windows from time-series data for each user and activity.

    Args:
        df (pd.DataFrame): Input dataframe with sensor readings
        window_size (int): Number of samples per window (default: 80 ≈ 4 seconds at 20Hz)
        step_size (int): Step size for sliding window (default: 40, 50% overlap)

    Returns:
        list: List of (window_data, activity_label) tuples
    """
    print(f"\nCreating sliding windows (size={window_size}, step={step_size})...")

    windows = []

    # Group by user and activity to maintain continuity
    for (user, activity), group in tqdm(df.groupby(['user', 'activity'])):
        # Sort by timestamp
        group = group.sort_values('timestamp')

        # Extract sensor values
        sensor_data = group[['x', 'y', 'z']].values

        # Create windows
        for i in range(0, len(sensor_data) - window_size + 1, step_size):
            window = sensor_data[i:i + window_size]
            if len(window) == window_size:
                windows.append((window, activity))

    print(f"Created {len(windows)} windows")
    return windows

In [17]:
# ==================== STEP 3: Convert Windows to Text ====================
def window_to_text(window):
    """
    Convert a numeric sensor window to text format for T5 input.

    Args:
        window (np.ndarray): Array of shape (window_size, 3) with x, y, z values

    Returns:
        str: Text representation of the window
    """
    # Convert sensor readings to text format
    # Use statistics to make input more meaningful and avoid echo patterns
    x_vals = window[:, 0]
    y_vals = window[:, 1]
    z_vals = window[:, 2]

    # Calculate statistics to reduce dimensionality
    x_mean, x_std = float(np.mean(x_vals)), float(np.std(x_vals))
    y_mean, y_std = float(np.mean(y_vals)), float(np.std(y_vals))
    z_mean, z_std = float(np.mean(z_vals)), float(np.std(z_vals))

    # Create a descriptive text input using only statistics
    # Avoid comma-separated numbers that model might echo
    text = (f"Accelerometer data: "
            f"x-axis mean {x_mean:.2f} std {x_std:.2f}, "
            f"y-axis mean {y_mean:.2f} std {y_std:.2f}, "
            f"z-axis mean {z_mean:.2f} std {z_std:.2f}. "
            f"What activity is this?")
    return text


def create_text_dataset(windows):
    """
    Convert windows to text dataset format for T5.

    Args:
        windows (list): List of (window_data, activity_label) tuples

    Returns:
        pd.DataFrame: DataFrame with 'input_text' and 'target_text' columns
    """
    print("\nConverting windows to text format...")

    data = []
    for window, activity in tqdm(windows):
        input_text = window_to_text(window)
        # Make target more explicit for T5
        target_text = activity.strip()
        data.append({'input_text': input_text, 'target_text': target_text})

    df = pd.DataFrame(data)
    print(f"Created {len(df)} text examples")
    print(f"\nSample input: {df['input_text'].iloc[0]}")
    print(f"Sample target: {df['target_text'].iloc[0]}")

    return df


In [18]:
# ==================== STEP 4: Prepare Dataset for T5 ====================
def prepare_dataset(text_df, test_size=0.2, val_size=0.1):
    """
    Split data and create HuggingFace Dataset objects.

    Args:
        text_df (pd.DataFrame): DataFrame with input_text and target_text
        test_size (float): Proportion for test set
        val_size (float): Proportion for validation set (from training set)

    Returns:
        DatasetDict: Dictionary with train, validation, and test datasets
    """
    print("\nSplitting dataset...")

    # First split: train+val vs test
    train_val_df, test_df = train_test_split(
        text_df, test_size=test_size, random_state=42, stratify=text_df['target_text']
    )

    # Second split: train vs val
    train_df, val_df = train_test_split(
        train_val_df, test_size=val_size, random_state=42, stratify=train_val_df['target_text']
    )

    print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

    # Create HuggingFace datasets
    dataset_dict = DatasetDict({
        'train': Dataset.from_pandas(train_df, preserve_index=False),
        'validation': Dataset.from_pandas(val_df, preserve_index=False),
        'test': Dataset.from_pandas(test_df, preserve_index=False)
    })

    return dataset_dict

In [None]:
# ==================== STEP 5: Tokenization ====================
def preprocess_function(examples, tokenizer, max_input_length=512, max_target_length=16):
    """
    Tokenize input and target texts for T5.
    T5 expects: "task_name: input_text"

    Args:
        examples: Batch of examples from dataset
        tokenizer: T5 tokenizer
        max_input_length (int): Max length for input tokens
        max_target_length (int): Max length for target tokens

    Returns:
        dict: Tokenized inputs and labels
    """
    # Add task prefix for T5
    inputs = ["classify activity: " + text for text in examples['input_text']]
    targets = examples['target_text']

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding="max_length",
        return_tensors=None
    )

    # Tokenize targets
    labels = tokenizer(
        targets,
        max_length=max_target_length,
        truncation=True,
        padding="max_length",
        return_tensors=None
    )

    # Set labels (replace padding token id with -100 to ignore in loss)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs


def tokenize_dataset(dataset_dict, tokenizer):
    """
    Apply tokenization to all splits in the dataset.

    Args:
        dataset_dict (DatasetDict): Dataset with train/val/test splits
        tokenizer: T5 tokenizer

    Returns:
        DatasetDict: Tokenized dataset
    """
    print("\nTokenizing dataset...")

    tokenized_datasets = dataset_dict.map(
        lambda examples: preprocess_function(examples, tokenizer),
        batched=True,
        remove_columns=dataset_dict['train'].column_names
    )

    print("Tokenization complete!")
    return tokenized_datasets


In [None]:
# ==================== STEP 6: Fine-tune FLAN-T5 ====================
def train_model(tokenized_datasets, model, tokenizer, output_dir='./results', num_epochs=5):
    """
    Fine-tune FLAN-T5 model using HuggingFace Trainer.

    Args:
        tokenized_datasets (DatasetDict): Tokenized train/val/test data
        model: FLAN-T5 model
        tokenizer: T5 tokenizer
        output_dir (str): Directory to save model checkpoints
        num_epochs (int): Number of training epochs

    Returns:
        Seq2SeqTrainer: Trained model trainer
    """
    print("\nSetting up training...")

    # Data collator for dynamic padding
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        padding=True,
        pad_to_multiple_of=8  # For better GPU utilization
    )

    def compute_metrics(eval_preds):
        """Compute accuracy metrics during evaluation."""
        predictions, labels = eval_preds
        
        # predictions are already token IDs when predict_with_generate=True
        # If they're logits (3D array), we need to take argmax
        if len(predictions.shape) == 3:
            predictions = np.argmax(predictions, axis=-1)
        
        # Decode predictions and labels
        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        
        # Replace -100 in labels as we can't decode them
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        
        # Calculate exact match accuracy
        accuracy = sum([pred.strip().lower() == label.strip().lower() 
                       for pred, label in zip(decoded_preds, decoded_labels)]) / len(decoded_preds)
        
        return {"accuracy": accuracy}

    # Training arguments - optimized for classification
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=5e-5,  # Standard learning rate for fine-tuning
        per_device_train_batch_size=8,  # Reduced batch size for stability
        per_device_eval_batch_size=16,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        save_total_limit=3,
        predict_with_generate=True,
        fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
        logging_dir=f'{output_dir}/logs',
        logging_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        report_to="none",  # Disable wandb/tensorboard
        seed=42,  # For reproducibility
        gradient_accumulation_steps=2,  # Accumulate gradients for stability
        optim="adafactor"  # Use adafactor optimizer for better stability
    )

    # Initialize trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['validation'],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    print("Starting training...")
    trainer.train()

    print("\nTraining complete!")
    return trainer


In [None]:
# ==================== STEP 7: Evaluate Model ====================
def evaluate_model(trainer, tokenized_datasets, tokenizer, model):
    """
    Evaluate the fine-tuned model on test set.

    Args:
        trainer (Seq2SeqTrainer): Trained model trainer
        tokenized_datasets (DatasetDict): Dataset with test split
        tokenizer: T5 tokenizer
        model: Fine-tuned model

    Returns:
        dict: Evaluation metrics
    """
    print("\nEvaluating on test set...")

    # Evaluate
    metrics = trainer.evaluate(eval_dataset=tokenized_datasets['test'])
    print(f"Test Loss: {metrics['eval_loss']:.4f}")

    return metrics


def test_predictions(model, tokenizer, test_inputs, num_samples=5):
    """
    Generate predictions for sample inputs.

    Args:
        model: Fine-tuned model
        tokenizer: T5 tokenizer
        test_inputs (list): List of test input texts
        num_samples (int): Number of samples to test

    Returns:
        list: Generated predictions
    """
    print("\n" + "="*60)
    print("Testing predictions on sample inputs...")
    print("="*60)

    model.eval()
    device = next(model.parameters()).device

    predictions = []
    for i, input_text in enumerate(test_inputs[:num_samples]):
        # Add task prefix to match training format
        prefixed_input = "classify activity: " + input_text
        
        # Tokenize input
        inputs = tokenizer(
            prefixed_input,
            return_tensors="pt",
            max_length=512,
            truncation=True
        ).to(device)

        # Generate prediction with constrained decoding
        with torch.no_grad():
            outputs = model.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_length=16,
                num_beams=1,
                temperature=0.7,
                top_p=0.9
            )

        # Decode prediction
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        predictions.append(prediction)

        print(f"\nSample {i+1}:")
        print(f"Input: {input_text[:150]}...")
        print(f"Predicted Activity: {prediction}")

    print("="*60)
    return predictions

## Step 4: Run the Main Pipeline

### Configuration
Update the `WISDM_FILE` path based on where you uploaded the WISDM dataset in Google Drive.
For example, if you uploaded it to `/My Drive/WISDM_ar_v1.1/`, the path would be `/content/drive/My Drive/WISDM_ar_v1.1/WISDM_ar_v1.1_raw.txt`

In [22]:
# ==================== MAIN PIPELINE ====================
print("="*60)
print("FLAN-T5 Fine-tuning for WISDM Activity Recognition")
print("="*60)

# Configuration
# UPDATE THESE PATHS BASED ON YOUR GOOGLE DRIVE STRUCTURE
# If you uploaded to /My Drive/, use: /content/drive/My\ Drive/WISDM_ar_v1.1/WISDM_ar_v1.1_raw.txt
WISDM_FILE = "/content/drive/MyDrive/SIU/Flan-T5/WISDM_ar_v1.1/WISDM_ar_v1.1_raw.txt"
MODEL_NAME = "google/flan-t5-small"  # Using small version for faster training
OUTPUT_DIR = "/content/drive/MyDrive/SIU/Flan-T5/flan-t5-wisdm"
WINDOW_SIZE = 80  # ~4 seconds at 20Hz
STEP_SIZE = 40    # 50% overlap
NUM_EPOCHS = 5    # Increased epochs for better training

# Check if WISDM data exists
if not os.path.exists(WISDM_FILE):
    print(f"Error: WISDM data file not found at {WISDM_FILE}")
    print("Please ensure the WISDM dataset is uploaded to Google Drive at the specified path.")
    print(f"Current file path: {WISDM_FILE}")
else:
    # Step 1: Load WISDM data
    df = load_wisdm_data(WISDM_FILE)

    # Optionally limit data for faster training (remove for full dataset)
    # df = df.groupby('activity').head(10000)

    # Step 2: Create sliding windows
    windows = create_sliding_windows(df, window_size=WINDOW_SIZE, step_size=STEP_SIZE)

    # Step 3: Convert to text dataset
    text_df = create_text_dataset(windows)

    # Optionally limit dataset size for demonstration (remove for full training)
    # text_df = text_df.groupby('target_text').head(500)

    # Step 4: Prepare train/val/test splits
    dataset_dict = prepare_dataset(text_df)

    # Step 5: Load model and tokenizer
    print(f"\nLoading FLAN-T5 model: {MODEL_NAME}...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

    # Move model to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")
    model = model.to(device)

    # Step 6: Tokenize dataset
    tokenized_datasets = tokenize_dataset(dataset_dict, tokenizer)

    # Step 7: Train model
    trainer = train_model(
        tokenized_datasets,
        model,
        tokenizer,
        output_dir=OUTPUT_DIR,
        num_epochs=NUM_EPOCHS
    )

    # Step 8: Evaluate model
    metrics = evaluate_model(trainer, tokenized_datasets, tokenizer, model)

    # Step 9: Test predictions on samples
    test_inputs = [dataset_dict['test'][i]['input_text'] for i in range(min(5, len(dataset_dict['test'])))]
    test_targets = [dataset_dict['test'][i]['target_text'] for i in range(min(5, len(dataset_dict['test'])))]

    predictions = test_predictions(model, tokenizer, test_inputs)

    # Print accuracy on test samples
    print("\n" + "="*60)
    print("Sample Predictions vs Ground Truth:")
    print("="*60)
    correct = 0
    for i, (pred, target) in enumerate(zip(predictions, test_targets)):
        match = pred.strip().lower() == target.strip().lower()
        if match:
            correct += 1
        print(f"Sample {i+1}: Predicted='{pred.strip()}' | Actual='{target.strip()}' | Match={match}")

    accuracy = (correct / len(predictions)) * 100 if predictions else 0
    print(f"\nSample Accuracy: {accuracy:.1f}%")

    # Save final model
    print(f"\nSaving final model to {OUTPUT_DIR}/final_model...")
    trainer.save_model(f"{OUTPUT_DIR}/final_model")
    tokenizer.save_pretrained(f"{OUTPUT_DIR}/final_model")

    print("\n" + "="*60)
    print("PIPELINE COMPLETE!")
    print("="*60)
    print(f"Model saved to: {OUTPUT_DIR}/final_model")
    print("You can now use this model for activity recognition from accelerometer data.")


FLAN-T5 Fine-tuning for WISDM Activity Recognition
Loading WISDM data...
Loaded 1086465 sensor readings
Activities: ['Jogging' 'Walking' 'Upstairs' 'Downstairs' 'Sitting' 'Standing']
Activity distribution:
activity
Walking       418393
Jogging       336445
Upstairs      122869
Downstairs    100425
Sitting        59939
Standing       48394
Name: count, dtype: int64

Creating sliding windows (size=80, step=40)...


100%|██████████| 179/179 [00:01<00:00, 141.03it/s]


Created 26893 windows

Converting windows to text format...


100%|██████████| 26893/26893 [00:09<00:00, 2760.96it/s]


Created 26893 text examples

Sample input: sensor x_mean:-1.8 x_std:3.3 y_mean:9.8 y_std:4.2 z_mean:2.4 z_std:4.4 readings:-0.1,9.2,-0.3 -0.2,10.0,4.8 -5.0,16.2,6.0 -5.4,7.7,3.1 0.6,9.3,0.8 0.3,12.6,5.4 -3.5,7.2,0.7 0.8,14.1,1.6 2.5,19.2,8.9 5.0,12.9,7.6
Sample target: Downstairs

Splitting dataset...
Train: 19362, Validation: 2152, Test: 5379

Loading FLAN-T5 model: google/flan-t5-small...
Using device: cuda

Tokenizing dataset...


Map:   0%|          | 0/19362 [00:00<?, ? examples/s]

Map:   0%|          | 0/2152 [00:00<?, ? examples/s]

Map:   0%|          | 0/5379 [00:00<?, ? examples/s]

Tokenization complete!

Setting up training...
Starting training...


Epoch,Training Loss,Validation Loss
1,0.0,
2,0.0,
3,0.0,
4,0.0,
5,0.0,



Training complete!

Evaluating on test set...


Test Loss: nan

Testing predictions on sample inputs...

Sample 1:
Input: sensor x_mean:-1.2 x_std:4.2 y_mean:9.0 y_std:8.7 z_mean:1.8 z_std:5.8 readings:-0.9,-0.8,4.8 -7.4,19.6,2.1 2.7,-7.5,-3.7 -6.8,19.0,-8.3 -1.8,3.2,3.9 ...
Predicted Activity: -1.1,-6.2,-3.8

Sample 2:
Input: sensor x_mean:4.0 x_std:2.6 y_mean:9.2 y_std:3.6 z_mean:1.3 z_std:2.7 readings:3.1,9.7,7.5 6.0,2.8,-2.5 5.0,7.5,0.5 7.3,9.7,1.0 4.4,10.9,-0.2 2.0,5.9,...
Predicted Activity: 2.0,16.0,0.3

Sample 3:
Input: sensor x_mean:0.0 x_std:0.0 y_mean:0.0 y_std:0.0 z_mean:0.0 z_std:0.0 readings:0.0,0.0,0.0 0.0,0.0,0.0 0.0,0.0,0.0 0.0,0.0,0.0 0.0,0.0,0.0 0.0,0.0,0.0...
Predicted Activity: x_mean:0.0 x

Sample 4:
Input: sensor x_mean:1.0 x_std:0.0 y_mean:9.9 y_std:0.0 z_mean:0.3 z_std:0.0 readings:1.0,9.9,0.3 1.0,9.9,0.3 1.1,9.9,0.3 1.0,9.9,0.3 0.9,9.9,0.3 1.0,9.9,0.2...
Predicted Activity: 1.0,9.9,0.3

Sample 5:
Input: sensor x_mean:-1.1 x_std:2.8 y_mean:10.0 y_std:4.2 z_mean:0.5 z_std:4.3 readings:5.2,19.0,11.3 -1.4,7.8,-