# Download the dataset

This notebook **automatically checks and downloads** the required datasets from Hugging Face **if they are not already present locally**.

- ‚úÖ If you have already run `Download_datasets.ipynb`, no action is needed.
- ‚¨áÔ∏è If not, the dataset will be downloaded automatically when you run the first cell.
- üîÅ The process is safe to run multiple times and will **only download missing files**.

‚ö†Ô∏è **Important**:  
Make sure you have a valid Hugging Face access token set in your environment:

```bash
HUGGINGFACE_HUB_TOKEN=your_token_here or write in the ValueError catch

In [4]:
# ============================================================
# Dataset bootstrap (safe to run multiple times)
# ============================================================

import os
from pathlib import Path
from dotenv import load_dotenv
from huggingface_hub import hf_hub_download

# ------------------
# Configuration
# ------------------
REPO_ID = "CristianLazoQuispe/pose-action-recognition"
SPLITS = ["Train", "Val", "Test"]

DATASET_DIR = Path("/data/cristian/paper_2025/Testing/")
DATASET_DIR.mkdir(parents=True, exist_ok=True)

REQUIRED_FILES = [
    "ISLR/WLASL/WLASL100/WLASL100_135-Train.hdf5",
    "ISLR/WLASL/WLASL100/WLASL100_135-Val.hdf5",
    "ISLR/WLASL/WLASL100/WLASL100_135-Test.hdf5",
    "ISLR/WLASL/WLASL100/wlasl_100_maplabels.json",
]

# ------------------
# Load Hugging Face token
# ------------------

load_dotenv()
token = os.getenv("HUGGINGFACE_HUB_TOKEN")

if token is None:
    raise ValueError(
        "HUGGINGFACE_HUB_TOKEN not found.\n"
        "Create a .env file with:\n"
        "HUGGINGFACE_HUB_TOKEN=your-token"
    )

# ------------------
# Check which files already exist
# ------------------
missing_files = [
    f for f in REQUIRED_FILES if not (DATASET_DIR / f).exists()
]

if not missing_files:
    print("‚úÖ Dataset already present. Skipping download.")
else:
    print("‚¨áÔ∏è  Downloading missing dataset files...")

    for filename in missing_files:
        print(f"Downloading {filename}...")
        hf_hub_download(
            repo_id=REPO_ID,
            filename=filename,
            token=token,
            repo_type="dataset",
            local_dir=DATASET_DIR,
            local_dir_use_symlinks=False
        )

    print("‚úÖ Dataset download completed.")

‚úÖ Dataset already present. Skipping download.


  from .autonotebook import tqdm as notebook_tqdm


### üß† Model, Dataset, and Training Setup

This cell initializes all core components required for training and evaluating the **GCN-BERT model** on the WLASL100 dataset.

**What this cell does:**
- Imports PyTorch, evaluation utilities, and project-specific modules.
- Defines absolute paths to the **Train / Validation / Test** HDF5 files and the label-mapping JSON.
- Builds PyTorch `DataLoader`s with:
  - Data augmentation enabled for training
  - A custom `collate_fn` compatible with the GCN-BERT architecture
- Instantiates the **GCN-BERT** model and automatically selects **GPU (CUDA)** if available.
- Configures the optimizer and loss function for multi-class classification.

**Important notes:**
- The dataset files must exist at the specified paths. If not, run the dataset bootstrap cell at the top of this notebook.
- Batch size, learning rate, and model hyperparameters are set to default research values and can be adjusted for experimentation.
- GPU acceleration is used automatically when available; otherwise, training falls back to CPU.

This setup cell is required **before running the training or evaluation loops** below.

In [16]:
import h5py
h5py.__version__

'3.15.1'

In [3]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
from training.dataset.dataloader import SimpleHDF5Dataset
from training.models.gcn_bert.util import collate_fn_gcnbert
from training.models.gcn_bert.gcn_bert import GCN_BERT


# === Paths
train_path     = "/data/cristian/paper_2025/Testing/ISLR/WLASL/WLASL100/WLASL100_135-Train.hdf5"
val_path       = "/data/cristian/paper_2025/Testing/ISLR/WLASL/WLASL100/WLASL100_135-Val.hdf5"
test_path      = "/data/cristian/paper_2025/Testing/ISLR/WLASL/WLASL100/WLASL100_135-Test.hdf5"

map_label_path = "/data/cristian/paper_2025/Testing/ISLR/WLASL/WLASL100/wlasl_100_maplabels.json"


# === Loaders
train_loader = DataLoader(SimpleHDF5Dataset(train_path,map_label_path,augmentation=True,noise_std=0.01), batch_size=8, shuffle=True, collate_fn=collate_fn_gcnbert)
val_loader   = DataLoader(SimpleHDF5Dataset(val_path,map_label_path),   batch_size=8, collate_fn=collate_fn_gcnbert)
test_loader  = DataLoader(SimpleHDF5Dataset(test_path,map_label_path),   batch_size=8, collate_fn=collate_fn_gcnbert)

# === Model
model = GCN_BERT(num_classes=135, hidden_features=2, seq_len=50, num_joints=135,nhead=5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# === Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()



### üöÄ Training, Validation, Early Stopping, and Final Evaluation

This cell executes the **full training pipeline** for the GCN-BERT model, including training, validation, early stopping, and final testing.


**Key configuration parameters:**
- `EPOCHS`: Maximum number of training epochs.
- `PATIENCE`: Number of validation epochs without improvement before early stopping.
- `BEST_MODEL_PATH`: Location where the best model checkpoint is stored.

**Important notes:**
- Early stopping is triggered **only by validation accuracy**, not training accuracy.
- The final reported test accuracy corresponds to the **best validation checkpoint**, ensuring fair evaluation.
- GPU acceleration is used automatically if available; otherwise, training runs on CPU.


In [4]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import torch
import os

# === Configuraci√≥n
EPOCHS = 2
PATIENCE = 7
BEST_MODEL_PATH = "../../results/models/gcn_bert/wlasl_best.pth"
os.makedirs(os.path.dirname(BEST_MODEL_PATH), exist_ok=True)

best_val_acc = 0.0
epochs_no_improve = 0

for epoch in range(EPOCHS):
    model.train()
    train_loss, train_preds, train_targets = 0.0, [], []

    loop = tqdm(train_loader, total=len(train_loader), ncols=100, desc=f"Epoch {epoch+1}/{EPOCHS} [Train]")
    for x, y, mask, _ in loop:
        x, y, mask = x.to(device), y.to(device), mask.to(device)
        optimizer.zero_grad()
        out = model(x, mask)
        loss = criterion(out, y)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_preds.extend(out.argmax(dim=1).cpu().numpy())
        train_targets.extend(y.cpu().numpy())

        acc = accuracy_score(train_targets, train_preds)
        loop.set_postfix(loss=loss.item(), acc=acc)

    epoch_train_acc = accuracy_score(train_targets, train_preds)

    # === Validation
    model.eval()
    val_loss, val_preds, val_targets = 0.0, [], []
    with torch.no_grad():
        val_loop = tqdm(val_loader, total=len(val_loader), ncols=100, desc=f"Epoch {epoch+1}/{EPOCHS} [Val  ]")
        for x, y, mask, _ in val_loop:
            x, y, mask = x.to(device), y.to(device), mask.to(device)
            out = model(x, mask)
            loss = criterion(out, y)

            val_loss += loss.item()
            val_preds.extend(out.argmax(dim=1).cpu().numpy())
            val_targets.extend(y.cpu().numpy())

            acc_val = accuracy_score(val_targets, val_preds)
            val_loop.set_postfix(loss=loss.item(), acc=acc_val)

    epoch_val_acc = accuracy_score(val_targets, val_preds)

    # === Early stopping check
    if epoch_val_acc > best_val_acc:
        best_val_acc = epoch_val_acc
        epochs_no_improve = 0
        torch.save(model.state_dict(), BEST_MODEL_PATH)
        print(f"‚úÖ New best model saved! Val Acc: {best_val_acc:.4f}")
    else:
        epochs_no_improve += 1
        print(f"‚è≥ No improvement for {epochs_no_improve} epochs")

    if epochs_no_improve >= PATIENCE:
        print(f"üõë Early stopping at epoch {epoch+1}")
        break

# === Final Test
model.load_state_dict(torch.load(BEST_MODEL_PATH))
model.eval()
test_loss, test_preds, test_targets = 0.0, [], []
with torch.no_grad():
    test_loop = tqdm(test_loader, total=len(test_loader), ncols=100, desc="[TEST]")
    for x, y, mask, _ in test_loop:
        x, y, mask = x.to(device), y.to(device), mask.to(device)
        out = model(x, mask)
        loss = criterion(out, y)

        test_loss += loss.item()
        test_preds.extend(out.argmax(dim=1).cpu().numpy())
        test_targets.extend(y.cpu().numpy())

        acc_test = accuracy_score(test_targets, test_preds)
        test_loop.set_postfix(loss=loss.item(), acc=acc_test)

final_test_acc = accuracy_score(test_targets, test_preds)
print(f"\n‚úÖ [TEST FINAL] Loss: {test_loss:.4f} | Accuracy: {final_test_acc:.4f}")


Epoch 1/2 [Train]: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 181/181 [00:04<00:00, 40.67it/s, acc=0.293, loss=3.06]
Epoch 1/2 [Val  ]: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 43/43 [00:00<00:00, 132.79it/s, acc=0.275, loss=3.27]


‚úÖ New best model saved! Val Acc: 0.2751


Epoch 2/2 [Train]: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 181/181 [00:04<00:00, 40.63it/s, acc=0.327, loss=2.34]
Epoch 2/2 [Val  ]: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 43/43 [00:00<00:00, 125.15it/s, acc=0.299, loss=3.74]


‚úÖ New best model saved! Val Acc: 0.2988


[TEST]: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 33/33 [00:00<00:00, 132.89it/s, acc=0.298, loss=4.67]


‚úÖ [TEST FINAL] Loss: 93.6709 | Accuracy: 0.2984



