# Whisper LoRA Training on Google Colab

This notebook prepares a Colab GPU runtime, installs dependencies, and launches the same `scripts/train.py` command you were running locally.

## 1. Check GPU availability
Make sure Colab actually gave us a CUDA device.

In [None]:
!nvidia-smi

## 2. Pull the project into the Colab VM

Set `REPO_URL` to your fork (or the upstream repo if it contains the latest changes you need). If your project isn’t on GitHub, upload a ZIP via the *Files* pane and replace this cell with the appropriate unzip commands.

In [None]:
import os

REPO_URL = "https://github.com/<your-username>/ChineseTaiwaneseWhisper.git"  # <-- edit me
BRANCH = "main"
PROJECT_DIR = "/content/ChineseTaiwaneseWhisper"

if REPO_URL.startswith("https://github.com/"):
    if not os.path.exists(PROJECT_DIR):
        !git clone --depth 1 -b $BRANCH $REPO_URL $PROJECT_DIR
    else:
        print(f"Skipping clone; {PROJECT_DIR} already exists")
else:
    raise ValueError("Please set REPO_URL to your repository URL (https://github.com/...) or upload a ZIP instead.")

%cd $PROJECT_DIR

## 3. (Optional) Mount Google Drive for persistent checkpoints
Skip this if you’re fine with storing outputs in the ephemeral Colab filesystem.

In [None]:
MOUNT_DRIVE = False  # Set to True if you want to store outputs on Google Drive
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/whisper-medium-taiwanese-lora"

if MOUNT_DRIVE:
    from google.colab import drive
    drive.mount("/content/drive")
    os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)
    print(f"Drive mounted. Outputs will be stored under {DRIVE_OUTPUT_DIR}")

## 4. Install dependencies (GPU wheels for PyTorch + project requirements)
This installs a CUDA-enabled PyTorch build alongside the packages listed in `requirements.txt`.

In [None]:
!pip install --upgrade pip setuptools wheel
!pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
!pip install -r requirements.txt

## 5. Configure and launch training
Adjust any of the parameters below before running. The defaults mirror your local command.

In [None]:
MODEL_NAME = "openai/whisper-medium"
LANGUAGE = "chinese"
OUTPUT_DIR = DRIVE_OUTPUT_DIR if 'DRIVE_OUTPUT_DIR' in globals() and MOUNT_DRIVE else "./whisper-medium-taiwanese-lora"
PREPROCESS_WORKERS = 4
TRAIN_CMD = f"""
python scripts/train.py \
  --model_name_or_path \"{MODEL_NAME}\" \
  --language \"{LANGUAGE}\" \
  --use_peft \
  --peft_method \"lora\" \
  --dataset \"common_voice_13_train\" \
  --dataset_dir \".\" \
  --output_dir \"{OUTPUT_DIR}\" \
  --num_train_epochs 10 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 1e-5 \
  --fp16 \
  --timestamp False \
  --logging_steps 50 \
  --save_steps 500 \
  --preprocessing_num_workers {PREPROCESS_WORKERS}
""".strip()

print("Training command:\n", TRAIN_CMD)

In [None]:
# Launch training (this will stream logs and the tqdm progress bar)
import os

os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("USE_MLFLOW", "false")  # keep MLflow optional unless you configure it explicitly

!{TRAIN_CMD}