# XiYan-SQL Training on Google Colab

This notebook provides a complete step-by-step guide to train the XiYan-SQL model on Google Colab.

## Prerequisites
- Upload your model files to Google Drive (e.g., `Qwen2.5-Coder-3B-Instruct` folder)
- Upload your dataset files to Google Drive (raw data, processed data, or both)
- Enable GPU runtime in Colab (Runtime ‚Üí Change runtime type ‚Üí GPU)

## Step 1: Install Dependencies

Install all required packages for XiYan-SQL training.

In [None]:
# Install system dependencies
!apt-get update -qq
!apt-get install -y -qq libaio-dev  # Required for DeepSpeed

# Install Python packages
!pip install -q accelerate>=1.12.0
!pip install -q datasets>=3.0.0
!pip install -q deepspeed>=0.18.4
!pip install -q llama-index>=0.9.6.post2
!pip install -q markupsafe==2.1.3  # Pin to <3.0
!pip install -q modelscope>=1.33.0
!pip install -q mysql-connector-python>=9.5.0
!pip install -q ninja>=1.13.0
!pip install -q "numpy>=1.23.0,<2.0"
!pip install -q packaging>=24.1
!pip install -q pandas>=2.3.3
!pip install -q peft==0.11.1
!pip install -q "protobuf>=6.33.3"
!pip install -q psycopg2-binary>=2.9.11
!pip install -q sentencepiece>=0.2.1
!pip install -q setuptools>=70.2.0
!pip install -q sqlalchemy>=2.0.45
!pip install -q sqlglot>=28.5.0
!pip install -q swanlab>=0.7.6
!pip install -q textdistance>=4.6.3
!pip install -q "torch==2.9.0" --index-url https://download.pytorch.org/whl/cu126
!pip install -q "torchaudio==2.9.0" --index-url https://download.pytorch.org/whl/cu126
!pip install -q "torchvision==0.24.0" --index-url https://download.pytorch.org/whl/cu126
!pip install -q transformers==4.42.3
!pip install -q wheel>=0.45.1

# Install flash-attn (optional, for faster attention)
# Note: This may take a while to compile
try:
    !pip install -q flash-attn --no-build-isolation
    print("‚úÖ flash-attn installed successfully")
except:
    print("‚ö†Ô∏è  flash-attn installation failed, continuing without it")

print("\n‚úÖ All dependencies installed!")

## Step 2: Clone Repository

Clone the XiYan-SQL repository to Colab.

In [None]:
# Change to content directory
import os
import sys
os.chdir('/content')

# Clone the repository
# Replace with your repository URL
REPO_URL = "https://github.com/rezaarrazi/XiYan-SQL.git"  # ‚ö†Ô∏è UPDATE THIS

if not os.path.exists('XiYan-SQL'):
    os.system(f'git clone {REPO_URL}')
    print("‚úÖ Repository cloned successfully")
else:
    print("‚úÖ Repository already exists")

# Navigate to training directory
os.chdir('XiYan-SQL/XiYan-SQLTraining')

# Add to Python path so imports work correctly
TRAINING_DIR = os.getcwd()
if TRAINING_DIR not in sys.path:
    sys.path.insert(0, TRAINING_DIR)
if os.path.dirname(TRAINING_DIR) not in sys.path:
    sys.path.insert(0, os.path.dirname(TRAINING_DIR))

print(f"\nüìÅ Current directory: {os.getcwd()}")
print(f"‚úÖ Python path configured")

## Step 3: Mount Google Drive

Mount your Google Drive to access model and dataset files.

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

print("‚úÖ Google Drive mounted successfully")
print("\nüìÇ Drive path: /content/drive/MyDrive")

## Step 4: Copy Model from Google Drive

Copy your pre-downloaded model from Google Drive to the model directory.

**Configured Path:** `My Drive/Xiyan-SQL/Models/Qwen/`

The script will automatically detect and copy the model folder(s) from this location.

In [None]:
import shutil
import os

# Path to your model in Google Drive
MODEL_DRIVE_PATH = "/content/drive/MyDrive/Xiyan-SQL/Models/Qwen"

# Target directory in the repository
MODEL_TARGET_DIR = "train/model/Qwen"

# Create target directory if it doesn't exist
os.makedirs(MODEL_TARGET_DIR, exist_ok=True)

# Check if model directory exists in Drive
if os.path.exists(MODEL_DRIVE_PATH):
    print(f"üì• Found model directory at {MODEL_DRIVE_PATH}")
    
    # List contents to see what's inside
    contents = os.listdir(MODEL_DRIVE_PATH)
    print(f"üìÅ Contents: {contents}")
    
    # Check if it's a single model folder or contains multiple model folders
    model_folders = [item for item in contents if os.path.isdir(os.path.join(MODEL_DRIVE_PATH, item))]
    
    if len(model_folders) == 1:
        # Single model folder - copy it directly
        model_name = model_folders[0]
        source_path = os.path.join(MODEL_DRIVE_PATH, model_name)
        target_path = os.path.join(MODEL_TARGET_DIR, model_name)
        
        if os.path.exists(target_path):
            print(f"‚ö†Ô∏è  Model already exists at {target_path}")
            print("Skipping copy (delete manually if you want to re-copy)")
        else:
            print(f"üì• Copying model '{model_name}' from {source_path}...")
            shutil.copytree(source_path, target_path)
            print(f"‚úÖ Model copied to {target_path}")
        
        MODEL_PATH = target_path
    else:
        # Multiple folders or files - copy the entire Qwen directory
        target_path = MODEL_TARGET_DIR
        if os.path.exists(target_path) and os.listdir(target_path):
            print(f"‚ö†Ô∏è  Model directory already exists at {target_path}")
            print("Skipping copy (delete manually if you want to re-copy)")
        else:
            print(f"üì• Copying all models from {MODEL_DRIVE_PATH}...")
            for item in contents:
                source_item = os.path.join(MODEL_DRIVE_PATH, item)
                target_item = os.path.join(target_path, item)
                if os.path.isdir(source_item):
                    if not os.path.exists(target_item):
                        shutil.copytree(source_item, target_item)
                        print(f"  ‚úÖ Copied {item}")
                else:
                    if not os.path.exists(target_item):
                        shutil.copy2(source_item, target_item)
                        print(f"  ‚úÖ Copied {item}")
            print(f"‚úÖ All models copied to {target_path}")
        
        # Set MODEL_PATH to the first model folder found, or let user specify
        if model_folders:
            MODEL_PATH = os.path.join(MODEL_TARGET_DIR, model_folders[0])
            print(f"\nüìå Using model: {MODEL_PATH}")
            print(f"üí° If you want to use a different model, update MODEL_PATH in Step 7")
        else:
            MODEL_PATH = MODEL_TARGET_DIR
            print(f"\nüìå Model directory: {MODEL_PATH}")
            print(f"üí° Please specify the exact model folder name in Step 7")
    
    print(f"\nüìå Model path for training: {MODEL_PATH}")
else:
    print(f"‚ùå Model not found at {MODEL_DRIVE_PATH}")
    print("\nPlease check:")
    print("1. Google Drive is mounted correctly")
    print("2. The path 'My Drive/Xiyan-SQL/Models/Qwen/' exists in your Drive")
    MODEL_PATH = None

## Step 5: Copy and Extract Dataset from Google Drive

Copy and extract your dataset zip file from Google Drive.

**Configured Path:** `My Drive/Xiyan-SQL/Dataset/train.zip`

The script will automatically:
1. Copy the zip file from Google Drive
2. Extract it to the appropriate location
3. Verify the extraction was successful

In [None]:
import shutil
import os
import zipfile

# Path to your dataset zip file in Google Drive
DATASET_ZIP_PATH = "/content/drive/MyDrive/Xiyan-SQL/Dataset/train.zip"

# Target directory
DATA_TARGET_DIR = "data/data_warehouse"
os.makedirs(DATA_TARGET_DIR, exist_ok=True)

# Check if zip file exists
if os.path.exists(DATASET_ZIP_PATH):
    print(f"üì• Found dataset zip file at {DATASET_ZIP_PATH}")
    
    # Check if already extracted
    extracted_path = os.path.join(DATA_TARGET_DIR, "train")
    
    if os.path.exists(extracted_path) and os.listdir(extracted_path):
        print(f"‚úÖ Dataset already extracted at {extracted_path}")
        print("Skipping extraction (delete the folder manually if you want to re-extract)")
    else:
        # Copy zip file to local directory first (faster extraction)
        local_zip = "/content/train.zip"
        print(f"üì• Copying zip file to local storage...")
        shutil.copy2(DATASET_ZIP_PATH, local_zip)
        print(f"‚úÖ Zip file copied")
        
        # Extract zip file
        print(f"üì¶ Extracting dataset from {local_zip}...")
        print("‚è≥ This may take a few minutes depending on file size...")
        
        with zipfile.ZipFile(local_zip, 'r') as zip_ref:
            # Get the root directory name from the zip
            zip_contents = zip_ref.namelist()
            if zip_contents:
                # Determine extraction path
                # If zip contains a 'train' folder, extract to data_warehouse
                # If zip contains data_warehouse structure, extract accordingly
                first_item = zip_contents[0]
                if 'data_warehouse' in first_item:
                    # Extract maintaining structure
                    zip_ref.extractall(DATA_TARGET_DIR)
                    print(f"‚úÖ Dataset extracted to {DATA_TARGET_DIR}")
                elif 'train' in first_item:
                    # Extract train folder to data_warehouse
                    zip_ref.extractall(DATA_TARGET_DIR)
                    print(f"‚úÖ Dataset extracted to {DATA_TARGET_DIR}")
                else:
                    # Extract and create train folder
                    zip_ref.extractall(extracted_path)
                    print(f"‚úÖ Dataset extracted to {extracted_path}")
        
        # Clean up local zip file
        if os.path.exists(local_zip):
            os.remove(local_zip)
        
        print(f"\n‚úÖ Dataset extraction completed!")
        print(f"üìÅ Check contents at: {DATA_TARGET_DIR}")
    
    # Verify extraction
    if os.path.exists(extracted_path):
        contents = os.listdir(extracted_path)
        print(f"\nüìã Extracted contents: {contents[:10]}..." if len(contents) > 10 else f"\nüìã Extracted contents: {contents}")
else:
    print(f"‚ùå Dataset zip file not found at {DATASET_ZIP_PATH}")
    print("\nPlease check:")
    print("1. Google Drive is mounted correctly")
    print("2. The file 'My Drive/Xiyan-SQL/Dataset/train.zip' exists in your Drive")
    print("3. The file name matches exactly (case-sensitive)")

## Step 6: Prepare Training Data

If you have raw data, process it first. If you have processed data, assemble it into training format.

In [None]:
import os
import subprocess
import glob

# Set training directory
TRAINING_DIR = "/content/XiYan-SQL/XiYan-SQLTraining"
os.chdir(TRAINING_DIR)

# Find dataset paths (handle different extraction structures)
DATA_WAREHOUSE_DIR = "data/data_warehouse"

# Check for processed data in various possible locations
PROCESSED_DATA_PATTERNS = [
    "data/data_warehouse/train/processed_data/train_nl2sqlite.json",
    "data/data_warehouse/processed_data/train_nl2sqlite.json",
    "data/data_warehouse/*/processed_data/*.json",
]

# Check for raw data in various possible locations
RAW_DATA_PATTERNS = [
    "data/data_warehouse/train/raw_data/train.json",
    "data/data_warehouse/raw_data/train.json",
    "data/data_warehouse/*/raw_data/*.json",
]

PROCESSED_DATA_PATH = None
RAW_DATA_PATH = None

# Find processed data
for pattern in PROCESSED_DATA_PATTERNS:
    matches = glob.glob(pattern)
    if matches:
        PROCESSED_DATA_PATH = matches[0]
        break

# Find raw data
for pattern in RAW_DATA_PATTERNS:
    matches = glob.glob(pattern)
    if matches:
        RAW_DATA_PATH = matches[0]
        break

DB_CONN_CONFIG = "data/data_warehouse/train/db_conn.json"  # ‚ö†Ô∏è You may need to create this

print(f"üìä Dataset search results:")
print(f"  Processed data: {PROCESSED_DATA_PATH if PROCESSED_DATA_PATH else 'Not found'}")
print(f"  Raw data: {RAW_DATA_PATH if RAW_DATA_PATH else 'Not found'}")

# Step 6a: Process raw data (if raw data exists and processed data doesn't)
if RAW_DATA_PATH and os.path.exists(RAW_DATA_PATH) and (not PROCESSED_DATA_PATH or not os.path.exists(PROCESSED_DATA_PATH)):
    print("üìä Processing raw data...")
    print("‚ö†Ô∏è  Note: You need db_conn.json for database connections")
    
    # Create directories
    os.makedirs("data/data_warehouse/train/processed_data", exist_ok=True)
    os.makedirs("data/data_warehouse/train/mschema", exist_ok=True)
    
    # Run data processing
    # Note: You may need to create db_conn.json first
    cmd = [
        "python", "data/data_processing.py",
        "--raw_data_path", RAW_DATA_PATH,
        "--db_conn_config", DB_CONN_CONFIG,
        "--processed_data_dir", "data/data_warehouse/train/processed_data/",
        "--save_mschema_dir", "data/data_warehouse/train/mschema/",
        "--save_to_configs", "data/configs/datasets_all.json"
    ]
    
    try:
        result = subprocess.run(cmd, cwd=TRAINING_DIR, check=True, capture_output=True, text=True)
        print(result.stdout)
        if result.stderr:
            print("Warnings:", result.stderr)
        print("‚úÖ Raw data processed successfully")
    except subprocess.CalledProcessError as e:
        print(f"‚ùå Error processing data: {e}")
        print(f"Error output: {e.stderr}")
        print("\n‚ö†Ô∏è  If you get database connection errors, you can skip this step if you already have processed data")
elif PROCESSED_DATA_PATH and os.path.exists(PROCESSED_DATA_PATH):
    print(f"‚úÖ Processed data already exists at {PROCESSED_DATA_PATH}, skipping processing step")
else:
    print("‚ö†Ô∏è  No raw or processed data found.")
    print("Please check:")
    print("1. The dataset zip was extracted correctly in Step 5")
    print("2. The zip file contains the expected folder structure")
    print("3. If you have processed data, proceed to data assembly step")

In [None]:
# Step 6b: Assemble training dataset
import os
import subprocess
import json
import glob

# Set training directory
TRAINING_DIR = "/content/XiYan-SQL/XiYan-SQLTraining"
os.chdir(TRAINING_DIR)

# Find processed data (handle different extraction structures)
PROCESSED_DATA_PATTERNS = [
    "data/data_warehouse/train/processed_data/train_nl2sqlite.json",
    "data/data_warehouse/processed_data/train_nl2sqlite.json",
    "data/data_warehouse/*/processed_data/*.json",
]

PROCESSED_DATA_PATH = None
for pattern in PROCESSED_DATA_PATTERNS:
    matches = glob.glob(pattern)
    if matches:
        PROCESSED_DATA_PATH = matches[0]
        break

TRAIN_DATASET_PATH = "train/datasets/nl2sql_standard_train.json"

if PROCESSED_DATA_PATH and os.path.exists(PROCESSED_DATA_PATH):
    # Create dataset config if it doesn't exist
    dataset_config_path = "data/configs/datasets_nl2sql_standard.json"
    
    if not os.path.exists(dataset_config_path):
        # Create a simple dataset config
        config = {
            "train_data": {
                "data_path": PROCESSED_DATA_PATH,
                "sample_num": -1,  # Use all samples
                "task_name": "nl2sqlite",
                "data_aug": False
            }
        }
        os.makedirs("data/configs", exist_ok=True)
        with open(dataset_config_path, 'w') as f:
            json.dump(config, f, indent=2)
        print(f"‚úÖ Created dataset config at {dataset_config_path}")
    
    # Assemble training dataset
    if not os.path.exists(TRAIN_DATASET_PATH):
        print("üì¶ Assembling training dataset...")
        os.makedirs("train/datasets", exist_ok=True)
        
        cmd = [
            "python", "data/data_assembler.py",
            "--dataset_config_path", dataset_config_path,
            "--save_path", TRAIN_DATASET_PATH
        ]
        
        try:
            result = subprocess.run(cmd, cwd=TRAINING_DIR, check=True, capture_output=True, text=True)
            print(result.stdout)
            if result.stderr:
                print("Warnings:", result.stderr)
            print(f"‚úÖ Training dataset assembled at {TRAIN_DATASET_PATH}")
        except subprocess.CalledProcessError as e:
            print(f"‚ùå Error assembling data: {e}")
            print(f"Error output: {e.stderr}")
    else:
        print(f"‚úÖ Training dataset already exists at {TRAIN_DATASET_PATH}")
else:
    print(f"‚ö†Ô∏è  Processed data not found.")
    print("Please check:")
    print("1. The dataset zip was extracted correctly in Step 5")
    print("2. The zip file contains processed data files")
    print("3. If you only have raw data, you may need to process it first (Step 6a)")
    print("4. Or manually specify the processed data path")

## Step 7: Configure Training Parameters

Set up your training configuration. Adjust these parameters based on your GPU memory and requirements.

In [None]:
# Training Configuration
# ‚ö†Ô∏è Adjust these parameters based on your GPU memory

TRAINING_CONFIG = {
    # Experiment ID
    "expr_id": "nl2sql_3b_colab",
    
    # Model path (set in Step 4)
    "model_path": MODEL_PATH if 'MODEL_PATH' in globals() else "train/model/Qwen/Qwen2.5-Coder-3B-Instruct",
    
    # Dataset path
    "data_path": "train/datasets/nl2sql_standard_train.json",
    
    # Output directory
    "output_dir": "train/output/dense/nl2sql_3b_colab/",
    
    # Training hyperparameters
    "epochs": 5,
    "learning_rate": 2e-6,
    "weight_decay": 0.1,
    "max_length": 10240,  # Reduce if OOM: 8192 or 4096
    
    # LoRA configuration (recommended for Colab)
    "use_lora": True,
    "lora_r": 512,
    "lora_alpha": 512,  # Usually same as lora_r
    
    # Batch configuration (adjust for your GPU)
    "batch_size": 1,  # Start with 1, increase if memory allows
    "gradient_accumulation_steps": 4,  # Effective batch = batch_size * grad_accum * num_gpus
    
    # Other settings
    "save_steps": 500,
    "group_by_length": True,
    "shuffle": True,
    "use_flash_attention": True,
    "bf16": True,
}

print("üìã Training Configuration:")
for key, value in TRAINING_CONFIG.items():
    print(f"  {key}: {value}")

print("\nüí° Tips:")
print("  - If you get OOM (Out of Memory) errors:")
print("    * Reduce batch_size to 1")
print("    * Reduce max_length to 8192 or 4096")
print("    * Increase gradient_accumulation_steps")
print("  - For faster training:")
print("    * Increase batch_size if memory allows")
print("    * Reduce max_length if not needed")

## Step 8: Start Training

Run the training script with your configuration.

In [None]:
import os
import subprocess
import json

# Set training directory
TRAINING_DIR = "/content/XiYan-SQL/XiYan-SQLTraining"
os.chdir(TRAINING_DIR)

# Create DeepSpeed config for single GPU (Colab typically has 1 GPU)
ds_config = {
    "compute_environment": "LOCAL_MACHINE",
    "distributed_type": "DEEPSPEED",
    "deepspeed_config": {
        "gradient_accumulation_steps": TRAINING_CONFIG["gradient_accumulation_steps"],
        "gradient_clipping": 1.0,
        "offload_optimizer_device": "cpu",  # Offload to CPU to save GPU memory
        "offload_param_device": "cpu",
        "zero3_init_flag": False,
        "zero3_save_16bit_model": False,
        "zero_stage": 2,  # Use Zero2 for efficiency
        "bf16": {
            "enabled": True
        }
    },
    "machine_rank": 0,
    "main_process_ip": None,
    "main_process_port": None,
    "num_machines": 1,
    "num_processes": 1,  # Single GPU in Colab
    "rdzv_backend": "static",
    "same_network": True,
    "tpu_env": [],
    "tpu_use_cluster": False,
    "tpu_use_sudo": False,
    "use_cpu": False
}

# Save DeepSpeed config
os.makedirs("train/config", exist_ok=True)
ds_config_path = "train/config/colab_zero2.json"
with open(ds_config_path, 'w') as f:
    json.dump(ds_config, f, indent=2)

print("üöÄ Starting training...")
print(f"üìÅ Model: {TRAINING_CONFIG['model_path']}")
print(f"üìä Dataset: {TRAINING_CONFIG['data_path']}")
print(f"üíæ Output: {TRAINING_CONFIG['output_dir']}")
print("\n‚è≥ This may take several hours depending on dataset size...")
print("\n" + "="*60)

# Build training command
cmd = [
    "accelerate", "launch",
    "--config_file", ds_config_path,
    "--num_processes", "1",
    "train/sft4xiyan.py",
    "--save_only_model", "True",
    "--resume", "False",
    "--model_name_or_path", TRAINING_CONFIG["model_path"],
    "--data_path", TRAINING_CONFIG["data_path"],
    "--output_dir", TRAINING_CONFIG["output_dir"],
    "--num_train_epochs", str(TRAINING_CONFIG["epochs"]),
    "--per_device_train_batch_size", str(TRAINING_CONFIG["batch_size"]),
    "--gradient_accumulation_steps", str(TRAINING_CONFIG["gradient_accumulation_steps"]),
    "--save_strategy", "steps",
    "--save_steps", str(TRAINING_CONFIG["save_steps"]),
    "--save_total_limit", "3",  # Keep only last 3 checkpoints
    "--learning_rate", str(TRAINING_CONFIG["learning_rate"]),
    "--weight_decay", str(TRAINING_CONFIG["weight_decay"]),
    "--adam_beta2", "0.95",
    "--warmup_ratio", "0.1",
    "--lr_scheduler_type", "cosine",
    "--logging_steps", "10",
    "--report_to", "none",
    "--model_max_length", str(TRAINING_CONFIG["max_length"]),
    "--lazy_preprocess", "False",
    "--gradient_checkpointing", "True",
    "--predict_with_generate", "True",
    "--include_inputs_for_metrics", "True",
    "--use_lora", str(TRAINING_CONFIG["use_lora"]),
    "--lora_r", str(TRAINING_CONFIG["lora_r"]),
    "--lora_alpha", str(TRAINING_CONFIG["lora_alpha"]),
    "--do_shuffle", str(TRAINING_CONFIG["shuffle"]),
    "--torch_compile", "False",
    "--group_by_length", str(TRAINING_CONFIG["group_by_length"]),
    "--model_type", "auto",
    "--use_flash_attention", str(TRAINING_CONFIG["use_flash_attention"]),
    "--bf16",
    "--expr_id", TRAINING_CONFIG["expr_id"]
]

# Run training
try:
    result = subprocess.run(
        cmd,
        cwd=TRAINING_DIR,
        check=False  # Don't raise on error, we'll check return code
    )
    
    if result.returncode == 0:
        print("\n" + "="*60)
        print("‚úÖ Training completed successfully!")
        print(f"üìÅ Model saved to: {TRAINING_CONFIG['output_dir']}")
    else:
        print("\n" + "="*60)
        print(f"‚ùå Training failed with return code {result.returncode}")
        print("\nCommon issues:")
        print("  - Out of Memory (OOM): Reduce batch_size or max_length")
        print("  - Model not found: Check MODEL_PATH in Step 4")
        print("  - Dataset not found: Check data_path in Step 6")
except Exception as e:
    print(f"\n‚ùå Error during training: {e}")

## Step 9: Save Trained Model to Google Drive (Optional)

After training completes, save your model to Google Drive for future use.

In [None]:
import shutil
import os

# Path to trained model
TRAINED_MODEL_PATH = TRAINING_CONFIG["output_dir"]

# Destination in Google Drive
# ‚ö†Ô∏è UPDATE THIS: Where you want to save the trained model
DRIVE_SAVE_PATH = "/content/drive/MyDrive/trained_models/nl2sql_3b_colab"  # ‚ö†Ô∏è UPDATE THIS

if os.path.exists(TRAINED_MODEL_PATH):
    print(f"üì• Copying trained model to Google Drive...")
    print(f"   From: {TRAINED_MODEL_PATH}")
    print(f"   To: {DRIVE_SAVE_PATH}")
    
    # Create parent directory
    os.makedirs(os.path.dirname(DRIVE_SAVE_PATH), exist_ok=True)
    
    # Copy model
    if os.path.exists(DRIVE_SAVE_PATH):
        shutil.rmtree(DRIVE_SAVE_PATH)
    
    shutil.copytree(TRAINED_MODEL_PATH, DRIVE_SAVE_PATH)
    print(f"\n‚úÖ Model saved to Google Drive: {DRIVE_SAVE_PATH}")
else:
    print(f"‚ö†Ô∏è  Trained model not found at {TRAINED_MODEL_PATH}")
    print("Make sure training completed successfully in Step 8.")

## Troubleshooting

### Out of Memory (OOM) Errors
- Reduce `batch_size` to 1
- Reduce `max_length` to 8192 or 4096
- Increase `gradient_accumulation_steps` to maintain effective batch size
- The DeepSpeed config already uses CPU offloading, which helps

### Model Not Found
- Check that `MODEL_DRIVE_PATH` in Step 4 is correct
- Verify the model folder exists in Google Drive
- Ensure the model folder contains all required files (config.json, tokenizer files, etc.)

### Dataset Not Found
- Check that dataset paths in Step 5 are correct
- Verify files exist in Google Drive
- If processing raw data, ensure `db_conn.json` exists

### Training Too Slow
- Colab free tier has limited GPU time
- Consider using Colab Pro for longer training sessions
- Reduce dataset size for testing (set `sample_num` in dataset config)

### Connection Issues
- Colab sessions may disconnect after inactivity
- Use `nohup` or save checkpoints frequently
- Consider running training in multiple sessions if needed