# ArXivCode: Code Encoder Training on Colab

This notebook runs the contrastive learning training for the code encoder on Google Colab with free GPU access.

## Setup Instructions

1. **Enable GPU**: Runtime → Change runtime type → GPU (T4)
2. **Run cells in order**
3. **Upload data** when prompted in Step 2
4. **Monitor training** progress in Step 4
5. **Download checkpoints** in Step 5

**Estimated time**: 30-60 minutes for 3 epochs on T4 GPU


## Step 1: Setup Environment


In [None]:
# Mount Google Drive (optional - if you want to save to Drive)
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Clone repository from GitHub
# Replace with your GitHub repo URL
!git clone https://github.com/your-username/arxivcode.git
%cd arxivcode


In [None]:
# Install dependencies
!pip install torch transformers tqdm arxiv PyGithub python-dotenv


In [None]:
# Verify GPU access
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  No GPU detected! Enable GPU: Runtime → Change runtime type → GPU")


## Step 2: Upload Data

Upload your `parsed_pairs.json` file. You can either:
- Upload from local machine (Option 1)
- Load from Google Drive (Option 2)


In [None]:
# Option 1: Upload from local machine
from google.colab import files
import os

print("Upload parsed_pairs.json file...")
uploaded = files.upload()

# Create directory and move file
!mkdir -p data/processed
for filename in uploaded.keys():
    !mv {filename} data/processed/parsed_pairs.json
    print(f"✓ Uploaded {filename} to data/processed/parsed_pairs.json")


In [None]:
# Option 2: Load from Google Drive (uncomment if using)
# !mkdir -p data/processed
# !cp /content/drive/MyDrive/arxivcode/data/processed/parsed_pairs.json data/processed/
# print("✓ Loaded from Google Drive")


In [None]:
# Verify data file exists
import os
if os.path.exists('data/processed/parsed_pairs.json'):
    import json
    with open('data/processed/parsed_pairs.json') as f:
        data = json.load(f)
    print(f"✓ Data loaded: {len(data)} pairs")
else:
    print("⚠️  Error: parsed_pairs.json not found!")
    print("Please upload the file in the previous cell.")


## Step 3: Prepare Data (if not already done)

If you haven't run the parser yet, run these cells. Otherwise skip to Step 4.


In [None]:
# Parse paper-code pairs (only if you have paper_code_pairs.json)
# !python src/embeddings/paper_code_parser.py


In [None]:
# Setup DataLoaders (creates train/val splits)
!python src/embeddings/data_loader_setup.py


## Step 4: Train Model

This will take 30-60 minutes on Colab T4 GPU. The training will:
- Load CodeBERT encoders
- Train with InfoNCE loss
- Save best model automatically
- Log progress to console


In [None]:
# Train code encoder
# Adjust parameters as needed:
#   --batch_size: 8 (default) or 16 if you have memory
#   --num_epochs: 3 (default) or more for better results
#   --learning_rate: 2e-5 (default)

!python src/embeddings/train_code_encoder.py \
    --json_path data/processed/parsed_pairs.json \
    --batch_size 8 \
    --num_epochs 3 \
    --learning_rate 2e-5 \
    --checkpoint_dir checkpoints/code_encoder


## Step 5: Monitor Training Progress


In [None]:
# View training history
import json
import os

history_path = 'checkpoints/code_encoder/training_history.json'
if os.path.exists(history_path):
    with open(history_path) as f:
        history = json.load(f)
    
    print("Training Progress:")
    print(f"  Epochs completed: {len(history['train_losses'])}")
    print(f"  Train losses: {history['train_losses']}")
    if history['val_losses']:
        print(f"  Val losses: {history['val_losses']}")
    print(f"  Best val loss: {history['best_val_loss']:.4f}")
    
    # Simple plot
    try:
        import matplotlib.pyplot as plt
        plt.figure(figsize=(10, 5))
        plt.plot(history['train_losses'], label='Train Loss')
        if history['val_losses']:
            plt.plot(history['val_losses'], label='Val Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training Progress')
        plt.legend()
        plt.grid(True)
        plt.show()
    except:
        print("(Install matplotlib for plots: !pip install matplotlib)")
else:
    print("Training history not found yet. Training may still be running.")


## Step 6: Save Results

Download your trained model checkpoints to your local machine.


In [None]:
# Option 1: Download checkpoints as tar.gz
from google.colab import files
import os

# Compress checkpoints
!tar -czf checkpoints.tar.gz checkpoints/

# Download
files.download('checkpoints.tar.gz')
print("✓ Checkpoints downloaded!")


In [None]:
# Option 2: Save to Google Drive (uncomment if using)
# !cp -r checkpoints /content/drive/MyDrive/arxivcode/
# print("✓ Saved to Google Drive")


## Step 7: Verify Checkpoints

Check that your model was saved correctly.


In [None]:
# List saved checkpoints
import os

checkpoint_dir = 'checkpoints/code_encoder'
if os.path.exists(checkpoint_dir):
    files = os.listdir(checkpoint_dir)
    print("Saved files:")
    for f in files:
        filepath = os.path.join(checkpoint_dir, f)
        size = os.path.getsize(filepath) / (1024 * 1024)  # MB
        print(f"  {f}: {size:.2f} MB")
    
    if 'best_model.pt' in files:
        print("\n✓ Best model saved successfully!")
    else:
        print("\n⚠️  Best model not found. Check training logs.")
else:
    print("⚠️  Checkpoint directory not found!")
