# LLM Fine-Tuning with LangChain on Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ravidsun/llm-finetune/blob/master/notebooks/llm_finetune_colab.ipynb)

Complete notebook for fine-tuning open-source LLMs (Llama, Qwen, Mistral) using Google Colab.

## Features
- ‚úÖ Works on Colab Free (T4) and Pro+ (A100)
- ‚úÖ QLoRA 4-bit training for memory efficiency
- ‚úÖ LangChain document processing
- ‚úÖ Optional QA generation with Claude
- ‚úÖ Data augmentation
- ‚úÖ Save to Google Drive

## Requirements
- Google account
- GPU runtime (T4 free, A100 with Pro+)
- Hugging Face token (for gated models)

## Estimated Time
- Setup: 5-10 minutes
- Training (1K samples, 7B model): 30-60 minutes on T4

---

## Step 1: Check GPU

In [4]:
%pip install -q torch

# Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ö†Ô∏è No GPU detected! Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

/bin/bash: line 1: nvidia-smi: command not found

PyTorch version: 2.9.0+cpu
CUDA available: False
‚ö†Ô∏è No GPU detected! Go to Runtime ‚Üí Change runtime type ‚Üí GPU


## Step 2: Install Dependencies

In [None]:
%%capture
# Install core ML libraries (takes ~3-5 minutes)
!pip install -q -U \
    torch torchvision torchaudio \
    transformers>=4.40.0 \
    datasets>=2.18.0 \
    accelerate>=0.27.0 \
    peft>=0.10.0 \
    trl>=0.8.0 \
    bitsandbytes>=0.43.0 \
    safetensors>=0.4.0

# Install LangChain ecosystem
!pip install -q -U \
    langchain>=0.2.0 \
    langchain-core>=0.2.0 \
    langchain-community>=0.2.0 \
    langchain-text-splitters>=0.2.0

# Install document processing
!pip install -q -U \
    pymupdf>=1.24.0 \
    python-docx>=1.1.0

# Install CLI tools
!pip install -q -U \
    typer[all]>=0.9.0 \
    rich>=13.0.0 \
    pyyaml>=6.0

print("‚úÖ Dependencies installed!")

In [None]:
# Verify installation
import transformers
import peft
import trl
import langchain

print(f"Transformers: {transformers.__version__}")
print(f"PEFT: {peft.__version__}")
print(f"TRL: {trl.__version__}")
print(f"LangChain: {langchain.__version__}")
print("\n‚úÖ All libraries ready!")

## Step 3: Clone Repository

In [None]:
# Clone repository
!git clone https://github.com/ravidsun/llm-finetune.git
%cd llm-finetune

# Install package
!pip install -q -e .

# Verify CLI
!python -m finetune_project --help

## Step 4: Mount Google Drive (Recommended)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create directories
!mkdir -p /content/drive/MyDrive/llm-finetune/data
!mkdir -p /content/drive/MyDrive/llm-finetune/output
!mkdir -p /content/drive/MyDrive/llm-finetune/configs

print("‚úÖ Google Drive mounted!")

## Step 5: Authentication

In [None]:
import os
from getpass import getpass

# Hugging Face token (REQUIRED)
print("Get your token from: https://huggingface.co/settings/tokens")
hf_token = getpass("Enter your Hugging Face token: ")
os.environ['HF_TOKEN'] = hf_token

# Login
!huggingface-cli login --token {hf_token}

print("\n‚úÖ Authenticated with Hugging Face!")

In [None]:
# Optional: Anthropic API key (for QA generation)
use_qa_generation = False  # Set to True if you want QA generation

if use_qa_generation:
    anthropic_key = getpass("Enter your Anthropic API key: ")
    os.environ['ANTHROPIC_API_KEY'] = anthropic_key
    print("‚úÖ Anthropic API configured!")
else:
    print("‚ÑπÔ∏è QA generation disabled")

## Step 6: Prepare Training Data

Choose one of the options below:

### Option A: Upload Your Own Data (Single or Multiple JSONL Files)

In [None]:
from google.colab import files

# Create data directory
!mkdir -p data/input

# Upload files (can select multiple JSONL files)
print("Select your training data files...")
print("üí° You can select multiple .jsonl files to upload!")
print("Expected format: {\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n")
uploaded = files.upload()

# Move to data/input directory
file_count = 0
for filename in uploaded.keys():
    if filename.endswith('.jsonl'):
        !mv "{filename}" data/input/
        file_count += 1
        print(f"‚úÖ Moved: {filename}")

print(f"\nüìÅ Uploaded {file_count} JSONL file(s)")
print(f"üìç Location: data/input/")

# Show file info
print("\nüìä File details:")
!ls -lh data/input/*.jsonl

print("\nTotal training examples:")
!wc -l data/input/*.jsonl

print("\nPreview (first 2 lines from first file):")
!head -n 2 data/input/*.jsonl | head -n 2

print("\n‚úÖ Data ready!")
print("üí° Note: All .jsonl files in data/input/ will be automatically combined")

### Option B: Use Sample Data (for testing)

In [None]:
import json

# Create sample dataset
sample_data = [
    {"instruction": "What is machine learning?", "input": "", "output": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses algorithms to analyze data, identify patterns, and make decisions with minimal human intervention."},
    {"instruction": "Explain neural networks", "input": "", "output": "Neural networks are computing systems inspired by biological neural networks in animal brains. They consist of interconnected nodes (neurons) organized in layers that process and transform input data to produce outputs. Each connection has a weight that adjusts as learning proceeds."},
    {"instruction": "What is deep learning?", "input": "", "output": "Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to progressively extract higher-level features from raw input. It excels at tasks like image recognition, natural language processing, and speech recognition."},
    {"instruction": "Define supervised learning", "input": "", "output": "Supervised learning is a machine learning approach where the algorithm learns from labeled training data. The model is trained on input-output pairs and learns to map inputs to correct outputs, which it can then apply to new, unseen data."},
    {"instruction": "What is reinforcement learning?", "input": "", "output": "Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties for its actions and learns to maximize cumulative rewards over time through trial and error."},
]

# Save to file
!mkdir -p data
with open('data/sample.jsonl', 'w') as f:
    for item in sample_data:
        f.write(json.dumps(item) + '\n')

print(f"‚úÖ Created sample dataset with {len(sample_data)} examples")
!head -n 2 data/sample.jsonl

### Option C: Download from URL

In [None]:
# Download dataset from URL or multiple URLs
# Example: Download one or more JSONL files

# Single file example:
# dataset_url = "YOUR_DATASET_URL_HERE"
# !mkdir -p data/input
# !wget -O data/input/dataset.jsonl "{dataset_url}"

# Multiple files example:
# urls = [
#     "https://example.com/data1.jsonl",
#     "https://example.com/data2.jsonl",
#     "https://example.com/data3.jsonl"
# ]
# !mkdir -p data/input
# for url in urls:
#     filename = url.split('/')[-1]
#     !wget -O data/input/{filename} "{url}"

# Or from Hugging Face dataset
# from datasets import load_dataset
# dataset = load_dataset("your-username/dataset")
# dataset['train'].to_json('data/input/train.jsonl')

# Verify
# !ls -lh data/input/
# !wc -l data/input/*.jsonl

## Step 7: Create Configuration

In [None]:
# Configuration optimized for Colab T4 (15GB VRAM)
config_yaml = """
model:
  model_name: "Qwen/Qwen2.5-7B-Instruct"  # Or: unsloth/Llama-3.2-3B-Instruct
  lora_rank: 16
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
  use_qlora: true  # IMPORTANT: 4-bit quantization for Colab

data:
  input_path: "data/input"  # All .jsonl files here will be combined
  output_path: "processed_data"
  input_type: "json"  # json, pdf, docx, txt

  langchain:
    enabled: false
    qa_generation_enabled: false

  augmentation:
    enabled: false

training:
  num_epochs: 3
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16

  learning_rate: 2.0e-4
  warmup_ratio: 0.03
  lr_scheduler_type: "cosine"

  max_seq_length: 512  # Reduced for Colab

  gradient_checkpointing: true
  fp16: false
  bf16: false  # T4 doesn't support bf16

  output_dir: "output"
  save_strategy: "epoch"
  save_total_limit: 2

  logging_steps: 10
  report_to: []

  evaluation_strategy: "no"
"""

# Save config
with open('config.yaml', 'w') as f:
    f.write(config_yaml)

print("‚úÖ Configuration created!")
print("\nüí° Note: If you have multiple JSONL files in data/input/,")
print("   they will all be automatically combined during data preparation.")
print("\nConfiguration preview:")
!cat config.yaml

## Step 8: Prepare Data

In [None]:
# Prepare training data
!python -m finetune_project prepare-data --config config.yaml

# Check output
!ls -lh processed_data/
!head -n 2 processed_data/train.jsonl

## Step 9: Train Model

‚è±Ô∏è Training time: ~30-60 minutes for 1K samples on T4

In [None]:
# Start training
!python -m finetune_project train --config config.yaml

## Step 10: Save Model to Google Drive

In [None]:
# Copy output to Google Drive
!cp -r output /content/drive/MyDrive/llm-finetune/

print("‚úÖ Model saved to Google Drive: /MyDrive/llm-finetune/output/")

## Step 11: Test the Model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Model name from config
model_name = "Qwen/Qwen2.5-7B-Instruct"
adapter_path = "output"

# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load base model
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base_model, adapter_path)

print("\n‚úÖ Model loaded! Ready for testing.")

In [None]:
# Test with a prompt
def generate_response(prompt, max_tokens=256):
    # Format prompt (adjust based on your training format)
    formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:\n"

    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        repetition_penalty=1.1
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part
    response = response.split("### Response:")[-1].strip()
    return response

# Test
test_prompts = [
    "What is machine learning?",
    "Explain neural networks",
    "What is deep learning?"
]

for prompt in test_prompts:
    print(f"\n{'='*60}")
    print(f"Prompt: {prompt}")
    print(f"{'='*60}")
    response = generate_response(prompt)
    print(f"Response: {response}\n")

## Step 12: Download Model (Optional)

In [None]:
from google.colab import files

# Zip and download
!zip -r my-lora-adapter.zip output/
files.download('my-lora-adapter.zip')

print("‚úÖ Model downloaded!")

## Step 13: Upload to Hugging Face (Optional)

In [None]:
from huggingface_hub import HfApi

# Set your model name
repo_name = "your-username/your-model-name"  # Change this!

# Upload
api = HfApi()
api.upload_folder(
    folder_path="output",
    repo_id=repo_name,
    repo_type="model",
    token=hf_token
)

print(f"‚úÖ Model uploaded to: https://huggingface.co/{repo_name}")

## Troubleshooting

### Out of Memory?
- Reduce `per_device_train_batch_size` to 1
- Reduce `max_seq_length` to 256
- Use smaller model: `unsloth/Llama-3.2-3B-Instruct`

### Training too slow?
- Increase `per_device_train_batch_size` if VRAM allows
- Reduce `max_seq_length`
- Use Colab Pro+ with A100

### Session disconnected?
- Mount Google Drive and save checkpoints
- Resume from checkpoint using `--resume-from-checkpoint`

---

## Next Steps

üéâ **Congratulations!** You've fine-tuned your LLM!

- Deploy your model: [PHASE5_DEPLOYMENT.md](https://github.com/ravidsun/llm-finetune/blob/master/docs/PHASE5_DEPLOYMENT.md)
- Share on Hugging Face Hub
- Integrate into your application

## Resources

- [Documentation](https://github.com/ravidsun/llm-finetune/tree/master/docs)
- [GitHub Repo](https://github.com/ravidsun/llm-finetune)
- [Colab Guide](https://github.com/ravidsun/llm-finetune/blob/master/docs/COLAB_GUIDE.md)