# Chapter 6: Fine-Tuning Gemma with QLoRA on Vertex AI

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayoisio/genai-on-google-cloud/blob/main/chapter-6/colabs/01_gemma_finetuning.ipynb)

## Overview

This notebook demonstrates how to fine-tune the Gemma 7B-it model using QLoRA (Quantized Low-Rank Adaptation) on Google Colab with Vertex AI. We will train the model to act as a financial analyst, summarizing news excerpts into a specific JSON format.

**Learning Goals:**
- Configure QLoRA for efficient fine-tuning on consumer GPUs
- Fine-tune Gemma 7B using the HuggingFace transformers and PEFT libraries
- Evaluate fine-tuned vs. base model performance
- Save and deploy the fine-tuned model to Google Cloud Storage
- Understand production deployment options on GCP

## Prerequisites

- Google Colab with GPU runtime (T4, L4, or A100 recommended)
- Google Cloud Project with Vertex AI API enabled
- HuggingFace account with access to Gemma (free, one-time setup)
- Dataset file (`dataset.jsonl`) for training

> **See also**: [Gemma Model Documentation](https://ai.google.dev/gemma) | [QLoRA Paper](https://arxiv.org/abs/2305.14314)

## 1. Setup and Authentication

First, we'll authenticate with Google Cloud and set up the project.

In [None]:
# Check if running in Colab
import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab")
else:
    print("Not running in Google Colab. This notebook is optimized for Colab.")

In [None]:
# Authenticate with Google Cloud
if IN_COLAB:
    from google.colab import auth
    auth.authenticate_user()
    print("✓ Authentication successful!")

In [None]:
# Set GCP Project ID and Region
import os

# Prompt for Project ID
PROJECT_ID = input("Enter your GCP Project ID: ")
REGION = input("Enter your GCP Region (e.g., us-central1): ")

# Set environment variables
os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID
os.environ['GOOGLE_CLOUD_REGION'] = REGION

print(f"\n✓ Project ID: {PROJECT_ID}")
print(f"✓ Region: {REGION}")

In [None]:
# Configure gcloud
!gcloud config set project {PROJECT_ID}
!gcloud config set ai/region {REGION}

## 1.1. Initialize Vertex AI

Initialize Vertex AI SDK to access Model Garden.

In [None]:
# Initialize Vertex AI
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

print(f"✓ Vertex AI initialized")
print(f"  Project: {PROJECT_ID}")
print(f"  Region: {REGION}")

## 1.2. Enable Required APIs

Enable Vertex AI and related APIs for your project.

In [None]:
# # Enable required APIs
# print("Enabling required GCP APIs...")
# print("This may take a minute...\n")

# # Enable Vertex AI API
# !gcloud services enable aiplatform.googleapis.com --project={PROJECT_ID}

# # Enable Cloud Storage API
# !gcloud services enable storage.googleapis.com --project={PROJECT_ID}

# # Enable Compute Engine API (for GPU resources)
# !gcloud services enable compute.googleapis.com --project={PROJECT_ID}

# print("\n✓ All required APIs enabled!")

## 2. Check GPU Availability

Make sure you're using a GPU runtime. Go to **Runtime > Change runtime type** and select a GPU (T4, L4, or A100).

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"✓ GPU Available: {gpu_name}")
    print(f"✓ GPU Memory: {gpu_memory:.2f} GB")
else:
    print("⚠️  No GPU detected! Please enable GPU in Runtime > Change runtime type")

## 3. Install Required Libraries

In [None]:
# Install required packages
!pip install -q bitsandbytes transformers peft trl accelerate datasets google-cloud-aiplatform huggingface_hub
print("✓ All packages installed successfully!")

## 4. Upload Dataset

Upload your `dataset.jsonl` file using the file browser on the left, or use the code below.

In [None]:
# Optional: Upload dataset file
if IN_COLAB:
    from google.colab import files
    import os

    if not os.path.exists('dataset.jsonl'):
        print("Please upload your dataset.jsonl file:")
        uploaded = files.upload()
        print("✓ Dataset uploaded successfully!")
    else:
        print("✓ Dataset file found!")

In [None]:
# Verify dataset
import json

dataset_path = "dataset.jsonl"

with open(dataset_path, 'r') as f:
    lines = f.readlines()
    num_examples = len(lines)
    print(f"✓ Dataset loaded: {num_examples} examples")

    # Show first example
    if num_examples > 0:
        first_example = json.loads(lines[0])
        print("\nFirst example:")
        print(first_example['text'][:200] + "...")

## 5. Fine-Tuning Configuration and Training

Now we'll configure and start the fine-tuning process using QLoRA.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

print("✓ Libraries imported successfully!")

In [None]:
# Model configuration
# Model weights: HuggingFace (via authentication above)
# Deployment: Vertex AI Model Registry (configured earlier)

model_name = "google/gemma-7b-it"
output_dir = "./gemma-7b-analyst"
adapter_name = "gemma-7b-analyst-adapter"

print(f"Model: {model_name}")
print(f"Output directory: {output_dir}")
print(f"Adapter name: {adapter_name}")
print()
print("✓ Model will be loaded from HuggingFace")
print("✓ Fine-tuned model will be deployed to Vertex AI")

### 5.1. Authenticate with HuggingFace for Gemma Access

While we use Vertex AI for deployment, Gemma model weights are accessed via HuggingFace. This is a simple one-time authentication.

In [None]:
# Authenticate with HuggingFace to access Gemma
from huggingface_hub import login

print("=" * 70)
print("HUGGINGFACE AUTHENTICATION")
print("=" * 70)
print()
print("To access Gemma model weights, you need a HuggingFace account.")
print()
print("Steps:")
print("1. Go to https://huggingface.co/google/gemma-7b-it")
print("2. Click 'Agree and access repository' (one-time)")
print("3. Go to https://huggingface.co/settings/tokens")
print("4. Create a new token (Read access is sufficient)")
print("5. Copy and paste it below")
print()
print("=" * 70)
print()

HF_TOKEN = input("Enter your HuggingFace token: ")

# Login to HuggingFace
login(token=HF_TOKEN, add_to_git_credential=False)

print("\n✓ Authenticated with HuggingFace!")
print("✓ You now have access to Gemma model")

In [None]:
# Alternative: Use Kaggle instead of HuggingFace (OPTIONAL)
# If you prefer Kaggle over HuggingFace, uncomment and run this cell instead

# import os
# import json
#
# print("KAGGLE AUTHENTICATION (Alternative to HuggingFace)")
# print("Get credentials from: https://www.kaggle.com/settings -> API -> Create New Token")
# print()
#
# KAGGLE_USERNAME = input("Kaggle username: ")
# KAGGLE_KEY = input("Kaggle API key: ")
#
# os.environ['KAGGLE_USERNAME'] = KAGGLE_USERNAME
# os.environ['KAGGLE_KEY'] = KAGGLE_KEY
#
# os.makedirs('/root/.kaggle', exist_ok=True)
# with open('/root/.kaggle/kaggle.json', 'w') as f:
#     json.dump({"username": KAGGLE_USERNAME, "key": KAGGLE_KEY}, f)
#
# !chmod 600 /root/.kaggle/kaggle.json
# print("✓ Kaggle configured!")
#
# # Then use Kaggle model path in model loading:
# # model_name = "kaggle://google/gemma/pyTorch/7b-it"

print("SKIP THIS CELL - Use HuggingFace authentication (next section) unless you prefer Kaggle")

In [None]:
# 1. QLoRA Configuration (4-bit quantization)
print("Configuring 4-bit quantization...")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

print("✓ Quantization config ready")

In [None]:
# 2. Load Base Model & Tokenizer
print(f"Loading base model: {model_name}...")
print("This may take a few minutes (downloading ~14GB)...")
print()

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

print("\n✓ Model and tokenizer loaded successfully!")

In [None]:
# 3. LoRA Configuration
print("Configuring LoRA...")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
print("\n✓ LoRA adapters applied!")
print("\nTrainable parameters:")
model.print_trainable_parameters()

In [None]:
# 4. Load Dataset
print("Loading training dataset...")

dataset = load_dataset("json", data_files=dataset_path, split="train")

print(f"✓ Dataset loaded: {len(dataset)} examples")

In [None]:
# 5. Configure Training Arguments
print("Configuring training arguments...")

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=10,
    fp16=True,
    save_strategy="epoch",
    report_to="none",  # Disable wandb/tensorboard for Colab
)

print("✓ Training arguments configured")
print(f"  - Batch size: {training_args.per_device_train_batch_size}")
print(f"  - Learning rate: {training_args.learning_rate}")
print(f"  - Epochs: {training_args.num_train_epochs}")

In [None]:
# 6. Initialize Trainer
print("Initializing trainer...")

# Check TRL version to understand the API
import trl
print(f"TRL version: {trl.__version__}")

# For newer versions of TRL, use minimal configuration
# The model already has PEFT applied, tokenizer is inferred from model

try:
    # Try newer API first (TRL >= 0.7.0)
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=training_args,
    )
except TypeError as e:
    print(f"Trying alternative API due to: {e}")
    # Fall back to older API
    from transformers import DataCollatorForLanguageModeling

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )

    from transformers import Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=data_collator,
    )

print("✓ Trainer initialized and ready!")

In [None]:
# 7. Start Training
print("="*60)
print("Starting fine-tuning...")
print("This will take some time depending on your GPU and dataset size.")
print("="*60)

trainer.train()

print("\n" + "="*60)
print("✓ Training complete!")
print("="*60)

In [None]:
# 8. Save the Adapter
print(f"Saving adapter to {adapter_name}...")

trainer.model.save_pretrained(adapter_name)
tokenizer.save_pretrained(adapter_name)

print("✓ Adapter saved successfully!")

## 6. Evaluation

Let's compare the base model with our fine-tuned model.

In [None]:
# Load the fine-tuned model
print("Loading fine-tuned model for evaluation...")

base_model_eval = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

tuned_model = PeftModel.from_pretrained(base_model_eval, adapter_name)
tuned_model = tuned_model.merge_and_unload()

print("✓ Fine-tuned model loaded!")

In [None]:
# Test prompt
test_prompt = "<s> Analyze the following: 'A new startup, InnovateAI, just raised a $50M Series A round, but has no revenue.'"

print("Test Prompt:")
print(test_prompt)
print("\n" + "="*60)

In [None]:
# Base Model Response
print("\n--- BASE MODEL (Gemma 7B IT) ---\n")

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
outputs = base_model_eval.generate(**inputs, max_new_tokens=150, temperature=0.7)
base_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(base_response)
print("\n" + "="*60)

In [None]:
# Fine-tuned Model Response
print("\n--- FINE-TUNED MODEL (Financial Analyst) ---\n")

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
outputs = tuned_model.generate(**inputs, max_new_tokens=150, temperature=0.1)
tuned_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(tuned_response)
print("\n" + "="*60)

## 7. Save Adapter to Cloud Storage

Upload the fine-tuned LoRA adapter to Google Cloud Storage for safe keeping and future use.

In [None]:
# Create a Cloud Storage bucket (if needed)
BUCKET_NAME = f"{PROJECT_ID}-gemma-finetuning"
BUCKET_URI = f"gs://{BUCKET_NAME}"

print(f"Bucket: {BUCKET_URI}")

# Create bucket
!gsutil mb -l {REGION} {BUCKET_URI} 2>/dev/null || echo "Bucket already exists or error creating bucket"

In [None]:
# Upload adapter to Cloud Storage
MODEL_GCS_PATH = f"{BUCKET_URI}/models/{adapter_name}/"

print(f"Uploading adapter to {MODEL_GCS_PATH}...")
!gsutil -m cp -r {adapter_name}/* {MODEL_GCS_PATH}

print("✓ Adapter uploaded to Cloud Storage!")

In [None]:
# Verify upload and display adapter location
print("="*70)
print("✓ LoRA Adapter Successfully Saved!")
print("="*70)
print()
print(f"📦 Adapter Location: {MODEL_GCS_PATH}")
print()
print("Your fine-tuned adapter is now stored in Cloud Storage.")
print("You can:")
print("  1. Download it locally (see next section)")
print("  2. Load it from GCS in other notebooks or applications")
print("  3. Share the GCS path with your team")
print()
print("Example - Load adapter from GCS:")
print(f"  adapter = PeftModel.from_pretrained(base_model, '{MODEL_GCS_PATH}')")
print()
print("="*70)

### 7.1. Prepare for Vertex AI Deployment

To deploy to Vertex AI Model Registry and Endpoints, we need to merge the adapter with the base model and prepare it properly.

In [None]:
# Merge adapter with base model for deployment
print("Merging LoRA adapter with base model...")
print("This creates a single model ready for deployment")
print()

# Free up GPU memory first - clear ALL models from previous sections
print("Clearing GPU memory...")
import gc

# Delete all possible model references
try:
    del model  # Training model
except NameError:
    pass

try:
    del base_model_eval  # Evaluation base model
except NameError:
    pass

try:
    del tuned_model  # Evaluation fine-tuned model
except NameError:
    pass

try:
    del deployment_model  # In case this cell was run before
except NameError:
    pass

# Force garbage collection and clear CUDA cache
gc.collect()
torch.cuda.empty_cache()

# Display GPU memory status
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    print(f"✓ GPU memory cleared")
    print(f"  Allocated: {allocated:.2f} GB")
    print(f"  Reserved: {reserved:.2f} GB")
print()

# Load base model WITH quantization (same as training)
# This is necessary because the adapter was trained on a quantized model
print("Loading base model for deployment...")
deployment_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,  # Use same quantization as training
    device_map="auto",
    trust_remote_code=True,
)

# Load and merge the adapter
print("Loading and merging adapter...")
deployment_model = PeftModel.from_pretrained(deployment_model, adapter_name)
deployment_model = deployment_model.merge_and_unload()

print("✓ Adapter merged with base model!")
print()

# Save merged model for deployment
merged_model_dir = "./gemma-7b-analyst-merged"
print(f"Saving merged model to {merged_model_dir}...")
deployment_model.save_pretrained(merged_model_dir)
tokenizer.save_pretrained(merged_model_dir)

print(f"✓ Merged model saved to: {merged_model_dir}")
print()
print("Note: Model is saved in 4-bit quantized format for efficient deployment")


In [None]:
# Upload merged model to Cloud Storage
MERGED_MODEL_GCS_PATH = f"{BUCKET_URI}/models/gemma-7b-analyst-merged/"

print(f"Uploading merged model to {MERGED_MODEL_GCS_PATH}...")
print("This may take 5-10 minutes (uploading ~14GB)...")
!gsutil -m cp -r {merged_model_dir}/* {MERGED_MODEL_GCS_PATH}

print("\n✓ Merged model uploaded to Cloud Storage!")

### 7.2. Deployment Options

For deploying HuggingFace models on Vertex AI, you have several options:

**Option A: Local/Colab Inference** (Recommended for testing)
- Load the merged model directly in notebooks
- Use for development and testing
- No deployment costs

**Option B: Vertex AI Prediction with Custom Container** (Production)
- Create a Docker container with transformers library
- Deploy to Vertex AI Endpoints
- Requires additional setup (see cells below)

**Option C: Cloud Run** (Serverless)
- Wrap model in FastAPI
- Deploy to Cloud Run
- Auto-scales to zero when not in use

We'll demonstrate Option A below. For production (Option B/C), see the notes at the end.

In [None]:
# Option A: Load Model for Inference (Recommended)
print("="*70)
print("MODEL DEPLOYMENT - OPTION A: Direct Inference")
print("="*70)
print()
print("Your fine-tuned model is saved and ready to use!")
print()
print(f"📦 Model Location: {MERGED_MODEL_GCS_PATH}")
print()
print("To use the model for inference:")
print()
print("# Load from Cloud Storage")
print("from transformers import AutoModelForCausalLM, AutoTokenizer")
print("import torch")
print()
print(f"model = AutoModelForCausalLM.from_pretrained('{MERGED_MODEL_GCS_PATH}', torch_dtype=torch.float16, device_map='auto')")
print(f"tokenizer = AutoTokenizer.from_pretrained('{MERGED_MODEL_GCS_PATH}')")
print()
print("# Or load from local directory")
print(f"model = AutoModelForCausalLM.from_pretrained('{merged_model_dir}', torch_dtype=torch.float16, device_map='auto')")
print(f"tokenizer = AutoTokenizer.from_pretrained('{merged_model_dir}')")
print()
print("="*70)
print()
print("✓ Model is ready for use in notebooks, scripts, or applications!")
print()
print("See the cells below for inference examples and production deployment options.")

### 7.3. Test Your Fine-Tuned Model

Let's test the fine-tuned model with some predictions.

In [None]:
# Test the fine-tuned model with inference
print("Testing fine-tuned model...")
print()

# The deployment_model is already loaded and merged from the previous section
# If you need to reload it, uncomment below:
# deployment_model = AutoModelForCausalLM.from_pretrained(
#     merged_model_dir,
#     torch_dtype=torch.float16,
#     device_map="auto",
# )
# tokenizer = AutoTokenizer.from_pretrained(merged_model_dir)

# Test prompts
test_prompts = [
    "Analyze the following: 'Company XYZ reported Q4 revenue of $2B, up 25% YoY, exceeding analyst expectations.'",
    "Analyze the following: 'Startup ABC raised $100M Series B but laid off 20% of staff.'",
    "Analyze the following: 'TechCorp stock dropped 15% after missing earnings targets by $50M.'"
]

print("="*70)
print("FINE-TUNED MODEL PREDICTIONS")
print("="*70)
print()

for i, prompt in enumerate(test_prompts, 1):
    print(f"Test {i}:")
    print(f"Prompt: {prompt}")
    print()

    # Format with instruction template
    full_prompt = f"<s> {prompt}"

    # Generate response
    inputs = tokenizer(full_prompt, return_tensors="pt").to(deployment_model.device)
    outputs = deployment_model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the model's response (after the prompt)
    response = response[len(full_prompt):].strip()

    print(f"Response: {response}")
    print()
    print("-"*70)
    print()

print("✓ Model inference complete!")
print()
print(f"Your model is saved at: {MERGED_MODEL_GCS_PATH}")
print(f"Local copy at: {merged_model_dir}")

### 7.4. Production Deployment Options

For production deployment on GCP, here are your options:

In [None]:
# Production Deployment Guide
print("="*70)
print("PRODUCTION DEPLOYMENT OPTIONS")
print("="*70)
print()

print("Your fine-tuned model is saved and ready for production!")
print(f"Model Location: {MERGED_MODEL_GCS_PATH}")
print()

print("OPTION 1: Vertex AI Workbench (Easiest)")
print("-" * 70)
print("• Create a Vertex AI Workbench notebook instance")
print("• Load model from GCS and serve predictions")
print("• Good for: Internal tools, batch processing")
print("• Cost: ~$0.20/hour for notebook instance")
print()

print("OPTION 2: Cloud Run (Serverless)")
print("-" * 70)
print("• Create a FastAPI wrapper around your model")
print("• Deploy to Cloud Run with GPU support")
print("• Good for: Variable traffic, cost optimization")
print("• Cost: Pay only when serving requests")
print()
print("Example FastAPI app:")
print("""
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
model = AutoModelForCausalLM.from_pretrained(
    'gs://YOUR-BUCKET/models/gemma-7b-analyst-merged/',
    torch_dtype=torch.float16,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained('gs://YOUR-BUCKET/models/gemma-7b-analyst-merged/')

@app.post("/predict")
def predict(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200)
    return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
""")
print()

print("OPTION 3: Vertex AI Prediction (Custom Container)")
print("-" * 70)
print("• Build custom Docker container with transformers")
print("• Deploy to Vertex AI Prediction Endpoints")
print("• Good for: Enterprise, SLA requirements")
print("• Cost: ~$0.50-$1.00/hour for T4 GPU")
print()
print("Required files:")
print("  - Dockerfile (with transformers, torch, fastapi)")
print("  - predictor.py (custom prediction handler)")
print("  - requirements.txt")
print()

print("OPTION 4: Vertex AI Model Garden Deploy (Coming Soon)")
print("-" * 70)
print("• Upload to Model Garden for one-click deployment")
print("• Requires TorchServe .mar archive format")
print("• Good for: Standardized deployments")
print()

print("="*70)
print()
print("RECOMMENDED: Start with Vertex AI Workbench for testing,")
print("then move to Cloud Run or Vertex AI Prediction for production.")
print()
print("="*70)

### 7.5. Using Your Model from Other Notebooks

Load your fine-tuned model from anywhere:

In [None]:
# Example: Load and use your model from another notebook or script
print("="*70)
print("USING YOUR MODEL FROM OTHER NOTEBOOKS/SCRIPTS")
print("="*70)
print()

code_example = f'''
# Install required packages
!pip install transformers torch accelerate

# Import libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model from Cloud Storage
model_path = "{MERGED_MODEL_GCS_PATH}"

print("Loading model from GCS...")
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Make a prediction
def analyze_financial_news(text):
    prompt = f"<s> Analyze the following: '{{text}}'"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Example usage
result = analyze_financial_news("Company ABC reported record Q4 earnings of $5B.")
print(result)
'''

print(code_example)
print()
print("="*70)
print()
print(f"Model GCS Path: {MERGED_MODEL_GCS_PATH}")
print(f"Model Local Path: {merged_model_dir}")
print()
print("✓ Copy the code above to use your model anywhere!")
print()
print("="*70)

In [None]:
# Clean up GPU memory when done
print("="*70)
print("CLEANUP")
print("="*70)
print()
print("To free up GPU memory when you're done:")
print()

cleanup_code = '''
import gc
import torch

# Delete model from memory
del deployment_model
del tokenizer

# Clear GPU cache
gc.collect()
torch.cuda.empty_cache()

print("✓ GPU memory cleared!")
'''

print(cleanup_code)
print()
print("Note: Your model files are safely stored in Cloud Storage")
print(f"Location: {MERGED_MODEL_GCS_PATH}")
print()
print("="*70)

## 8. Download Model (Optional)

Download the fine-tuned adapter to your local machine.

In [None]:
# Download the adapter
if IN_COLAB:
    import shutil

    # Create a zip file
    shutil.make_archive(adapter_name, 'zip', adapter_name)

    # Download
    from google.colab import files
    files.download(f"{adapter_name}.zip")

    print(f"✓ Downloaded {adapter_name}.zip")

## Summary

Congratulations! You have successfully:

1. ✓ Authenticated with Google Cloud and HuggingFace
2. ✓ Loaded Gemma 7B model from HuggingFace
3. ✓ Configured QLoRA (4-bit quantization + LoRA adapters)
4. ✓ Fine-tuned Gemma on your financial analysis dataset
5. ✓ Evaluated the model's performance
6. ✓ Merged adapter with base model for deployment
7. ✓ Uploaded model to Google Cloud Storage
8. ✓ Tested the fine-tuned model with predictions

**Your Fine-Tuned Model:**
- **Model Name:** gemma-7b-financial-analyst
- **GCS Location:** `gs://{PROJECT_ID}-gemma-finetuning/models/gemma-7b-analyst-merged/`
- **Local Location:** `./gemma-7b-analyst-merged/`
- **Format:** HuggingFace Transformers (4-bit quantized)

**Using Your Model:**

From any Python environment:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load from Cloud Storage
model = AutoModelForCausalLM.from_pretrained(
    "gs://{PROJECT_ID}-gemma-finetuning/models/gemma-7b-analyst-merged/",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "gs://{PROJECT_ID}-gemma-finetuning/models/gemma-7b-analyst-merged/"
)

# Make predictions
prompt = "Analyze the following: 'Your financial news here'"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

**Production Deployment Options:**

1. **Vertex AI Workbench** (Recommended for testing)
   - Create notebook instance
   - Load model and serve predictions
   - Cost: ~$0.20/hour

2. **Cloud Run** (Serverless, cost-effective)
   - Wrap model in FastAPI
   - Deploy with GPU support
   - Auto-scales to zero
   - Cost: Pay per request

3. **Vertex AI Prediction** (Enterprise)
   - Custom Docker container
   - Managed endpoints with SLAs
   - Cost: ~$0.50-$1.00/hour (T4 GPU)

**Storage Costs:**
- Model in GCS: ~$0.02/month (4GB quantized model)
- Adapter only: ~$0.001/month (small adapter files)

**Next Steps:**

1. **Test thoroughly** with your use cases
2. **Choose deployment method** based on your needs:
   - Low traffic → Vertex AI Workbench
   - Variable traffic → Cloud Run
   - High traffic/SLA → Vertex AI Prediction
3. **Set up monitoring** with Cloud Logging
4. **Implement CI/CD** for model updates
5. **Consider** training on larger datasets for production

**Resources Created:**
- LoRA Adapter: `gs://{PROJECT_ID}-gemma-finetuning/models/gemma-7b-analyst-adapter/`
- Merged Model: `gs://{PROJECT_ID}-gemma-finetuning/models/gemma-7b-analyst-merged/`
- Dataset: `dataset.jsonl` (102 examples)

**Need Help?**
- HuggingFace Transformers docs: https://huggingface.co/docs/transformers
- Vertex AI docs: https://cloud.google.com/vertex-ai/docs
- Cloud Run GPU: https://cloud.google.com/run/docs/configuring/services/gpu

🎉 **Your fine-tuned Gemma model is ready for production use!**