# Credit AI - Model Training on Google Colab

This notebook trains the Credit AI machine learning models:
- **DistilBERT Classifier**: 4-class dispute eligibility classification
- **MiniLM Embeddings**: Semantic search and retrieval

**Requirements**:
- Google Colab account (free)
- Runtime: GPU recommended (free tier available)

**Training Time**:
- Classifier: ~3 minutes on GPU, ~17 minutes on CPU
- Embeddings: ~10 seconds on GPU, ~26 seconds on CPU

---

## Step 1: Enable GPU Runtime (Recommended)

For faster training, enable GPU:
1. Click **Runtime** â†’ **Change runtime type**
2. Select **T4 GPU** from Hardware accelerator
3. Click **Save**

You can verify GPU availability:

In [None]:
# Check GPU availability
!nvidia-smi

## Step 2: Clone Repository

Clone the Credit AI repository from GitHub:

In [None]:
# Clone repository
!git clone https://github.com/sevillanosebastianof28-hub/creditaioi.git
%cd creditaioi

## Step 3: Install Dependencies

Install required Python packages for training:

In [None]:
# Install dependencies quietly (this may take 2-3 minutes)
!pip install -q -r services/local-ai/requirements.txt

# Verify installation
import torch
import transformers
from sentence_transformers import SentenceTransformer

print(f"âœ“ PyTorch version: {torch.__version__}")
print(f"âœ“ Transformers version: {transformers.__version__}")
print(f"âœ“ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"âœ“ GPU: {torch.cuda.get_device_name(0)}")

## Step 4: Train DistilBERT Classifier

Train the dispute eligibility classifier:
- **Task**: 4-class classification
- **Classes**: eligible, conditionally_eligible, not_eligible, insufficient_information
- **Training examples**: 1,200
- **Expected time**: ~3 minutes on GPU, ~17 minutes on CPU

In [None]:
# Train classifier
import os
os.environ['CONFIG'] = 'services/local-ai/train/configs/model2_classifier.yaml'

!python services/local-ai/train/train_classifier.py

### Verify Classifier Training

Check that the classifier model was created successfully:

In [None]:
# Check classifier model files
!ls -lh models/finetuned/distilbert-eligibility/

# Check metrics
import json
try:
    with open('models/finetuned/distilbert-eligibility/test_metrics.json', 'r') as f:
        metrics = json.load(f)
    print("\nâœ“ Classifier Training Metrics:")
    for key, value in metrics.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")
        else:
            print(f"  {key}: {value}")
except Exception as e:
    print(f"âš  Could not load metrics: {e}")

## Step 5: Train MiniLM Embeddings

Train the semantic embeddings model:
- **Task**: Semantic similarity for credit domain
- **Training pairs**: 450
- **Expected time**: ~10 seconds on GPU, ~26 seconds on CPU

In [None]:
# Train embeddings
os.environ['CONFIG'] = 'services/local-ai/train/configs/model3_embeddings.yaml'

!python services/local-ai/train/train_embeddings.py

### Verify Embeddings Training

Check that the embeddings model was created successfully:

In [None]:
# Check embeddings model files
!ls -lh models/finetuned/minilm-embeddings/

# Test embeddings model loading
try:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('models/finetuned/minilm-embeddings')
    print("\nâœ“ Embeddings model loaded successfully")
    print(f"  Max sequence length: {model.max_seq_length}")
    
    # Test embedding generation
    test_sentence = "This is a test credit report dispute"
    embedding = model.encode(test_sentence)
    print(f"  Embedding dimension: {len(embedding)}")
    print(f"\nâœ“ Successfully generated test embedding")
except Exception as e:
    print(f"âš  Error loading embeddings model: {e}")

## Step 6: Evaluate Models (Optional)

Run the evaluation harness to verify model performance:

In [None]:
# Run evaluation
!python scripts/evaluate_models.py

# Display results
print("\n" + "="*70)
print("EVALUATION RESULTS")
print("="*70)
!cat AI_EVALUATION_REPORT.json

## Step 7: Download Trained Models

Package and download the trained models:

In [None]:
# Create zip archive of trained models
!zip -r trained_models.zip models/finetuned/

# Show archive size
!ls -lh trained_models.zip

print("\nâœ“ Models packaged successfully!")
print("  Archive: trained_models.zip")
print("  Contains:")
print("    - models/finetuned/distilbert-eligibility/")
print("    - models/finetuned/minilm-embeddings/")

In [None]:
# Download to your local machine
from google.colab import files
files.download('trained_models.zip')

print("\nâœ“ Download started!")
print("  Check your browser's download folder for trained_models.zip")

## Next Steps

After downloading the trained models:

1. **Extract the archive**:
   ```bash
   unzip trained_models.zip
   ```

2. **Deploy to production**:
   - Upload to cloud storage (S3, GCS, Azure Blob)
   - Or copy to your production server
   - Update environment variables with model paths

3. **Test the AI service**:
   ```bash
   cd services/local-ai
   ENABLE_LLM=false uvicorn main:app --host 0.0.0.0 --port 8000
   ```

4. **Verify AI features**:
   - Test classifier predictions
   - Test embeddings generation
   - Validate end-to-end workflow

---

## Troubleshooting

### Out of Memory (OOM) Errors
- Enable GPU runtime if not already enabled
- Use minimal configs: `model2_classifier_minimal.yaml`
- Reduce batch size in config files

### Import Errors
- Restart runtime: Runtime â†’ Restart runtime
- Re-run dependency installation cell

### Training Takes Too Long
- Verify GPU is enabled: check `torch.cuda.is_available()`
- Expected times:
  - Classifier: ~3 min (GPU) or ~17 min (CPU)
  - Embeddings: ~10 sec (GPU) or ~26 sec (CPU)

### Model Files Not Found
- Check paths: models should be in `models/finetuned/*/`
- Verify training completed without errors
- Check disk space: Runtime â†’ Manage sessions

---

## Documentation

For more information, see:
- [AI_TRAINING_GUIDE.md](https://github.com/sevillanosebastianof28-hub/creditaioi/blob/main/AI_TRAINING_GUIDE.md)
- [AI_TRAINING_STATUS.md](https://github.com/sevillanosebastianof28-hub/creditaioi/blob/main/AI_TRAINING_STATUS.md)
- [models/README.md](https://github.com/sevillanosebastianof28-hub/creditaioi/blob/main/models/README.md)

---

**Training completed successfully! ðŸŽ‰**