# Laravel RAG LLM - Complete Pipeline

Notebook ini berisi complete pipeline untuk Laravel RAG (Retrieval Augmented Generation) LLM menggunakan GPT-2.

## Struktur:
1. **Setup & Installation** - Install dependencies dan import libraries
2. **Configuration** - Load configuration
3. **Data Exploration** - Explore dataset Laravel
4. **Data Processing** - Process data untuk training
5. **Model Loading** - Load GPT-2 model
6. **Model Training** - Fine-tune model (Optional)
7. **RAG Setup** - Setup retrieval system
8. **Inference** - Test RAG system
9. **Interactive Demo** - Try your own queries

---

---
**‚ö° QUICK START**: Untuk langsung testing RAG system, run cells berikut secara berurutan:
1. Cell #2 (Install dependencies)
2. Cell #3 (Setup & imports)
3. Cell #4 (Load configuration)
4. Cell #7 (Load model)
5. Cell #9 (Setup RAG)
6. Cell #11 atau #12 (Interactive demo)

**üìö FULL TUTORIAL**: Ikuti semua cells step-by-step untuk memahami seluruh pipeline.

---

## 1. Setup & Installation

In [None]:
# Install dependencies (run once)
!pip install -q transformers torch datasets pandas numpy tqdm

In [None]:
# Import libraries and setup path
import sys
import os
import json
import torch
from pathlib import Path

# Get notebook directory and project root
notebook_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))

# Add src to path if not already there
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

print(f"üìÅ Project root: {project_root}")
print(f"üìÅ Notebook directory: {notebook_dir}")
print(f"üìÅ Source path: {src_path}")

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\nüñ•Ô∏è  Using device: {device}")
print(f"üêç Python version: {sys.version}")
print(f"üî• PyTorch version: {torch.__version__}")

## 2. Configuration

In [None]:
# Load configuration
from config_loader import ConfigLoader

# Use absolute path
config_path = os.path.join(project_root, 'configs', 'config.json')
config = ConfigLoader(config_path=config_path)

print("üìã Configuration loaded:")
print(f"  Config file: {config_path}")
print(f"  Model: {config.get('model.name')}")
print(f"  Training epochs: {config.get('training.num_train_epochs')}")
print(f"  Max sequence length: {config.get('training.max_seq_length')}")
print(f"  Batch size: {config.get('training.per_device_train_batch_size')}")

## 3. Data Exploration

In [None]:
# Load dan explore raw dataset
import pandas as pd

# Load raw QA dataset using absolute path
raw_data_path = os.path.join(project_root, 'data', 'raw', 'laravel_qa_dataset.json')
with open(raw_data_path, 'r', encoding='utf-8') as f:
    raw_data = json.load(f)

# Convert to DataFrame untuk easy viewing
df = pd.DataFrame(raw_data)

print(f"üìä Dataset Statistics:")
print(f"  Dataset file: {raw_data_path}")
print(f"  Total QA pairs: {len(df)}")
print(f"  Categories: {df['category'].unique().tolist()}")
print(f"  Difficulty levels: {df['difficulty'].unique().tolist()}")
print(f"\nüìà Category distribution:")
print(df['category'].value_counts())

# Show sample
print("\nüìù Sample QA pair:")
sample = raw_data[0]
print(f"Q: {sample['question']}")
print(f"A: {sample['answer'][:200]}...")

## 4. Data Processing

In [None]:
# Process data untuk training
from data_processing import DataProcessor

# Use absolute paths
raw_data_path = os.path.join(project_root, 'data', 'raw', 'laravel_qa_dataset.json')
processed_data_path = os.path.join(project_root, 'data', 'processed', 'training_data.json')

processor = DataProcessor(
    raw_data_path=raw_data_path,
    processed_data_path=processed_data_path
)

# Process and save
processor.process_and_save()

# Load processed data
processed_data = processor.load_processed_data()

print(f"\n‚úÖ Processed {len(processed_data)} training samples")
print("\nüìù Sample training format:")
print(f"Prompt: {processed_data[0]['prompt']}")
print(f"Completion: {processed_data[0]['completion'][:200]}...")

## 5. Model Loading

In [None]:
# Load GPT-2 model
from model_utils import ModelManager

# Initialize model manager
model_manager = ModelManager(
    model_name=config.get('model.name', 'gpt2'),
    model_path=config.get('model.model_path'),
    device=device
)

# Load model (will try fine-tuned first, fallback to base gpt2)
model_manager.load_model(from_pretrained=True)

print("\n‚úÖ Model loaded and ready!")

## 6. Model Training (Optional)

‚ö†Ô∏è **Warning**: Training membutuhkan waktu dan resource. Skip section ini jika:
- Sudah punya model yang di-fine-tune
- Mau testing dulu dengan base model
- Resource terbatas

Uncomment cell di bawah untuk training.

In [None]:
# # Training (OPTIONAL - Uncomment untuk train)
# from transformers import GPT2Tokenizer
# 
# print("üöÄ Starting training...")
# print("‚è∞ This may take 10-30 minutes depending on your hardware\n")
# 
# # Create dataset
# train_dataset = processor.create_dataset_for_training(
#     tokenizer=model_manager.tokenizer,
#     max_length=config.get('training.max_seq_length', 256)
# )
# 
# print(f"üìä Training dataset size: {len(train_dataset)}")
# 
# # Train model
# model_manager.train_model(
#     train_dataset=train_dataset,
#     output_dir=config.get('training.output_dir'),
#     num_train_epochs=config.get('training.num_train_epochs'),
#     per_device_train_batch_size=config.get('training.per_device_train_batch_size'),
#     gradient_accumulation_steps=config.get('training.gradient_accumulation_steps'),
#     learning_rate=config.get('training.learning_rate'),
#     warmup_steps=config.get('training.warmup_steps'),
#     save_steps=config.get('training.save_steps'),
#     logging_steps=config.get('training.logging_steps'),
#     save_total_limit=config.get('training.save_total_limit')
# )
# 
# print("\n‚úÖ Training complete!")
# print("üì¶ Model saved to:", config.get('training.output_dir'))

## 7. RAG Setup

In [None]:
# Setup RAG retrieval system
from retrieval import RAGRetriever, KnowledgeBase
from model_utils import RAGGenerator

# Use absolute path for knowledge base
kb_path = os.path.join(project_root, 'data', 'knowledge_base', 'local_db.json')

# Initialize retriever
retriever = RAGRetriever(
    kb_path=kb_path,
    use_web_fallback=config.get('retrieval.use_web_fallback', False)
)

print("üìö Knowledge base loaded")
print(f"  KB path: {kb_path}")
retriever.kb.show_all()

# Initialize RAG generator
rag_generator = RAGGenerator(
    model_manager=model_manager,
    retriever=retriever
)

print("\n‚úÖ RAG system ready!")

## 8. Inference Testing

In [None]:
# Test RAG system dengan sample queries
test_queries = [
    "Bagaimana cara install Laravel?",
    "Apa itu Eloquent ORM?",
    "Bagaimana cara membuat controller?",
    "Bagaimana cara membuat migration?",
]

print("üß™ Testing RAG System\n" + "="*60 + "\n")

for query in test_queries:
    print(f"‚ùì Query: {query}")
    
    # Generate response
    result = rag_generator.generate_with_context(
        query=query,
        max_new_tokens=config.get('generation.max_new_tokens', 200),
        temperature=config.get('model.temperature', 0.7)
    )
    
    print(f"üìä Confidence: {result['confidence']:.2f} | Method: {result['method']}")
    print(f"üí° Answer: {result['answer'][:300]}...\n")
    print("-" * 60 + "\n")

## 9. Interactive Demo

Try your own queries!

In [None]:
# Interactive query function
def ask_laravel_question(query: str, show_context: bool = False):
    """
    Ask a Laravel-related question
    
    Args:
        query: Your question
        show_context: Show retrieved context
    """
    print(f"\n{'='*60}")
    print(f"‚ùì Your Question: {query}")
    print(f"{'='*60}\n")
    
    # Generate response
    result = rag_generator.generate_with_context(
        query=query,
        max_new_tokens=config.get('generation.max_new_tokens', 200),
        temperature=config.get('model.temperature', 0.7)
    )
    
    # Show results
    print(f"üìä Confidence: {result['confidence']:.2f}")
    print(f"üîç Method: {result['method']}")
    
    if show_context:
        print(f"\nüìö Context Retrieved:")
        print(f"{result['context'][:300]}...\n")
    
    print(f"\nüí° Answer:")
    print(f"{result['answer']}")
    print(f"\n{'='*60}\n")
    
    return result

# Example usage:
# ask_laravel_question("Bagaimana cara membuat API di Laravel?")
# ask_laravel_question("Apa itu middleware?", show_context=True)

In [None]:
# Try your own questions here!
ask_laravel_question("Bagaimana cara membuat authentication di Laravel?")

In [None]:
# Ask another question
ask_laravel_question("Bagaimana cara validasi form?", show_context=True)

## 10. Advanced: Add New Knowledge

Anda bisa menambahkan knowledge baru ke knowledge base

In [None]:
# Add new knowledge entry
def add_knowledge(query: str, answer: str):
    """Add new entry to knowledge base"""
    retriever.kb.add_entry(query, answer)
    print(f"‚úÖ Added new knowledge entry")
    print(f"Query: {query}")
    print(f"Answer: {answer[:100]}...")

# Example:
# add_knowledge(
#     "cara deploy laravel",
#     "Untuk deploy Laravel: 1) Setup server dengan PHP 8.1+, 2) Clone repository, 3) Run composer install, 4) Setup .env file, 5) Generate key: php artisan key:generate, 6) Run migrations, 7) Configure web server (Nginx/Apache)"
# )

## Summary

### ‚úÖ What we built:
1. **Data Pipeline**: Raw data ‚Üí Processed training data
2. **RAG System**: Knowledge base + Retrieval + GPT-2 Generation
3. **Fine-tuning**: Optional model training on Laravel-specific data
4. **Interactive Interface**: Ask Laravel questions and get AI-powered answers

### üöÄ Next Steps:
1. **Expand Dataset**: Add more Laravel QA pairs ke `data/raw/`
2. **Fine-tune**: Train model dengan dataset yang lebih besar
3. **Improve Retrieval**: Implement semantic search dengan embeddings
4. **Add Web Interface**: Build Flask/FastAPI backend + React frontend
5. **Deploy**: Deploy model ke production

### üìö Resources:
- [Laravel Documentation](https://laravel.com/docs)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [RAG Papers](https://arxiv.org/abs/2005.11401)

---

**Happy Coding! üéâ**