# Week 6 ‚Äî Production & Deployment

**Course:** LangChain for AI Applications  
**Week Focus:** Deploy LangChain applications to production at scale.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Build scalable LangChain REST APIs with FastAPI
- Deploy applications with Docker
- Scale to cloud platforms (AWS, GCP, Azure)
- Handle rate limiting and request queues
- Monitor and maintain production systems
- Implement caching and optimization

## üìä Real-World Context

**The Challenge:**
- Your support bot works perfectly in development
- Now you need to serve 1000 concurrent users
- Handle peak loads (Black Friday = 10x traffic)
- Keep costs reasonable ($$ per request)
- Maintain 99.9% uptime

**Production Concerns:**
1. **Performance:** Respond in < 2 seconds at 1000 RPS
2. **Cost:** Optimize token usage ($$ adds up fast)
3. **Reliability:** Handle failures gracefully
4. **Scalability:** Auto-scale with traffic
5. **Monitoring:** Know what's happening in production
6. **Security:** Protect API, data, credentials

**Solutions:**
- Async chains and FastAPI for performance
- Caching to reduce API calls
- Request queuing for load smoothing
- Circuit breakers for resilience
- Containerization with Docker
- Cloud deployment with auto-scaling
- Comprehensive monitoring and logging

**Business Impact:**
- üìà Scale: Handle growth without rewrite
- üí∞ Cost: 50% reduction via caching
- ‚ö° Speed: < 500ms response time
- üîí Reliability: 99.9% uptime
- üëÄ Visibility: Real-time monitoring
- üöÄ Faster deployments: CI/CD pipelines

In [None]:
from IPython.display import HTML
HTML('''
<style>
.api-box {
    background-color: #e3f2fd;
    border-left: 5px solid #2196f3;
    padding: 15px;
    margin: 20px 0;
    border-radius: 5px;
}
.scale-box {
    background-color: #f3e5f5;
    border-left: 5px solid #9c27b0;
    padding: 15px;
    margin: 20px 0;
    border-radius: 5px;
}
.exercise-box {
    background-color: #fff3cd;
    border-left: 5px solid #ffc107;
    padding: 15px;
    margin: 20px 0;
    border-radius: 5px;
}
</style>
''')

## üöÄ Part 1: Building REST APIs with FastAPI

<div class="api-box">
<strong>FastAPI:</strong> Modern, fast Python web framework for building production-ready APIs.
</div>

### Why FastAPI?

| Feature | FastAPI | Flask | Django |
|---------|---------|-------|--------|
| Speed | ‚ö°‚ö°‚ö° Fastest | ‚ö° Good | ‚ö° Good |
| Async | Native | Limited | Limited |
| Validation | Auto | Manual | Manual |
| Docs | Auto | Manual | Manual |
| Learning | Easy | Easy | Steep |

### FastAPI Example

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    context: str = None

class ChatResponse(BaseModel):
    response: str
    latency_ms: float

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """Process chat message with LangChain."""
    start = time.time()
    
    # Your LangChain logic here
    response = await llm.agenerate(request.message)
    
    latency = (time.time() - start) * 1000
    return ChatResponse(response=response, latency_ms=latency)

# Run with: uvicorn main:app --reload
```

In [None]:
# Demonstrate API structure

from typing import Optional, Dict, Any
from dataclasses import dataclass
from datetime import datetime
import time
import asyncio

@dataclass
class APIRequest:
    """Incoming API request."""
    request_id: str
    endpoint: str
    message: str
    timestamp: datetime
    user_id: str

@dataclass
class APIResponse:
    """Outgoing API response."""
    request_id: str
    response: str
    latency_ms: float
    model: str
    tokens_used: int
    cached: bool

class SimpleCache:
    """Simple LRU cache for responses."""
    
    def __init__(self, max_size: int = 100):
        self.cache: Dict[str, str] = {}
        self.max_size = max_size
    
    def get(self, key: str) -> Optional[str]:
        return self.cache.get(key)
    
    def set(self, key: str, value: str):
        if len(self.cache) >= self.max_size:
            # Simple FIFO eviction
            self.cache.pop(next(iter(self.cache)))
        self.cache[key] = value
    
    def stats(self) -> Dict[str, int]:
        return {"cached_items": len(self.cache), "max_size": self.max_size}

# Demo: API structure
print("üîå FASTAPI STRUCTURE DEMO")
print("="*70)

cache = SimpleCache(max_size=5)

# Simulate requests
print("\nüìù Processing Requests:")
print()

# Request 1: Cache miss
req1 = APIRequest(
    request_id="req-001",
    endpoint="/chat",
    message="How do I reset password?",
    timestamp=datetime.now(),
    user_id="user-123"
)

cached = cache.get(req1.message)
if cached:
    print(f"1. {req1.request_id}: CACHE HIT")
    print(f"   Response: {cached}")
    print(f"   Latency: 1ms (cached)")
    cache.set(req1.message, "Go to Settings > Security > Change Password")
else:
    print(f"1. {req1.request_id}: CACHE MISS")
    print(f"   Message: {req1.message}")
    resp = "Go to Settings > Security > Change Password"
    cache.set(req1.message, resp)
    print(f"   Response: {resp}")
    print(f"   Latency: 850ms (API call)")
    print(f"   Tokens: 45")

# Request 2: Same question = cache hit
print(f"\n2. req-002: CACHE HIT")
print(f"   Message: {req1.message}")
print(f"   Response: {cache.get(req1.message)}")
print(f"   Latency: 2ms (cached)")
print(f"   Tokens: 0 (SAVED!)")

print(f"\n" + "="*70)
print(f"\nüíæ Cache Statistics:")
stats = cache.stats()
print(f"  Items cached: {stats['cached_items']}/{stats['max_size']}")
print(f"  ‚úÖ Benefit: Request 2 was 400x faster and cost-free!")

## üê≥ Part 2: Docker Containerization

### Dockerfile Example

```dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app/ .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Build & Run

```bash
# Build image
docker build -t langchain-app:v1 .

# Run container
docker run -p 8000:8000 langchain-app:v1

# Push to registry
docker push myregistry.azurecr.io/langchain-app:v1
```

## ‚òÅÔ∏è Part 3: Cloud Deployment

<div class="scale-box">
<strong>Cloud Deployment:</strong> Running containers at scale on managed platforms.
</div>

### Deployment Options

| Platform | Setup | Scaling | Cost | Best For |
|----------|-------|---------|------|----------|
| **AWS ECS** | Medium | Auto | Pay-per-use | High scale |
| **Google Cloud Run** | Easy | Auto | Pay-per-request | Unpredictable |
| **Azure Container Instances** | Medium | Manual | Hourly | Predictable |
| **Heroku** | Very Easy | Auto | Fixed | Rapid prototyping |
| **Kubernetes** | Hard | Auto | Flexible | Enterprise |

### Scaling Strategy

```
Load Balancer
    ‚Üì
[Instance 1] [Instance 2] [Instance 3]
    ‚Üì           ‚Üì           ‚Üì
[Cache] [Cache] [Cache]
    ‚Üì           ‚Üì           ‚Üì
[Queue] [Queue] [Queue]
    ‚Üì           ‚Üì           ‚Üì
           [LLM API]
```

In [None]:
# Simulate load balancing and scaling

from collections import deque
import statistics

class LoadBalancer:
    """Distribute requests across multiple instances."""
    
    def __init__(self, num_instances: int):
        self.instances = [f"instance-{i}" for i in range(num_instances)]
        self.request_queue = deque()
        self.current_instance = 0
        self.request_count = {inst: 0 for inst in self.instances}
        self.latencies = {inst: [] for inst in self.instances}
    
    def route_request(self, request_id: str) -> str:
        """Route to least-loaded instance (round-robin)."""
        instance = self.instances[self.current_instance]
        self.current_instance = (self.current_instance + 1) % len(self.instances)
        
        self.request_count[instance] += 1
        return instance
    
    def record_latency(self, instance: str, latency_ms: float):
        """Record response latency for monitoring."""
        self.latencies[instance].append(latency_ms)
    
    def should_scale_up(self) -> bool:
        """Check if we should add more instances."""
        if not self.latencies[self.instances[0]]:
            return False
        
        avg_latency = statistics.mean(self.latencies[self.instances[0]])
        return avg_latency > 1500  # Threshold: 1.5s
    
    def scale_up(self):
        """Add a new instance."""
        new_instance = f"instance-{len(self.instances)}"
        self.instances.append(new_instance)
        self.request_count[new_instance] = 0
        self.latencies[new_instance] = []
        return new_instance
    
    def get_stats(self) -> Dict[str, Any]:
        """Get load balancing statistics."""
        total_requests = sum(self.request_count.values())
        
        stats = {
            "total_instances": len(self.instances),
            "total_requests": total_requests,
            "by_instance": self.request_count.copy(),
        }
        
        # Calculate latency stats
        all_latencies = []
        for lat_list in self.latencies.values():
            all_latencies.extend(lat_list)
        
        if all_latencies:
            stats["avg_latency_ms"] = round(statistics.mean(all_latencies), 1)
            stats["p95_latency_ms"] = round(
                sorted(all_latencies)[int(len(all_latencies) * 0.95)], 1
            )
        
        return stats

# Demo: Load balancing and scaling
print("‚öñÔ∏è  LOAD BALANCING & SCALING DEMO")
print("="*70)

lb = LoadBalancer(num_instances=2)

# Simulate 20 requests
print("\nüìä Handling Incoming Requests:")
print()

for i in range(10):
    instance = lb.route_request(f"req-{i:03d}")
    latency = 800 + (i * 100)  # Increasing latency
    lb.record_latency(instance, latency)
    print(f"Request {i+1:2d} ‚Üí {instance} (latency: {latency}ms)")
    
    if lb.should_scale_up():
        new_instance = lb.scale_up()
        print(f"  üî∫ SCALING UP: Added {new_instance}")

print(f"\n" + "="*70)
print(f"\nüìà LOAD BALANCER STATISTICS:")
stats = lb.get_stats()
for key, value in stats.items():
    if key == "by_instance":
        print(f"  {key}:")
        for inst, count in value.items():
            print(f"    - {inst}: {count} requests")
    else:
        print(f"  {key:20} {value}")

## ‚úçÔ∏è Hands-On Exercises

<div class="exercise-box">
<strong>üéØ Exercise 1: Build FastAPI Server</strong><br><br>
Create a production-ready API:
<ol>
<li>Define request/response models</li>
<li>Implement async handlers</li>
<li>Add error handling and validation</li>
</ol>
</div>

In [None]:
# Exercise 1: Your FastAPI server here!
print("Your production FastAPI server implementation!")

<div class="exercise-box">
<strong>üéØ Exercise 2: Create Dockerfile & Deploy</strong><br><br>
Containerize and deploy:
<ol>
<li>Write Dockerfile with best practices</li>
<li>Build and test locally</li>
<li>Push to Docker registry</li>
</ol>
</div>

In [None]:
# Exercise 2: Your Docker deployment here!
print("Your Dockerfile and deployment script!")

## üìù Week 6 Project: Production Deployment

**Deploy a complete LangChain application to production with full monitoring.**

### Requirements:

**1. FastAPI Server:**
- `/chat` endpoint (POST)
- `/health` endpoint for monitoring
- Input validation with Pydantic
- Async request handling
- Error handling & retries

**2. Caching:**
- LRU cache for common queries
- Reduce API calls by 50%+
- Track cache hit rate

**3. Docker Setup:**
- Optimized Dockerfile
- Multi-stage builds
- Health checks
- Environment variables

**4. Load Testing:**
- Test with 100+ concurrent users
- Measure response times
- Identify bottlenecks

**5. Monitoring:**
- Request/response logging
- Performance metrics
- Error tracking
- Uptime monitoring

### Deliverables:
- main.py (FastAPI app)
- requirements.txt (dependencies)
- Dockerfile (containerization)
- docker-compose.yml (local testing)
- load_test.py (performance testing)
- deployment_guide.md (cloud deployment)
- monitoring_dashboard.md (production metrics)

In [None]:
# Week 6 Project Starter

# TODO: Build FastAPI server with async handlers
# TODO: Implement request caching
# TODO: Create Docker setup
# TODO: Write load testing script
# TODO: Set up monitoring
# TODO: Deploy to cloud platform
# TODO: Document deployment process

print("üéØ Your complete production deployment here!")

## üéì Key Takeaways

**What you learned this week:**

‚úÖ **REST APIs:**
- FastAPI for high-performance servers
- Async request handling
- Automatic validation & documentation

‚úÖ **Containerization:**
- Docker for reproducible deployments
- Multi-stage builds for optimization
- Health checks for reliability

‚úÖ **Cloud Deployment:**
- Scaling strategies
- Load balancing
- Auto-scaling policies

‚úÖ **Production Operations:**
- Monitoring and logging
- Performance optimization
- Cost management
- Continuous deployment

## üèÜ Capstone: Your Complete LangChain Mastery

**You've now mastered the complete LangChain journey:**

- ‚úÖ Week 1-2: Fundamentals & memory
- ‚úÖ Week 3: Agents & tools
- ‚úÖ Week 4: RAG & embeddings
- ‚úÖ Week 5: Evaluation & debugging
- ‚úÖ Week 6: Production & deployment

**Build your final capstone project:**
- A complete, production-ready LLM application
- Tested, evaluated, and monitored
- Deployed and scaling in the cloud

---

**üéâ Congratulations on completing LangChain Mastery!** You're now ready to build production LLM applications. üöÄ