# Chapter 7: Deploying Gen AI Models

Deploying Generative AI models enables their integration into real-world applications such as chatbots, recommendation systems, and more. This chapter covers the deployment process on cloud platforms and optimizing inference for efficient performance.

---

## Deploying on Cloud Platforms

Popular cloud platforms for model deployment include:
- **AWS (Amazon Web Services)**: Services like SageMaker for model deployment.
- **GCP (Google Cloud Platform)**: AI Platform for model training and deployment.
- **Azure**: Azure ML for hosting and managing models.

### Example: Deploying on AWS
```bash
# Step 1: Upload your model to an S3 bucket.
aws s3 cp model_directory s3://your-bucket-name --recursive

# Step 2: Create a SageMaker endpoint.
aws sagemaker create-endpoint     --endpoint-name your-endpoint-name     --model-name your-model-name
```

---

## Optimizing Model Inference

Optimizing Generative AI models involves reducing latency and computational cost without compromising accuracy.

### Techniques:
1. **Quantization**: Reduces the precision of weights (e.g., from FP32 to INT8).
2. **Model Pruning**: Removes less significant parts of the model to reduce size.

### Example: Detecting GPU Usage for Optimization
```python
import torch

# Check GPU availability
if torch.cuda.is_available():
    print(f"Running on GPU: {torch.cuda.get_device_name(0)}")
else:
    print("Running on CPU")
```

---

## Code Example: Deploying a Hugging Face Model with FastAPI
```python
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()

# Load a text generation pipeline
generator = pipeline("text-generation", model="gpt2")

@app.post("/generate/")
async def generate(prompt: str, max_length: int = 50):
    response = generator(prompt, max_length=max_length)
    return {"generated_text": response[0]["generated_text"]}

# Run the API server
# Command: uvicorn script_name:app --reload
```

---

## Quiz

1. What is the primary purpose of quantization in model optimization?
   - A. To improve accuracy.
   - B. To reduce computational costs.
   - C. To increase the size of the model.

2. Which command is used to upload a model to an AWS S3 bucket?
   - A. aws cp model s3://bucket-name
   - B. aws s3 cp model_directory s3://bucket-name --recursive
   - C. aws upload model_directory bucket-name

---

### Answers:
1. **B**: To reduce computational costs.
2. **B**: aws s3 cp model_directory s3://bucket-name --recursive

---

## Exercise

### Task:
1. Deploy a basic text-generation service using FastAPI and Hugging Face.
2. Optimize the model for inference by enabling GPU usage.

---

### Example Solution:
```python
from transformers import pipeline

# Check for GPU and use it if available
device = 0 if torch.cuda.is_available() else -1
generator = pipeline("text-generation", model="gpt2", device=device)

# Use the generator in your FastAPI app as shown in the example above.
```

---
