# 1.1 Getting Started

**Duration**: ~20 minutes

The basics you need before optimizing prompts and costs.

---

## 1. How LLMs Work

[Large Language Models (LLMs)](https://en.wikipedia.org/wiki/Large_language_model) are AI systems that understand and generate content. Modern models are multimodal - they can process text, images, and more.

```
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   PROMPT    │  →   │    MODEL    │  →   │  RESPONSE   │
│  (Input)    │      │  (Claude,   │      │  (Output)   │
│             │      │   Nova)     │      │             │
└─────────────┘      └─────────────┘      └─────────────┘
  "What is the         Processing         "The capital
   capital of           happens            of France
   France?"             here               is Paris."
```

**Key concept**: Models don't read words - they read **tokens** (subword units that models process).

## 2. Amazon Bedrock for Inference

**Amazon Bedrock** is a fully managed service for building generative AI applications. This workshop focuses on **model inference** - calling LLMs via API.

### Why Bedrock?

| Benefit | Description |
|---------|-------------|
| **Multiple models** | Claude, Nova, Llama, Mistral - one API |
| **No infrastructure** | Fully managed, auto-scales |
| **Pay-per-use** | Only pay for tokens processed |
| **Enterprise ready** | Security, compliance, VPC support |

### Inference Pricing Modes

| Mode | Best For |
|------|----------|
| **On-demand** | Variable workloads, no commitment |
| **Provisioned Throughput** | Predictable high-volume, consistent latency |
| **Batch** | Non-urgent processing, up to 50% cheaper |

This workshop uses **on-demand** inference with the **Converse API**.

## 3. Tokens: The Unit of Everything

**Tokens** are how LLMs process text - and how you're billed.

| Text | Approximate Tokens |
|------|-------------------|
| 1 sentence | ~10-20 tokens |
| 1 paragraph | ~50-100 tokens |
| 1 page | ~300-500 tokens |

**Why tokens matter**:
- **Cost**: You pay per token (input + output)
- **Latency**: More tokens = longer processing
- **Limits**: Models have maximum context windows

---

## 4. Setup

Let's connect to Bedrock.

In [None]:
from dotenv import load_dotenv
load_dotenv()

import os
import boto3

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=os.getenv('AWS_DEFAULT_REGION', 'us-east-1')
)

print("Connected to Amazon Bedrock")

In [None]:
# Using US cross-region inference profile for Claude Haiku 4.5
MODEL_ID = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
print(f"Model: {MODEL_ID}")

## 5. Available Models

Bedrock offers models from multiple providers. Models vary by:
- **Capability**: Simple tasks vs complex reasoning
- **Speed**: Latency per request
- **Cost**: Price per token
- **Context window**: Maximum input + output tokens

### General Guidance

| Task Complexity | Model Tier | Examples |
|-----------------|------------|----------|
| Simple (classification, extraction) | Fast & cheap | Haiku, Nova Lite/Micro |
| Medium (Q&A, summarization) | Balanced | Sonnet, Nova Pro |
| Complex (reasoning, code, research) | Most capable | Opus, Nova Premier |

**Rule of thumb**: Start with cheaper models, upgrade only if quality is insufficient.

> See [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/) for current models and pricing.

In [None]:
# List available models in your region
bedrock = boto3.client('bedrock', region_name=os.getenv('AWS_DEFAULT_REGION', 'us-east-1'))

try:
    response = bedrock.list_foundation_models(byOutputModality='TEXT')
    
    # Group by provider
    providers = {}
    for model in response['modelSummaries']:
        provider = model['providerName']
        if provider not in providers:
            providers[provider] = []
        providers[provider].append(model['modelId'])
    
    print("AVAILABLE MODELS IN YOUR REGION")
    print("=" * 60)
    for provider in ['Anthropic', 'Amazon', 'Meta']:
        if provider in providers:
            print(f"\n{provider}:")
            for model_id in providers[provider][:5]:
                print(f"  • {model_id}")
            if len(providers[provider]) > 5:
                print(f"  ... and {len(providers[provider])-5} more")
except Exception as e:
    print(f"Could not list models: {e}")

## 6. Your First Inference Call

Let's call the model using the **Converse API** (recommended for all new projects).

In [None]:
prompt = "What is the capital of France? Answer in one word."

response = bedrock_runtime.converse(
    modelId=MODEL_ID,
    messages=[{
        "role": "user",
        "content": [{"text": prompt}]
    }],
    inferenceConfig={"maxTokens": 50}
)

answer = response['output']['message']['content'][0]['text']
usage = response['usage']

print(f"Prompt: \"{prompt}\"")
print(f"Answer: {answer}")

### Token Usage

Every inference response includes token counts - this is how you're billed.

| Token Type | Description |
|------------|-------------|
| **Input** | Tokens in your prompt |
| **Output** | Tokens the model generated |

**Why this matters**:
- You pay for both input AND output tokens
- Output tokens typically cost 3-5x more than input
- Track usage to estimate and control costs

In [None]:
print("Token Usage")
print("-" * 30)
print(f"  Input:  {usage['inputTokens']} tokens")
print(f"  Output: {usage['outputTokens']} tokens")
print(f"  Total:  {usage['inputTokens'] + usage['outputTokens']} tokens")

## 7. Inference Parameters

In the previous call, we used `maxTokens: 50`. Let's explore what this and other parameters do.

| Parameter | What it does | Range |
|-----------|-------------|-------|
| **max_tokens** | Limit output length | 1 - model max |
| **temperature** | Randomness (0=deterministic, 1=creative) | 0.0 - 1.0 |
| **top_p** | Nucleus sampling (cumulative probability cutoff) | 0.0 - 1.0 |
| **top_k** | Limit to top K most likely tokens | 1 - 500 |

> See [Inference parameters documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-parameters.html) for all options.

### max_tokens Demo

Control output length - useful for cost control and getting concise answers.

In [None]:
prompt = "Explain what machine learning is."
print(f"Prompt: \"{prompt}\"\n")

for max_tok in [20, 50, 150]:
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": max_tok, "temperature": 0}
    )
    output = response['output']['message']['content'][0]['text']
    tokens_used = response['usage']['outputTokens']
    
    print(f"{'─' * 60}")
    print(f"maxTokens={max_tok} → {tokens_used} tokens used")
    print(f"{'─' * 60}")
    print(output)
    print()

### Temperature Demo

See how temperature affects output consistency.

In [None]:
prompt = "Give me a creative name for a coffee shop. Just the name, nothing else."
print(f"Prompt: \"{prompt}\"\n")

for temp in [0.0, 1.0]:
    label = "DETERMINISTIC" if temp == 0.0 else "CREATIVE"
    print(f"{'─' * 40}")
    print(f"temperature={temp} ({label})")
    print(f"{'─' * 40}")
    for i in range(3):
        response = bedrock_runtime.converse(
            modelId=MODEL_ID,
            messages=[{"role": "user", "content": [{"text": prompt}]}],
            inferenceConfig={"maxTokens": 20, "temperature": temp}
        )
        answer = response['output']['message']['content'][0]['text'].strip()
        print(f"  Run {i+1}: {answer}")
    print()

print("Note: temp=0 gives identical results, temp=1 varies each time")

## 8. Pricing & Deployment

### Pricing Structure

You pay separately for input and output tokens:

| Token Type | Typical Ratio | Example (Claude Sonnet) |
|------------|---------------|------------------------|
| Input | 1x | $3.00 / 1M tokens |
| Output | 3-5x | $15.00 / 1M tokens |

**Optimization focus**: Input tokens often dominate costs in production.

> See [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/)

### Cross-Region Inference (CRIS)

Route requests across regions for better availability and cost.

| Type | Model ID Prefix | Characteristics |
|------|-----------------|----------|
| **In-Region** | `anthropic.claude-*` | Lowest latency, single region |
| **Geographic** | `us.`, `eu.`, `apac.` | Data stays in geographic area |
| **Global** | `global.` | Up to ~10% savings, best throughput |

> See [Cross-Region Inference documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/cross-region-inference.html)

In [None]:
print("MODEL ID FORMATS")
print("=" * 60)
print(f"{'Type':<12} {'Example Model ID'}")
print("-" * 60)
print(f"{'In-Region':<12} anthropic.claude-sonnet-4-20250514-v1:0")
print(f"{'Geographic':<12} us.anthropic.claude-sonnet-4-20250514-v1:0")
print(f"{'Global':<12} global.anthropic.claude-sonnet-4-20250514-v1:0")

---

## Summary

| Concept | Key Point |
|---------|----------|
| **LLMs** | Prompt → Model → Response |
| **Tokens** | Unit of processing and billing |
| **Models** | Match complexity to cost |
| **max_tokens** | Limit output length for cost control |
| **temperature** | 0 = consistent, 1 = creative |
| **Pricing** | Input + Output tokens, output costs more |
| **CRIS** | Choose based on compliance and throughput needs |

---

**Next**: [02-optimization-strategy.ipynb](./02-optimization-strategy.ipynb) - Learn techniques to reduce costs and improve performance.