# 09. Production-Ready LLM Deployment and Observability

## 0. 安装依赖

In [2]:
%uv pip install langchain-community~=0.4 langchain-openai~=1.0

[2mAudited [1m2 packages[0m [2min 3ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
%uv pip install langchain~=1.0

[2mAudited [1m1 package[0m [2min 2ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
%uv pip install python-dotenv~=1.1

[2mAudited [1m1 package[0m [2min 1ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


工具类

In [6]:
import os

import dotenv
from langchain_openai import ChatOpenAI


class Config:
    def __init__(self):
        # By default, load_dotenv doesn't override existing environment variables and looks for a .env file in same directory as python script or searches for it incrementally higher up.
        dotenv_path = dotenv.find_dotenv(usecwd=True)
        if not dotenv_path:
            raise ValueError("No .env file found")
        dotenv.load_dotenv(dotenv_path=dotenv_path)

        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OPENAI_API_KEY is not set")

        base_url = os.getenv("OPENAI_API_BASE_URL")
        if not base_url:
            raise ValueError("OPENAI_API_BASE_URL is not set")

        model = os.getenv("OPENAI_MODEL")
        if not model:
            raise ValueError("OPENAI_MODEL is not set")

        self.api_key = api_key
        self.base_url = base_url
        self.model = model

        self.langsmith_api_key = os.getenv("LANGSMITH_API_KEY")

    def new_openai_like(self, **kwargs) -> ChatOpenAI:
        # 参考：https://bailian.console.aliyun.com/?tab=api#/api/?type=model&url=2587654
        # 参考：https://help.aliyun.com/zh/model-studio/models
        # ChatOpenAI 文档参考：https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html#langchain_openai.chat_models.base.ChatOpenAI
        return ChatOpenAI(
            api_key=self.api_key, base_url=self.base_url, model=self.model, **kwargs
        )

## Security considerations for LLM applications
## Deploying LLM apps
### Web framework deployment with FastAPI
### Scalable deployment with Ray Serve
#### Building the index
#### Serving the index
#### Running the application
### Deployment considerations for LangChain applications
### LangGraph platform
#### Local development with the LangGraph CLI

### Serverless deployment options
- AWS Lambda: For lightweight LangChain applications, though with limitations on execution time and memory
- Google Cloud Run: Supports containerized LangChain applications with automatic scaling
- Azure Functions: Similar to AWS Lambda but in the Microsoft ecosystem

### UI frameworks
- [Chainlit](https://chainlit.io/)
- [Gradio](https://www.gradio.app/)
- [Streamlit](https://streamlit.io/)
- [Mesop](https://mesop-dev.github.io/mesop/)

### Model Context Protocol

In [8]:
%uv pip install langchain-mcp-adapters~=0.1

[2mAudited [1m1 package[0m [2min 4ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
from langchain_mcp_adapters.tools import load_mcp_tools
from langchain.agents import create_agent
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

model = Config().new_openai_like()

server_params = StdioServerParameters(
    command="python",
    # Update with the full absolute path to math_server.py
    args=["static/math_server.py"],
)


async def run_agent():
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools = await load_mcp_tools(session)
            agent = create_agent(model, tools)
            response = await agent.ainvoke({"messages": "what's (3 + 5) x 12?"})
            response["messages"][-1].pretty_print()

await run_agent()


The result of $(3 + 5) \times 12$ is $96$.


### Infrastructure considerations
#### How to choose your deployment model
1. 先看数据监管要求：严监管->本地部署；否则考虑上云；
1. 需要绝对控制 -> 本地部署；
1. 上云+自己运维
1. 混合部署
    - 敏感数据->本地，非敏感数据->云服务
    - 特化任务->本地，通用任务->云服务
    - 平时流量->本地，高峰流量->云服务
...

#### Model serving infrastructure

- The key to cost-effective LLM deployment is memory optimization. Quantization reduces your
    models from 16-bit to 8-bit or 4-bit precision, cutting memory usage by 50-75% with minimal quality loss
- Request batching is equally important – configure your serving layer to automatically group multiple user requests when possible. This improves throughput by 3-5x.
- Pay attention to the attention key-value cache. Setting appropriate context length limits and implementing cache
expiration strategies prevents memory overflow during long conversations.

In [12]:
%uv pip install langchain-litellm==0.3.2 litellm~=1.78

[2mAudited [1m2 packages[0m [2min 4ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
# LiteLLM with LangChain
import os

import dotenv
from langchain_core.prompts import PromptTemplate
from langchain_litellm import ChatLiteLLMRouter
from litellm import Router

dotenv.load_dotenv()

# Configure multiple model deployments with fallbacks
# openai/ 前缀的必要性参见 https://docs.litellm.ai/docs/providers/openai_compatible
model_list = [
    {
        "model_name": f"anthropic/{os.environ['ANTHROPIC_MODEL']}",
        "litellm_params": {
            "model": f"anthropic/{os.environ['ANTHROPIC_MODEL_FALLBACK']}",  # Automatic fallback option
            "api_key": os.environ["ANTHROPIC_API_KEY"],
            "api_base": os.environ["ANTHROPIC_BASE_URL"],
        },
    },
    {
        "model_name": f"openai/{os.environ['OPENAI_MODEL']}",
        "litellm_params": {
            "model": f"openai/{os.environ['OPENAI_MODEL_FALLBACK']}",  # Automatic fallback option
            "api_key": os.environ["OPENAI_API_KEY"],
            "api_base": os.environ["OPENAI_API_BASE_URL"],
        },
    },
]

# Setup router with reliability features
router = Router(
    model_list=model_list,
    routing_strategy="usage-based-routing-v2",
    cache_responses=True,  # Enable caching
    num_retries=3,  # Auto-retry failed requests
)

model_name = f"openai/{os.environ['OPENAI_MODEL']}"
# Create LangChain LLM with router
router_llm = ChatLiteLLMRouter(router=router, model_name=model_name)

# Build and use a LangChain
prompt = PromptTemplate.from_template("Summarize: {text}")
chain = prompt | router_llm
result = chain.invoke({"text": "LiteLLM provides reliability for LLM applications"})
result.pretty_print()



LiteLLM enhances reliability for LLM applications by offering a unified interface to access multiple large language models, enabling seamless model switching, automatic retry logic, and fallback mechanisms. It ensures consistent performance during API failures or high latency by routing requests to alternative models or endpoints, thus improving application resilience and uptime.


## How to observe LLM apps
### Operational metrics for LLM applications

### Tracking responses

In [14]:
%uv pip install pydantic~=2.12

[2mAudited [1m1 package[0m [2min 1ms[0m[0m
Note: you may need to restart the kernel to use updated packages.


In [15]:
"""Tracing of agent calls and intermediate results."""

import subprocess
from urllib.parse import urlparse

from langchain.agents import create_agent
from pydantic import HttpUrl


def ping(url: HttpUrl, return_error: bool) -> str:
    """Ping the fully specified url. Must include https:// in the url."""
    hostname = urlparse(str(url)).netloc
    completed_process = subprocess.run(
        ["ping", "-c", "1", hostname], capture_output=True, text=True
    )
    output = completed_process.stdout
    if return_error and completed_process.returncode != 0:
        return completed_process.stderr
    return output


llm = Config().new_openai_like()

agent = create_agent(
    model=llm, tools=[ping], system_prompt="You are a helpful assistant"
)

# 参考 https://python.langchain.com/docs/how_to/migrate_agent/#return_intermediate_steps
# 输出的 result 已包含所有中间执行步骤
result = agent.invoke(
    {
        "messages": [
            {
                "role": "user",
                "content": "What's the latency like for https://langchain.com?",
            }
        ]
    }
)
result['messages'][-1].pretty_print()


The latency for https://langchain.com is approximately **55 milliseconds** based on the ping result. This indicates a fairly responsive connection with no packet loss.


From Ben Auffarth’s work at Chelsea AI Ventures

- For all requests, track only the request ID, timestamp, token counts, latency, error codes, and endpoint called.
- Sample 5% of non-critical interactions for deeper analysis. For customer service, increase to 15% during the first month after deployment or after major updates.
- For critical use cases (financial advice or healthcare), track complete data for 20% of interactions. Never go below 10% for regulated domains.
- Delete or aggregate data older than 30 days unless compliance requires longer retention. For most applications, keep only aggregate metrics after 90 days.
- Use extraction patterns to remove PII from logged prompts – never store raw user inputs containing email addresses, phone numbers, or account details.

This approach cuts storage requirements by 85-95% while maintaining sufficient data for troubleshooting and analysis.

### Hallucination detection
1. Retrieval-based validation: comparing the outputs of LLMs against retrieved external content to verify factual claims.
2. LLM-as-judge: a more powerful LLM is used to assess the factual correctness of a response.
3. External knowledge verification: entails cross-referencing model responses against trusted external sources to ensure accuracy.

### Bias detection and monitoring
- `demographic_parity_difference` function from the `Fairlearn` library

#### LangSmith
### Observability strategy
### Continuous improvement for LLM applications
## Cost management for LangChain applications
Factors that drive costs in LLM applications:
- **Token-based pricing**: Most LLM providers charge per token processed, with separate rates for input tokens (what you send) and output tokens (what the model generates).
- **Output token premium**: Output tokens typically cost 2-5 times more than input tokens. For example, with GPT-4o, input tokens cost $0.005 per 1K tokens, while output tokens cost $0.015 per 1K tokens.
- **Model tier differential**: More capable models command significantly higher prices. For instance, Claude 3 Opus costs substantially more than Claude 3 Sonnet, which is in turn more expensive than Claude 3 Haiku.
- **Context window utilization**: As conversation history grows, the number of input tokens can increase dramatically, affecting costs

### Model selection strategies in LangChain

#### Tiered model selection
**How**: Use a lightweight model to classify a query and select an appropriate model accordingly

In [16]:
import os

import dotenv
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

dotenv.load_dotenv()

base_url = os.environ["OPENAI_API_BASE_URL"]

# Define models with different capabilities and costs
affordable_model = ChatOpenAI(
    model=os.environ["OPENAI_MODEL_AFFORDABLE"],
    base_url=base_url,
)  # ~10× cheaper than gpt-4o

powerful_model = ChatOpenAI(
    model=os.environ["OPENAI_MODEL_POWERFUL"],
    base_url=base_url,
)  # More capable but more expensive

# Create classifier prompt
classifier_prompt = ChatPromptTemplate.from_template(
    """
Determine if the following query is simple or complex based on these
criteria:
- Simple: factual questions, straightforward tasks, general knowledge
- Complex: multi-step reasoning, nuanced analysis, specialized expertise

Query: {query}

Respond with only one word: "simple" or "complex"
"""
)

# Create the classifier chain
classifier = classifier_prompt | affordable_model | StrOutputParser()


def route_query(query):
    """Route the query to the appropriate model based on complexity."""
    complexity = classifier.invoke({"query": query})

    if "simple" in complexity.lower():
        print(f"Using affordable model for: {query}")
        return affordable_model
    else:
        print(f"Using powerful model for: {query}")
        return powerful_model


# Example usage
def process_query(query):
    model = route_query(query)
    return model.invoke(query)


simple_query = "what is the sum of 2+3"
# print(process_query(simple_query))
process_query(simple_query).pretty_print()

# complex_query = "plan a app serving billion users"
# print(process_query(complex_query))

Using affordable model for: what is the sum of 2+3

The sum of 2 + 3 is **5**.


#### Cascading model approach
**How**: First attempts a response using a cheaper model and escalates to a stronger one only if the initial output is inadequate.

### Output token optimization

In [18]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize the LLM with max_tokens parameter
llm = Config().new_openai_like(max_tokens=150) # Limit to approximately 100-120 words

# Create a prompt template with length guidance
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that provides concise, accurate information. Your responses should be no more than 100 words unless explicitly asked for more detail.",
        ),
        ("human", "{query}"),
    ]
)

# Create a chain
chain = prompt | llm | StrOutputParser()

result = chain.invoke(
    {"query": "write simple python function checking is a integer is prime"}
)
print(result)

```python
def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True
```


### Other strategies
- 缓存
    1. In-memory caching: Simple caching to help reduce costs appropriate in a development environment.
    1. Redis cache: Robust cache appropriate for production environments enabling persistence across application restarts and across multiple instances of your application.
    1. Semantic caching: This advanced caching approach allows you to reuse responses for semantically similar queries, dramatically increasing cache hit rates.
- Use structured outputs to eliminate unnecessary narrative text.
- Implementing token-based context windowing is particularly important as it provides predictable cost control.

### Monitoring and cost analysis

LangChain provides callbacks for tracking token usage.

In [19]:
from langchain_community.callbacks.manager import get_openai_callback

llm = Config().new_openai_like()

with get_openai_callback() as cb:
    response = llm.invoke("Explain quantum computing in simple terms")

    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 522
Prompt Tokens: 15
Completion Tokens: 507
Total Cost (USD): $0.0
