```{contents}
```
## Cost Tracking


**Cost tracking** is the process of **measuring and attributing LLM usage cost** based on:

* Tokens consumed (input + output)
* Model used
* Number of calls
* Retries / fallbacks
* Tools invoked

In LLM systems, cost tracking answers:

* *How much did this request cost?*
* *Which user / feature is expensive?*
* *Where are tokens being wasted?*

Cost tracking is commonly implemented using:

* Callback handlers
* Tracing metadata
* Token usage APIs

Supported natively in LangChain.

---

### Why Cost Tracking Is Critical

Without cost tracking:

* Bills are unpredictable
* Token waste goes unnoticed
* Scaling becomes risky

With cost tracking:

* Budget control
* Per-user / per-feature accounting
* Cost optimization
* SLA and quota enforcement

---

### What Contributes to LLM Cost

| Factor        | Description               |
| ------------- | ------------------------- |
| Input tokens  | Prompt + context          |
| Output tokens | Generated text            |
| Model         | GPT-4 > GPT-3.5           |
| Retries       | Each retry costs          |
| Fallbacks     | Secondary models add cost |
| Streaming     | Same cost, better UX      |

---

### Architecture View

![Image](https://blog.promptlayer.com/content/images/2024/11/How-a-Prompt-Engineering-Tool-Improves-AI-Model-Performance--24-.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/1%2AUzR4Qr__-TOsZ63BgLnqww.png)

![Image](https://mintcdn.com/langchain-5e9cc07a/Tf5b6pnNY9Uj6Vtl/langsmith/images/primitives.png?auto=format\&fit=max\&n=Tf5b6pnNY9Uj6Vtl\&q=85\&s=50c5f4d966f8fe4f8ae0be0beaf11bc4)

---


---

### Built-in Token & Cost Tracking (Callback)

#### Using `get_openai_callback`

```python
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

llm = ChatOpenAI()

with get_openai_callback() as cb:
    response = llm.invoke("Explain cost tracking in LLMs")
    
    print("Prompt tokens:", cb.prompt_tokens)
    print("Completion tokens:", cb.completion_tokens)
    print("Total tokens:", cb.total_tokens)
    print("Total cost ($):", cb.total_cost)
```

**Output (example)**

```
Prompt tokens: 23
Completion tokens: 42
Total tokens: 65
Total cost ($): 0.00013
```

This is the **simplest and most common** approach.

---

#### Cost Tracking in a RunnableSequence

```python
from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "Explain {topic} in simple terms"
)

chain = prompt | llm

with get_openai_callback() as cb:
    chain.invoke({"topic": "cost tracking"})
    print("Cost:", cb.total_cost)
```

Tracks cost across the **entire chain**, not just the LLM.

---

#### Cost Tracking with RAG Pipelines

```python
chain = (
    {
        "question": lambda x: x,
        "context": retriever
    }
    | prompt
    | llm
)

with get_openai_callback() as cb:
    chain.invoke("What is RAG?")
    print("Tokens used:", cb.total_tokens)
    print("Cost:", cb.total_cost)
```

This reveals:

* How much retrieval context inflates prompt tokens
* Why chunk size matters

---

#### Cost Tracking with Retries & Fallbacks

```python
primary = ChatOpenAI(model="gpt-4").with_retry(2)
backup = ChatOpenAI(model="gpt-3.5-turbo")

llm = primary.with_fallbacks([backup])

with get_openai_callback() as cb:
    llm.invoke("Explain retry and fallback")
    print("Total cost with retries/fallbacks:", cb.total_cost)
```

Cost includes:

* All retries
* Fallback model usage

---

#### Cost Tracking per Request (FastAPI)

```python
from fastapi import FastAPI

app = FastAPI()

@app.post("/chat")
async def chat(q: str):
    with get_openai_callback() as cb:
        answer = await llm.ainvoke(q)
        
        return {
            "answer": answer.content,
            "tokens": cb.total_tokens,
            "cost": cb.total_cost
        }
```

Enables:

* Per-request billing
* User-level quotas
* Usage dashboards

---

#### Cost Tracking via Tracing (Production)

When tracing is enabled:

```bash
export LANGCHAIN_TRACING_V2=true
```

Each trace automatically records:

* Token counts
* Cost per step
* Model usage
* Retry/fallback impact

Viewed in:

* LangSmith UI
* Cost dashboards

---

### Cost Tracking vs Token Tracking

| Aspect        | Token Tracking | Cost Tracking |
| ------------- | -------------- | ------------- |
| Measures      | Tokens         | Money         |
| Model-aware   | ❌              | ✅             |
| Billing-ready | ❌              | ✅             |
| Optimization  | Partial        | Full          |

Cost = **tokens × model price**.

---

### Common Cost Pitfalls

* Large retrieval context
* Overlapping chunks
* Using GPT-4 unnecessarily
* Unlimited retries
* Verbose prompts

---

### Cost Optimization Hooks

Typical hooks:

* Log cost per request
* Enforce max tokens
* Route to cheaper model
* Abort when budget exceeded

Example:

```python
if cb.total_cost > 0.01:
    raise Exception("Cost limit exceeded")
```

---

### Mental Model

Cost tracking is a **meter** attached to every LLM call.

```
LLM runs → tokens counted → cost calculated → budget enforced
```

---

### Key Takeaways

* Cost tracking is mandatory for production LLMs
* LangChain provides built-in callbacks
* Works across chains, RAG, retries, fallbacks
* Enables budgeting, optimization, and governance