---
title: "Taming LLMs"
subtitle: "A Practical Guide to LLM Pitfalls with Open Source Software"
description: | 
  A Practical Guide to LLM Pitfalls with Open Source Software
date: 01/28/2025
author:
  - name: Thársis Souza, Ph.D. 
citation:
  url: https://www.souzatharsis.com/writing/tllms-codedotorg
website:
  repo-url: https://github.com/souzatharsis/writing/
  repo-actions: [source, issue]
repo-actions: true
reference-location: margin
citation-location: margin
highlight-style: github
#cap-location: margin
link-citations: true
bibliography: ../references.bib
editor:
  render-on-save: false
format:
  revealjs: 
    theme: dark
    slide-number: true
    logo: images/taming.ico
    footer: '[tamingllms.com](tamingllms.com)'
---


## About Me{auto-animate=true auto-animate-easing="ease-in-out"}

::: {.r-stack}
::: {data-id="box1" style="background: #e83e8c; width: 350px; height: 350px; border-radius: 200px;"}
CS
:::
::: {data-id="box2" style="background: #3fb618; width: 250px; height: 250px; border-radius: 200px;"}
PM
:::
::: {data-id="box3" style="background: #2780e3; width: 150px; height: 150px; border-radius: 200px;"}
Finance
:::
:::

## About Me {auto-animate=true auto-animate-easing="ease-in-out"}

::: {.r-hstack}
::: {style="display: flex; flex-direction: column; align-items: center;"}
::: {data-id="box1" auto-animate-delay="0" style="background: #e83e8c; width: 200px; height: 150px; margin: 10px; display: flex; justify-content: center; align-items: center;"}
**CS**
:::
::: {style="width: 2px; height: 30px; background: #e83e8c;"}
:::
PhD, UCL
:::

::: {style="display: flex; flex-direction: column; align-items: center;"}
::: {data-id="box2" auto-animate-delay="0.1" style="background: #3fb618; width: 200px; height: 150px; margin: 10px; display: flex; justify-content: center; align-items: center;"}
**PM**
:::
::: {style="width: 2px; height: 30px; background: #3fb618;"}
:::
SF
:::

::: {style="display: flex; flex-direction: column; align-items: center;"}
::: {data-id="box3" auto-animate-delay="0.2" style="background: #2780e3; width: 200px; height: 150px; margin: 10px; display: flex; justify-content: center; align-items: center;"}
**Finance**
:::
::: {style="width: 2px; height: 30px; background: #2780e3;"}
:::
Ex-Two Sigma
:::
:::



## Agenda

1. LLM Pitfalls
2. Case Study: Safety & Alignment
3. Discussion

## LLM Pitfalls

![*Samuel Colvin, Pydantic*](images/bad.jpeg){#fig-bad}


## LLM Pitfalls
::: {.r-stack}
::: {.r-hstack style="flex-wrap: wrap; justify-content: center; align-items: center; max-width: 1200px;"}
::: {.fragment style="background: #e83e8c; width: 300px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Testing Complexity
:::
::: {.fragment style="background: #ff7518; width: 300px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Structural (un)Reliability
:::
::: {.fragment style="background: #3fb618; width: 300px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Input Data Issues
:::
::: {.fragment style="background: #2780e3; width: 300px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Safety
:::
::: {.fragment style="background: #9954bb; width: 300px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Alignment
:::
::: {.fragment style="background: #20c997; width: 300px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Vendor <br> Lock-in
:::
::: {.fragment style="background: #6f42c1; width: 1000px; height: 150px; margin: 10px; padding: 15px; border-radius: 15px; display: flex; justify-content: center; align-items: center;"}
### Cost & Performance Optimization
:::
:::
:::

## 

::: {.r-fit-text style="background: #e83e8c; width: 100%; height: 100vh; margin: 0; padding: 15px; display: flex; justify-content: center; align-items: center; text-align: center; align-content: center;"}
### Testing Complexity
:::


## Testing Complexity

LLMs are Generative, Non-Deterministic, and have Emergent Properties.

![Animation representing LLMs "Emerging Properties" Phenomenon. From: "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance"](images/emergent-properties-palm.gif){width="100%"}



## Testing Complexity {.smaller}

| Aspect | Traditional Software Products | LLM-Based Software Products |
|:--------|-------------------|------------------|
| Capability Assessment | Validates specific functionality against requirements | May assess emergent properties like reasoning and creativity |
| Metrics and Measurement | Precisely defined and measurable metrics | Subjective qualities that resist straightforward quantification |
| Dataset Contamination | Uses carefully crafted test cases | Risk of memorized evaluation examples from training |
| Benchmark Evolution | Maintains stable test suites | Continuously evolving benchmarks as capabilities advance |
| Human Evaluation | Mostly automated validation | May require significant human oversight |

## Testing Complexity: Evals Design

![Conceptual overview of Multiple LLM-based applications evaluation.](images/conceptual-multi.svg){#fig-conceptual-multi}


## Testing Complexity: Tools {.smaller}

::: panel-tabset

### LightEval

```bash
lighteval accelerate --model_args "pretrained=meta-llama/Llama-3.2-1B-Instruct" --tasks "leaderboard|mmlu:econometrics|0|0" --output_dir="./evals/"
```

### LangSmith

``` {.python}
def run_evaluation(app, model_name, dataset,  evaluators):
    results = langsmith_evaluate(
        app,
        data=dataset,
        evaluators=evaluators,
        experiment_prefix=model_name,
        num_repetitions=5
    )
    return results
```

### PromptFoo

```yaml
description: Best model eval
prompts:
- file://prompt1.txt
- file://prompt2.txt
- file://prompt3.txt
providers:
- openai:gpt-3.5-turbo
defaultTest:
  assert:
  - type: llm-rubric
    value: 'Evaluate the output based on how detailed it is.  Grade it on a scale
      of 0.0 to 1.0, where:

      Score of 0.1: Not much detail.

      Score of 0.5: Some detail.

      Score of 1.0: Very detailed.

      '
tests: file://tests.csv
```
:::

## Testing Complexity: Tools {.smaller}


![](images/evals_compare.png){width="100%"}



## 

::: {.r-fit-text style="background: #ff7518; width: 100%; height: 100vh; margin: 0; padding: 15px; display: flex; justify-content: center; align-items: center; text-align: center; align-content: center;"}
### Structural (un)Reliability
:::

## Structural (un)Reliability

::: incremental
- Language Models excel at generating human-like text but struggle with producing structured output consistently [@tang2024strucbenchlargelanguagemodels; @shorten2024structuredragjsonresponseformatting]
- This limitation poses significant challenges when integrating LLMs into production systems
  - Databases
  - APIs 
  - Other software applications
- Even carefully crafted prompts cannot guarantee consistent structural adherence in LLM responses
:::

## Structural (un)Reliability {.smaller}

But what user needs drive the demand for LLM output constraints? A recent Google Research [@10.1145/3613905.3650756] study explored this question through a survey of 51 industry professionals who use LLMs in their work:

::: incremental
- **Improving Developer Efficiency and Workflow**
  - Reducing trial and error in prompt engineering
  - Minimizing post-processing of LLM outputs 
  - Streamlining integration with downstream processes
  - Enhancing quality of synthetic datasets

- **Meeting UI and Product Requirements**
  - Adhering to UI size limitations
  - Ensuring output consistency

- **Enhancing User Trust and Experience**
  - Mitigating hallucinations
  - Driving user adoption
:::

<!-- To constrain LLM output is not just a technical consideration but a fundamental user need, impacting developer efficiency and user experience. -->


## Structural (un)Reliability {.smaller}

The text generation process follows a probabilistic approach. At each step, the model calculates the probability distribution over its entire vocabulary to determine the most likely next token.

![Text Generation Process: "Sampling".](images/logit.svg){#fig-logit}

## Structural (un)Reliability {.smaller}

This process can be expressed mathematically as:

\begin{equation}
P(X) = P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n p(x_i|x_{<i})
\end{equation}

where, $x_i$ represents the current token being generated, while $x_{<i}$ encompasses all preceding tokens.

## Structural (un)Reliability {.smaller}

 This controlled text generation\index{Controlled text generation} process can be formalized as [@liang2024controllabletextgenerationlarge]:

\begin{equation}
P(X|\color{green}{C}) = P(x_1, x_2, \ldots, x_n|\color{green}{C}) = \prod_{i=1}^n p(x_i|x_{<i}, \color{green}{C})
\end{equation}

Here, $\color{green}{C}$ represents the set of constraints or control conditions that shape the generated output.

## Structural (un)Reliability {.smaller}

Common constraints ($C$) include:

::: incremental
- **Format Constraints**: Enforcing specific output formats like JSON, XML, or YAML ensures the generated content follows a well-defined structure that can be easily parsed and validated. Format constraints are essential for system integration and data exchange.

- **Multiple Choice Constraints**: Restricting LLM outputs to a predefined set of options helps ensure valid responses and reduces the likelihood of unexpected or invalid outputs. This is particularly useful for classification tasks or when specific categorical responses are required.

- **Static Typing Constraints**: Enforcing data type requirements (strings, integers, booleans, etc.) ensures outputs can be safely processed by downstream systems. Type constraints help prevent runtime errors and improve system reliability.

- **Length Constraints**: Limiting the length of generated content is crucial for UI display, platform requirements (like Twitter's character limit), and maintaining consistent user experience. Length constraints can be applied at the character, word, or token level.

- **Ensuring Output Consistency**: Consistent output length and format are crucial for user experience and UI clarity. Constraints help maintain this consistency, avoiding overwhelming variability in generated text.
:::

## Structural (un)Reliability {.smaller}


```{css}
/*| echo: false */
figcaption {
  margin: auto;
  text-align: center;
}
```

![A common way to solve LLM Structural (un)Reliability.](images/x.png){#fig-x fig-align="center"}

## Structural (un)Reliability {.smaller}


## References

::: {#refs}
:::