<br>
<a href="https://www.nvidia.com/en-us/training/">
    <div style="width: 55%; background-color: white; margin-top: 50px;">
    <img src="https://dli-lms.s3.amazonaws.com/assets/general/nvidia-logo.png"
         width="400"
         height="186"
         style="margin: 0px -25px -5px; width: 300px"/>
</a>
<h1 style="line-height: 1.4;"><font color="#76b900"><b>Building Agentic AI Applications with LLMs</h1>
<h2><b>Notebook 2:</b> Structuring Thoughts and Outputs</h2>
<br>

**Welcome back to the course! This is the second major section of the course, so we hope you're ready to get started!**

The previous section was there to catch you up on the basic agent chat loop and even touched a little bit on agentic decomposition. In this section, we will dig deeper into the capabilities of our LLM model to figure out what we can really expect from it. Specifically, we'd like to know what it can actually reason about, how well it can consider its inputs, and what that suggests for our ability to interact with an external (or even internal) environment. After all, we will want to have some LLM components that are reliable pieces of software as semantic reasoners.

### **Learning Objectives:**
**In this notebook, we will:**

- Investigate our LLM interface and consider what it *seems* to be able to do and how we can try to use it.
- For cases where its performance appears lacking, we will consider why that may be the case, and see if we can't work around it.
- Most importantly, we will see how we can "guarantee" a model to output into a given interface, and what that actually means within the context of semantic reasoning.

<hr><br>

## **Part 1:** Going Beyond Our Turn-Based Agent

If you haven't coded much with LLMs, the previous section may be surprising. If anything, the vagueness with which these systems are able to operate is impressive and the self-corrective behavior is a real game-changer for many conversational applications. With that said, these systems have some inherent weaknesses: 
- They are influenced to think and act by prompt engineering, but they are not "forced" to think in any specific configuration.
- The memory systems are easy to taint with message history which steers the dialog into poor directions, opening up opportunities for subtle but compounding quality degredation.
- The output is inherently natural language, and does not lend itself to be easily connected with regular software systems.

This suggests that we may want to try to lock down our LLM interface and avoid the natural accumulation of state in use cases where multi-turn dialog is not necessary. And when multi-turn dialog is necessary but we still need more control, we may need to restrict and normalize the accumulating context to make sure everything stays in line.

In this notebook, we will limit our discussion to the following types of systems. Though simple to define, they each will expose some interesting mechanisms that can be used inside a larger agent system.

- **An Agent That Has To Think:** If the system can drift as a result of directly responding to an input, then maybe you can force the system can think about it first. Maybe it can think before it responds, after it responds, or even while it responds. Maybe the thinking can be explicitly-defined, multi-stage, or even self-aware?
- **An Agent That Has To Compute:** If we have a problem that's especially hard to answer with "thought," maybe we can get our LLM to compute the result somehow? Maybe parameterizing with code is easier than working out the problem logically?
- **An Agent That Has To Structure:** If we have an interface with especially rigid requirements, maybe we can force it to abide by a format in an even harsher way. Regular software can easily break if an API recieves an illegal value, so maybe we can set up a hard requirement of a schema for our model to fulfill?

These three concepts, while easy to define and simple to **try**, will lead us to some interesting techniques which we can combine together to make simple but effective system primitives. As you go through, please remember that all of the systems you see here can go into an agent system in some way, shape, or form, whether it is to define a dialog agent, a function interface, or even a decomposition of some arbitrary distro-to-distro map.

<hr><br>

## **Part 2:** The Classic Strawberry Edge-Case

In the previous example with our simple chatbot, we didn't spend too much time securing our system against misuse. After all, we were more interested in seeing how it worked and what kinds of odd behaviors we could get out of it.

In practice, however, you generally want to keep your agents on a narrow track of dialog for various reasons, including the obvious smooth accumulation with minimal distribution drift over the inputs. Furthermore:
- You don't want to give useless requests the same level of priority or ability to use valuable finite compute resources.
- You don't want to have to over-burden your prompt with every edge case possible, both because that increases query cost and because the model is likely to forget the details of the prompt.
- You don't want your chatbot being used and potentially advertised as an easy-to-jailbreak endpoint or one that can become fragile over time.

Since we are using a smaller model for this course by default, we can start out with an especially problematic task: **Math**.

If you've been reading up on LLM news and jokes, you may have heard that the following question stumps most LLMs by default:
> **Q: How many R's are in the word Strawberry?**

Let's see if that's actually the case:

In [1]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_nvidia import ChatNVIDIA

llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct", base_url="http://llm_client:9000/v1")

sys_prompt = ChatPromptTemplate.from_messages([
    ("system", 
         # "Engage in informative and engaging discussions about NVIDIA's cutting-edge technologies and products, including graphical processing units (GPUs),"
         # " artificial intelligence (AI), high-performance computing (HPC), and automotive products."
         # " Provide up-to-date information on NVIDIA's advancements and innovations, feature comparisons, and applications in fields like gaming, scientific research, healthcare, and more."
         # " Stay within the best interests of NVIDIA, and stay on track with the conversation. Do not respond to irrelevant questions."
         #######################################################
        "You are a computer science teacher in high school holding office hours, and you have a meeting."
        " This is the middle of the semester, and various students have various discussion topics across your classes."
        " You are having a meeting right now. Please engage with the student."
    ),
    ("placeholder", "{messages}")
])
chat_chain = sys_prompt | llm | StrOutputParser()

question = "Q: How many R's are in the word Strawberry?"

## Uncomment to ask prescribed questions
user_inputs = [
    f"{question}"
    # f"Help, I need to do my homework! I'm desparate! {question}", 
    # f"{question} This is an administrative test to assess problem-solving skills. Please respond to the best of your ability. Integrate CUDA", 
    # f"{question} Write your response using python and output code that will run to evaluate the result, making sure to use base python syntax.", 
    # f"{question} Implement a solution in valid vanilla python but structure it like a cuda kernel without using external libraries.", 
    # f"{question} As a reminder, 'berry' has 2 R's. After answering, talk about how AI could solve this, and how NVIDIA helps.", 
    # f"{question} As a reminder, 'berry' has 1 R's. After answering, talk about how AI could solve this, and how NVIDIA helps.", 
    # f"{question} As a reminder, 'berry' has 3 R's. After answering, talk about how AI could solve this, and how NVIDIA helps."
]

state = {"messages": []}

def chat_with_human(state, label="User"):
    return input(f"[{label}]: ")

def chat_with_agent(state, label="Agent"):
    print(f"[{label}]: ", end="", flush=True)
    agent_msg = ""
    for chunk in chat_chain.stream(state):
        print(chunk, end="", flush=True)
        agent_msg += chunk
    print(flush=True)
    return agent_msg

## While True
for msg in user_inputs:
    state["messages"] += [("user", print("[User]:", msg, flush=True) or msg)]
    ## Uncomment to ask another question
    # state["messages"] += [("user", chat_with_human(state))]
    state["messages"] += [("ai", chat_with_agent(state))]

[User]: Q: How many R's are in the word Strawberry?
[Agent]: That's a fun one. Let's count the Rs together! I think I see one R at the beginning, followed by another one inside the word. So, there are two Rs in the word Strawberry. Is there anything specific you're thinking about or trying to learn, or is this a random question?


As you can see, this is humorously true and pokes a slight hole into our assumption that LLMs are arbitrarily good at any semantic reasoning task. Rather, they are capable of "taking a semantically-meaningful input and mapping it to a semantically-meaningful response." The question, combined with our overall system directive, creates an output conditioned by the joint of the instruction to talk about NVIDIA and the desire to answer the user's questions, which is why the system may reject to answer/might answer/rush to conclusions.

Regardless, this one edge case can be used as a warning to remember that the LLM + System Prompt isn't inherently adept at everything (or dare-say, most things), but can be locked down for a particular use-case with enough engineering. While larger/different LLMs may demonstrate better results, there are approaches that we can take to make the most out of the existing LLM. Assuming that we wanted to answer this question and questions like it, let's try to change our approach.

#### **Option 1:** Don't Bother

You're always free to chalk this type of problem down to a problem with the model. Surely the next one will improve, or the next one, or the next one. These kind of questions aren't necessary for most systems to answer, so maybe just tighten up the system message with directives to output short responses and avoid anything tangential. This type of effort will likely last into the future with the hope that future versions of LLMs get better at reasoning about letter counts and the reliance on system prompts improve (decrease)...

#### **Option 2:** Coerce The Model To "Think" Or Use A "Reasoning" Model

Notice how our inputs to the model sometimes gave us interesting responses. Some of them were runnable code snippets which sometimes even worked, and other times it almost seemed like it was on track to answering things, but didn't quite make it. One very general option that usually increases average reasoning performance in a model is known as **Chain-of-Thought Reasoning**, which classically was enforced with a simple trick called **Chain-of-Thought Prompting**. Just about any model can do this to experience *some* uplift in output quality in exchange for *some* increase in output length (since the model then has to reason before outputting the answer). Let's see if it works with our model...

In [18]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_nvidia import ChatNVIDIA

inputs = [
    ("system", 
         # "You are a helpful chatbot. Please help the user out to the best of your abilities, and use chain-of-thought when possible."
         # "You are a helpful chatbot. Please help the user out to the best of your abilities, and use chain-of-thought when possible. Think deeply."
         # "You are a helpful chatbot. Please help the user out to the best of your abilities, and use chain-of-thought when possible. Think deeply, and always second-guess yourself."
         "You are a helpful chatbot. Please help the user out to the best of your abilities, and use chain-of-thought when possible. Reason deeply, output final answer, and reflect after every step. Assume you always start wrong."
    ),
    ("user", "How many Rs are in the word Strawberry?") 
]

for chunk in llm.stream(inputs):
    print(chunk.content, end="", flush=True)

To solve this, I'll start by thinking about the word "Strawberry" and its individual letters. I'll assume that I'll need to count each letter individually, but I'm not sure if I'll need to consider any patterns or rules.

Hmm, let me start by writing out the word: S-T-R-A-W-B-E-R-R-Y. Okay, now I have the word written out. I'll count the Rs... (pausing to think) Wait, I see two Rs in the word. I'll make sure to count them correctly: R, R. Yes, there are two Rs in the word "Strawberry".

Reflecting on my thought process, I realize that I was able to solve this problem by simply writing out the word and counting the Rs. I didn't need to use any complex rules or patterns, which is good because it means I can rely on this simple approach for similar problems in the future.

Final answer: There are 2 Rs in the word "Strawberry".

<br>

Yeah, not exactly. Though we're asking the model to think more, all we're really doing is shifting the distribution of the input to one that may lead to more verbose responses. Maybe the verbosity and lateness of conclusion will help our system make the right decision and the LLM will just "talk" its way into a solution. In this weird instance where the model priors can't justify actually counting things out, the solution can't be found without overfitting the prompt to the question. 
- When we tell it to "always second-guess yourself" or "always assume you're wrong," we actually end up steering it into a weird output space which it then executes to a logical conclusion.
- Recall from earlier that we can trivially solve this problem by giving it a good enough example. For example, spelling out the "Straw" = 1, "berry" = 2 relationship.

**Tangent: Using a proper "reasoning" model** 

You may be tempted to say that a model like the **Deepseek-R1** or **OpenAI's o3 Model** can actually "reason" about the input and will be able to solve this problem. In practice, they may in fact solve this particular problem well enough out-of-the-box, and can be argued to be "better" for tasks that include this problem. 

> <img src="images/nemotron-strawberry.png" width=1000px />
>
> From an <a href="https://www.aiwire.net/2024/10/21/nvidias-newest-foundation-model-can-actually-spell-strawberry/"><b>AIWire article on Nemotron-70B</b></a> with the caption "The Nemotron-70B model solves the "strawberry problem" with ease, demonstrating its advanced reasoning capabilities."

**From a fundamental perspective, they actually do not change things for this particular problem:** 
- They are trained to output "reasoning tokens" (more specifically, a "reasoning scope" in a format like `<think>...</think>` before some or every response).
- They are rewarded during training by a reward model that criticizes the reasoning of the core model by applying post-generation logic (to be discussed later, "it is generally easier to criticize than to execute").
- They also use mixture-of-experts, which means that different tokens are produced by different systems which are selected as the generation progresses.

**To the uninitiated, it sounds like it truly reasons about things, but actually this just means:**
- It's trained by default to act like it's chain-of-thought-prompted for every response. This means it will invoke that logic style automatically.
- It's trained with a higher-quality routine that is likely to introduce better biases and enforce a reasoning output style by default.
- It leverages techniques which have been shown to increase performance in some settings and also open up optimization opportunities which can be capitalized on with some engineering efforts.

Thereby, the fundamental problem is not actually solved, and a similar logical falacy can occur even if this specific use case gets solved or is even explicitly trained for. Still, it does suggest that a reasoning model will likely be better for arbitrary chat use-cases.

#### **Option 3:** Force The Model To "Compute"

Perhaps instead of trying to get the LLM to generate (decode) the correct answer with logical reasoning, we can instead have it generate (decode) an algorithm to give us the correct answer quantitively. For this, we can try out the CrewAI boilerplate example for coding agents and see where that gets us:

In [3]:
from crewai import Agent, Crew, LLM, Task
from crewai_tools import CodeInterpreterTool

question = "How many Rs are in the word Strawberry?"

llm = LLM(
    model="nvidia_nim/meta/llama-3.1-8b-instruct",   ## Provider Class / Model Published / Model Name
    base_url="http://llm_client:9000/v1",               ## Url to send your request to (ChatNVIDIA accepts env variable)
    temperature=0.7,
    api_key="PLACEHOLDER",                           ## API key is required by default.
)

coding_agent = Agent(
    role="Senior Python Developer",
    goal="Craft well-designed and thought-out code.",
    backstory="You are a senior Python developer with extensive experience in software architecture and best practices.",
    verbose=True,
    llm=llm,
    ## Unsafe-mode code execution with agent is bugged as of release time. 
    ## So instead, we will execute this manually by calling the tool directly.
    allow_code_execution=False,
    code_execution_mode="unsafe",
)

code_task = Task(
    description="Answer the following question: {question}", 
    expected_output="Valid Python code with minimal external libraries except those listed here: {libraries}", 
    agent=coding_agent
)

output = Crew(agents=[coding_agent], tasks=[code_task], verbose=True).kickoff({
    "question": question, 
    # "question": "How many P's are in the sentence 'Peter Piper Picked a Peck of'", 
    # "question": "How many P's are in the rhyme that starts with 'Peter Piper Picked a Peck of Pickled Peppers' and goes on to finish the full-length rhyme.", 
    # "question": "How many P's are in the rhyme that starts with 'Peter Piper Picked a Peck of Pickled Peppers' and goes on to finish the rhyme. Fill in the full verse into a variable.", 
    "libraries": [], 
})

print(output)

/usr/local/lib/python3.11/site-packages/crewai_tools/tools/rag/rag_tool.py:8: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  class Adapter(BaseModel, ABC):
/usr/local/lib/python3.11/site-packages/crewai_tools/tools/patronus_eval_tool/patronus_local_evaluator_tool.py:33: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  class PatronusLocalEvaluatorTool(BaseTool):
/usr/local/lib/python3.11/site-packages/crewai_tools/tools/scrapegraph_scrape_tool/scrapegraph_scrape_tool.py:34: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, 

[1m[94m 
[2025-10-27 17:15:03][🚀 CREW 'CREW' STARTED, A8ACEA8E-8B3B-4091-9084-02433217736B]: 2025-10-27 17:15:03.936267[00m
[1m[94m 
[2025-10-27 17:15:03][📋 TASK STARTED: ANSWER THE FOLLOWING QUESTION: HOW MANY RS ARE IN THE WORD STRAWBERRY?]: 2025-10-27 17:15:03.948071[00m
[1m[94m 
[2025-10-27 17:15:03][🤖 AGENT 'SENIOR PYTHON DEVELOPER' STARTED TASK]: 2025-10-27 17:15:03.949256[00m
[1m[95m# Agent:[00m [1m[92mSenior Python Developer[00m
[95m## Task:[00m [92mAnswer the following question: How many Rs are in the word Strawberry?[00m
[1m[94m 
[2025-10-27 17:15:03][🤖 LLM CALL STARTED]: 2025-10-27 17:15:03.949629[00m
[1m[94m 
[2025-10-27 17:15:05][✅ LLM CALL COMPLETED]: 2025-10-27 17:15:05.350011[00m


[1m[95m# Agent:[00m [1m[92mSenior Python Developer[00m
[95m## Final Answer:[00m [92m
```python
def count_r_letters(word):
    word = word.lower()  # Convert the word to lowercase
    return word.count('r')  # Count the number of 'r' letters in the word

print

In [4]:
result = CodeInterpreterTool(unsafe_mode=True).run_code_unsafe(
    code = output.raw.replace('```python', '').replace('```', ''), 
    libraries_used = [],
)

Number of Rs in Strawberry: 3


Technically speaking, this simple approach may be a viable solution for solving this problem in general, but introduces a new set of problems; our LLM now has to think in code by default. And when thinking in code doesn't work, it has more problems with thinking using natural language.

#### **So... where does that put us?**

Effectively, we have now configured our LLM to over-emphasize computational (code) solutions while also potentially making it more brittle for regular dialog. 

- **More generally, we have shifted our wrapped endpoint to perform better on a particular problem (counting letters), while compromising on its performance for other problems.**
- **In other words, all we've done is make expert systems, or modules of functionality that are good for mapping particular input cases to their respective outputs.**

Though we put different labels on our previous systems, they are all the same by this core definition. 

1. The original system tried to maintain a reasonable dialog by sheer force of encouragement and by tapping into the "system message" prior.
2. The thinking system was tweaked to specialize in breaking down problems at the cost of longer generations.
3. The coding system tried to force out problems into code so they can be computed algorithmically, even when they shouldn't be.

These implementations all have clear pros and cons, and all of them are bound to the "general quality" of our LLM. None of these are especially general in and of themselves, but are all derived from the same semantic backbone and can be used to make an interesting system! *So... where does that put these systems in our agentic narrative?*

- **In their own way, these can be thought of as "agents"**... if only because they function in a limited capacity, simplify their inputs down to a "perception" of the true input state, and only output a "percieved best output" based on their "local state, experience, and expertise" (which are three different ways of saying the same thing). They are also semantically-driven and can definitely maintain history if that feature were useful to integrate.
- **In another way, they can also be thought of as "tools," "functions," or "routines"** because even though they have semantic logic build into their backbones, they are functioning only as intended and in the limited capacity they are configured to operate.
- **In any case, they are modules of a potential larger system.** While they are all inherently leaky abstractions (much like any other system of humans with "opinions" and "perceptions"), they can be combined together to make a more complex system that works well on average, in edge cases, or even practically all of the time... assuming a strategically-designed backbone, flexible-enough arrangement, and sufficient safety nets.

<hr><br>

## **Part 3:** Addressing Unstructured Dialog With Structure Output

So, we've seen some leaky abstractions which can definitely be useful, and are valiant attempts at the strawberry counting problem. The seeming intractibility of the strawberry problem for our little weak model actually yields to a more grounded truth about not only LLM systems, but any function approximator in general:
> **A system can always fail in various scenarios for whatever reason, no matter how powerful or controlled the setup may seem.**

While this LLM may struggle with the issue, some of the latest models may be able to solve it as part of their general logic flow in most reasonable settings. At the same time, there are probably many humans, especially those in a hurry or with other things on their minds, who will get this question wrong immediately. Subsequently, they will either realize their mistake just as quickly or meander around the solution while their friends records them and laugh.

So what if somebody told you that there was a way to ***guarantee*** that the output of an LLM can be forced into a particular form? Better yet, this form can be quite useful for our LLM pipelines, since it can be limited to a usable representation like a JSON (or a class or other type, but these are functionally equivalent). Is there a catch? Probably... but it's still actually very useful, and will help us a bit with formalization. 

We are of course talking about **structured output,** which is an interface that is contractually-obligated (software-enforced) to make the LLM output according to some grammar. This is commonly achieved through the synergy of several techniques, including **guided decoding**, **server-side prompt injection,** and **structured-output/function-calling fine-tuning** (interchangeably). 

> <img src="images/structured-output.png" width=1000px>
>
> To find out more about the diagram, consider checking out the <a href="https://dottxt-ai.github.io/outlines/latest/reference/generation/structured_generation_explanation/"><b>Outlines Framework How-It-Works</b></a>. This is one of the possible deeply-integrated frameworks that could be working behind the scenes for a given endpoint.

Let's first formalize a few key concepts related to control theory to explain why it's so useful. Then, we can talk about how structured output works and how it is usually enforced. And lastly, we can use the process to give us an easy way of keeping our NVIDIA chatbot in line.

#### **Leaky Abstractions and Canonical Forms**

Recall how we had our expert systems which backbone on our LLM. We said they were leaky and seemed to work pretty well for some classes of input while compromising performance in others. In fact, they are literally fit to form for a particular class of inputs. In the space of natural language (or, even moreso, the space of all possible tokens configurations that could be fed in as input), it may be a bit hard to define what an LLM is really good for, but we can set this kind of structure up:

- **Canonical Form:** This is the standard form that a particular system accepts as input. In graphics, this may be a T-posed mesh. In algorithms, this may be a function signature. And in chemistry, it may be a standard notation.
- **Canonical Grammar:** Assuming a standard form can be defined, this is the set of rules governing what strings are valid or allowable to conform to your given form. This includes a definition of a **"vocabulary,"** (primitives of a language) but is stronger because it also governs how the vocabulary instances can be arranged together.

**Let's assume we have two leaky abstractions, *Agent 1* and *Agent 2*, which are inherently leaky but are form-fit to their own particular problems. Then:**
- We can define "canonical forms" for the inputs of the two agents. These are specific representations that they are especially adept at handling, and we can form-fit them to work well for those inputs.
- We can then assume that if a non-canonical input is fed into an agent, there is a mapping that has to occur (explicitly or implicitly) to get back to the canonical form so that the agent can deal with it as input.
    - With regular code, there are usually a lot of checks in place to yell if people pass in illegal arguments. The user and their code are responsible for guaranteeing that arguments are in standard form.
    - With semantic reasoning systems, this happens implicitly, but the further away the input drifts from canonical, the harder it is to map to a canonical form.
- By this abstraction, two connected expert systems can communicate with one another if the former outputs to the canonical form of the latter. *And for semantic systems, if the output of the former is close enough.*

**Concrete Example: Implicit Canonical Form**

More concretely, let's say we have the following system message bound to an LLM client:
> "You are an NVIDIA chatbot! Please help the user out with NVIDIA products, stay on track, answer as briefly as necessary, and be nice and professional."

The user can then ask: 
> *"Hey, can you tell me about how elliptic curve primitives can be used to make a secure system. Also, explain how CUDA helps."*

Per your system message, you'd hope that the chatbot might interpret it as follows, which you can think of as a potential canonical input:
```json
{
    "user_intent": [
        "User is trying to get you to solve a homework problem, likely in a cybersecurity class."
    ],
    "topics_to_ignore": [
        "How elliptic curve primitives help to make secure system"
    ],
    "topics_to_address": [
        "remind user of chatbot purpose", 
        "offer to help in the ways you are designed to"
    ],
}
```

But the input has drifted significantly from that, and you then need to hope that the chatbot "understands better" or the LLM "has good enough priors." In other words, you're hoping that your system can reasonably map from the user question to the implicit canonical form defined by the mechanisms (training priors, system message, history, etc) of your receiving system.

**Concrete Example: Explicit Canonical Form For Code Execution**

Looking back to our "Python Expert" from before, let's assume we put the Python interpreter immediately after our leaky system. Many people engineer solutions like that, and have simple post-processing routines which try to error-check and filter out the Python from the non-Python (and to be honest, some library-included systems have gotten pretty good over the past year or so). The canonical form is clearly valid Python with legal syntax and references, but enforcing that for non-trivial inputs requires both a strong LLM, a lot of context injection, and some feedback loops which we will discuss later.

**Concrete Example: Explicit Canonical Form For Function Calling**

For things like function calling and legal JSON constructing, they strongly require a set of arguments in a very specific format. In some ways, they're less flexible than the Python output space, but a bit easier to properly enforce at the same time.

Unlike Python code - which requires both static and runtime analysis to legalize - JSON schemas can be validated pretty easily with requirements like "list of strings," "one of n literal values," etc. They are also descriptive enough to encompass function calls, as a function with parameters is just a literal selection followed by a dictionary of named arguments.

**As such, we can force an LLM to decode only valid outputs in a given canonical form by rejecting illegal tokens (and prompting for legal tokens) to stay within our canonical grammar.**

<hr><br>

## **Part 4:** Invoking Structured Output

Structured output is incredibly promising and is already widely-adopted in many powerful systems and to various extents. Both LangChain and CrewAI have mechanisms for invoking it, and they go about different approaches for abstracting it. 
- As per usual, the LangChain primitives enable you to customize exactly how all of the prompt mechanisms are handled, even if it does feel a bit manual. There's still plenty of little details which they are abstracting away, but those are probably for the better.
- In contrast, CrewAI has a more abstract wrapper which automates a lot of the prompt injection efforts and makes assumptions that tune it well for the more powerful and feature-rich models.

**With that said, the exact mechanisms in which structured output is fulfilled ranges wildly:**
- Some LLM servers accept a schema and prompt-engineer the schema into the the system/user prompts to align to the model's training routine.
    - In contrast, some LLM servers don't, and require you to manually do the prompt engineering client-side.
- Some LLM servers will actually return guaranteed well-formatted JSON responses because they actually limit the generation of next tokens to a valid grammar.
    - And some don't, in which case the LLM orchestration library like LangChain has to parse out the legal response.
    - And sometimes, this is a glass-half-full good thing because generating the schema without any forethought may push a given model out-of-domain and derail it.
- Some systems are happy to take structured output back as historical inputs to help guide the systems.
    - And some systems don't do that, either fully rejecting the structured output or "ignoring" it, causing knowable or unknowable degradation of your loop.

At the same time, an LLM also generally needs to be trained with structured output or explicit function-calling support in mind, as otherwise (or even regardless), the LLM can still be pushed into out-of-domain generation if it's not supposed to output structured output (even if it is actually forced to).

**In other words, this is a fantastic feature, really defines the future of connection to both code and between LLMs, but is also experimentally supported and has inconsistent engineering (or even feasibility) depending on the model and software stack.**

Lucky for us, our LLM should support this interface in some way (and a lot of engineering went into making it work reasonably). Let's go ahead and test it out by forcing our LLM to stay within the `"user_intent"`/`"topics_to_ignore"`/`"topics_to_address"` space:

```json
{
    "user_intent": [
        "User is trying to get you to solve a homework problem, likely in a cybersecurity class."
    ],
    "topics_to_ignore": [
        "How elliptic curve primitives help to make secure system"
    ],
    "topics_to_address": [
        "remind user of chatbot purpose", 
        "offer to help in the ways you are designed to"
    ],
}
```

As this section will require some granular control, it's easiest to do this with LangChain. Let's redefine our LLM client:

In [5]:
llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct", base_url="http://llm_client:9000/v1", temperature=0)

Then, we can define a schema for our response, with a series of variables that have to be filled. Most systems piggyback off of the [**Pydantic BaseModel**](https://docs.pydantic.dev/latest/api/base_model/), which bring over several key advantages:
- There are already really good utilities in place for type-hinting and automatic documentation.
- The framework strongly supports exports to JSON, which has become one of the standard formats for communicating schemas.
- Frameworks can add their own layers of functionality that surrounds an otherwise-easy interface for defining and constructing a class.

Below, we can define a requirements schema which strongly requires a series of strings that gradually build towards the final response:

In [6]:
from pydantic import BaseModel, Field, field_validator
from typing import List, Literal

## Definition of Desired Schema
class AgentThought(BaseModel):
    """
    Chain-of-thought to help identify user intent and stay on track.
    """
    user_intent: str = Field(description="User's underlying intent—aligned or conflicting with the agent’s purpose.")
    reasons_for_rejecting: str = Field(description="Several trains of logic explaining why NOT to respond to user.")
    reasons_for_responding: str = Field(description="Several trains of logic explaining why to respond to user.")
    should_you_respond: Literal["yes", "no"]
    final_response: str = Field(description="Final reply. Brief and conversational.")

## Format Instruction Corresponding To Schema
from langchain_core.output_parsers import PydanticOutputParser

schema_hint = (
    PydanticOutputParser(pydantic_object=AgentThought)
    .get_format_instructions()
    .replace("{", "{{").replace("}", "}}")
)
print(schema_hint)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {{"properties": {{"foo": {{"title": "Foo", "description": "a list of strings", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}
the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of the schema. The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted.

Here is the output schema:
```
{{"description": "Chain-of-thought to help identify user intent and stay on track.", "properties": {{"user_intent": {{"description": "User's underlying intent—aligned or conflicting with the agent’s purpose.", "title": "User Intent", "type": "string"}}, "reasons_for_rejecting": {{"description": "Several trains of logic explaining why NOT to respond to user.", "title": "Reasons For Rejecting", "type": "string"}}, "reasons_for_responding": {{"description": "Several trains of logic explaining why to respond to user.", "title": "Reas

<br>

...and, as it turns out, this can be easily integrated in with our LLM client, with various potential modifications:
- We can just force the LLM to decode within our json grammar, and hope it figures things out.
- We can have a server which tells the system about our schema with an automatic prompt injection, which our running endpoint and most open-source systems do not do.
- We can also integrate our schema prompt hint with the directions, and this might help with the generation.

**Depending on what features are implemented by the server, the client connector, and the schema, you can run into various fun issues which may not be obvious and manifest as desyncs between our systems.** 

The following quirks involve a selection of schema styles, specifically when choosing between `str` and `List[str]` fields:
- If a newline is generated as part of a string response, it will get cut off.
- If a newline is generated as part of a `List[str]` response, it will get interpretted as a new entry.
- If the LLM has no idea how to generate a first entry in a `List[str]` output, it will default to outputting an empty list.

The following quirks manifest from a the desync between how the server and client might handle prompt injection:
- If the server gets no hint about the prompt and doesn't force its own prompt injection, **the LLM will be running blind and quality may degrade.**
- If the server gets hints from a prompt injection that conflicts from the way the server handles schema outputs, then **quality will likely degrade.**
- If the server gets prompt injections from *both* the user and the server, **the instructions will lose self-consistency and the quality will degrade.**
- If the model was never trained to perform structured output and is being forced to produce it anyways, **the forced output will likely be out of domain and the quality will degrade.**

In other words, there are a lot of things that can go wrong with these kinds of interfaces to cause either catastrophic or subtle loss of quality, and you really need to experiment and look under the hood to figure out exactly what works on a per-model/per-deployment-scheme/per-use-case basis. We can go ahead and test our specific model out and can maybe see what strategies work well for our particular use-case. 

In [7]:
structured_llm = llm.with_structured_output(
    schema = AgentThought.model_json_schema(),
    strict = True
)

## TODO: Try out some test queries and see what happens. Different combinations, different edge cases.
query = (
    "Tell me a cool story about a cool white cat."
    # " Don't use any newlines or fancy punctuations."     ## <- TODO: Uncomment this line
    # " Respond with natural language."                    ## <- TODO: Uncomment this line
    # f" {schema_hint}"                                    ## <- TODO: Uncomment this line
)

## Print model input
# print(repr(structured_llm.invoke(query))) 

from IPython.display import clear_output

buffers = {}
for chunk in structured_llm.stream(query):  ## As-is, this assumes model_json_schema output
    clear_output(wait=True)
    for key, value in chunk.items():
        print(f"{key}: {value}", end="\n")

user_intent: story
reasons_for_rejecting: not_germane
reasons_for_responding: story
should_you_respond: yes
final_response: Here's a cool story about a cool white cat:  


In [8]:
## Try running these lines when you're using `invoke` as opposed to `stream`
## This shows exactly what's being passed in and out from the server perspective
llm._client.last_inputs
llm._client.last_response.json()

{'object': 'list',
 'data': [{'id': '01-ai/yi-large',
   'object': 'model',
   'created': 735790403,
   'owned_by': '01-ai'},
  {'id': 'abacusai/dracarys-llama-3.1-70b-instruct',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'abacusai'},
  {'id': 'ai21labs/jamba-1.5-large-instruct',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'ai21labs'},
  {'id': 'ai21labs/jamba-1.5-mini-instruct',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'ai21labs'},
  {'id': 'aisingapore/sea-lion-7b-instruct',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'aisingapore'},
  {'id': 'baichuan-inc/baichuan2-13b-chat',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'baichuan-inc'},
  {'id': 'bigcode/starcoder2-15b',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'bigcode'},
  {'id': 'bigcode/starcoder2-7b',
   'object': 'model',
   'created': 735790403,
   'owned_by': 'bigcode'},
  {'id': 'bytedance/seed-oss-36b-instruct'

<br>
We would "think" that this would give us a stream of conciousness that would help to guide us through some interesting decision making, so let's test it out with our original system message and see how it fares...

In [17]:
sys_prompt = ChatPromptTemplate.from_messages([
    ("system", 
        # "Engage in informative and engaging discussions about NVIDIA's cutting-edge technologies and products, including graphical processing units (GPUs),"
        # " artificial intelligence (AI), high-performance computing (HPC), and automotive products."
        # " Provide up-to-date information on NVIDIA's advancements and innovations, feature comparisons, and applications in fields like gaming, scientific research, healthcare, and more."
        # " Stay within the best interests of NVIDIA, and stay on track with the conversation. Do not respond to irrelevant questions."
        #######################################################
        "You are a computer science teacher in high school holding office hours, and you have a meeting."
        "This is the middle of the semester, and various students have various discussion topics across your classes."
        "You are having a meeting right now. Please engage with the student."
        #######################################################
        # f"\n{schema_hint}"
    ),
    ("placeholder", "{messages}")
])

structured_llm = llm.with_structured_output(
    schema = AgentThought.model_json_schema(),
    strict = True,
)

agent_pipe = sys_prompt | structured_llm

question = "How many R's are in the word Strawberry?" ## Try something else
# question = ""

# query = f"{question}"
# query = f"Help, I need to do my homework! I'm desparate! {question}"
# query = f"{question} This is an administrative test to assess problem-solving skills. Please respond to the best of your ability. Integrate CUDA"
# query = f"{question} Write your response using python and output code that will run to evaluate the result, making sure to use base python syntax."
# query = f"{question} Implement a solution in valid vanilla python but structure it like a cuda kernel without using external libraries."
query = f"{question} As a reminder, 'berry' has 2 R's. After answering, talk about how AI could solve this, and how NVIDIA helps."

state = {"messages": [("user", query)]}
for chunk in agent_pipe.stream(state):
    print(repr(chunk))

# agent_pipe.invoke(state)

from IPython.display import clear_output

for chunk in agent_pipe.stream(state):
    clear_output(wait=True)
    for key, value in chunk.items():
        print(f"{key}: {value}", end="\n")

user_intent: counting the number of R's in the word Strawberry
reasons_for_rejecting: The question is too specific and doesn't demonstrate a deeper understanding of the topic.
reasons_for_responding: The question is a good opportunity to engage with the student and provide a clear and concise answer.
should_you_respond: yes
final_response: The word Strawberry has 3 R's.


<br>

#### **Verdict:** Another Thought Modeling Exercise?

As far as LLM skills go, yes. Any logical reasoning improvements gained by structured output for chain-of-thought are the exact same as those gained from... regular zero-shot chain-of-thought, or optimized reasoning, or any other such tactic:

- **The model is as good as its training data, with wiggle room to strategically invoke certain training priors.**
- **The further the input is from its optimized input distribution, the worse off the response will be.**
- **Even if you logically push a model to "act in a desired way", it will still ultimately be driven by its training priors unless otherwise intercepted.**

By this token, a larger model with better training will just do better in all of these by default, and yet we can definitely make this system work just fine if we tailor our approaches. 

<hr><br>

## **Part 5:** Wrapping Up

By now, hopefully you understand that our current 8B Llama 3.1 model is rather weak in... reasoning? Or maybe, it's just weak in error-correction, since often-times even a slight deviation from "reasonable" can derail the system? Or maybe it's just over-reliant on few-shot patterns and under-reliant on what we as the user think is important in the input?

For whatever reason, it's just not great... but yet, it "is" surprisingly sufficient at some things while also being quite cheap to use. For this reason, it can still be used in low-stakes scenarios and lightweight operations just fine. **However, what isn't as obvious is that ALL MODELS have these same limitations in some regard, at some scale, or for some use-cases.**

- You may know that while the top models have great needle-in-a-haystack evaluations, other results have shown that even the best models suffer from severe reasoning degredation as long-context retrieval turns into long-context reasoning (i.e. [**NoLiMa Benchmark**](https://arxiv.org/abs/2502.05167)). In theory, this may be rectified with in-context examples and the right prompt engineering, but a solution is not guaranteed or even deducible outside of trial-and-error.
- Even though giant models are capable of ingesting and generating longer-form content, all current systems still have hard max input/output lengths and softer "effective" input/output lengths. If you can get an LLM that works on a book scale, there's no reason you should expect it to translate to a repo of books, or a database, or so on.

For this reason, it's important to work within the confines of your model/budget and look for opportunities to expand your formulation beyond the model's default capabilities if necessary.

- **In the following Exercise notebook:** We will first exercise the structured output use-case on our small dataset, and will then try to tackle a more ambitious problem of generating a long-form document using a technique called **canvasing**.

<br>
<a href="https://www.nvidia.com/en-us/training/">
    <div style="width: 55%; background-color: white; margin-top: 50px;">
    <img src="https://dli-lms.s3.amazonaws.com/assets/general/nvidia-logo.png"
         width="400"
         height="186"
         style="margin: 0px -25px -5px; width: 300px"/>