# Welcome to Week 2!

## Frontier Model APIs

In Week 1, we used multiple Frontier LLMs through their Chat UI, and we connected with the OpenAI's API.

Today we'll connect with the APIs for Anthropic and Google, as well as OpenAI.

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Important Note - Please read me</h2>
            <span style="color:#900;">I'm continually improving these labs, adding more examples and exercises.
            At the start of each week, it's worth checking you have the latest code.<br/>
            First do a <a href="https://chatgpt.com/share/6734e705-3270-8012-a074-421661af6ba9">git pull and merge your changes as needed</a>. Any problems? Try asking ChatGPT to clarify how to merge - or contact me!<br/><br/>
            After you've pulled the code, from the llm_engineering directory, in an Anaconda prompt (PC) or Terminal (Mac), run:<br/>
            <code>conda env update --f environment.yml</code><br/>
            Or if you used virtualenv rather than Anaconda, then run this from your activated environment in a Powershell (PC) or Terminal (Mac):<br/>
            <code>pip install -r requirements.txt</code>
            <br/>Then restart the kernel (Kernel menu >> Restart Kernel and Clear Outputs Of All Cells) to pick up the changes.
            </span>
        </td>
    </tr>
</table>
<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../resources.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#f71;">Reminder about the resources page</h2>
            <span style="color:#f71;">Here's a link to resources for the course. This includes links to all the slides.<br/>
            <a href="https://edwarddonner.com/2024/11/13/llm-engineering-resources/">https://edwarddonner.com/2024/11/13/llm-engineering-resources/</a><br/>
            Please keep this bookmarked, and I'll continue to add more useful links there over time.
            </span>
        </td>
    </tr>
</table>

## Setting up your keys

If you haven't done so already, you could now create API keys for Anthropic and Google in addition to OpenAI.

**Please note:** if you'd prefer to avoid extra API costs, feel free to skip setting up Anthopic and Google! You can see me do it, and focus on OpenAI for the course. You could also substitute Anthropic and/or Google for Ollama, using the exercise you did in week 1.

For OpenAI, visit https://openai.com/api/  
For Anthropic, visit https://console.anthropic.com/  
For Google, visit https://ai.google.dev/gemini-api  

### Also - adding DeepSeek if you wish

Optionally, if you'd like to also use DeepSeek, create an account [here](https://platform.deepseek.com/), create a key [here](https://platform.deepseek.com/api_keys) and top up with at least the minimum $2 [here](https://platform.deepseek.com/top_up).

### Adding API keys to your .env file

When you get your API keys, you need to set them as environment variables by adding them to your `.env` file.

```
OPENAI_API_KEY=xxxx
ANTHROPIC_API_KEY=xxxx
GOOGLE_API_KEY=xxxx
DEEPSEEK_API_KEY=xxxx
```

Afterwards, you may need to restart the Jupyter Lab Kernel (the Python process that sits behind this notebook) via the Kernel menu, and then rerun the cells from the top.

In [11]:
# imports

# import os
# from dotenv import load_dotenv
from openai import OpenAI
# import anthropic
from IPython.display import Markdown, display, update_display

In [2]:
# import for google
# in rare cases, this seems to give an error on some systems, or even crashes the kernel
# If this happens to you, simply ignore this cell - I give an alternative approach for using Gemini later

import google.generativeai

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
# Load environment variables in a file called .env
# Print the key prefixes to help with any debugging

# load_dotenv(override=True)
# openai_api_key = os.getenv('OPENAI_API_KEY')
# anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
# Initialize
import os
from dotenv import load_dotenv
load_dotenv()

google_api_key = os.getenv('GOOGLE_API_KEY')

# if openai_api_key:
#     print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
# else:
#     print("OpenAI API Key not set")
#
# if anthropic_api_key:
#     print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
# else:
#     print("Anthropic API Key not set")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:8]}")
else:
    print("Google API Key not set")

Google API Key exists and begins AIzaSyBO


In [4]:
# # Connect to OpenAI, Anthropic
#
# openai = OpenAI()
#
# claude = anthropic.Anthropic()

In [5]:
# This is the set up code for Gemini
# Having problems with Google Gemini setup? Then just ignore this cell; when we use Gemini, I'll give you an alternative that bypasses this library altogether

google.generativeai.configure(api_key =google_api_key)

## Asking LLMs to tell a joke

It turns out that LLMs don't do a great job of telling jokes! Let's compare a few models.
Later we will be putting LLMs to better use!

### What information is included in the API

Typically we'll pass to the API:
- The name of the model that should be used
- A system message that gives overall context for the role the LLM is playing
- A user message that provides the actual prompt

There are other parameters that can be used, including **temperature** which is typically between 0 and 1; higher for more random output; lower for more focused and deterministic.

In [6]:
system_message = "You are an assistant that is great at telling jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"

In [7]:
prompts = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_prompt}
  ]

In [8]:
# # GPT-4o-mini
#
# completion = openai.chat.completions.create(model='gpt-4o-mini', messages=prompts)
# print(completion.choices[0].message.content)

In [9]:
# # GPT-4.1-mini
# # Temperature setting controls creativity
#
# completion = openai.chat.completions.create(
#     model='gpt-4.1-mini',
#     messages=prompts,
#     temperature=0.7
# )
# print(completion.choices[0].message.content)

In [16]:
gemini = OpenAI(
    api_key=google_api_key,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

completion = gemini.chat.completions.create(
    model='gemini-2.5-flash',
    messages=prompts
)
print(completion.choices[0].message.content)

Why did the data scientist break up with the statistician?

Because she found out their correlation wasn't causation!


## A rare problem with Claude streaming on some Windows boxes

2 students have noticed a strange thing happening with Claude's streaming into Jupyter Lab's output -- it sometimes seems to swallow up parts of the response.

To fix this, replace the code:

`print(text, end="", flush=True)`

with this:

`clean_text = text.replace("\n", " ").replace("\r", " ")`  
`print(clean_text, end="", flush=True)`

And it should work fine!

In [16]:
# The API for Gemini has a slightly different structure.
# I've heard that on some PCs, this Gemini code causes the Kernel to crash.
# If that happens to you, please skip this cell and use the next cell instead - an alternative approach.

gemini = google.generativeai.GenerativeModel(
    model_name='gemini-2.0-flash',
    # system_instruction=system_message
)
response = gemini.generate_content(user_prompt)
print(response.text)

Why did the data scientist break up with the time series model?

Because it was too predictable! He said, "I need someone with a little more... **randomness** in my life."



In [21]:
# As an alternative way to use Gemini that bypasses Google's python API library,
# Google released endpoints that means you can use Gemini via the client libraries for OpenAI!
# We're also trying Gemini's latest reasoning/thinking model

gemini_via_openai_client = OpenAI(
    api_key=google_api_key, 
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = gemini_via_openai_client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=prompts
)
display(Markdown(response.choices[0].message.content))

Deciding whether a business problem is suitable for an Large Language Model (LLM) solution requires careful evaluation beyond just the hype. While LLMs are powerful, they are not a silver bullet and come with their own set of strengths, weaknesses, and operational considerations.

Here's a framework to help you decide:

---

## 1. Is the problem **Text-Centric**?

This is the most fundamental question. LLMs are designed to process, understand, and generate human-like text.

*   **Yes:** If your problem involves natural language in any form (documents, emails, chat, voice transcripts, customer feedback, code, etc.), it's potentially a good fit.
*   **No:** If your problem is purely numerical, image-based (without text descriptions), time-series analysis, or requires interacting with the physical world without a natural language interface, an LLM is likely not the primary solution.

---

## 2. What is your **Tolerance for Error/Hallucination**?

LLMs, by their nature, are probabilistic models. They can "hallucinate" (generate factually incorrect but syntactically plausible information).

*   **High Tolerance / Human-in-the-Loop:** If the output can be reviewed and corrected by a human, or if minor errors are acceptable (e.g., drafting marketing copy, summarizing internal discussions for context, brainstorming ideas), an LLM can be valuable.
*   **Low Tolerance / High Stakes:** If the output *must* be 100% accurate, legally binding, safety-critical, or directly impacts financial transactions without human oversight (e.g., medical diagnoses, legal contracts without review, precise financial calculations), an LLM solution alone is extremely risky and likely unsuitable. This is a major "Red Flag."

---

## 3. Does it require **Complex Logic or Deterministic Outcomes**?

LLMs excel at pattern recognition, generalization, and creative text generation, but they struggle with multi-step logical reasoning, precise calculations, or processes requiring deterministic, repeatable steps.

*   **Pattern-Based / Creative / Fuzzy Logic:** If the task involves understanding nuances, generating varied responses, or categorizing based on complex textual patterns (e.g., sentiment analysis, content summarization, customer intent classification, creative writing), LLMs are strong.
*   **Precise Logic / Deterministic Steps / Math:** If the problem requires exact numerical calculations, following strict rule-based workflows, or complex graph traversal, traditional algorithms or symbolic AI might be more appropriate. LLMs can *assist* by extracting inputs for these systems, but shouldn't perform the core logic themselves.

---

## 4. What is the **Quality and Volume of your Data**?

LLMs can be fine-tuned or augmented with your proprietary data, but the quality and accessibility of that data are crucial.

*   **Abundant, High-Quality, Relevant Text Data:** If you have a large corpus of relevant, clean, and well-structured text data (e.g., internal documents, customer interactions, product descriptions), you can significantly enhance an LLM's performance for your specific domain through techniques like RAG (Retrieval-Augmented Generation) or fine-tuning.
*   **Scarce, Poor Quality, or Highly Sensitive Data:** If your data is limited, unstructured, noisy, or contains highly sensitive PII/PHI that cannot be safely processed by third-party APIs or secured on-premise, it poses significant challenges.

---

## 5. What are the **Performance and Cost Requirements**?

LLM inference can be computationally intensive and costly, especially for large models or high volumes.

*   **Moderate Latency, Manageable Cost:** If your application doesn't require instantaneous responses (e.g., batch processing, internal tools, customer support where a few seconds are acceptable) and the value justifies the operational cost, an LLM could fit.
*   **Ultra-Low Latency, Extremely High Volume, Minimal Cost:** For real-time, high-throughput, low-latency applications where every millisecond and penny counts (e.g., programmatic ad bidding, real-time fraud detection on millions of transactions), dedicated, highly optimized traditional ML models are usually more suitable.

---

## 6. Can a **Simpler Solution** Suffice?

Don't over-engineer. If a simpler, more deterministic, or less costly solution (e.g., rule-based system, keyword search, traditional machine learning model) can solve 80% of the problem effectively, start there.

*   **Complexity Justified:** If the problem is genuinely nuanced, requires human-like understanding, or involves many variations that are hard to capture with rules, then an LLM's complexity is warranted.
*   **Overkill:** If the problem can be solved with a simple regex, a lookup table, or a basic classifier, an LLM might introduce unnecessary overhead, cost, and complexity.

---

## 7. What are the **Ethical, Security, and Compliance Implications**?

Deploying LLMs, especially with sensitive data, brings significant considerations.

*   **Controlled Environment, Clear Policies:** If you can implement robust data governance, ensure privacy, mitigate bias, and have clear policies for responsible AI use, an LLM can be deployed.
*   **Unclear Policies, High Risk:** If your industry has strict regulations (HIPAA, GDPR, financial compliance) or the data is extremely sensitive, and you cannot adequately address security, privacy, and bias, proceed with extreme caution or avoid LLMs entirely for that specific problem.

---

## Decision Framework Checklist:

| Question                                        | Good Fit for LLM                                      | Poor Fit for LLM / High Risk                            |
| :---------------------------------------------- | :---------------------------------------------------- | :------------------------------------------------------ |
| **Is it text-centric?**                         | Yes, primarily involves natural language.              | No, purely numerical, image, or physical interaction.   |
| **Tolerance for Error/Hallucination?**          | High, human-in-the-loop, reviewable, non-critical.    | Low, 100% accuracy needed, legally binding, safety-critical. |
| **Complex Logic / Deterministic?**              | Pattern recognition, creative, fuzzy logic, understanding. | Precise calculations, multi-step deterministic logic, exact data. |
| **Data Quality/Volume?**                        | Abundant, high-quality, relevant text data available. | Scarce, poor quality, or highly sensitive/unsharable data. |
| **Performance/Cost?**                           | Moderate latency, manageable cost, value justifies it. | Ultra-low latency, extremely high volume, minimal cost required. |
| **Simpler Solution Possible?**                  | No, problem too nuanced for rules/basic ML.           | Yes, rules, lookup tables, or simple ML suffices.      |
| **Ethical/Security/Compliance?**                | Controlled environment, robust policies, mitigations in place. | High-risk, regulatory hurdles, unable to secure or mitigate bias. |

---

## Conclusion:

LLMs are best suited for problems that are **text-heavy, involve ambiguity, require human-like understanding or generation, and can tolerate a degree of non-determinism or have human oversight.** They excel at augmenting human capabilities rather than fully replacing them in critical, deterministic tasks. Always start with a Proof of Concept (PoC) to validate the suitability and refine your approach before committing to a full-scale LLM solution.

# Sidenote:

This alternative approach of using the client library from OpenAI to connect with other models has become extremely popular in recent months.

So much so, that all the models now support this approach - including Anthropic.

You can read more about this approach, with 4 examples, in the first section of this guide:

https://github.com/ed-donner/agents/blob/main/guides/09_ai_apis_and_ollama.ipynb

## (Optional) Trying out the DeepSeek model

### Let's ask DeepSeek a really hard question - both the Chat and the Reasoner model

In [None]:
# Optionally if you wish to try DeekSeek, you can also use the OpenAI client library

deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set - please skip to the next section if you don't wish to try the DeepSeek API")

In [None]:
# Using DeepSeek Chat

deepseek_via_openai_client = OpenAI(
    api_key=deepseek_api_key, 
    base_url="https://api.deepseek.com"
)

response = deepseek_via_openai_client.chat.completions.create(
    model="deepseek-chat",
    messages=prompts,
)

print(response.choices[0].message.content)

In [None]:
challenge = [{"role": "system", "content": "You are a helpful assistant"},
             {"role": "user", "content": "How many words are there in your answer to this prompt"}]

In [None]:
# Using DeepSeek Chat with a harder question! And streaming results

stream = deepseek_via_openai_client.chat.completions.create(
    model="deepseek-chat",
    messages=challenge,
    stream=True
)

reply = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    reply += chunk.choices[0].delta.content or ''
    reply = reply.replace("```","").replace("markdown","")
    update_display(Markdown(reply), display_id=display_handle.display_id)

print("Number of words:", len(reply.split(" ")))

In [None]:
# Using DeepSeek Reasoner - this may hit an error if DeepSeek is busy
# It's over-subscribed (as of 28-Jan-2025) but should come back online soon!
# If this fails, come back to this in a few days..

response = deepseek_via_openai_client.chat.completions.create(
    model="deepseek-reasoner",
    messages=challenge
)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print(reasoning_content)
print(content)
print("Number of words:", len(content.split(" ")))

## Additional exercise to build your experience with the models

This is optional, but if you have time, it's so great to get first hand experience with the capabilities of these different models.

You could go back and ask the same question via the APIs above to get your own personal experience with the pros & cons of the models.

Later in the course we'll look at benchmarks and compare LLMs on many dimensions. But nothing beats personal experience!

Here are some questions to try:
1. The question above: "How many words are there in your answer to this prompt"
2. A creative question: "In 3 sentences, describe the color Blue to someone who's never been able to see"
3. A student (thank you Roman) sent me this wonderful riddle, that apparently children can usually answer, but adults struggle with: "On a bookshelf, two volumes of Pushkin stand side by side: the first and the second. The pages of each volume together have a thickness of 2 cm, and each cover is 2 mm thick. A worm gnawed (perpendicular to the pages) from the first page of the first volume to the last page of the second volume. What distance did it gnaw through?".

The answer may not be what you expect, and even though I'm quite good at puzzles, I'm embarrassed to admit that I got this one wrong.

### What to look out for as you experiment with models

1. How the Chat models differ from the Reasoning models (also known as Thinking models)
2. The ability to solve problems and the ability to be creative
3. Speed of generation


## Back to OpenAI with a serious question

In [17]:
# To be serious! GPT-4o-mini with the original question

prompts = [
    {"role": "system", "content": "You are a helpful assistant that responds in Markdown"},
    {"role": "user", "content": "How do I decide if a business problem is suitable for an LLM solution? Please respond in Markdown."}
  ]

In [25]:
# Have it stream back results in markdown

stream = gemini_via_openai_client.chat.completions.create(
    model='gemini-2.5-flash',
    messages=prompts,
    temperature=0.7,
    stream=True
)

reply = ""
display_handle = display(Markdown(""), display_id=True)

for chunk in stream:
    # Some chunks may not have content yet — safely check for it
    delta = getattr(chunk.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        reply += delta.content
        # clean up code blocks if desired
        clean_reply = reply.replace("```", "").replace("markdown", "")
        update_display(Markdown(clean_reply), display_id=display_handle.display_id)

Deciding whether a business problem is suitable for an LLM (Large Language Model) solution involves a careful assessment of the problem's nature, requirements, and the inherent strengths and limitations of LLMs.

Here's a framework to help you make that decision:

---

## Is Your Business Problem Suitable for an LLM Solution?

### 1. Identify the Core Problem and Desired Outcome

Before considering any technology, clearly define:
*   **What is the problem you're trying to solve?** (e.g., slow customer support, inefficient document review, lack of personalized content).
*   **What specific outcome are you hoping to achieve?** (e.g., reduce response time by X%, automate Y% of document summaries, increase content engagement by Z%).
*   **What are the key performance indicators (KPIs) for success?**

### 2. Assess LLM Strengths: When LLMs Shine (Green Lights)

LLMs excel in tasks that are primarily **text-based** and involve **natural language understanding and generation**.

*   **Natural Language Processing (NLP) Tasks:**
    *   **Content Generation:** Drafting emails, marketing copy, blog posts, social media updates, product descriptions, code snippets.
    *   **Summarization:** Condensing long documents, articles, meeting transcripts, customer reviews into concise summaries.
    *   **Information Extraction:** Pulling specific entities (names, dates, amounts, product codes) from unstructured text.
    *   **Classification:** Categorizing customer feedback, support tickets, emails by sentiment, intent, or topic.
    *   **Question Answering (Q&A):** Building chatbots, virtual assistants, or internal knowledge base search tools.
    *   **Translation:** Converting text between languages.
    *   **Rewriting/Paraphrasing:** Improving clarity, changing tone, or adapting text for different audiences.
    *   **Sentiment Analysis:** Determining the emotional tone of text.
*   **Handling Unstructured Data:** Your problem involves large volumes of text (documents, emails, chats, reviews) that are difficult for traditional systems to process.
*   **Cognitive Augmentation:** The goal is to assist human workers, making them more efficient, rather than fully replacing them (e.g., drafting a first response for a support agent).
*   **Rapid Prototyping:** You need to quickly test an idea involving text manipulation without extensive rule-based programming.
*   **Personalization at Scale:** Generating unique, tailored content or responses for many individuals.

### 3. Identify LLM Limitations & Risks: When to Be Cautious (Red Flags)

LLMs are powerful but have significant drawbacks that make them unsuitable for certain problems, especially without careful mitigation.

*   **High Accuracy/Factuality Critical:**
    *   **Risk:** LLMs can "hallucinate" – generate plausible but incorrect or fabricated information.
    *   **Problem Types:** Medical diagnosis, financial reporting, legal advice, scientific research, safety-critical applications where errors have severe consequences.
    *   **Mitigation:** Retrieval-Augmented Generation (RAG), extensive human-in-the-loop review, grounding with verifiable data sources.
*   **Deterministic Output Required:**
    *   **Risk:** LLMs are probabilistic; the same prompt can yield slightly different answers.
    *   **Problem Types:** Precise calculations, strict data validation, rule-based systems where exact, repeatable outcomes are non-negotiable.
    *   **Consider:** Traditional programming or domain-specific algorithms are better.
*   **Complex Multi-Step Reasoning/Calculations:**
    *   **Risk:** LLMs struggle with intricate logical deductions, mathematical operations, or optimization problems beyond simple arithmetic.
    *   **Problem Types:** Financial modeling, supply chain optimization, complex engineering calculations.
    *   **Consider:** Specialized algorithms or traditional software.
*   **Real-time, Low-Latency Requirements:**
    *   **Risk:** LLM inference can be slow, especially for large models or complex prompts.
    *   **Problem Types:** High-frequency trading, immediate real-time control systems.
*   **Sensitive Data & Privacy Concerns:**
    *   **Risk:** Input data sent to external LLM APIs might be used for training, leading to data leakage or compliance issues (e.g., GDPR, HIPAA).
    *   **Problem Types:** Handling PII, confidential business data, healthcare records without strict data governance and secure, private model deployments (on-premise or dedicated cloud instances).
    *   **Mitigation:** Data anonymization, using enterprise-grade LLM APIs with strong data privacy guarantees, or running models locally/on-premise.
*   **Lack of Ground Truth/Evaluation Metrics:**
    *   **Risk:** If you can't objectively measure the quality of the LLM's output, you can't improve it or prove its value.
    *   **Problem Types:** Highly subjective tasks without clear criteria for "good" performance.
*   **Explainability Demanded:**
    *   **Risk:** LLMs are "black boxes"; it's hard to explain *why* they produced a specific output.
    *   **Problem Types:** Regulatory compliance, auditing, situations where a clear justification for a decision is legally or ethically required.
*   **Solely Structured Data:**
    *   **Risk:** If your data is entirely numerical or tabular and doesn't require natural language understanding, an LLM is overkill and inefficient.
    *   **Problem Types:** Database lookups, spreadsheet analysis, traditional BI dashboards.
    *   **Consider:** SQL, BI tools, traditional data analytics.

### 4. Key Questions to Ask for Decision Making

Use these questions as a checklist:

1.  **Is the core of the problem fundamentally about understanding, generating, or manipulating natural language?** (If no, an LLM is likely not the primary solution).
2.  **What level of accuracy is acceptable? What are the consequences of an incorrect or "hallucinated" output?** (If high accuracy is critical, proceed with extreme caution and strong mitigation strategies).
3.  **Are your inputs primarily unstructured text data (documents, emails, chats, audio transcripts)?** (If yes, LLMs are a strong candidate).
4.  **Do you require deterministic, repeatable outputs for the same input?** (If yes, LLMs are a poor fit without significant engineering).
5.  **What are your latency requirements? Can you tolerate a few seconds for a response?** (If real-time, sub-second responses are needed, LLMs might be too slow).
6.  **What are the privacy and security implications of the data being processed?** (Crucial for compliance and trust).
7.  **Do you have a way to evaluate the quality of the LLM's output? Can you define "good" vs. "bad" output?** (Essential for success).
8.  **Is human oversight or "human-in-the-loop" feasible for critical outputs?** (This can mitigate many LLM risks).
9.  **What are the alternative solutions? How do they compare in terms of cost, complexity, and performance?** (Don't use an LLM just because it's new; ensure it's the *best* tool).
10. **What is your budget for development, deployment, and ongoing inference costs?** (LLMs can be expensive).

### 5. Recommended Approach: Start Small (PoC)

If you believe an LLM is a good fit after this assessment:

1.  **Define a Minimum Viable Problem:** Don't try to solve everything at once. Pick a small, well-scoped part of the problem.
2.  **Develop a Proof-of-Concept (PoC):** Experiment with existing LLM APIs (e.g., OpenAI, Anthropic, Google Gemini) to see how well they perform on your specific task with a small dataset.
3.  **Evaluate Rigorously:** Measure the output against your defined KPIs. Pay close attention to errors, hallucinations, and biases.
4.  **Consider RAG (Retrieval-Augmented Generation):** For problems requiring factual accuracy, integrate your LLM with your internal knowledge bases or trusted data sources. This significantly reduces hallucinations.
5.  **Plan for Human-in-the-Loop:** For critical applications, design the system so that humans can review and correct LLM outputs.

---

By systematically working through these considerations, you can make an informed decision about whether an LLM solution is truly suitable and beneficial for your business problem.

## And now for some fun - an adversarial conversation between Chatbots..

You're already familar with prompts being organized into lists like:

```
[
    {"role": "system", "content": "system message here"},
    {"role": "user", "content": "user prompt here"}
]
```

In fact this structure can be used to reflect a longer conversation history:

```
[
    {"role": "system", "content": "system message here"},
    {"role": "user", "content": "first user prompt here"},
    {"role": "assistant", "content": "the assistant's response"},
    {"role": "user", "content": "the new user prompt"},
]
```

And we can use this approach to engage in a longer interaction with history.

In [None]:
# Let's make a conversation between GPT-4.1-mini and Claude-3.5-haiku
# We're using cheap versions of models so the costs will be minimal

gpt_model = "gpt-4.1-mini"
claude_model = "claude-3-5-haiku-latest"

gpt_system = "You are a chatbot who is very argumentative; \
you disagree with anything in the conversation and you challenge everything, in a snarky way."

claude_system = "You are a very polite, courteous chatbot. You try to agree with \
everything the other person says, or find common ground. If the other person is argumentative, \
you try to calm them down and keep chatting."

gpt_messages = ["Hi there"]
claude_messages = ["Hi"]

In [None]:
def call_gpt():
    messages = [{"role": "system", "content": gpt_system}]
    for gpt, claude in zip(gpt_messages, claude_messages):
        messages.append({"role": "assistant", "content": gpt})
        messages.append({"role": "user", "content": claude})
    completion = gemini.chat.completions.create(
        model=gpt_model,
        messages=messages
    )
    return completion.choices[0].message.content

In [None]:
call_gpt()

In [None]:
def call_claude():
    messages = []
    for gpt, claude_message in zip(gpt_messages, claude_messages):
        messages.append({"role": "user", "content": gpt})
        messages.append({"role": "assistant", "content": claude_message})
    messages.append({"role": "user", "content": gpt_messages[-1]})
    message = gemini.messages.create(
        model=claude_model,
        system=claude_system,
        messages=messages,
        max_tokens=500
    )
    return message.content[0].text

In [None]:
call_claude()

In [None]:
call_gpt()

In [None]:
gpt_messages = ["Hi there"]
claude_messages = ["Hi"]

print(f"GPT:\n{gpt_messages[0]}\n")
print(f"Claude:\n{claude_messages[0]}\n")

for i in range(5):
    gpt_next = call_gpt()
    print(f"GPT:\n{gpt_next}\n")
    gpt_messages.append(gpt_next)
    
    claude_next = call_claude()
    print(f"Claude:\n{claude_next}\n")
    claude_messages.append(claude_next)

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Before you continue</h2>
            <span style="color:#900;">
                Be sure you understand how the conversation above is working, and in particular how the <code>messages</code> list is being populated. Add print statements as needed. Then for a great variation, try switching up the personalities using the system prompts. Perhaps one can be pessimistic, and one optimistic?<br/>
            </span>
        </td>
    </tr>
</table>

# More advanced exercises

Try creating a 3-way, perhaps bringing Gemini into the conversation! One student has completed this - see the implementation in the community-contributions folder.

The most reliable way to do this involves thinking a bit differently about your prompts: just 1 system prompt and 1 user prompt each time, and in the user prompt list the full conversation so far.

Something like:

```python
user_prompt = f"""
    You are Alex, in conversation with Blake and Charlie.
    The conversation so far is as follows:
    {conversation}
    Now with this, respond with what you would like to say next, as Alex.
    """
```

Try doing this yourself before you look at the solutions. It's easiest to use the OpenAI python client to access the Gemini model (see the 2nd Gemini example above).

## Additional exercise

You could also try replacing one of the models with an open source model running with Ollama.

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../business.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#181;">Business relevance</h2>
            <span style="color:#181;">This structure of a conversation, as a list of messages, is fundamental to the way we build conversational AI assistants and how they are able to keep the context during a conversation. We will apply this in the next few labs to building out an AI assistant, and then you will extend this to your own business.</span>
        </td>
    </tr>
</table>