## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [2]:
# Always remember to do this!
load_dotenv(override=True)

True

In [3]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key not set (and this is optional)
Google API Key not set (and this is optional)
DeepSeek API Key not set (and this is optional)
Groq API Key not set (and this is optional)


In [4]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [5]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [6]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)


If you could design an experiment to measure the ethical implications of artificial intelligence in decision-making, what variables would you include, how would you define success, and what potential outcomes would you anticipate, specifically regarding biases in data and user trust?


In [7]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

In [8]:
# The API we know well

model_name = "gpt-4o-mini"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Designing an experiment to measure the ethical implications of artificial intelligence (AI) in decision-making requires careful consideration of various variables, methodologies, and potential impacts. Here’s how one could approach such an experiment:

### Experiment Design:
1. **Objective:**
   To evaluate the impact of AI decision-making on user trust and to identify biases present in AI systems.

2. **Variables:**
   a. **Independent Variables:**
      - **Algorithm Type:** Variants of AI algorithms (e.g., rule-based systems, machine learning models, deep learning models).
      - **Data Source:** Different datasets that may have varying levels of bias (e.g., demographic data from various regions, historical data with known biases).
      - **Decision Context:** Different scenarios in which AI makes decisions (e.g., hiring, loan approval, law enforcement).
      - **User Demographics:** Varying user profiles (age, education level, cultural background) to see how trust levels vary across different groups.

   b. **Dependent Variables:**
      - **User Trust:** Measured via surveys assessing satisfaction with AI decisions, perceived fairness, and willingness to rely on AI vs. human judgment.
      - **Bias Metrics:** Data on outcomes from AI decisions (e.g., rates of approval/denial based on demographic attributes) analyzed using statistical methods to check for disparities.
      - **Decision Accuracy:** The success of AI decisions as compared to human judgments or known outcomes.

3. **Methodology:**
   - **Sample Selection:** Recruit a diverse group of participants representing various demographics.
   - **Simulation Environment:** Create a controlled environment where users can interact with different AI decision-making systems under the same circumstances.
   - **Surveys and Interviews:** Administer pre-, during, and post-experiment surveys to gauge initial trust levels, real-time reactions, and reflections after interaction with AI systems.
   - **Longitudinal Study:** Track user trust over time and how interactions with AI change perceptions.

### Definition of Success:
1. **Reduction of Bias:** A successful outcome would see a measurable decrease in biased outcomes in AI decision-making, identified through statistical analysis of equitable decision rates across demographic groups.
2. **Enhanced Trust Levels:** Achieving higher scores on user surveys regarding trust in AI systems, particularly among historically marginalized groups.
3. **Informed Trust:** Participants can articulate their understanding of AI decision-making processes, indicating that transparency and education contribute to their trust levels.

### Anticipated Outcomes:
1. **Bias Identification:**
   - Expect that certain algorithms may inherently perpetuate biases when trained on historical data that reflects societal inequities. This could manifest in significant disparities in outcomes for different user demographics.
   
2. **User Trust:**
   - Trust levels may vary significantly based on both the context of AI use and the transparency of the algorithms. More transparent systems may cultivate greater trust, while opaque AI systems may lead to skepticism, especially among users from underrepresented backgrounds.
   
3. **Impact of Feedback and Adjustment:**
   - Engaging users in a feedback loop could highlight biases earlier and allow developers to adjust AI systems, potentially leading to enhanced decision-making accuracy and reduced perceived bias.

4. **Cultural and Contextual Sensitivity:**
   - Observations may reveal that user trust is significantly influenced by cultural backgrounds and prior experiences with technology, suggesting that AI solutions must be tailored to various contexts to achieve broad acceptance.

In conclusion, the outcomes of this experiment could provide valuable insights into both the technical and social dimensions of AI implementation, fostering advancements in ethical AI development geared toward reducing biases and enhancing user trust.

In [13]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-3-7-sonnet-latest"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

TypeError: "Could not resolve authentication method. Expected either api_key or auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly omitted"

In [None]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = deepseek.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "llama-3.3-70b-versatile"

response = groq.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [None]:
!ollama pull llama3.2

In [10]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Designing an experiment to measure the ethical implications of artificial intelligence (AI) in decision-making requires a multi-faceted approach. Here's a potential design for such an experiment:

**Title:** Exploring Ethical Implications of AI in Decision-Making: A Mixed-Methods Experiment

**Variables:**

1. **Data Set**: Use real-world data sets that reflect diverse populations and domains (e.g., medical diagnosis, hiring, or financial transactions).
2. **AI Model**: Implement a range of AI models with varying levels of complexity, transparency, and explainability (e.g., rule-based, machine learning, or deep learning models).
3. **Human Evaluators**: Recruit human evaluators to assess the decision-making processes employed by each AI model.
4. **User Interactions**: Engage users with each AI model through different interfaces (e.g., text-based, graphical, or conversational) to examine user trust and behavior.
5. **Bias Analysis**: Implement data bias detection tools and manual review procedures to assess the prevalence of biases in each AI model.

**Experiment Design:**

1. **Phase 1:** Train and test multiple AI models on diverse datasets, assessing their performance, transparency, and explainability.
2. **Phase 2:** Recruit human evaluators (n=50) with diverse backgrounds and expertise to review decision-making processes employed by each AI model.
3. **Phase 3:** Engage users (n=100) with each AI model through various interfaces, measuring user trust, satisfaction, and behavior.
4. **Phase 4:** Conduct bias analysis on the trained datasets using automated tools and manual reviews.

**Success Metrics:**

1. **Accuracy**: How accurately do the AI models make decisions compared to human evaluators?
2. **Bias Detection**: To what extent are biases present in the data sets, and how effectively can they be identified by AI systems and human reviewers?
3. **User Trust**: What factors influence user trust in individual AI models?
4. **Transparency and Explainability**: How transparent are the decision-making processes of each AI model?

**Potential Outcomes:**

1. **Bias Detection:** Identify biases present in data sets and assess the effectiveness of AI systems and human reviewers in detecting and mitigating these biases.
2. **User Trust:** Determine how factors like transparency, explainability, fairness, and accountability impact user trust in individual AI models.
3. **AI Model Performance:** Compare the performance of different AI models, noting their strengths and weaknesses, particularly with regards to bias detection and remediation.
4. **Ethical implications:** Identify areas where current AI systems compromise ethics, such as perpetuating biases or withholding accountability.

By designing an experiment that incorporates these variables, you can gain a deeper understanding of the ethical implications of AI in decision-making, particularly regarding biases in data and user trust.

In [11]:
# So where are we?

print(competitors)
print(answers)


['gpt-4o-mini', 'llama3.2']
['Designing an experiment to measure the ethical implications of artificial intelligence (AI) in decision-making requires careful consideration of various variables, methodologies, and potential impacts. Here’s how one could approach such an experiment:\n\n### Experiment Design:\n1. **Objective:**\n   To evaluate the impact of AI decision-making on user trust and to identify biases present in AI systems.\n\n2. **Variables:**\n   a. **Independent Variables:**\n      - **Algorithm Type:** Variants of AI algorithms (e.g., rule-based systems, machine learning models, deep learning models).\n      - **Data Source:** Different datasets that may have varying levels of bias (e.g., demographic data from various regions, historical data with known biases).\n      - **Decision Context:** Different scenarios in which AI makes decisions (e.g., hiring, loan approval, law enforcement).\n      - **User Demographics:** Varying user profiles (age, education level, cultural ba

In [12]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


Competitor: gpt-4o-mini

Designing an experiment to measure the ethical implications of artificial intelligence (AI) in decision-making requires careful consideration of various variables, methodologies, and potential impacts. Here’s how one could approach such an experiment:

### Experiment Design:
1. **Objective:**
   To evaluate the impact of AI decision-making on user trust and to identify biases present in AI systems.

2. **Variables:**
   a. **Independent Variables:**
      - **Algorithm Type:** Variants of AI algorithms (e.g., rule-based systems, machine learning models, deep learning models).
      - **Data Source:** Different datasets that may have varying levels of bias (e.g., demographic data from various regions, historical data with known biases).
      - **Decision Context:** Different scenarios in which AI makes decisions (e.g., hiring, loan approval, law enforcement).
      - **User Demographics:** Varying user profiles (age, education level, cultural background) to see

In [14]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [15]:
print(together)

# Response from competitor 1

Designing an experiment to measure the ethical implications of artificial intelligence (AI) in decision-making requires careful consideration of various variables, methodologies, and potential impacts. Here’s how one could approach such an experiment:

### Experiment Design:
1. **Objective:**
   To evaluate the impact of AI decision-making on user trust and to identify biases present in AI systems.

2. **Variables:**
   a. **Independent Variables:**
      - **Algorithm Type:** Variants of AI algorithms (e.g., rule-based systems, machine learning models, deep learning models).
      - **Data Source:** Different datasets that may have varying levels of bias (e.g., demographic data from various regions, historical data with known biases).
      - **Decision Context:** Different scenarios in which AI makes decisions (e.g., hiring, loan approval, law enforcement).
      - **User Demographics:** Varying user profiles (age, education level, cultural background) t

In [16]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [17]:
print(judge)

You are judging a competition between 2 competitors.
Each model has been given this question:

If you could design an experiment to measure the ethical implications of artificial intelligence in decision-making, what variables would you include, how would you define success, and what potential outcomes would you anticipate, specifically regarding biases in data and user trust?

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}

Here are the responses from each competitor:

# Response from competitor 1

Designing an experiment to measure the ethical implications of artificial intelligence (AI) in decision-making requires careful consideration of various variables, methodologies, and potential impacts. Here’s how one could approach such an experiment:

#

In [18]:
judge_messages = [{"role": "user", "content": judge}]

In [19]:
# Judgement time!

openai = OpenAI()
response = openai.chat.completions.create(
    model="o3-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)


{"results": ["1", "2"]}


In [20]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gpt-4o-mini
Rank 2: llama3.2


<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>