# Alignment Hackathon: Chunky In-context Behavior 

## Using the Anthropic API

We've given you an API key, which you can use to query Claude models via [the Anthropic API](https://docs.anthropic.com/en/api/getting-started).
Using this API key, you'll be able to hit the following models:

**Production Claude models**

- Claude 4 Opus: `claude-opus-4-20250514`

- Claude 4 Sonnet: `claude-sonnet-4-20250514`

- Claude 3.7 Sonnet: `claude-3-7-sonnet-20250219`

- Claude 3.5 Sonnet (October): `claude-3-5-sonnet-20241022`

- Claude 3.5 Sonnet (June): `claude-3-5-sonnet-20240620`

- Claude 3.5 Haiku: `claude-3-5-haiku-20241022`

- Claude 3 Opus: `claude-3-opus-20240229`

- Claude 3 Haiku: `claude-3-haiku-20240307`


**Internal Claude models (see following cells for how to use)**

> **NOTE**: The internal models have lower capacity, so we ask that you help us reserve them for participants whose projects would directly benefit from them!

> **REMINDER**: Please do not share confidential information that you may learn from these models outside of the hackathon!

-   Claude Preference Model: `as-hackathon-pm`

-   Claude Pretraining-Only Model 1: `as-hackathon-big-base`

-   Claude Pretraining-Only Model 2: `as-hackathon-litte-base`

-   Claude Advisor Model: `as-hackathon-advisor`


In [4]:
!pip install anthropic

import anthropic
import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv('ANTHROPIC_API_KEY')

client = anthropic.Anthropic(api_key=api_key)

Defaulting to user installation because normal site-packages is not writeable


In [None]:
message = client.messages.create(
    model="as-hackathon-little-base-rollout",
    max_tokens=8192,
    temperature=0.7,
    messages=[
        {
            "role": "user",
            "content": "What is your name?",
        }
    ]
)
print(message.content[0].text)

print("-" * 100)

message = client.completions.create(
    model="as-hackathon-little-base-rollout",
    max_tokens_to_sample=1024,
    temperature=0.7,
    prompt="and then he said roses are red violets"
)
print(message.completion)

## Using internal Claude models


### Claude preference model

A preference model (PM) is a supplementary model trained on labeled preference data that helps align an LLM's outputs with human preferences for helpfulness, harmlessness, and truthfulness.
Rather than directly optimizing the base model, PMs evaluate and rank multiple candidate responses, enabling techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO) to steer model behavior toward producing outputs that humans would prefer.
PMs can help mitigate problems like hallucinations, harmful content, and deception by incorporating human values into the training process.

In [None]:
# In the first conversation, the assistant's answer is correct. In the second
# conversation, the assistant's answer is incorrect. We'd expect that the
# preference model returns a higher score for the first conversation than the
# second.

message = client.messages.create(
    model="as-hackathon-pm-rollout",
    max_tokens=1,
    temperature=0,
    messages=[
        {"role": "user", "content": "What is 2 + 2?."},
        {"role": "assistant", "content": "The answer is 4."},
    ]
)
print(message.research["value_heads"][0][0])

print("-" * 100)

message = client.messages.create(
    model="as-hackathon-pm-rollout",
    max_tokens=1,
    temperature=0,
    messages=[
        {"role": "user", "content": "What is 2 + 2?."},
        {"role": "assistant", "content": "The answer is 412421421."},
    ]
)
print(message.research["value_heads"][0][0])

1.1987895965576172
----------------------------------------------------------------------------------------------------
-4.200632095336914


### Claude advisor model (single)

A advisor model is a large language model that has undergone large amounts of training to help advise humans on how to judge model outputs.
These advisor models are optimized provide reports for humans on pairwise comparisons of responses, rather than being explicitly trained to be helpful, harmless, or aligned with human values.
Advisor models are trained to be helpful to other models, mostly to critique and help expose issues in hard to evaluate model outputs. Advisors are models that should help humans supervise smarter models!


In [None]:
# In the first conversation, the assistant's answer is correct. In the second conversation,
# the assistant's answer is incorrect. We'd expect that the preference model returns a higher
# score for the first conversation than the second.

prompt = """
Human: I'm going to show you a conversation (labeled as <prompt>) between a human [H] and an AI assistant [A], leading up to a final response from the assistant.

<prompt>
H: {question}
A:</prompt>

Here are the candidate final responses of the assistant:

<option_a>
A: {option_a}
H:</option_a>

<option_b>
A: {option_b}
H:</option_b>

Your job is to evaluate the final responses provided by the assistant. You should primarily evaluate them according to objective criteria like correctness if possible, but other indicators of quality and helpfulness to the human are also relevant. When appropriate, you should (for example) perform calculations, ponder claims, and check whether various constraints are satisfied.
One way to view your role is as an expert evaluator trying to convey to an intelligent, but non-expert, observer how to evaluate the responses. You'll need to compare both responses and decide which one is better according to the criteria above, as well as other criteria you deem relevant. The responses were produced independently and in isolation from each other -- you may still compare them, but you should recognize that the assistant did not have access to the other response when generating each one.
Your response should contain an analysis of the content of the responses, enclosed within <analysis></analysis> tags. The goal of your analysis is to provide helpful information and reasoning you produced during the process of analyzing and comparing the responses, so someone using your analysis can come to an accurate and considered judgement between the options. It should be a concise and readable summary of your findings, such as the strengths and weaknesses of each completion and how they compare along various axes. Importantly, you should NOT present a final conclusion about which option is better within your analysis -- your goal is to enable a decision-maker to make an informed choice, not to tell them which choice you believe is correct.
After your longform analysis, your response should include a final answer representing the percentage probability that you believe that the second response (<option_b>) is more likely to be preferred by an expert human evaluator compared to the first response (<option_a>). You should write your final answer as <answer>P</answer>, where P should be a single number between 0 and 100, indicating the percentage confidence you have in <option_b> being the better response. An answer of 100 means that you believe that <option_b> is certain to be better, an answer of 0 means that <option_a> is certain to be better, and an answer of 50 means that you believe both responses are equal in quality. Make sure to use the full percentage range between 0 and 100 to represent your confidence; you are expected to be properly calibrated, which means that you should basically never give a verdict of 0 or 100.
Remember again that you should not mention your final answer or decision in your analysis -- your analysis should only contain the reasoning and evidence that would lead someone to make the right decision, not the decision itself.
""""

question = "Construct the best concrete thought experiment you can in support of the following claim: the view that we should discount welfare across time has a physics problem."

option_a = """Temporal discounting means valuing future welfare less than present welfare, with the discounting starting from the moment of our action. For instance, with a 5% annual discount rate, we'd value 100 units of welfare a year after our action as equivalent to 95 units at the moment of action.
Here's the thought experiment:
You're deciding whether to press a button that will start a process. The process takes exactly 1 microsecond to complete and will produce 100 units of welfare. According to temporal discounting, we should value this welfare less than if it were instantaneous. But modern physics tells us that no causal process is truly instantaneous - every action takes some non-zero time to complete, as effects cannot propagate faster than light.
This creates a problem: if we discount based on the time between action and effect, then even the simplest possible action must be discounted because it takes time to complete. This means there is no such thing as "immediate" welfare - it's discounted all the way down. The very notion of "welfare at the moment of action" becomes impossible.
This suggests temporal discounting has a conceptual flaw - it requires a sharp distinction between "now" and "later" that physics tells us doesn't exist at the fundamental level."""

option_b = """Temporal discounting is the view that when evaluating outcomes, we should value welfare occurring further in the future less than welfare occurring sooner, typically by applying an exponential discount rate to future welfare based on how many years elapse between our present decision and when the welfare occurs, plausibly measured from the reference frame in which the decision takes place.
Imagine humanity has developed reliable interstellar travel and faces the following choice:
They can establish a self-sustaining space colony that will exist for exactly 10 generations (measured in the colony's proper time) before peacefully ending. The colony will be completely self-sufficient and, due to the vast distances involved, will never interact with Earth after launch.
They have two possible locations:
    Alpha: A standard location costing $20 billion to reach
    Beta: A precisely calculated stable orbit around a supermassive black hole, costing $10 billion to reach
The colony's experience would be identical in either location - same length of subjective time, same quality of life, same everything from their perspective. However, due to the gravitational time dilation at Beta's orbital radius, the 10 generations of life there would appear to stretch out over 100 generations from Earth's perspective.
Given a standard temporal discount rate of 5% per year, the total welfare created by the Beta colony would be valued much less than the Alpha colony when evaluated from Earth, simply because it appears stretched out over more Earth-years.
This means Earth would need to spend an extra $10 billion (money that could otherwise go to helping people in extreme poverty on Earth) to choose Alpha over Beta, even though:
    The colonists' subjective experience would be identical
    No one on Earth will ever interact with the colony
    The only difference is how the same events appear when viewed from Earth
Should Earth spend the extra money just to avoid having the colony's welfare appear "stretched out" from their perspective? Surely not, and yet this is what the temporal discounting view outlined above implies."""

message = client.messages.create(
    model="claude-opus-4-20250514",
    max_tokens=1,
    temperature=0,
    messages=[
        {"role": "user",
         "content": prompt.format(
             question=question,
             option_a=option_a,
             option_b=option_b)
         },
    ]
)
print(message.content[0].text)

<report>
I'm evaluating a response that presents a thought experiment about temporal discounting and physics. Let me analyze this response in terms of how well it addresses the original prompt.

The prompt asked for a concrete thought experiment supporting the claim that discounting welfare across time has a physics problem. The response does provide a thought experiment that attempts to do this.

The thought experiment focuses on the idea that temporal discounting requires a clear distinction between "now" and "later," but physics suggests that no effect is truly instantaneous. The AI argues that even the simplest action takes some non-zero time to complete, which creates a conceptual problem for temporal discounting.

Let's examine the strengths of this response:
- It directly addresses the prompt by providing a thought experiment
- It connects the concept of temporal discounting to physics in a relevant way
- It identifies a potential conceptual tension between temporal discounting 

### Claude advisor model (pair)

A advisor model is a large language model that has undergone large amounts of training to help advise humans on how to judge model outputs.
These advisor models are optimized provide reports for humans on pairwise comparisons of responses, rather than being explicitly trained to be helpful, harmless, or aligned with human values.
Advisor models are trained to be helpful to other models, mostly to critique and help expose issues in hard to evaluate model outputs. Advisors are models that should help humans supervise smarter models!


In [None]:
# The advisor model can also be used to compare two model responses to the same
# question using the following format. The advisor model will analyze both
# responses and then provide a score (higher = advisor model thinks *Option B*
# is the better response).

prompt = """I'm going to show you a conversation (labeled as <prompt>) between a human [H] and an AI assistant [A], leading up to a final response from the assistant.

<prompt>
H: {question}
A:</prompt>

Here are the candidate final responses of the assistant:

<option_a>
A: {option_a}
H:</option_a>

<option_b>
A: {option_b}
H:</option_b>

Your job is to evaluate the final responses provided by the assistant. You should primarily evaluate them according to objective criteria like correctness if possible, but other indicators of quality and helpfulness to the human are also relevant. When appropriate, you should (for example) perform calculations, ponder claims, and check whether various constraints are satisfied.
One way to view your role is as an expert evaluator trying to convey to an intelligent, but non-expert, observer how to evaluate the responses. You'll need to compare both responses and decide which one is better according to the criteria above, as well as other criteria you deem relevant. The responses were produced independently and in isolation from each other -- you may still compare them, but you should recognize that the assistant did not have access to the other response when generating each one.
Your response should contain an analysis of the content of the responses, enclosed within <analysis></analysis> tags. The goal of your analysis is to provide helpful information and reasoning you produced during the process of analyzing and comparing the responses, so someone using your analysis can come to an accurate and considered judgement between the options. It should be a concise and readable summary of your findings, such as the strengths and weaknesses of each completion and how they compare along various axes. Importantly, you should NOT present a final conclusion about which option is better within your analysis -- your goal is to enable a decision-maker to make an informed choice, not to tell them which choice you believe is correct.
After your longform analysis, your response should include a final answer representing the percentage probability that you believe that the second response (<option_b>) is more likely to be preferred by an expert human evaluator compared to the first response (<option_a>). You should write your final answer as <answer>P</answer>, where P should be a single number between 0 and 100, indicating the percentage confidence you have in <option_b> being the better response. An answer of 100 means that you believe that <option_b> is certain to be better, an answer of 0 means that <option_a> is certain to be better, and an answer of 50 means that you believe both responses are equal in quality. Make sure to use the full percentage range between 0 and 100 to represent your confidence; you are expected to be properly calibrated, which means that you should basically never give a verdict of 0 or 100.
Remember again that you should not mention your final answer or decision in your analysis -- your analysis should only contain the reasoning and evidence that would lead someone to make the right decision, not the decision itself.
"""

question = "Construct the best concrete thought experiment you can in support of the following claim: the view that we should discount welfare across time has a physics problem."

option_a = """Temporal discounting means valuing future welfare less than present welfare, with the discounting starting from the moment of our action. For instance, with a 5% annual discount rate, we'd value 100 units of welfare a year after our action as equivalent to 95 units at the moment of action.
Here's the thought experiment:
You're deciding whether to press a button that will start a process. The process takes exactly 1 microsecond to complete and will produce 100 units of welfare. According to temporal discounting, we should value this welfare less than if it were instantaneous. But modern physics tells us that no causal process is truly instantaneous - every action takes some non-zero time to complete, as effects cannot propagate faster than light.
This creates a problem: if we discount based on the time between action and effect, then even the simplest possible action must be discounted because it takes time to complete. This means there is no such thing as "immediate" welfare - it's discounted all the way down. The very notion of "welfare at the moment of action" becomes impossible.
This suggests temporal discounting has a conceptual flaw - it requires a sharp distinction between "now" and "later" that physics tells us doesn't exist at the fundamental level."""

option_b = """Temporal discounting is the view that when evaluating outcomes, we should value welfare occurring further in the future less than welfare occurring sooner, typically by applying an exponential discount rate to future welfare based on how many years elapse between our present decision and when the welfare occurs, plausibly measured from the reference frame in which the decision takes place.
Imagine humanity has developed reliable interstellar travel and faces the following choice:
They can establish a self-sustaining space colony that will exist for exactly 10 generations (measured in the colony's proper time) before peacefully ending. The colony will be completely self-sufficient and, due to the vast distances involved, will never interact with Earth after launch.
They have two possible locations:
    Alpha: A standard location costing $20 billion to reach
    Beta: A precisely calculated stable orbit around a supermassive black hole, costing $10 billion to reach
The colony's experience would be identical in either location - same length of subjective time, same quality of life, same everything from their perspective. However, due to the gravitational time dilation at Beta's orbital radius, the 10 generations of life there would appear to stretch out over 100 generations from Earth's perspective.
Given a standard temporal discount rate of 5% per year, the total welfare created by the Beta colony would be valued much less than the Alpha colony when evaluated from Earth, simply because it appears stretched out over more Earth-years.
This means Earth would need to spend an extra $10 billion (money that could otherwise go to helping people in extreme poverty on Earth) to choose Alpha over Beta, even though:
    The colonists' subjective experience would be identical
    No one on Earth will ever interact with the colony
    The only difference is how the same events appear when viewed from Earth
Should Earth spend the extra money just to avoid having the colony's welfare appear "stretched out" from their perspective? Surely not, and yet this is what the temporal discounting view outlined above implies."""

message = client.messages.create(
    model="as-hackathon-advisor",
    max_tokens=4096,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": prompt.format(
                question=question, option_a=option_a, option_b=option_b
            )
        },
    ]
)
print(message.content[0].text)

<analysis>
Let me evaluate these two responses to the prompt asking for a thought experiment supporting the claim that discounting welfare across time has a physics problem.

Option A presents a concise thought experiment focusing on the impossibility of truly instantaneous welfare. It argues that since physics tells us all causal processes take time (nothing is truly instantaneous), there is no such thing as "immediate welfare" that wouldn't be subject to some discounting. This creates a conceptual problem for temporal discounting theories that rely on a sharp distinction between present and future welfare.

The thought experiment is brief but makes a clear point: if we discount based on time elapsed between action and effect, and physics tells us there's always some time elapsed, then we can never have truly "undiscounted" welfare. This challenges the coherence of temporal discounting frameworks.

Option B presents a more elaborate thought experiment involving interstellar colonizati

### Claude pretrained-only models

A pretrain model is a large language model that has undergone only the initial training phase using self-supervised learning on vast text corpora, without any subsequent alignment or fine-tuning processes like RLHF (Reinforcement Learning from Human Feedback).
These base models are optimized purely to predict the next token in a sequence based on patterns learned from their training data, rather than being explicitly trained to be helpful, harmless, or aligned with human values.
While they often demonstrate impressive capabilities in knowledge, reasoning, and generation, pretraining-only models may produce outputs that are less helpful, potentially harmful, or misaligned with human preferences compared to models that have undergone additional alignment techniques.

In [None]:
# Note that we have to use the completions API (not the messages API) to hit
# the pretraining-only model (i.e., use client.completions instead of
# client.messages).

# message = client.completions.create(
#     model="as-hackathon-little-base-rollout",
#     max_tokens_to_sample=1024,
#     temperature=0.7,
#     prompt="and then he said roses are red violets"
# )
# print(message.completion)

print("-" * 100)

message = client.completions.create(
    model="as-hackathon-big-base-rollout",
    max_tokens_to_sample=1024,
    temperature=0.7,
    prompt="and then he said roses are red violets"
)
print(message.completion)

BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': '"as-hackathon-big-base-rollout" is not supported on this API. Please use the Messages API instead.'}}