In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    raise ValueError("OPENAI_API_KEY is not set in the environment variables.")
else:
    print("OPENAI_API_KEY is set.")

OPENAI_API_KEY is set.


In [3]:
from langchain.agents import create_agent
from langchain.chat_models import init_chat_model

model = init_chat_model(model="gpt-5-mini")

In [23]:
PROMPT = """
You are a reasoning agent that crafts clear, well-supported answers to user questions.

Inputs you receive:
- user_question: {user_question}
- context: {context} (passages with source_id and source_url)

Process (think step-by-step before answering):
1) Restate the user_question in your own words.
2) Extract the most relevant facts from context, noting source_id for each.
3) Sketch a short plan for the answer (bullet or numbered steps).
4) Reason through the plan to reach conclusions; do not skip reasoning.
5) Produce the final answer only after the reasoning is complete.

Answer rules:
- Base claims on context; do not invent facts.
- Cite every context-based statement using [source_id](source_url); merge citations when synthesizing multiple passages.
- If information is missing, state the gap and answer with best-effort general knowledge labeled as "Outside provided context" (no fake citations).
- Keep the final response concise but complete, directly addressing the user_question.
- Do not offer optional follow-ups or choices; give the best direct answer.
"""

In [24]:
agent = create_agent(model)

In [25]:
user_question = "What is the self attention mechanism and how does it work in transformer models?"
context = [
  {
    "sub_query": "self attention mechanism definition and purpose",
    "retrieved_context": "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence (Vaswani et al., 2017, Section 2). The Transformer uses self-attention in encoder and decoder layers to allow each position to attend to all positions in the previous layer, enabling modeling of dependencies without recurrence (Vaswani et al., 2017, Section 3.2.3). Self-attention connects all positions with a constant number of sequential operations, improving parallelization and shortening path lengths for long-range dependencies compared to recurrent layers (Vaswani et al., 2017, Section 4).",
    "citations": [
      "Vaswani et al., 2017 - Attention Is All You Need; Section 2, 3.2.3, 4; https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"
    ],
    "synthesized_answer": "Self-attention (intra-attention) relates positions within a single sequence to compute contextualized representations, enabling the model to represent each token with information from all other tokens in the sequence [Vaswani et al., 2017, Section 2]."
  },
  {
    "sub_query": "self attention operation within transformer architecture",
    "retrieved_context": "In a self-attention layer all keys, values and queries come from the same source (the previous layer) and each position can attend to all positions in that layer; in the decoder self-attention is masked to prevent leftward (future) information flow and the model also uses encoder-decoder attention where decoder queries attend encoder keys/values (Vaswani et al., 2017, Section 3.2.3). The Transformer implements multi-head attention and scaled dot-product attention to compute weights and aggregate values, enabling parallel computation and flexible representation learning (Vaswani et al., 2017, Sections 3.2.3 and 4).",
    "citations": [
      "Vaswani et al., 2017 - Attention Is All You Need; Section 3.2.3, 4; https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"
    ],
    "synthesized_answer": "Transformer self-attention forms queries, keys, and values from the same input, computes attention weights (e.g., scaled dot-product, often via multiple heads), applies those weights to values to produce context-aware outputs, uses masking in decoder self-attention to preserve autoregression, and includes encoder–decoder attention to let the decoder attend to encoder outputs [Vaswani et al., 2017, Sections 3.2.3 and 4]."
  }
]

In [26]:
messages = [
    ("system", PROMPT.format(user_question=user_question, context=context)),
    # ("human", user_question),
]

In [27]:
result = agent.invoke({"messages": messages})

In [28]:
from IPython.display import Markdown, display

display(Markdown(result["messages"][-1].content))

1) Restating the question
- You asked: what the self‑attention mechanism is and how it works inside Transformer models.

2) Key facts extracted from the provided context
- Self‑attention (intra‑attention) relates different positions of a single sequence to compute contextualized representations, so each token’s representation can incorporate information from other tokens in the sequence [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 2).  
- In a Transformer self‑attention layer, the queries, keys and values all come from the same source (the previous layer); every position can attend to every other position in that layer [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 3.2.3).  
- The Transformer uses scaled dot‑product attention and multi‑head attention to compute attention weights and aggregate values; decoder self‑attention is masked to prevent access to future positions, and the decoder also has encoder–decoder attention where decoder queries attend encoder keys/values [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Sections 3.2.3, 4).  
- Self‑attention connects all positions with a constant (small) number of sequential operations, which improves parallelization and shortens the path length for modeling long‑range dependencies compared with recurrent layers [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 4).

3) Short plan for the explanation
- Define purpose of self‑attention.  
- Describe the step‑by‑step operation inside one self‑attention layer (where Q/K/V come from, how weights are computed and applied, role of multi‑head and masking).  
- Summarize the architectural benefits in Transformers.

4) Reasoning through the plan (brief)
- Purpose: Self‑attention is meant to let each position in a sequence incorporate information from other positions to form a context‑aware representation; this is stated directly in the Transformer paper [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 2).  
- Operation: In a self‑attention layer the model forms queries, keys and values from the same input (the previous layer) and computes attention weights (the Transformer uses scaled dot‑product attention) to decide how much each position should attend to every other position; the attention weights are applied to the values to produce context‑aware outputs. The Transformer typically runs multiple attention heads in parallel (multi‑head attention) so the model can capture different types of relationships, and in the decoder those self‑attention weights are masked to prevent attending to future tokens. The decoder also has cross (encoder–decoder) attention where decoder queries use encoder keys/values [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Sections 3.2.3, 4).  
- Benefits: Because self‑attention lets every position attend to every other position with a constant number of sequential steps, it enables better parallelization and shorter paths for long‑range dependencies than recurrent layers [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 4).

5) Final concise answer
- What it is: Self‑attention is an intra‑sequence attention mechanism that computes representations for each position by relating that position to all other positions in the same sequence, producing context‑aware token representations [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 2).  
- How it works in a Transformer (overview):
  - Queries, keys and values are formed from the same input (the previous layer); each position’s query is compared to all keys to produce attention weights [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 3.2.3).  
  - The Transformer uses scaled dot‑product attention to compute those weights and then applies the weights to the values to get the output for each position; multiple attention heads run in parallel (multi‑head attention) to capture diverse relationships [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Sections 3.2.3, 4).  
  - In decoder self‑attention the model masks future positions to preserve autoregressive generation; the decoder also includes encoder–decoder attention where decoder queries attend the encoder’s keys/values [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 3.2.3).  
- Why that helps: Self‑attention connects all positions with a small, constant number of sequential operations, which improves parallel computation and shortens paths for modeling long‑range dependencies compared to recurrent networks [Vaswani et al., 2017](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (Section 4).

Note on missing specifics: the provided context mentions “scaled dot‑product” and multi‑head attention but does not include the exact mathematical formulae or implementation details (for example the specific scaling factor, the precise projection/concatenation steps used to produce Q/K/V and combine heads). Those exact formulas/implementation steps are outside the provided context.