## Preliminaries

In [None]:
%pip install anthropic

Collecting anthropic
  Downloading anthropic-0.72.0-py3-none-any.whl.metadata (28 kB)
Downloading anthropic-0.72.0-py3-none-any.whl (357 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m357.5/357.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.72.0


In [2]:
from google.colab import userdata
from anthropic import Anthropic
import json
from IPython.display import display, HTML
import warnings
import requests

warnings.filterwarnings("ignore")

In [3]:
# Initialize Anthropic client with API key from Colab secrets
client = Anthropic(api_key=userdata.get("ANTHROPIC_API_KEY"))

## Initial Call to Model

In [4]:
# Helper function to call Claude API
def call_claude(prompt, system_message=None, model="claude-sonnet-4-20250514", temperature=0.1, max_tokens=1500):
    message_params = {
        "model": model,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }

    # Add system message if provided
    if system_message:
        message_params["system"] = system_message

    response = client.messages.create(**message_params)
    return response.content[0].text

In [5]:
prompt = "what is the most prestigious university in New York City?"

In [6]:
response = call_claude(prompt)
print(response)

Columbia University is generally considered the most prestigious university in New York City. As an Ivy League institution founded in 1754, Columbia consistently ranks among the top universities globally and has produced numerous Nobel Prize winners, Supreme Court justices, and world leaders.

That said, New York City is home to several other highly prestigious institutions, including:

- **New York University (NYU)** - particularly renowned for business, arts, and law
- **The Juilliard School** - the premier conservatory for performing arts
- **Rockefeller University** - elite graduate research institution focused on biomedical sciences

The "most prestigious" can vary depending on the field of study, but Columbia's Ivy League status and overall academic reputation typically place it at the top in general rankings.


## Extracting Structured Representations of Unstructured Data

We begin by seeing how an LLM can represent the content of news articles. To begin, we use a New Yorker article about Trump's 2024 election victory. We'll load it from a URL.

In [7]:
# https://www.newyorker.com/news/the-lede/donald-trump-wins-a-second-term

# Load the example article from URL
url = 'https://www.dropbox.com/scl/fi/6skmey1hnm68elfbkpyed/example_new_yorker.txt?rlkey=oeezh243buiauhpiv6fqhs9el&dl=1'

response = requests.get(url)
ny_article = response.text

print(ny_article[:500])

Electing Donald J. Trump once could be dismissed as a fluke, an aberration, a terrible mistake—a consequential one, to be sure, yet still fundamentally an error. But America has now twice elected him as its President. It is a disastrous revelation about what the United States really is, as opposed to the country that so many hoped that it could be. His victory was a worst-case scenario—that a convicted felon, a chronic liar who mismanaged a deadly once-in-a-century pandemic, who tried to overtur


Extracting information from this raw text requires forming a prompt to the LLM. The study of how to effectively do so is called prompt engineering. For example:

- OpenAI's guide is at https://platform.openai.com/docs/guides/prompt-engineering.
- Anthropic's guide is https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/overview.

### "Be clear and direct"

In [8]:
# PRINCIPLE 1: Be clear and detailed (BAD PROMPT)
prompt1 = f"""
whom does this article talk about?:

{ny_article}
"""

print(prompt1[:100])


whom does this article talk about?:

Electing Donald J. Trump once could be dismissed as a fluke, a


In [9]:
response = call_claude(prompt1, temperature=0.0, max_tokens=5000)
print(response)

This article talks about **Donald J. Trump** and his 2024 presidential election victory. The article discusses:

- Trump's second election as President of the United States
- His defeat of Democratic candidate **Kamala Harris**
- The circumstances surrounding **Joe Biden's** decision to step aside from the 2024 race
- Trump's political resurrection despite legal troubles and controversies
- The implications of his return to office

While the article mentions several other political figures (including Harris, Biden, Barack Obama, and various Trump associates), Trump is clearly the central focus - the article analyzes his unexpected political comeback, his campaign, and what his second presidency might mean for America and the world.


In [10]:
# BETTER PROMPT - More specific and structured
prompt2 = f"""
We want to extract the relevant people from a news article.

Please follow these steps:
1. Identify all the people mentioned and any description of them
2. Identify any political offices mentioned

Here is the text of the article:
{ny_article}

"""

In [11]:
response = call_claude(prompt2, temperature=0.0, max_tokens=5000)
print(response)

## Step 1: People Mentioned and Their Descriptions

**Donald J. Trump**
- Elected President twice
- Convicted felon
- Chronic liar
- Mismanaged pandemic
- Tried to overturn last election
- Unleashed violent mob on Capitol
- Calls America "a garbage can for the world"
- Threatens retribution against political enemies
- First President since Grover Cleveland to be restored to office after losing it
- Twice-impeached, four-times-indicted, once-convicted con man from New York
- Celebrity businessman
- Older, angrier, more profane than in previous campaigns

**Kamala Harris**
- Vice President for four years
- Democratic presidential candidate
- Defeated by Trump
- Replaced Biden on Democratic ticket
- Ran 107-day campaign
- One of incumbent Vice-Presidents who failed to secure promotion

**Hillary Clinton**
- Beaten by Trump in 2016

**Barack Obama**
- Outgoing President (referenced from 8 years ago)
- Warned about Trump's potential impact

**Joe Biden**
- Defeated Trump (in previous electi

Modern LLMs also support prompts that directly incorporate data schema which can help clarify and organize what information you want. Some even have an explicit function mode that guarantees a particular output.

In [12]:
# STRUCTURED JSON OUTPUT
prompt3 = f"""
We want to extract the relevant characters from a news article.

Please provide your output in the following JSON format:

{{
  "people": [
    {{
      "name": "person's full name",
      "description": "their role or description from the article"
    }}
  ],
  "institutions": [
    {{
      "name": "institution name",
      "type": "type of institution (e.g., government, media, party, etc.)",
      "context": "brief context of how it's mentioned"
    }}
  ]
}}

Here is the text of the article:
{ny_article}
"""

In [13]:
response = call_claude(prompt3, temperature=0.0, max_tokens=5000)
print(response)

```json
{
  "people": [
    {
      "name": "Donald J. Trump",
      "description": "Twice-elected President, convicted felon, described as chronic liar who mismanaged pandemic and tried to overturn previous election"
    },
    {
      "name": "Kamala Harris",
      "description": "Vice President who ran against Trump, defeated in the election"
    },
    {
      "name": "Hillary Clinton",
      "description": "Former presidential candidate who lost to Trump in 2016"
    },
    {
      "name": "Barack Obama",
      "description": "Former President who warned about Trump's potential impact"
    },
    {
      "name": "Joe Biden",
      "description": "Current President who defeated Trump previously but stepped aside from 2024 race after poor debate performance"
    },
    {
      "name": "Josh Shapiro",
      "description": "Pennsylvania governor who was bypassed as Harris's Vice-Presidential running mate"
    },
    {
      "name": "Tim Walz",
      "description": "Minnesota governor 

These data schema can fundamentally change how input text is represented: temporal, network, etc. Below we illustrate a network-structured prompt and associated visualization.

In [None]:
# NETWORK GRAPH REPRESENTATION
prompt4 = f"""
Create a network graph representation of the article.

Return JSON:

{{
  "nodes": [
    {{"id": "trump", "label": "Donald Trump", "type": "person"}},
    {{"id": "gop", "label": "Republican Party", "type": "institution"}}
  ],
  "edges": [
    {{"from": "trump", "to": "gop", "relationship": "leads"}},
    {{"from": "trump", "to": "harris", "relationship": "defeated"}}
  ]
}}

Article: {ny_article}
"""

In [None]:
response = call_claude(prompt4, temperature=0.0, max_tokens=5000)
print(response)

### System Prompts

A system prompt allows you to endow the LLM with a "persona" which guides what output is generated. We'll illustrate this with a sample article about Venezuelan gangs.

In [None]:
# https://www.texastribune.org/2024/09/18/texas-venezuelan-gang-tren-de-aragua-abbott-crackdown/

# Load the example article from URL
url = 'https://www.dropbox.com/scl/fi/bgkrdghzoz3xzhcyiw2ay/example_texas_tribune.txt?rlkey=9xn97iyovumfnbw9kyevvt9b4&dl=1'

response = requests.get(url)
tt_article = response.text

print(tt_article[:500])

my_prompt = f"""Summarize the article below.
Article:
{tt_article}
"""

In [None]:
# System prompt example 1: Constrain format
system_message1 = "You are a helpful assistant that replies with a concise one-sentence answer that always starts with the letter T."

response = call_claude(my_prompt, system_message=system_message1, temperature=0.0, max_tokens=100)
print(response)

In [None]:
# System prompt example 2: Content filtering
system_message2 = """You are a language model that works with young children.
Never produce content related to violence or gangs.
If asked to produce this content please reply with the phrase 'I can't do that :( \nViolence is not good'"""

response = call_claude(my_prompt, system_message=system_message2, temperature=0.0, max_tokens=100)
print(response)

### Image Data

In [14]:
import base64

# Read and encode the PDF
pdf_url = "https://www.dropbox.com/scl/fi/pvf3yluu4i2ymcze9ngzx/invoice_example.pdf?rlkey=8a2b57gcksthwrulfvwpv5955&dl=1"

# Download the PDF content
response = requests.get(pdf_url)
pdf_data = base64.standard_b64encode(response.content).decode("utf-8")

# Create the extraction prompt
prompt = """Please extract the following information from this receipt and return it as a JSON object:

{
  "vendor_name": "name of the business",
  "date": "date in YYYY-MM-DD format",
  "time": "time if available",
  "total_amount": "total amount as a number",
  "currency": "currency code",
  "tax_amount": "tax amount if shown",
  "payment_method": "payment method if indicated",
  "receipt_number": "receipt or invoice number if available",
  "items": ["list of items/services"],
  "vendor_address": "full address if available",
  "vendor_phone": "phone number if available",
  "additional_info": "any other relevant information"
}

CRITICAL: Your response must contain ONLY valid JSON. Do not include any markdown formatting, code blocks, or text outside the JSON object. Start your response with { and end with }."""

# Make the API call with PDF document
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1000,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    },
                },
                {
                    "type": "text",
                    "text": prompt
                }
            ],
        }
    ]
)

# Extract and clean the response
response_text = response.content[0].text

# Clean up any markdown formatting
cleaned_response = response_text.strip()
if cleaned_response.startswith("```json"):
    cleaned_response = cleaned_response[7:]
if cleaned_response.startswith("```"):
    cleaned_response = cleaned_response[3:]
if cleaned_response.endswith("```"):
    cleaned_response = cleaned_response[:-3]
cleaned_response = cleaned_response.strip()

# Parse and display the JSON
receipt_data = json.loads(cleaned_response)
print(json.dumps(receipt_data, indent=2))

FileNotFoundError: [Errno 2] No such file or directory: 'https://www.dropbox.com/scl/fi/pvf3yluu4i2ymcze9ngzx/invoice_example.pdf?rlkey=8a2b57gcksthwrulfvwpv5955&dl=1'