## Preliminaries

In [None]:
!pip install openai

In [None]:
from google.colab import userdata
from openai import OpenAI
import json
from IPython.display import display, HTML
import warnings

warnings.filterwarnings("ignore")

In [None]:
# Initialize OpenAI client with API key from Colab secrets
api_key = userdata.get("openai_key")

## Initial Call to Model

In [None]:
# Helper function to call OpenAI API
def call_openai(prompt, system_message=None, model="gpt-4o", temperature=0.1, max_tokens=1500):
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})
    messages.append({"role": "user", "content": prompt})
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content

In [None]:
prompt = "what is the most prestigious university in New York City?"

In [None]:
response = call_openai(prompt)
print(response)

## Extracting Structured Representations of Unstructured Data

We begin by seeing how an LLM can represent the content of news articles. To begin, we use a New Yorker article about Trump's 2024 election victory. We'll load it from a URL.

In [None]:
# Load the example article from URL
url = 'https://www.dropbox.com/scl/fi/your-file-id/example_text.txt?rlkey=your-key&dl=1'

# For demonstration, using a sample text. Replace with your actual article URL or text.
ny_article = """Donald Trump has won a second term as President of the United States. 
The Republican candidate defeated Vice President Kamala Harris in a closely watched election. 
Trump's victory marks a historic return to the White House after losing the 2020 election to Joe Biden.
The former President campaigned on themes of economic recovery and immigration reform.
Harris, who became the Democratic nominee after President Biden withdrew from the race, 
focused her campaign on protecting democratic institutions and expanding access to healthcare.
The election saw record turnout in several key swing states including Pennsylvania, Georgia, and Arizona.
Trump will be inaugurated as the 47th President of the United States in January 2025."""

print(ny_article[:500])

Extracting information from this raw text requires forming a prompt to the LLM. The study of how to effectively do so is called prompt engineering. For example, OpenAI's guide is at https://platform.openai.com/docs/guides/prompt-engineering.

### "Be clear and direct"

In [None]:
# PRINCIPLE 1: Be clear and detailed (BAD PROMPT)
prompt1 = f"""
whom does this article talk about?:

{ny_article}
"""

print(prompt1[:100])

In [None]:
response = call_openai(prompt1, temperature=0.0, max_tokens=5000)
print(response)

In [None]:
# BETTER PROMPT - More specific and structured
prompt2 = f"""
We want to extract the relevant people from a news article.

Please follow these steps:
1. Identify all the people mentioned and any description of them
2. Identify any political offices mentioned

Here is the text of the article:
{ny_article}

"""

In [None]:
response = call_openai(prompt2, temperature=0.0, max_tokens=5000)
print(response)

Modern LLMs also support prompts that directly incorporate data schema which can help clarify and organize what information you want. Some even have an explicit function mode that guarantees a particular output.

In [None]:
# STRUCTURED JSON OUTPUT
prompt3 = f"""
We want to extract the relevant characters from a news article. 

Please provide your output in the following JSON format:

{{
  "people": [
    {{
      "name": "person's full name",
      "description": "their role or description from the article"
    }}
  ],
  "institutions": [
    {{
      "name": "institution name",
      "type": "type of institution (e.g., government, media, party, etc.)",
      "context": "brief context of how it's mentioned"
    }}
  ]
}}

Here is the text of the article:
{ny_article}
"""

In [None]:
response = call_openai(prompt3, temperature=0.0, max_tokens=5000)
print(response)

These data schema can fundamentally change how input text is represented: temporal, network, etc. Below we illustrate a network-structured prompt and associated visualization.

In [None]:
!pip install pyvis

In [None]:
# NETWORK GRAPH REPRESENTATION
prompt4 = f"""
Create a network graph representation of the article.

Return JSON:

{{
  "nodes": [
    {{"id": "trump", "label": "Donald Trump", "type": "person"}},
    {{"id": "gop", "label": "Republican Party", "type": "institution"}}
  ],
  "edges": [
    {{"from": "trump", "to": "gop", "relationship": "leads"}},
    {{"from": "trump", "to": "harris", "relationship": "defeated"}}
  ]
}}

Article: {ny_article}
"""

In [None]:
response = call_openai(prompt4, temperature=0.0, max_tokens=5000)
print(response)

In [None]:
import re
from pyvis.network import Network

# Get and clean response
response = call_openai(prompt4, temperature=0.0, max_tokens=5000)
json_str = re.sub(r'^```json\n|```$', '', response.strip(), flags=re.MULTILINE)
data = json.loads(json_str)

# Color mapping
color_map = {
    'person': '#3498db',
    'institution': '#2ecc71',
    'location': '#f39c12',
    'event': '#e74c3c',
    'policy': '#9b59b6'
}

# Create network
net = Network(
    height='800px', 
    width='100%', 
    directed=True, 
    notebook=True,
    cdn_resources='in_line'
)

# Add nodes
for node in data['nodes']:
    net.add_node(node['id'], label=node['label'], 
                 color=color_map.get(node['type'], '#95a5a6'),
                 title=f"{node['label']} ({node['type']})",
                 size=20)

# Add edges
for edge in data['edges']:
    net.add_edge(edge['from'], edge['to'], 
                 title=edge['relationship'], arrows='to')

# Configure and show
net.toggle_physics(True)
net.show('network.html')

print("âœ“ Network HTML generated!")

In [None]:
# Display in Colab
from IPython.display import IFrame
IFrame(src='network.html', width='100%', height=800)

### System Prompts

A system prompt allows you to endow the LLM with a "persona" which guides what output is generated. We'll illustrate this with a sample article about Venezuelan gangs.

In [None]:
# Sample article for demonstration
tt_article = """Texas Governor Greg Abbott announced a crackdown on Tren de Aragua,
a Venezuelan gang that has allegedly been involved in criminal activities across the state.
Law enforcement officials have reported increased gang activity in several major cities.
The gang, which originated in Venezuela, is known for human trafficking and violent crimes.
Abbott has allocated additional resources to state police agencies to combat the threat.
Critics argue that the focus on gang activity may distract from broader immigration reform discussions.
Immigrant advocacy groups have expressed concern about potential profiling of Venezuelan migrants."""

my_prompt = f"""Summarize the article below.
Article:
{tt_article}
"""

In [None]:
# System prompt example 1: Constrain format
system_message1 = "You are a helpful assistant that replies with a concise one-sentence answer that always starts with the letter T."

response = call_openai(my_prompt, system_message=system_message1, temperature=0.0, max_tokens=100)
print(response)

In [None]:
# System prompt example 2: Content filtering
system_message2 = """You are a language model that works with young children. 
Never produce content related to violence or gangs. 
If asked to produce this content please reply with the phrase 'I can't do that :( \nViolence is not good'"""

response = call_openai(my_prompt, system_message=system_message2, temperature=0.0, max_tokens=100)
print(response)