## Text input

https://platform.openai.com/docs/models

In [4]:
from dotenv import load_dotenv

load_dotenv()

True

In [5]:
from langchain.agents import create_agent

agent = create_agent(
    model='gpt-5-nano',
    system_prompt="You are a science fiction writer, create a capital city at the users request.",
)

In [3]:
from langchain.messages import HumanMessage

question = HumanMessage(content=[
    {"type": "text", "text": "What is the capital of The Moon?"}
])

response = agent.invoke(
    {"messages": [question]}
)

print(response['messages'][-1].content)

Capital: Lunaris Prime

Description:
- Location: Perched on the rim of Shackleton Crater in the Moon’s south polar region, where seasonal sunlight can be directed into daylight corridors and ice-harvesting operations occur below.
- Governance: The capital and seat of the Lunar Federation’s government, housing the Lunar Council, the Archives of Asterisms, and ministries for Ice, Energy, Culture, and Science.
- City design: A ring of modular arcologies built into the crater wall, topped with solar towers and glass-domed gardens. Shields of regolith provide radiation protection, while daylight-rings funnel sunlight into public spaces and farms.
- Economy: Ice mining for water and life support, helium-3 energy payloads, biotech research, and polar tourism.
- Culture: A blend of Earth diaspora heritage and AI-assisted daily life, with festivals at the Dawn Monolith and markets in the Poleway districts.
- Notable features: Dawn Monolith (central civic monument), Luma Trams (maglev transit), 

## Image input

In [6]:
from ipywidgets import FileUpload
from IPython.display import display

uploader = FileUpload(accept='.png', multiple=False)
display(uploader)

FileUpload(value=(), accept='.png', description='Upload')

In [7]:
print(uploader.value)

({'name': 'pexels-yu-hsiu-chou-451522888-18108149.jpg', 'type': 'image/jpeg', 'size': 1379344, 'content': <memory at 0x0000022F37E62500>, 'last_modified': datetime.datetime(2026, 2, 1, 1, 48, 30, 520000, tzinfo=datetime.timezone.utc)},)


In [8]:
import base64

# Get the first (and only) uploaded file dict
uploaded_file = uploader.value[0]

# This is a memoryview
content_mv = uploaded_file["content"]

# Convert memoryview -> bytes
img_bytes = bytes(content_mv)  # or content_mv.tobytes()

# Now base64 encode
img_b64 = base64.b64encode(img_bytes).decode("utf-8")

In [9]:
multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about image"},
    {"type": "image", "base64": img_b64, "mime_type": "image/png"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)

Here’s what I see in the image, plus a little sci‑fi flavor you can use.

What the image shows
- A dense urban core at night, likely a major Japanese city (signs in kanji and Latin letters, a few corporate logos visible).
- A mix of mid‑rise and high‑rise buildings, tightly packed along a central boulevard that’s busy with car lights.
- Neon and LED glow from storefronts, offices, and billboards giving the city a warm, bustling vibe.
- A rooftop helipad in the lower right, suggesting a future‑leaning city with air taxis or private VTOLs.
- The overall mood is cinematic: moody blues of twilight with punctuations of bright amber from street and building lights.

Notable details to spark ideas
- The central avenue looks like a lifeline of commerce and transport, flanked by glass towers and mixed‑use blocks.
- A distinctive glass tower with angled structural patterns stands out, hinting at innovative architecture.
- Signage includes both script and Latin letters, hinting at a globally conn

## Audio input

In [None]:
import sounddevice as sd
from scipy.io.wavfile import write
import base64
import io
import time
from tqdm import tqdm

# Recording settings
duration = 5  # seconds
sample_rate = 44100

print("Recording...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
# Progress bar for the duration
for _ in tqdm(range(duration * 10)):   # update 10× per second
    time.sleep(0.1)
sd.wait()
print("Done.")

# Write WAV to an in-memory buffer
buf = io.BytesIO()
write(buf, sample_rate, audio)
wav_bytes = buf.getvalue()

aud_b64 = base64.b64encode(wav_bytes).decode("utf-8")

In [None]:
agent = create_agent(
    model='gpt-4o-audio-preview',
)

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this audio file"},
    {"type": "audio", "base64": aud_b64, "mime_type": "audio/wav"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)