## Text input

https://platform.openai.com/docs/models

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
from langchain.agents import create_agent

agent = create_agent(
    model='gpt-5-nano',
    system_prompt="You are a science fiction writer, create a capital city at the users request.",
)

In [4]:
from langchain.messages import HumanMessage

question = HumanMessage(content=[
    {"type": "text", "text": "What is the capital of Italy?"}
])

response = agent.invoke(
    {"messages": [question]}
)

print(response['messages'][-1].content)

Rome is the capital of Italy.

If you’d like a science-fiction twist, I can design a futuristic capital for Italy—tell me the vibe (orbital megacity, subterranean hive beneath the ancient ruins, floating platforms, etc.) and I’ll create it.


## Image input

In [5]:
from ipywidgets import FileUpload
from IPython.display import display

uploader = FileUpload(accept='.png', multiple=False)
display(uploader)

FileUpload(value=(), accept='.png', description='Upload')

In [6]:
print(uploader.value)

({'name': 'cityscape-rome-ancient-centre-italy.png', 'type': 'image/png', 'size': 985457, 'content': <memory at 0x0000027B7F1A3C40>, 'last_modified': datetime.datetime(2026, 1, 4, 1, 46, 48, 780000, tzinfo=datetime.timezone.utc)},)


In [7]:
import base64

# Get the first (and only) uploaded file dict
uploaded_file = uploader.value[0]

# This is a memoryview
content_mv = uploaded_file["content"]

# Convert memoryview -> bytes
img_bytes = bytes(content_mv)  # or content_mv.tobytes()

# Now base64 encode
img_b64 = base64.b64encode(img_bytes).decode("utf-8")

In [8]:
multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this capital"},
    {"type": "image", "base64": img_b64, "mime_type": "image/png"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)

From the image you shared, the capital is Roma Prime—a city that wears its ancient stones like a living memory, while its glass and alloy towers push the sky a little higher each year.

What Roma Prime is like
- Name and role: Roma Prime is the political heart of the realm, the seat of the Republic of Aeterna. It blends imperial ceremony with planetary-scale governance, where centuries-old rituals accompany real-time AI councils. The city keeps its oldest ruins as civic monuments, not museum pieces.
- Geography and layout: The core sits on a tidal river delta, ringed by olive-green hills and terraces. The ancient forum sits at the center, surrounded by a grid of elevated avenues and sun-lit piazzas. Over time, new districts have grown outward: the Senate Dome, the Market of Echoes, the Docks of Lumen, and the upward-shooting Arcologies that crown the hills.
- Architecture and atmosphere: You can feel the past in every stone—the stones themselves seem to remember footsteps from a long-v

## Audio input

In [9]:
import sounddevice as sd
from scipy.io.wavfile import write
import base64
import io
import time
from tqdm import tqdm

# Recording settings
duration = 5  # seconds
sample_rate = 44100

print("Recording...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
# Progress bar for the duration
for _ in tqdm(range(duration * 10)):   # update 10× per second
    time.sleep(0.1)
sd.wait()
print("Done.")

# Write WAV to an in-memory buffer
buf = io.BytesIO()
write(buf, sample_rate, audio)
wav_bytes = buf.getvalue()

aud_b64 = base64.b64encode(wav_bytes).decode("utf-8")

Recording...


100%|██████████| 50/50 [00:05<00:00,  9.82it/s]

Done.





In [10]:
agent = create_agent(
    model='gpt-4o-audio-preview',
)

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this audio file"},
    {"type": "audio", "base64": aud_b64, "mime_type": "audio/wav"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)

It sounds like you’ve described an MP3 file using Base64 encoding. Unfortunately, I cannot directly determine the contents or details from just that description.

However, if you provide the actual Base64 data or upload the audio file, I can listen to the content of the audio and give you some information about what’s in it, such as whether it is music, speech, or something else, as well as general observations regarding the content. Please go ahead and provide the file or data, and I’ll do my best to assist!
