<a href="https://colab.research.google.com/github/queerviolet/flex-prompt/blob/main/doc/intro_to_flex_prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# flex[prompt]

In [79]:
#@title (logo)
%%html
<style>
  .logo {
    font-size: 600%;
    text-align: center;
    font-family: serif;
    font-variant-ligatures: no-common-ligatures;
  }
</style>
<div class=logo><i>flex</i>[prompt]</div>

Large language models have *maximum context window*—a maximum number of tokens they can receive and produce. You may have noticed this:

![Example error from exceeding a model's token limit](https://raw.githubusercontent.com/queerviolet/flex-prompt/main/doc/screenshot-max-content-length.png)


Flex prompt addresses this by fitting your prompt into the model's context window. You provide a flexible prompt template, flex prompt renders it into model input.

Flex prompt does not handle model execution, but integrates well with execution frameworks like [LangChain](https://www.langchain.com/) and [Haystack](https://haystack.deepset.ai/).

# Quickstart

We'll install `flex-prompt` with the optional `openai` dependencies, since we're using OpenAI models for these examples. This will install `tiktoken`, the OpenAI tokenizer.

In [None]:
!pip install flex-prompt[openai]

Let's also get ourselves a long string to work with:

In [81]:
#@title `WAR_AND_PEACE` = *(text of War and Peace from Project Gutenberg)*
WAR_AND_PEACE_URL = 'https://www.gutenberg.org/cache/epub/2600/pg2600.txt'
from urllib.request import urlopen
with urlopen(WAR_AND_PEACE_URL) as f: WAR_AND_PEACE = f.read().decode('utf-8')

## Rendering directly

In [82]:
from flex_prompt import render, Flex, Expect

rendered = render(
    Flex([
      "Given the text, answer the question.",
      "--Text--",
      Flex([WAR_AND_PEACE], flex_weight=2),
      "--End Text--",
      "Question: What's the title of this text?",
      "Answer:", Expect()
    ], join='\n'),
    model='text-davinci-002',
    # note: we're setting an artificially low token_limit for
    # demonstration purposes. If you omit token_limit, flex_prompt
    # will entirely fill the model's context window.
    token_limit=300)

print(rendered.output)

Given the text, answer the question.
--Text--
﻿The Project Gutenberg eBook of War and Peace
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: War and Peace


Author: graf Leo Tolstoy

Translator: Aylmer Maude
        Louise Maude

Release date: April 1, 2001 [eBook
--End Text--
Question: What's the title of this text?
Answer:



Here, we're using flex prompt's `Flex` component and passing it directly to `render`. `Flex` divides the available space in the prompt evenly amongst its children, filling the space completely. Neat!

But note that if we *entirely* fill the context window with our prompt, we'll have no more tokens left for the response! That's the role of `Expect`: it's a placeholder which participates in layout but doesn't render any tokens, leaving room for a response.

We can get the rendered prompt string from `rendered.output`. The number of tokens available for the response is `rendered.max_response_tokens`:

In [83]:
rendered.max_response_tokens

89

## `Flexed` components

Rendering directly is fine for a quick example, but in practice you'll probably want prompts which can take parameters. You can inherit from `flex_prompt.Flexed` to define a prompt component whose `content()` is flexed:

In [84]:
from flex_prompt import Flexed, Expect
from dataclasses import dataclass

@dataclass
class Ask(Flexed):
  text: str
  question: str
  answer: str | Expect = Expect()
  instruct: str = "Given a text, answer the question."

  flex_join = '\n' # yielded items will be joined by newlines
  def content(self, _ctx):
    if self.instruct:
      yield 'Given the text, answer the question.'
      yield ''
    yield '-- Begin Text --'
    yield Flex([self.text], flex_weight=2)
    yield '-- End Text --'
    yield 'Question: ', self.question
    yield 'Answer: ', self.answer

We can then pass an instance of `Ask` to `render`:

In [85]:
from flex_prompt import render
ask_tolstoy = Ask(text=WAR_AND_PEACE[10000:],
                  question="What character names appear in the text?")
rendering = render(ask_tolstoy, model='gpt-4', token_limit=300)
print(rendering.output)

Given the text, answer the question.

-- Begin Text --
. He went up to Anna Pávlovna,
kissed her hand, presenting to her his bald, scented, and shining head,
and complacently seated himself on the sofa.

“First of all, dear friend, tell me how you are. Set your friend’s
mind at rest,” said he without altering his tone, beneath the
politeness and affected sympathy of which indifference and even irony
could be discerned.

“Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?” said Anna Pávlovna. “You are
staying the whole evening, I hope?”

“And the fete at the English ambassador’s? Today is Wednesday. I
must put in an appearance there,” said the prince. “My daughter is
coming for me to take me there.”
-- End Text --
Question: What character names appear in the text?
Answer: 


Note that we take an `answer` and default it to `Expect()` an answer from the LLM. Writing prompts like this lets us use the same component to render examples and the active prompt, simplifying format changes:

In [86]:
@dataclass
class AskWithExamples(Flexed):
  examples: list[tuple[str, str, str]]
  ask: Ask

  flex_join = '\n'
  def content(self, _ctx):
    yield Ask.instruct
    yield ''
    for text, q, a in self.examples:
      yield '**EXAMPLE:**'
      yield Ask(text, q, a, instruct=None)
      yield '**END EXAMPLE**'
    yield Flex([self.ask], flex_weight=2)

examples = [
  ('The triangle is green', 'What color is the triangle?', 'green'),
  ('If you breathe deeply, you will fall asleep.', 'How do you fall asleep?', 'breathe deeply'),
  ('The 5-ht2a receptor mediates gastrointestinal activation',
   'What does the 5-ht3 receptor do?',
   'not answered in the text')
]

ask_tolstoy_w_examples = AskWithExamples(examples, ask_tolstoy)
rendering = render(
    ask_tolstoy_w_examples,
    model='gpt-4',
    token_limit=500)
print(rendering.output)

Given a text, answer the question.

**EXAMPLE:**
-- Begin Text --
The triangle is green
-- End Text --
Question: What color is the triangle?
Answer: green
**END EXAMPLE**
**EXAMPLE:**
-- Begin Text --
If you breathe deeply, you will fall asleep.
-- End Text --
Question: How do you fall asleep?
Answer: breathe deeply
**END EXAMPLE**
**EXAMPLE:**
-- Begin Text --
The 5-ht2a receptor mediates gastrointestinal activation
-- End Text --
Question: What does the 5-ht3 receptor do?
Answer: not answered in the text
**END EXAMPLE**
Given the text, answer the question.

-- Begin Text --
. He went up to Anna Pávlovna,
kissed her hand, presenting to her his bald, scented, and shining head,
and complacently seated himself on the sofa.

“First of all, dear friend, tell me how you are. Set your friend’s
mind at rest,” said he without altering his tone, beneath the
politeness and affected sympathy of which indifference and even irony
could be discerned.

“Can one be well while suffering morall

## Execution

Flex prompt doesn't really care how you execute your prompt. But it does provide basic integration hooks: `render(model=)` accepts strings, LangChain models, and Haystack models. Note that not all models are supported out of the box. You can [register support for new models as](#scrollTo=AfIHoOpghKyN) needed.

In [87]:
#@title (read our OpenAI key from the keychain)
#@markdown We'll use OpenAI's models for these examples.
#@markdown You'll need an OPENAI_API_KEY defined in your
#@markdown [colab secrets](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75).
from google.colab import userdata
import os
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

### Using with LangChain

In [None]:
!pip install langchain openai

In [89]:
from langchain.llms import OpenAI

llm = OpenAI()
rendering = render(ask_tolstoy_w_examples, model=llm)
print(llm(rendering.output, max_tokens=rendering.max_response_tokens))

 Prince Vasíli, Anna Pávlovna Schérer, Baron Funke, Emperor Alexander, Novosíltsev, Hardenburg, Haugwitz, Wintzingerode, Vicomte de Mortemart, Abbé Morio, Anatole, Hippolyte, Princess Mary Bolkónskaya, Prince Bolkónski


## Using with Haystack

In [None]:
!pip install farm-haystack

In [91]:
from haystack.nodes import PromptModel, PromptNode

llm = PromptModel(model_name_or_path='text-davinci-002', api_key=os.environ['OPENAI_API_KEY'])
rendering = render(ask_tolstoy_w_examples, model=llm)
print(llm.invoke(rendering.output, max_tokens=rendering.max_response_tokens))

['Anna Pávlovna, le Vicomte de Mortemart, the Abbé Morio, Baron Funke, Anatole, Hippolyte, Lavater, Princess Mary Bolkónskaya, and Prince Bolkónski.']


# `flex_prompt.render`
Flex prompt exports a top-level `render` function which renders an input for a given model. This returns a `Rendering[str]`, whose `output` is the rendered prompt string.

The input can be as simple as a string:

In [92]:
from flex_prompt import render
rendering = render("Q: What are the colors of the rainbow?\nA:", model='text-davinci-002')
rendering.output, rendering.max_response_tokens

('Q: What are the colors of the rainbow?\nA:', 4084)

The rendering has everything we need to call the model, here using LangChain:

In [93]:
from langchain.llms import OpenAI
davinci = OpenAI(model='text-davinci-002')
davinci(rendering.output, max_tokens=rendering.max_response_tokens)

' The colors of the rainbow are red, orange, yellow, green, blue, indigo, and violet.'

For convenience, the `render` function also accepts LangChain and Haystack models directly:



In [94]:
rendering = render("Q: What are the colors of the rainbow?\nA:", model=davinci)
davinci(rendering.output, max_tokens=rendering.max_response_tokens)

' The colors of the rainbow are red, orange, yellow, green, blue, indigo, and violet.'

`render` will render most things you throw at it:

## strings

Rendered strings are cropped to fit. We can see this if we pass an artificially low `token_limit`:

In [95]:
rendering = render("Q: What are the colors of the rainbow?\nA:", model=davinci, token_limit=5)
rendering.output

'Q: What are the'

If the input overflowed, `overflow_token_count` will be non-zero:

In [96]:
rendering.overflow_token_count

8

(You can't do much with this information right now, except know that the prompt overflowed).

## lists

Rendering a `list` (or really, any non-`str` `Iterable`) concatenates all the items in that list:

In [97]:
questions = ['1. Colors of the rainbow?\n', '2. Days of the week?\n']
rendering = render(['Answer the following questions:\n', questions], model='gpt-4')
rendering.output

'Answer the following questions:\n1. Colors of the rainbow?\n2. Days of the week?\n'

By default, lists are rendered in `block` mode. This means that partial items which would be cut off simply aren't rendered at all:

In [98]:
one = """
and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.
"""
two = """
we were witness to dark and terrible portents, whose nameless
features we could not grasp with our mortal minds
"""
three = """
it was only then, in the moment when cruel stars had long since
wrung us dry, that the chinchillas arrived.
"""

rendering = render([one, two, three], model='gpt-4', token_limit=60)
print(rendering.output)


and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.

we were witness to dark and terrible portents, whose nameless
features we could not grasp with our mortal minds



To control this, you can use the `Cat` component explicitly (list rendering uses `Cat` implicitly).

## callables (prompt components)

If you `render` a callable, flex prompt will call it with a rendering context and expect it to return an iterable of rendered items. It's thus convenient define prompt components as callable dataclasses:

In [99]:
from flex_prompt import Flex, Render, Expect
from dataclasses import dataclass

@dataclass
class Ask:
  """Given a text, answer a question."""
  text: str
  question: str
  def __call__(self, ctx: Render):
    yield Flex([
      'Given the text, answer the question\n\n',
      'Text:\n', self.text, '\n',
      'Question: ', self.question, '\n'
      'Answer:', Expect()
    ])

rendered = render(Ask(text=[one, two, three],
                      question='Where is this text from?'),
                  model='text-davinci-002')
print(rendered.output)

Given the text, answer the question

Text:

and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.

we were witness to dark and terrible portents, whose nameless
features we could not grasp with our mortal minds

it was only then, in the moment when cruel stars had long since
wrung us dry, that the chinchillas arrived.

Question: Where is this text from?
Answer:


In the example above, our prompt yields a single `Flex` with all our content. This is pretty common and regrettably ugly. Flex prompt provides a `Flexed` abstract base class to simplify the common case where you just want to throw a bunch of stuff in the prompt and have it show up. To use it, derive `Flexed` and implement `content()`:

In [100]:
from flex_prompt import Flex, Render, render, Flexed, Expect
from dataclasses import dataclass
from typing import Any

@dataclass
class Summarize(Flexed):
  """Summarize a text."""
  text: Any

  flex_join = '\n' # If it's present in the class, Flexed will pass flex_join
                  # to the inner Flex component
  def content(self, _ctx: Render):
    yield 'Summarize the text.'
    yield 'Text:'
    yield self.text
    yield 'Summary:', Expect()

rendered = render(Summarize([one, two, three]), model='text-davinci-002')
print(rendered.output)
print('expecting', rendered.max_response_tokens, 'tokens')

Summarize the text.
Text:

and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.

we were witness to dark and terrible portents, whose nameless
features we could not grasp with our mortal minds

it was only then, in the moment when cruel stars had long since
wrung us dry, that the chinchillas arrived.

Summary:
expecting 4001 tokens


# Included components

Flex prompt comes with a few components included.

## `Flex`

`Flex` divides the available space amongst its children:

In [101]:
from flex_prompt import render, Flex
A = 'A' * 10000
B = 'B' * 10000
C = 'C' * 10000
# the test-len-str model target is a test helper built into
# flex prompt. its tokenizer just returns each character as a token.
print(render(Flex([A, B, C]), model='test-len-str', token_limit=30).output)

AAAAAAAAAABBBBBBBBBBCCCCCCCCCC


You can control how `Flex` divides the space with the `flex_weight` property, set on the child:

In [102]:
for w in range(1, 5):
  rendering = render(
    Flex([A, Flex([B], flex_weight=w), C]),
    model='test-len-str',
    token_limit=30)
  print(rendering.output)

AAAAAAAAAABBBBBBBBBBCCCCCCCCCC
AAAAAAABBBBBBBBBBBBBBBCCCCCCCC
AAAAAABBBBBBBBBBBBBBBBBBCCCCCC
AAAAABBBBBBBBBBBBBBBBBBBBCCCCC


You can specify a `join` argument to make `Flex` join its children while respecting the window size:

In [103]:
rendering = render(
  Flex([A, Flex([B], flex_weight=2), C], join='\n--\n'),
  model='test-len-str',
  token_limit=30)
print(rendering.output)
print('output length:', len(rendering.output))

AAAAA
--
BBBBBBBBBBB
--
CCCCCC
output length: 30


## `Cat`

Flex prompt's `Cat` component concatenates all the iterables you give it, and gives you more control over their rendering than if they were just in a list.

With no arguments, it's equivalent to rendering a list:

In [104]:
from flex_prompt import render, Cat
rendering = render(Cat([one, two, three]), model=davinci, token_limit=40)
print(rendering.output)


and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.



If you'd rather clip items which can't be completely rendered, you can specify `mode='clip'`:

In [105]:
from flex_prompt import render, Cat
rendering = render(Cat([one, two, three], mode='clip'), model=davinci, token_limit=40)
print(rendering.output)
print(rendering.overflow_token_count, 'tokens clipped')


and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.

we were witness to dark and terrible portents,
14 tokens clipped


`Cat` also lets you specify a `join`er, just as `Flex` does:

In [111]:
from flex_prompt import render, Cat
rendering = render(Cat([one, two, three], mode='clip', join='---'), model=davinci, token_limit=70)
print(rendering.output)


and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.
---
we were witness to dark and terrible portents, whose nameless
features we could not grasp with our mortal minds
---
it was only then, in the moment when cruel stars had long


## `Expect`

`Expect` is a layout placeholder. It participates in layout but doesn't produce any tokens, leaving space for model output:

In [128]:
from flex_prompt import render, Flex, Expect

# without Expect
rendering = render(Flex([
  'What is the sentiment of the following text?',
  'Text:', Cat([one, two, three], mode='clip'),
  'Sentiment:'
], join='\n'), model='gpt-4', token_limit=80)
print('without Expect:')
print(rendering.output)
# note that this may not be exactly zero due to Flex rounding
# and [token accounting](#scrollTo=oONFedzRmazF)
print('available response tokens:', rendering.max_response_tokens)

# with Expect
rendering = render(Flex([
  'What is the sentiment of the following text?',
  'Text:', Cat([one, two, three], mode='clip'),
  'Sentiment:', Expect()
], join='\n'), model='gpt-4', token_limit=80)
print('\n\nwith Expect:')
print(rendering.output)
print('available response tokens:', rendering.max_response_tokens)

without Expect:
What is the sentiment of the following text?
Text:

and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.

we were witness to dark and terrible portents, whose nameless
features we could not grasp with our mortal minds

it was only then, in the moment when
Sentiment:
available response tokens: 5


with Expect:
What is the sentiment of the following text?
Text:

and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.

we were
Sentiment:

available response tokens: 36


Like all other built in components, `Expect` takes `flex_weight`:

In [130]:
for w in range(1, 5):
  rendering = render(
    Flex([A, B, C, Expect(flex_weight=w)]),
    model='test-len-str',
    token_limit=30)
  print(rendering.output, f'{rendering.max_response_tokens=}')

AAAAAAABBBBBBBCCCCCCCC rendering.max_response_tokens=8
AAAAAABBBBBBCCCCCC rendering.max_response_tokens=12
AAAAABBBBBCCCCC rendering.max_response_tokens=15
AAAABBBBCCCC rendering.max_response_tokens=18


# Getting a model-specific render function

When you call `render(input, model=m)`, flex prompt searches for a render `Target` for `m`. Finders may need to look up model parameters, which could be an expensive operation. If you want to do this search once rather than every time you render, you can call `flex_prompt.target` to get a model-specific renderer:

In [107]:
from flex_prompt import target, Flex, Expect
render = target(davinci)
rendering = render(Ask(text=WAR_AND_PEACE,
                       question='What might a 19th century Russian aristocrat think of this book?'))
print(davinci(rendering.output, max_tokens=rendering.max_response_tokens))

 The aristocrat might think the book is well-written and informative, but they may not agree with Tolstoy's views on war and peace.


This is equivalent to calling `render(model=)` (and is how `render` is implemented internally).

# Integrating new models

When you ask flex prompt to render against a model, it looks up a `Target` for that model by calling a series of target finders. You can register a target finder using `flex_prompt.register_target_finder`:

In [108]:
from flex_prompt import register_target_finder, Target
from flex_prompt.rendering import Str
from typing import Any

class WordTokenizer:
  def encode(self, string):
    return list(self._encode(string))

  def decode(self, tokens):
    return ''.join(tokens)

  def _encode(self, string):
    import re
    start = 0
    for m in re.finditer(r'(\s|\n)+', string):
      space_start, space_end = m.span()
      word = string[start:space_start]
      if word: yield word
      yield string[space_start:space_end]
      start = space_end
    yield string[start:]

@register_target_finder
def find_example_target(model: Any) -> Target | None:
  if model == 'example-target':
    return Target(10, WordTokenizer(), Str)
  elif model == 'example-target-big':
    return Target(100, WordTokenizer(), Str)

In [109]:
from flex_prompt import render
print(render(one, model='example-target').output)
print(render(one, model='example-target-big').output)


and lo betide, the red

and lo betide, the red sky opened upon us as though the crinkled
hand of the heavens itself was reaching down.



# Known Issues

## Token accounting

Flex prompt operates in token space: when you hand it strings to render, it  tokenizes them and then concatenates those token lists. It only finally generates a string when you read the rendering's `.output` property (or convert it to a string, which implicitly does the same thing). Flex prompt's layout engine assumes that `token_count(A + B) = token_count(A) + token_count(B)`.

This makes layout a bit faster, since we avoid repeatedly calling the tokenizer as we concatenate substrings. Unfortunately, it's also incorrect.

Specifically, if adjacent prompt fragments combine into a single token, flex prompt will report a `token_count` which is higher than the actual token count. You can see this by combining individual-character substrings:

In [110]:
from flex_prompt import target
render = target('gpt-4')

rendered = render([char for char in 'hello world']) # ['h', 'e', 'l', ...]
print('initial rendering token count:', rendered.token_count)
# rendered.output will be "hello world", so we're really just
# rendering a single string here:
print('actual token count:', render(rendered.output).token_count)

initial rendering token count: 11
actual token count: 2


Fortunately, this will always over-estimate the token count, so the fundamental guarantee that flex prompt fits prompts into the token window still holds. It is also *probably* a mistake to combine prompt sections in a way that could generate new words (that is, without whitespace), so this is unlikely to have a major impact in practice.

If an absolutely accurate accounting of tokens is important for your use case, you should re-count tokens as in the example above.