### Data

In [None]:
!curl https://www.gutenberg.org/files/2600/2600-0.txt > ../../data/other/war_and_peace.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3280k  100 3280k    0     0   667k      0  0:00:04  0:00:04 --:--:--  906k


In [None]:
!head ../../data/other/war_and_peace.txt

The Project Gutenberg eBook of War and Peace, by Leo Tolstoy

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.



In [4]:
!wc -w ../../data/other/war_and_peace.txt

wc: ../../data/other/war_and_peace.txt: No such file or directory


In [5]:
!curl https://www.gutenberg.org/files/9296/9296-0.txt > clarissa_complete.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  596k  100  596k    0     0      0      0 --:--:--  0:00:02 --:--:--     0   184k      0  0:00:03  0:00:03 --:--:--  184k


## Applying MapReduce to long-form summarisation

In [1]:
%load_ext autoreload
%autoreload 2

In [17]:
import importlib
import llm_programs
importlib.reload(llm_programs)
from llm_programs import LMFunction, TemplatedFunction, MapReduce, Gemini, DummyLM
from llm_programs.programs.base.prompters import ArgsPrompter
from llm_programs.utils import *

In [18]:
from pathlib import Path
content = Path("../../data/other/war_and_peace.txt").read_text()
# window_size = 10_000
window_size = 1_000
book_start = content.find("\n\n\n\n\nBOOK ONE")
content = content[book_start:]
print(len(content) // window_size)
inputs = [content[i:i+window_size].strip() for i in range(0, len(content), window_size)]
print(inputs[0])
print(inputs[0].count(' '))


3220
BOOK ONE: 1805





CHAPTER I

“Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don’t tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist—I really believe he is Antichrist—I will have nothing
more to do with you and you are no longer my friend, no longer my
‘faithful slave,’ as you call yourself! But how do you do? I see I
have frightened you—sit down and tell me all the news.”

It was in July, 1805, and the speaker was the well-known Anna Pávlovna
Schérer, maid of honor and favorite of the Empress Márya Fëdorovna.
With these words she greeted Prince Vasíli Kurágin, a man of high
rank and importance, who was the first to arrive at her reception. Anna
Pávlovna had had a cough for some days. She was, as she said, suffering
from la grippe; grippe being then a new word in St. Petersburg, used
only by the elite.

All her invitations without exception, written in French, a

In [22]:
template_map = """Summarize the following extract from a book in one or two sentences. Do not add any additional information or context.

BEGIN EXTRACT

{extract}

END EXTRACT
"""

template_reduce = """You will be provided with a list of summaries of sequential extracts of the same book. Your task is to combine them into a single summary that captures the main themes and ideas of the text. The summary must be very brief, only one or two sentences. Do not add any additional information or context.

BEGIN SUMMARIES

{summaries}

END SUMMARIES
"""


engine = Gemini(debug=True)
# engine = DummyLM()
# engine.__call__ = debug_wrap(engine.__call__)

map_fn = LMFunction(prompter=ArgsPrompter(template_map, ['extract']), engine=engine)

# Strip summaries of newlines and whitespace
def JoinerPrompter(inputs):
    inputs_clean = [v.replace('\n', ' ').strip() for v in inputs]
    inputs_joined = '\n\n'.join(inputs_clean)
    return template_reduce.format(summaries=inputs_joined)

reduce_fn = LMFunction(prompter=JoinerPrompter, engine=engine)

mapreduce = MapReduce
output = mapreduce(map_fn, reduce_fn)(inputs[:3])

==== Prompt ==== (n_llm_calls = 1)
Summarize the following extract from a book in one or two sentences. Do not add any additional information or context.

BEGIN EXTRACT

BOOK ONE: 1805





CHAPTER I

“Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don’t tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist—I really believe he is Antichrist—I will have nothing
more to do with you and you are no longer my friend, no longer my
‘faithful slave,’ as you call yourself! But how do you do? I see I
have frightened you—sit down and tell me all the news.”

It was in July, 1805, and the speaker was the well-known Anna Pávlovna
Schérer, maid of honor and favorite of the Empress Márya Fëdorovna.
With these words she greeted Prince Vasíli Kurágin, a man of high
rank and importance, who was the first to arrive at her reception. Anna
Pávlovna had had a cough for some days. She was, as she

In [30]:
print(output)

In 1805 St. Petersburg, Anna Pávlovna hosts a politically charged reception where anti-Napoleon sentiments and social maneuvering intertwine with ambitions for personal advancement.



In [31]:
printw(output)

In 1805 St. Petersburg, Anna Pávlovna hosts a politically charged reception where anti-Napoleon sentiments and social 
maneuvering intertwine with ambitions for personal advancement.



## Parallel MapReduce

In [33]:
class JoinerPrompter():
    def __init__(self, template, sep='\n\n', rm_newlines=True):
        self.template = template
        self.sep = sep
        self.rm_newlines = rm_newlines

    def __call__(self, **ctx):
        inputs = ctx['inputs']
        if self.rm_newlines:
            inputs = [v.replace('\n', ' ').strip() for v in inputs]
        return self.template.format(inputs=self.sep.join(inputs))



In [48]:
# fake templates for testing
template_map = """summarize_extract(\"{input}\")"""
template_reduce = """summarize_summaries({inputs})"""

text = 'abcdefghijklmnopqrstuvwxyz' * 2

inputs = [text[i:i+10] for i in range(0, len(text), 10)]

engine = DummyLM(latency=.3)
map_fn = TemplatedFunction(template_map, engine=engine)
reduce_fn = LMFunction(prompter=JoinerPrompter(template_reduce, sep=',', rm_newlines=False), engine=engine)
mapreduce = MapReduce(map_fn, reduce_fn, parallel=True, max_workers=1)
# mapreduce(inputs)

In [49]:
%%timeit
mapreduce = MapReduce(map_fn, reduce_fn, parallel=True, max_workers=1)
mapreduce(inputs)
None

3.3 s ± 327 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [50]:
%%timeit
mapreduce = MapReduce(map_fn, reduce_fn, parallel=True, max_workers=None)
mapreduce(inputs)
None

1.21 s ± 301 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
