# Chapter 3

## Summarizing a document bigger than the LLM’s context window

In [1]:
with open("./Moby-Dick.txt", 'r', encoding='utf-8') as f:
    moby_dick_book = f.read()

In [2]:
from langchain_openai import ChatOpenAI
from langchain_text_splitters import TokenTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnableParallel
import getpass

In [3]:
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY')

Enter your OPENAI_API_KEY ········


In [4]:
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model_name="gpt-5-nano")

In [5]:
# Split
text_chunks_chain = (
    RunnableLambda(lambda x: 
        [
            {
                'chunk': text_chunk, 
            }
            for text_chunk in 
               TokenTextSplitter(chunk_size=3000, chunk_overlap=100).split_text(x)
        ]
    )
)

In [6]:
# Map
summarize_chunk_prompt_template = """
Write a concise summary of the following text, and include the main details.
Text: {chunk}
"""

summarize_chunk_prompt = PromptTemplate.from_template(summarize_chunk_prompt_template)
summarize_chunk_chain = summarize_chunk_prompt | llm

summarize_map_chain = (
    RunnableParallel (
        {
            'summary': summarize_chunk_chain | StrOutputParser()        
        }
    )
)

In [7]:
# Reduce
summarize_summaries_prompt_template = """
Write a coincise summary of the following text, which joins several summaries, and include the main details.
Text: {summaries}
"""

summarize_summaries_prompt = PromptTemplate.from_template(summarize_summaries_prompt_template)
summarize_reduce_chain = (
    RunnableLambda(lambda x: 
        {
            'summaries': '\n'.join([i['summary'] for i in x]), 
        })
    | summarize_summaries_prompt 
    | llm 
    | StrOutputParser()
)

In [8]:
map_reduce_chain = (
   text_chunks_chain
   | summarize_map_chain.map()
   | summarize_reduce_chain
)     

In [9]:
summary = map_reduce_chain.invoke(moby_dick_book)

In [10]:
print(summary)

- Bibliographic basics: Moby-Dick; or The Whale by Herman Melville. Project Gutenberg edition (eBook #2701), English, UTF-8.

- Core aim and mood: Ishmael goes to sea to cure his gloom and rest his mind; the sea is framed as a mysterious, life-siphoning force, and his voyage is presented as part of a fated plan.

- Initial setting: He travels to New Bedford seeking cheap, honest seafaring work; the Spouter Inn in a dim, smoky room introduces a grand, unsettling painting and a wall of weapons, setting a ominous, mythic atmosphere.

- Key characters and fate hints: An imposing harpooneer (Queequeg) is anticipated as a future shipmate; Bulkington, a formidable Southerner, arrives and becomes Ishmael’s potential comrade-in-waiting; the landlord’s odd tales foreshadow dangers and curiosities of whaling life.

- Bed-sharing arrangement and odd lore: The inn warns that Ishmael will share a bed with the harpooneer; the landlord recounts bizarre, ritualistic details about the harpooneer’s recen

## Summarizing across documents

In [11]:
from langchain_community.document_loaders import WikipediaLoader

wikipedia_loader = WikipediaLoader(query="Paestum", load_max_docs=2)
wikipedia_docs = wikipedia_loader.load()

In [12]:
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import TextLoader

word_loader = Docx2txtLoader("Paestum/Paestum-Britannica.docx")
word_docs = word_loader.load()

pdf_loader = PyPDFLoader("Paestum/PaestumRevisited.pdf")
pdf_docs = pdf_loader.load()

txt_loader = TextLoader("Paestum/Paestum-Encyclopedia.txt")
txt_docs = txt_loader.load()

In [13]:
all_docs = wikipedia_docs + word_docs + pdf_docs + txt_docs

In [14]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
import getpass

In [15]:
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY')

Enter your OPENAI_API_KEY ········


In [16]:
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model_name="gpt-5-nano")

In [17]:
doc_summary_template = """Write a concise summary of the following text:
{text}
DOC SUMMARY:"""
doc_summary_prompt = PromptTemplate.from_template(doc_summary_template)

doc_summary_chain = doc_summary_prompt | llm

In [18]:
refine_summary_template = """
Your must produce a final summary from the current refined summary
which has been generated so far and from the content of an additional document.
This is the current refined summary generated so far: {current_refined_summary}
This is the content of the additional document: {text}
Only use the content of the additional document if it is useful, 
otherwise return the current full summary as it is."""

refine_summary_prompt = PromptTemplate.from_template(refine_summary_template)

refine_chain = refine_summary_prompt | llm | StrOutputParser()

In [19]:
def refine_summary(docs):

    intermediate_steps = []
    current_refined_summary = ''
    for doc in docs:
        intermediate_step = \
           {"current_refined_summary": current_refined_summary, 
            "text": doc.page_content}
        intermediate_steps.append(intermediate_step)
        
        current_refined_summary = refine_chain.invoke(intermediate_step)
        
    return {"final_summary": current_refined_summary,
            "intermediate_steps": intermediate_steps}

In [20]:
full_summary = refine_summary(all_docs)
print(full_summary)

{'final_summary': 'Integrated final summary (synthesizing the current refined summary with the additional document)\n\nPoseidonia/Paestum traces a long arc from a flourishing Greek urban center to a Lucanian-influenced city and finally to a Roman colonial town, with continuity in sacred spaces and intercultural exchange tempered by strategic political remodeling under Rome. The site also preserves remarkable early Greek architecture, most famously the Doric temples whose surviving ruins illuminate Paestum’s initial monumental character.\n\n1) Early core and material culture (7th–5th centuries BCE)\n- Urban and sacred framework: A fortified Greek polis with a four-gated wall, an agora north of the Hera sanctuary, the Temple of Athena, and the two Hera temples (I and II). The surviving group of Doric temples (c. 530–460 BCE) represents Paestum’s early monumental architecture. The Paestum Order—characterized by exaggerated entasis and broad, squat capitals—became a model that later Neo-Cl