# Chapter 4

## Summarizing a document bigger than the LLM’s context window

In [1]:
with open("./Moby-Dick.txt", 'r', encoding='utf-8') as f:
    mobi_dick_book = f.read()

In [2]:
from langchain.text_splitter import TokenTextSplitter
from langchain.schema import Document

text_splitter = TokenTextSplitter(chunk_size=3000, chunk_overlap=100)
text_chunks = text_splitter.split_text(mobi_dick_book)
chunk_docs = [Document(page_content=text_chunk, metadata={}) for text_chunk in text_chunks]

In [3]:
from langchain_openai import ChatOpenAI
from langchain.chains import (
    LLMChain,
    MapReduceChain,
    StuffDocumentsChain, 
    ReduceDocumentsChain, 
    MapReduceDocumentsChain
)
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.prompts import PromptTemplate
import getpass

In [5]:
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY')

Enter your OPENAI_API_KEY ········


In [6]:
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model_name="gpt-3.5-turbo")

In [7]:
# Map
map_prompt_template = """
Write a concise summary of the following text, and include the main details.
Text: {chunk}
"""

map_prompt = PromptTemplate(template=map_prompt_template, input_variables=["chunk"])
map_chain = LLMChain(llm=model, prompt=map_prompt)

In [8]:
# Reduce 
reduce_prompt_template = """
Write a coincise summary of the following summaries, and include the main details.
Text: {summaries}
"""

reduce_prompt = PromptTemplate(
    template=reduce_prompt_template, input_variables=["summaries"]
)

reduce_chain = LLMChain(llm=model, prompt=reduce_prompt)


In [9]:
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="summaries"
)

In [10]:
reduce_documents_chain = ReduceDocumentsChain(
    combine_documents_chain=combine_documents_chain,
    collapse_documents_chain=combine_documents_chain,
    # LLM token limit you do not want to exceed# ANNOTATION
    token_max=4000,
)

In [11]:
# map reduce
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain (DO THIS AND BELOW COMMENTS AS CODE ANNOTATION)
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain, 
    # The variable name used in the map chain
    document_variable_name="chunk",
    # Return output of map steps 
    return_intermediate_steps=False,
)

In [12]:
map_reduce_response = map_reduce_chain.invoke(chunk_docs)

In [13]:
print(map_reduce_response['output_text'])

The text is an eBook of the novel "Moby-Dick; or The Whale" by Herman Melville, available for free on Project Gutenberg. The narrator, Ishmael, describes his decision to embark on a whaling voyage, his arrival in New Bedford, and his stay at "The Spouter Inn" where he encounters a mysterious harpooneer named Queequeg. Initially wary of Queequeg's strange behavior, the narrator eventually forms a connection with him, leading to a realization about human understanding and connection. The text explores themes of adventure, mystery, and the contrast between civilized and savage behaviors.


## Summarizing across documents

In [14]:
from langchain.document_loaders import WikipediaLoader

wikipedia_loader = WikipediaLoader(query="Paestum", load_max_docs=2)
wikipedia_docs = wikipedia_loader.load()

In [15]:
from langchain.document_loaders import Docx2txtLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import TextLoader

word_loader = Docx2txtLoader("Paestum/Paestum-Britannica.docx")
word_docs = word_loader.load()

pdf_loader = PyPDFLoader("Paestum/PaestumRevisited.pdf")
pdf_docs = pdf_loader.load()

txt_loader = TextLoader("Paestum/Paestum-Encyclopedia.txt")
txt_docs = txt_loader.load()

In [16]:
all_docs = wikipedia_docs + word_docs + pdf_docs + txt_docs

In [17]:
from langchain_openai import ChatOpenAI
from langchain.chains import (
    LLMChain,
    load_summarize_chain
)
from langchain_core.prompts import PromptTemplate
import getpass

In [58]:
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY')

Enter your OPENAI_API_KEY ········


In [18]:
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model_name="gpt-3.5-turbo")

In [19]:
doc_summary_template = """Write a concise summary of the following text:
{text}
DOC SUMMARY:"""
doc_summary_prompt = PromptTemplate.from_template(doc_summary_template)

refine_summary_template = """
Your must produce a final summary from a temporary final summary
which has generated so far and from the content of an additional document.
This is the temporary final summary generated so far: {existing_answer}
This is the content of the additional document: {text}
Only use the content of the additional document if it is useful, 
otherwise return the temporary final summary as it is."""

refine_summary_prompt = PromptTemplate.from_template(refine_summary_template)

In [20]:
summary_refine_chain = load_summarize_chain(
    llm=model,
    chain_type="refine",
    question_prompt=doc_summary_prompt,
    refine_prompt=refine_summary_prompt,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="final_summary",
)

In [21]:
summary_result = summary_refine_chain.invoke({"input_documents": all_docs}, return_only_outputs=True)

In [22]:
print(summary_result)

{'intermediate_steps': ['Paestum was a major ancient Greek city in Magna Graecia known for its well-preserved Greek temples, city walls, and amphitheatre. Established around 600 BC as Poseidonia, it was later conquered by the Lucanians and Romans. The city declined due to trade route shifts and natural disasters. Today, the ruins of Paestum are located in modern-day Italy and are a popular tourist destination.', "Capaccio Paestum, formerly known as Capaccio, is a town and comune in the province of Salerno in the Campania region of south-western Italy. The ruins of the ancient Greek city of Paestum are located within the borders of the comune. The municipality of Capaccio is a hill town surrounded by a plain where most of the population resides. The nearest airport is Salerno-Pontecagnano, located 35 km away. Notable people from Capaccio include Vincenzo Romano and Michele Siano. The ruins of Paestum are a major tourist destination in modern-day Italy, known for its well-preserved Greek

## Summarizing structured data

In [23]:
watches = [
  {"brand": "Rolex", "model": "Submariner", "dial-size": 41, "dial-color": "black", "material": "steel", "status": "available"},
  {"brand": "Rolex", "model": "Dytona", "dial-size": 40, "dial-color": "black", "material": "steel", "status": "available"},
  {"brand": "Rolex", "model": "Dytona", "dial-size": 40, "dial-color": "white", "material": "gold", "status": "sold_yesterday"},
  {"brand": "Omega", "model": "Speedmaster Moonwatch", "dial-size": 42, "dial-color": "black", "material": "steel", "status": "available"},
  {"brand": "Omega", "model": "Seamaster", "dial-size": 43, "dial-color": "blue", "material": "steel", "status": "sold_yesterday"},    
]

In [24]:
from langchain.schema import Document

row_docs = [Document(page_content=f"We have {row['status']} a {row['material']} {row['brand']} {row['model']}  with a {row['dial-color']} {row['dial-size']}mm dial.", metadata={}) 
            for row in watches]

In [25]:
from langchain_openai import ChatOpenAI
from langchain.chains import (
    load_summarize_chain
)
import getpass

In [26]:
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY')

Enter your OPENAI_API_KEY ········


In [27]:
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model_name="gpt-3.5-turbo")

In [29]:
structured_data_summary_chain = load_summarize_chain(model, chain_type="stuff")

In [32]:
summary_result = structured_data_summary_chain.invoke(row_docs)
print(summary_result['output_text'])

Available: Steel Rolex Submariner (41mm black dial), Steel Rolex Daytona (40mm black dial), Steel Omega Speedmaster Moonwatch (42mm black dial)

Sold: Gold Rolex Daytona (40mm white dial), Steel Omega Seamaster (43mm blue dial)
