# Question/answering  long documents by summarizing them into a table of contents  using GPT/Langchain

TLDR: This notebook shows how to use GPT to answer questions from a big document that has does not have an available index. It is done in 3 steps: (1) Create a table of contents by summarizing chunks of texts (2) for a given question, ask GPT which summary in the index to look for the answer (2) get the full content used to generate the summary and include in the question/prompt so the answer can refer to the content (context) to better answer the question.

Motivation

With the availability of the GPT API, we can get natural language answers from data crawled by OpenAI. However, to get answers from private documents, we need to finetune the GPT model on the private documents or include those documents in the context of the question. If the document is too large for the context (maximum 4096 tokens, resulting in max ~3000 words including the answer) we need to be able to first find the sections of the document that may contain the answer. There are many ways to do this:

Embeddings

Split the document into chunks
Embedd this text chunks using GPT
Embedd the question and use as context the chunk(s) with maximum cosine distance between chunk and question embedding
See https://platform.openai.com/docs/tutorials/web-qa-embeddings

Using the table of contents/index (this notebook)

(1) If data is already organized with a table of contents, go to 2. If not, we can split the full document into chunks and summarize each chunk using a LLM to create a table of contents. This can be done by using summarization and chunking from Langchain.

(2) If the document is structured using sections (table of contents), we can ask the LLM (for a given question)to answer teh user question assuming the knowledge of the table of contents including the "source" (which section contains the answer). This can be done using https://langchain.readthedocs.io/en/latest/modules/indexes/chain_examples/qa_with_sources.html (QA with sources from Langchain).

(3) Then we can refine the answer. We can include the text content of the section into a second question which will help to answer the user's question.

More

For the case where we already have a table of contents and we do not want to use Langchain, see: https://github.com/rmuller-ml/gpt_table_of_contents_doc_qa/blob/main/Search_GPT_API_documentation_with_GPT.ipynb

In [2]:
# Using the state of the union as an example
# Splitting the full text into chunks with small overlap

from langchain.text_splitter import CharacterTextSplitter

with open('../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()

text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
docs = text_splitter.create_documents([state_of_the_union])

In [5]:
# Number of chunks and example of a chunk

print(f"Number of documents: {len(docs)}")
print("Example [0]")
print(docs[0])

Number of documents: 49
Example [0]
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the w

In [28]:
# Summarizing each chunk using a LLM a limiting each summary to around 10 words

from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

template = """Write a concise summary of the following:


"{page_content}"


CONCISE SUMMARY WITH MAXIMUM 10 WORDS:

"""

llm = OpenAI(temperature=0.0)
prompt = PromptTemplate(
    input_variables=["page_content"],
    template=template,
)

from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt)

# Run the chain only specifying the input variable.
summaries = [chain.run(doc.page_content) for doc in docs]



In [135]:
# The list of summaries can be seen as a table of contents
# Number of characters is compressed by x10

import re
print(f"Total number of characters in the summaries: {sum([len(s) for s in summaries])}")
print(f"Total number of characters in the docs: {sum([len(d.page_content) for d in docs])}")
print("Summaries:")
for i, s in enumerate(summaries):
    s_no_newline = re.sub("\n","",s)
    print(f"{i}:  {s_no_newline}")



Total number of characters in the summaries: 4483
Total number of characters in the docs: 45425
Summaries:
0:  Americans unite to stand against tyranny.
1:  Ukrainians courageously fight against aggression; US stands with them.
2:  NATO formed to secure peace in Europe; US and 29 other nations; US diplomacy, resolve countered Putin's attack.
3:  World holds Russia accountable, enforces sanctions, cutting off access to technology.
4:  U.S. and European allies isolate Russia, seize assets, close airspace.
5:  Russia's economy is suffering, US providing $1B aid to Ukraine, NATO allies defended against Putin.
6:  U.S. defends NATO countries against Russian aggression.
7:  President announces oil release to reduce gas prices.
8:  Democracies are uniting against autocracy, standing with Ukraine.
9:  Putin will never gain Ukrainian hearts; US faced hard years; American Rescue Plan passed to help.
10:  American Rescue Plan passed to help people, created 6.5 million jobs.
11:  Economy created 6

In [38]:
# Adding a unique index for each chunk
from langchain.docstore.document import Document
summary_docs = [Document(page_content=s, metadata={"source": f"{i}-pl"}) for i,s in enumerate(summaries)]

In [137]:
# Use Langchain's QA_with_sources chain to answer the question and tell which summaries contain the answer
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
query = "Any news regarding the semiconductor industry?"
intermediate_result = chain({"input_documents": summary_docs, "question": query}, return_only_outputs=True)
print(intermediate_result)

{'output_text': " President Biden announced Intel's $20 billion semiconductor site, creating 10,000 jobs, and Intel to invest $100 billion in US manufacturing.\nSOURCES: 15-pl, 16-pl"}


In [138]:
# Parse the sources and show the chunk contents

from langchain.output_parsers import RegexParser

output_parser = RegexParser(
    regex=r"(.*?)\SOURCES: (.*)",
    output_keys=["answer", "source"],
)
sources_str = output_parser.parse(intermediate_result['output_text'])['source']
sources = [int(x.split('-')[0]) for x in sources_str.split(', ')]
for i, s in enumerate(sources):
    print(f"SUMMARY {i}: {summary_docs[s].page_content}")
    print(f"CONTEXT {i}: {docs[s].page_content}")

SUMMARY 0: 

Pass Bipartisan Innovation Act to compete globally, build Intel's $20 billion semiconductor site, create 10,000 jobs.
CONTEXT 0: But to compete for the best jobs of the future, we also need to level the playing field with China and other competitors. 

That’s why it is so important to pass the Bipartisan Innovation Act sitting in Congress that will make record investments in emerging technologies and American manufacturing. 

Let me give you one example of why it’s so important to pass it. 

If you travel 20 miles east of Columbus, Ohio, you’ll find 1,000 empty acres of land. 

It won’t look like much, but if you stop and look closely, you’ll see a “Field of dreams,” the ground on which America’s future will be built. 

This is where Intel, the American company that helped build Silicon Valley, is going to build its $20 billion semiconductor “mega site”. 

Up to eight state-of-the-art factories in one place. 10,000 new good-paying jobs. 

Some of the most sophisticated man

In [139]:
# Refine the answer by using the original doc chunks (not only summaries) as context

from langchain.chains.question_answering import load_qa_chain

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
final_result = chain({"input_documents": [docs[s] for s in sources], "question": query}, return_only_outputs=True)
print(final_result['output_text'])

 Intel is planning to build a $20 billion semiconductor "mega site" with up to eight state-of-the-art factories in Ohio, creating 10,000 new jobs. Intel's CEO has also said they are ready to increase their investment from $20 billion to $100 billion.
