Let's summarize text from multiple documents using Langchains, LangGraph, Claude Sonnet 3.5 LLM from the Langchain Documentation

[Langchain Documentation for Summarization](https://python.langchain.com/docs/tutorials/summarization/)

Concepts:

*  Using language models.
*  Using document loaders, specifically the WebBaseLoader to load content from an HTML webpage.


Two ways to summarize or combine multiple documents:
1. Stuff, which simply concatenates documents into a prompt
2. Map-reduce, for larger sets of documents. This splits documents into batches, summarizes those, and then summarizes the summaries.

Use Case for Stuff: You have a set of documents (PDFs, Blogs, customer questions, etc.) and you want to summarize the content.

Use Case for MapReduce: A professor has a research paper from Google Scholar that is 30 pages long, he has 10 minutes to summarize the content of the paper.

Let's install the necessary libraries

In [14]:
%pip install --upgrade --quiet tiktoken langchain langgraph beautifulsoup4 langchain-community langchain-aws

Put in your API Key Credentials and Ensure it works

First we will be loading in our documents. We will use WebBaseLoader to load a blog post:

In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

Select your AWS chat model and Ensure that your credentials are configured

In [18]:
# Ensure your AWS credentials are configured

from langchain.chat_models import init_chat_model

llm = init_chat_model("anthropic.claude-3-5-sonnet-20240620-v1:0", model_provider="bedrock_converse")

Part 1: Stuff: summarize in a single LLM call

The chain will take a list of documents, insert them all into a prompt, and pass that prompt to an LLM:

In [12]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import ChatPromptTemplate

# Define prompt
prompt = ChatPromptTemplate.from_messages(
    [("system", "You are a smart and helpful AI assistant that summarizes documents!"),
     ("user", "Summarize the following blog:\n\n{context}")]
)

# Instantiate chain
chain = create_stuff_documents_chain(llm, prompt)

# Invoke chain
result = chain.invoke({"context": docs})
print(result)

Here is a summary of the key points from the blog post on LLM-powered autonomous agents:

1. Overview of agent system components:
- Planning (task decomposition, self-reflection)
- Memory (short-term, long-term)
- Tool use (external APIs and capabilities)

2. Planning techniques:
- Chain of thought reasoning
- Tree of Thoughts for exploring multiple reasoning paths
- Self-reflection to improve from past actions

3. Memory types:
- Short-term: In-context learning within limited context window
- Long-term: External vector stores with fast retrieval

4. Tool use approaches:  
- MRKL system for routing to expert modules
- Fine-tuning LMs to use external APIs (TALM, Toolformer)
- ChatGPT plugins and function calling

5. Case studies:
- Scientific discovery agents (ChemCrow)
- Generative agent simulations (25 AI characters interacting)

6. Proof-of-concept demos:
- AutoGPT 
- GPT-Engineer

7. Key challenges:
- Limited context length
- Difficulties with long-term planning
- Reliability issues

Now, that we summarized text from a blog, let's summarize a pdf using a pdf based loader!

In [16]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.5.0-py3-none-any.whl (303 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.4/303.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.5.0


In [17]:
from langchain_community.document_loaders import PyPDFLoader

# Load and chunk contents of the PDF

file_path = "/content/summarize/edfors-et-al-gene.pdf"  # Replace with your PDF path
loader = PyPDFLoader(file_path)

docs = loader.load()

In [19]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import ChatPromptTemplate

# Define prompt
prompt = ChatPromptTemplate.from_messages(
    [("system", "You are a smart and helpful AI assistant that summarizes documents!"),
     ("user", "Summarize the following pdf:\n\n{context}")]
)

# Instantiate chain
chain = create_stuff_documents_chain(llm, prompt)

# Invoke chain
result = chain.invoke({"context": docs})
print(result)

Here is a summary of the key points from the article:

- The study investigated the correlation between mRNA levels and protein levels for a set of 55 genes across 9 human cell lines and 11 human tissues.

- They used targeted proteomics with internal standards to measure absolute protein copy numbers, and compared these to mRNA levels measured by RNA-seq.

- They found that mRNA and protein levels did not correlate well when compared directly. However, introducing a gene-specific RNA-to-protein (RTP) conversion factor significantly improved the correlation.

- The RTP conversion factor was found to be relatively consistent for a given gene across different cell types and tissues, but varied widely between different genes (from hundreds to hundreds of thousands of protein copies per mRNA).

- Using the gene-specific RTP factors allowed protein copy numbers to be predicted from mRNA levels with good accuracy across different samples.

- They developed a histone-based method to normalize

In this notebook, I did not use MapReduce, but it's commonly used for summarizing lots of pages of text in a short amount of time. I mentioned the use case for MapReduce above.