# Long text summarization using LCEL chains on Langchain with Bedrock APIs

> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

## Overview
When we work with large documents, we can face some challenges as the input text might not fit into the model context length, or the model hallucinates with large documents, or, out of memory errors, etc.

To solve those problems, we are going to show a solution that is based on the concept of chunking and chaining prompts. This solution is leveraging [LangChain](https://python.langchain.com/docs/get_started/introduction.html) which is a popular framework for developing applications powered by language models.

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
1. Langchain utility is used to split it into multiple smaller chunks (chunking)
1. First chunk is sent to the model; Model returns the corresponding summary
1. Langchain gets next chunk and appends it to the returned summary and sends the combined text as a new request to the model; the process repeats until all chunks are processed
1. In the end, you have final summary based on entire content

### Use case
This approach can be used to summarize call transcripts, meetings transcripts, books, articles, blog posts, and other relevant content.

### Install the anthropic API For counting tokens

In [2]:
%pip install anthropic

Collecting anthropic
  Obtaining dependency information for anthropic from https://files.pythonhosted.org/packages/b0/fc/6e4ebeb2d28a1d2a7a433d4ed15c8adca0c5f719c232c349949d1521f886/anthropic-0.7.2-py3-none-any.whl.metadata
  Downloading anthropic-0.7.2-py3-none-any.whl.metadata (13 kB)
Collecting distro<2,>=1.7.0 (from anthropic)
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Downloading anthropic-0.7.2-py3-none-any.whl (807 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.6/807.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: distro, anthropic
Successfully installed anthropic-0.7.2 distro-1.8.0
[0mNote: you may need to restart the kernel to use updated packages.


### Imports

In [3]:
import json
import os
import sys
from langchain.llms import Bedrock
import boto3
from langchain.agents import XMLAgent, tool, AgentExecutor


module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww

os.environ["BEDROCK_ASSUME_ROLE"] = "arn:aws:iam::776440668689:role/Crossaccountbedrock"  # E.g. "arn:aws:..."


bedrock_runtime = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

model = Bedrock(
    model_id="anthropic.claude-v2", 
    client=bedrock_runtime,
    model_kwargs={'temperature': 0.3}
    )

Create new client
  Using region: us-west-2
  Using role: arn:aws:iam::776440668689:role/Crossaccountbedrock ... successful!
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-west-2.amazonaws.com)


### Load shareholder letter

We will be following a process similar to lab 02 in this summarization section. First, let us load the 2022 Amazon shareholder letter

In [4]:
shareholder_letter = "./letters/2022-letter.txt"

with open(shareholder_letter, "r") as file:
    letter = file.read()

In [5]:
len(letter.split(' '))

5084

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"], chunk_size=4000, chunk_overlap=100
)

docs = text_splitter.create_documents([letter])

In [7]:
num_docs = len(docs)

num_tokens_first_doc = model.get_num_tokens(docs[0].page_content)

print(
    f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens"
)

Now we have 10 documents and the first one has 439 tokens


In [8]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import XMLOutputParser, PydanticOutputParser
from langchain.output_parsers.json import SimpleJsonOutputParser
from langchain.schema.output_parser import StrOutputParser


xml_parser = XMLOutputParser(tags=['insight'])
str_parser = StrOutputParser()

prompt = PromptTemplate(
    template="""
    
    Human:
    {instructions} : \"{document}\"
    Format help: {format_instructions}.
    Assistant:""",
    input_variables=["instructions","document"],
    partial_variables={"format_instructions": xml_parser.get_format_instructions()},
)

insight_chain = prompt | model | StrOutputParser()

In [9]:
len(docs)

10

# Option 1. Manually process insights, then summarize

In [10]:
%%time
insights=[]
for i in range(len(docs)):
    insights.append(
        insight_chain.invoke({
        "instructions":"Provide Key insights from the following text",
        "document": {docs[i].page_content}
    }))

CPU times: user 65.3 ms, sys: 9 ms, total: 74.3 ms
Wall time: 1min 54s


In [11]:
str_parser = StrOutputParser()

prompt = PromptTemplate(
    template="""
    
    Human:
    {instructions} : \"{document}\"
    Assistant:""",
    input_variables=["instructions","document"]
)

summary_chain = prompt | model | StrOutputParser()

In [12]:
%%time
print(summary_chain.invoke({
        "instructions":"You will be provided with multiple sets of insights. Compile and summarize these insights and provide key takeaways in one concise paragraph. Do not use the original xml tags. Just provide a paragraph with your compiled insights.",
        "document": {'\n'.join(insights)}
    }))

 Here is a concise paragraph summarizing the key insights from the given XML text:

Amazon is constantly evolving and taking a long-term approach, willing to make tough decisions to position itself for future success. It is expanding into large retail markets like grocery and healthcare to address customer pain points. Amazon is investing heavily in emerging technologies like machine learning, large language models, and satellite broadband to transform customer experiences across all businesses. Despite economic challenges, Amazon sees significant growth potential as key markets shift online and to the cloud. Amazon believes its customer obsession, willingness to reinvent itself, and long-term perspective set it up for continued innovation and leadership.
CPU times: user 10.2 ms, sys: 0 ns, total: 10.2 ms
Wall time: 9.37 s


Map reduce

# Option 2. Use Map reduce pattern on Langchain

In [13]:
from langchain.chains.summarize import load_summarize_chain
summary_chain = load_summarize_chain(llm=model, chain_type="map_reduce", verbose=False)

In [14]:
%%time
print(summary_chain.run(docs))

Token indices sequence length is longer than the specified maximum sequence length for this model (1250 > 1024). Running this sequence through the model will result in indexing errors


 Here is a concise summary of the key points:

Amazon CEOs remain optimistic about future growth despite economic challenges. They believe constant adaptation, balancing efficiency and long-term innovation is key. Major investments transformed Amazon from an online bookseller to a global ecommerce, cloud computing and devices giant. Though growth has slowed recently, Amazon sees positive long-term potential in areas like AWS, advertising, healthcare, satellite internet. Amazon will keep expanding into new segments by leveraging capabilities and focusing on customers. Leaders highlight major innovations like AI as key to future success across retail, cloud and emerging businesses.
CPU times: user 56.2 ms, sys: 7.22 ms, total: 63.4 ms
Wall time: 1min 36s
