# Summarization on Custom Dataset with SageMaker Jumpstart and [LangChain](https://python.langchain.com/en/latest/index.html) Library

Reference: https://github.com/gkamradt/langchain-tutorials/tree/main/data_generation


 There are two main types of methods for summarizing text: abstractive and extractive.

Abstractive summarization generates a new shorter summary in its own words based on understanding the meaning and concepts of the original text. It analyzes the text using advanced natural language techniques to grasp the key ideas and then expresses those ideas in a summarized form using different words and phrases. This is similar to how humans summarize by reading something and then explaining the main points in their own words.

Extractive summarization works by selecting the most important sentences, phrases or words from the original text to construct a summary. It calculates the weight or importance of each part of the text using algorithms and then chooses the parts with the highest weights to put into the summary. This pulls summarizes by extracting key elements from the text itself rather than interpreting the meaning.

So in short, abstractive summarization rewrites the key ideas in new words while extractive summarization selects the most salient parts of the existing text. Both aim to distill the essence and most significant information from the original document into a condensed summary.

We're going to run through 3 methods for summarization that start with basic prompting to summarizing large documents using `map_reduce` method. These aren't the only options, feel free to modify it based on your use case. 

**3 Levels Of Summarization:**
1. **Summarize a couple sentences** - Basic Prompt
2. **Summarize a couple paragraphs** - Prompt Templates
3. **Summarize a large document with multiple pages** - Map Reduce

In this notebook we will demonstrate how to use a **Falcon 7b Instruct** model for text summarization using a library of documents as a reference.

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom text summarization application. Let's install some dependencies that will be required and initialize some basic variables.**

In [None]:
!pip install --upgrade pip
!pip install --upgrade sagemaker
!pip install langchain
!pip install datasets
!pip install transformers

## Note
You must Restart Kernel here for the installations to take effect. After restarting kernel, run the following cells.

In [None]:
import sagemaker
from sagemaker.session import Session
import boto3
import os

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

print(f"Region is {aws_region}, Role is {aws_role}")

## Deploy large language model (LLM) and embedding model in SageMaker JumpStart
---

To better illustrate the idea, let's first deploy all the models that are required to perform the demo. You can see the list of Falcon models available via JumpStart by running the following code block. You can deploy any of the 7b models on a minimum of `ml.g5.12xlarge` instance type for ideal performance. For 40b we recommend atleast a 24xl or higher. In this tutorial, we will deploy the `huggingface-llm-falcon-7b-instruct-bf16` model.

In [None]:
# To list all the available textgeneration models in JumpStart uncomment and run the code below
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models, list_jumpstart_tasks
filter_value = "task == llm"

print("===== Available Models =====")
text_generation_models = list_jumpstart_models(filter=filter_value)
text_generation_models

In [None]:
model_id = 'huggingface-llm-falcon-7b-instruct-bf16'

We will now deploy this model to a SageMaker endpoint for inference.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

try:
    model = JumpStartModel(model_id=model_id, instance_type="ml.g5.2xlarge")
    predictor = model.deploy()
except Exception as e:
    print(str(e))

In [None]:
endpoint_name =predictor.endpoint_name
region = aws_region

In [None]:
print(f"SageMaker Endpoint with Falcon-7b deployed: {endpoint_name}")

## Summarize a few sentences 
---

In [None]:
prompt = """
Given the following text, provide a coincise and complete summary.

Text:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.

Summary:
"""

In order to use our model endpoint with LangChain we wrap up endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint` which is LangChain's built in support for SageMaker endpoints. 

In [None]:
import json
import re
from langchain import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain import PromptTemplate, LLMChain

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        input_str = json.dumps({"inputs": prompt,  "parameters": model_kwargs}) 
        return input_str.encode('utf-8')
    
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generated_text"]

content_handler = ContentHandler()

sm_llm=SagemakerEndpoint(
        endpoint_name=endpoint_name, 
        region_name=aws_region,
        model_kwargs={"do_sample": True,
                                    "top_p": 0.9,
                                    "temperature": 0.5,
                                    "max_new_tokens":  100,
                                    "stop": ["<|endoftext|>", "</s>"]},
        content_handler=content_handler
    )

In [None]:
num_tokens = sm_llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

In [None]:
output = sm_llm(prompt)
print(output)

In [None]:
prompt = """
Given the following text, write a 1 line summary.

Text:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.

Summary:
"""

In [None]:
output = sm_llm(prompt)
print (output)

##  Summarize a couple paragraphs -  Prompt Templates
---

Prompt templates are a great way to dynamically place text within your prompts. They are like [python f-strings](https://realpython.com/python-f-strings/) but specialized for working with language models.

We're going to look at 2 short Paul Graham essays

In [None]:
from datasets import load_dataset
dataset = load_dataset("chromadb/paul_graham_essay")
essay1 = dataset['data'][0]['document']
essay2 = dataset['data'][1]['document']

essays=[essay1, essay2]
for essay in essays:
    print(essay)
    print("===============")

Next let's create a prompt template which will hold our instructions and a placeholder for the essay. In this example we only want a 1 sentence summary to come back.

In [None]:
template = """
Given the following text, write a short summary.

Text: {essay}
Summary:
"""

prompt = PromptTemplate(
    input_variables=["essay"],
    template=template
)

In [None]:
sm_llm=SagemakerEndpoint(
        endpoint_name=endpoint_name, 
        region_name=aws_region,
        model_kwargs={"do_sample": True,
                                    "top_p": 0.9,
                                    "temperature": 0.8,
                                    "max_new_tokens":  200,
                                    "stop": ["<|endoftext|>", "</s>"]},
        content_handler=content_handler
    )

for essay in essays:
    summary_prompt = prompt.format(essay=essay)
    
    num_tokens = sm_llm.get_num_tokens(summary_prompt)
    print (f"--> This prompt + essay has {num_tokens} tokens")
    
    summary = sm_llm(summary_prompt)
    
    print (f"Summary: {summary.strip()}")
    print ("\n")

## Summarize large text  from multiple pages of a document - MapReduce
---

If you have multiple pages you'd like to summarize, you'll likely hve large amounts of text and will likely run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if you run into the issue.

The chain type "Map Reduce" is a method that helps with this. You first generate a summary of smaller chunks (that fit within the token limit) and then you get a summary of the summaries.

Check out [this video](https://www.youtube.com/watch?v=f9_BWhCI4Zo) for more information on how chain types work. We will use articles from the PubMed dataset available via HuggingFace `datasets`.

In [None]:
from datasets import load_dataset
dataset = load_dataset("ccdv/pubmed-summarization")
essay = dataset['train'][0]['article']
print(essay)

In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
sm_llm.get_num_tokens(essay)

That's too many, let's split our text up into chunks so they fit into the prompt limit. I'm going a chunk size of 2,000 characters. 

> You can think of tokens as pieces of words used for natural language processing. For English text, **1 token is approximately 4 characters** or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.

This means the number of tokens we should expect is 2,000 / 4 = ~500 token chunks. But this will vary, each body of text/code will be different.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=2000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

In [None]:
num_docs = len(docs)

num_tokens_first_doc = sm_llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Great, assuming that number of tokens is consistent in the other docs we should be good to go. Let's use LangChain's [load_summarize_chain](https://python.langchain.com/en/latest/use_cases/summarization.html) method, we will use `refine` chain type for summarization. We first need to initialize our chain

Our document is pretty large and has 19 chunks, so lets pick the first few chunks and try to summarize them using LangChain's load_summarize_chain.

In [None]:
summary_chain = load_summarize_chain(llm=sm_llm, chain_type='map_reduce',
                                     verbose=True # Set verbose=True if you want to see the prompts being used
                                    )

In [None]:
output = summary_chain.run(docs[:5])

In [None]:
print(output.strip())

---
This summary is a great start, but since we took partial text our resulting summary isn't great and is left incomplete. This can be solved with a bit of prompt engineering but ideally we would like to summarize the whole document. So, lets modify to summarize the entire document and get only the key points as the final summary.

In order to do this we will use custom prompts (like we did above) to instruct the model on what we need. But this time, instead of using just 5 chunks of the given document, we will use all chunks of the documents and use a MapReduce Summary chain from LangChain and our Falcon model hosted in SageMaker.

We will Summarize the document using LangChain MapReduce summary chain

- We will first generate summaries of the smaller chunks (map)
- Then we will generate a narrative using the generated summaries (reduce)
- Then we will use the shortened narrative to generate final key themes, summary of the document.

In [None]:
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

llm =SagemakerEndpoint(
        endpoint_name=endpoint_name, 
        region_name=aws_region,
        model_kwargs={"do_sample": True,
                                    "top_p": 0.9,
                                    "temperature": 0.8,
                                    "max_new_tokens":  100,
                                    "stop": ["<|endoftext|>", "</s>"]},
        content_handler=content_handler
    )

Let's define the Map chain that will generate summaries of each of the 30 chunks. In this case, you can see that it is just a regular LLMChain with a simple summary prompt. This is because we simply want to run summary on each of the indovidual chunks of text.

In [None]:
# Map
map_template = """Given the following text, write a short summary.

Text: {docs}
Summary: """

map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

We then define the reduce chain. The purpose of this chain is to take all the generated summaries (by the map chain) and generate a single final summary.

In [None]:
# Reduce
reduce_template = """The following is set of summaries. Take these and distill it into a final, consolidated summary of the main themes. 

Text: {doc_summaries}
Summary: """

reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

We then define a chain that combines all the generated summaries from the Map chain, subsequently pass it to the Reduce chain

In [None]:
# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="doc_summaries"
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=1000,
)

Finally we define the overall MapReduceDocumentsChain. This chain takes care of executing all the chains we have defined so far, passing the output(s) from one to the other to  generate the final summary. If you want to be able to see each of the steps as they execute, you can pass `verbose = True` in the `map_chain` and the `reduce_chain` initializations above. For this exercise we kept it default to False, but feel free to change it and execute.

In [None]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

In [None]:
# we have already split our document into chunks previously so we will use it now
print(map_reduce_chain.run(docs))

# Cleanup
---

We have seen how we can deploy a Falcon 7b Instruct model using SageMaker Endpoint and use it with LangChain to perform small text and very large text summarizations. Let's delete the endpoint to avoid incurring additional cost.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()