# Text Summarization

---

For this notebook we use the SageMaker Python SDK for Text Summarization. Text Summarization is the task of shortening the data and creating a summary that represents the most important information present in the original text. Here, we show how to use state-of-the-art pre-trained Distilbart model for Text Summarization. 

---

1. [Set Up](#1.-Set-Up)
2. [Select a model](#2.-Select-a-model)
3. [Retrieve Artifacts & Deploy an Endpoint](#3.-Retrieve-Artifacts-&-Deploy-an-Endpoint)
4. [Query endpoint and parse response](#4.-Query-endpoint-and-parse-response)
5. [Clean up the endpoint](#5.-Clean-up-the-endpoint)

Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel.

### 1. Set Up

---
Before executing the notebook, there are some initial steps required for set up. This notebook requires ipywidgets.

---

In [11]:
!pip install ipywidgets==7.0.0 --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Permissions and environment variables

---
To host on Amazon SageMaker, we need to set up and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook as the AWS account role with SageMaker access. 

---

In [12]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

In [14]:
import pandas as pd

df = pd.read_csv('cleaned_publications_v2.csv',lineterminator='\n')
data_tuning = pd.read_csv('publications_tuning.csv',lineterminator='\n')

In [15]:
df

Unnamed: 0.1,Unnamed: 0,Title,Content,Program,Full_Report,Language
0,0,Policy Brief: Policies and Enabling Environmen...,identifying priority actions for decarbonizing...,,policies and enabling environment to drive pr...,en
1,1,Impact and Highlights from CPI’s Climate Finan...,photo by hendrik cornelissen our 2022 highligh...,Climate Finance Tracking,climate finance tracking program impact and hi...,en
2,3,Challenges of Rural Insurance in the Context o...,promoting the modernization and sustainability...,Brazil Policy Center,promoting the modernization and sustainability...,en
3,4,An Innovative IFI Operating Model for the 21st...,photo by kalen emsley last year saw major cont...,San Giorgio Group,an inno vative ifi operating model for the 21...,en
4,7,Emissions Accounting in Managed Coal Phaseout ...,to meet the global temperature goal of the par...,Climate Finance,2 emissions accounting in manage...,en
...,...,...,...,...,...,...
372,464,The Role of Government Policy in the Developme...,in late 2010 cpi began a study of the impact o...,Energy Finance,david nelson climate policy ...,en
373,465,PV Industry and Policy in Germany and China,as building-integrated photovoltaic (pv) solut...,Energy Finance,survey of photovoltaic industry and policy...,en
374,466,Review of Low-Carbon Development in China 2010,china’s 11th five-year plan (2006-2010) set a ...,Energy Finance,review of low carbon development ...,en
375,467,Carbon Pricing Project,to drive low-carbon investment policy framewor...,Climate Finance,...,en


## 2. Select a pre-trained model
***
You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at [Sagemaker pre-trained Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#).
***

In [16]:
# We leverage the CNN 6-6 model and the CNN Sam Sum model. For more detailed descriptions of why we chose these models, see the poster 
# presentation of this research
# CNN Sam Sum string: "huggingface-summarization-bart-large-cnn-samsum"
model_id, model_version, = (
    "huggingface-summarization-distilbart-cnn-6-6",
    "*",)

***
[Optional] Select a different Sagemaker pre-trained model. Here, we download the model_manifest file from the Built-In Algorithms s3 bucket, filter-out all the Text Summarization models and select a model for inference.
***

In [None]:
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Retrieves all Text Summarization models available by SageMaker Built-In Algorithms.
filter_value = "task == summarization"
text_summarization_models = list_jumpstart_models(filter=filter_value)

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = Dropdown(
    options=text_summarization_models,
    value=model_id,
    description="Select a model",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)

#### Chose a model for Inference

In [None]:
display(model_dropdown)

In [None]:
# model_version="*" fetches the latest version of the model
model_id, model_version = model_dropdown.value, "*"

### 3. Retrieve Artifacts & Deploy an Endpoint

***

Using SageMaker, we can perform inference on the pre-trained model. We start by retrieving the `deploy_image_uri`, `deploy_source_uri`, and `model_uri` for the pre-trained model. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it. 
***

In [17]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

endpoint_name = name_from_base(f"jumpstart-example-{model_id}")

inference_instance_type = 'ml.m4.xlarge' #"ml.p2.xlarge"

# Retrieve the inference docker container uri. This is the base HuggingFace container image for the default model above.
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)


# Retrieve the model uri. This includes the pre-trained model and parameters.
model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"
)

hyperparameters = {
    "epochs": 10,
    "learning_rate": 0.001,
}

# Create the SageMaker model instance
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=model_uri,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
)

# Start the fine-tuning job
model.fit(
    inputs=data_tuning,
    job_name="fine-tuning-job",
    hyperparameters=hyperparameters,
)

# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the sagemaker API.
model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)

--------!

### 4. Query endpoint and parse response

---
Input to the endpoint is any string of text dumped in json and encoded in `utf-8` format. Output of the endpoint is a `json` with summarized text.

---

---
Below, we  put in some example input text. You can put in any text and the model will summarize the text.

---

In [18]:
def query(model_predictor, text):
    """Query the model predictor."""

    encoded_text = text.encode("utf-8")

    query_response = model_predictor.predict(
        encoded_text,
        {
            "ContentType": "application/x-text",
            "Accept": "application/json",
        },
    )
    return query_response


def parse_response(query_response):
    """Parse response and return summary text."""

    model_predictions = json.loads(query_response)
    translation_text = model_predictions["summary_text"]
    return translation_text

In [21]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

input_text = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

query_response = query(model_predictor, input_text)

summary_text = parse_response(query_response)

print(f"Input text: {input_text}{newline}" f"Summary text: {bold}{summary_text}{unbold}{newline}")

Input text: The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
Summary text: [1m The Eiffel Tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building . It is the second tallest free-standing structure in France after the Millau Viaduct . It was the first structure to reach a heig

In [19]:
def summarize(model_predictor, input_text):
    query_response = query(model_predictor, input_text)
    summary_text = parse_response(query_response)
    return summary_text

In [55]:
df['Full_Report'][0][:4946]

'policies and enabling environment to drive private investments for industrial decarbonization in indiaidentifying priority actions for decarbonizing steel and cement sectors policy briefapril 2023authors yash kashyap dhruba purkayastha2policies and enabling environment to drive private investments for industrial decarbonization in indiaintroductionindustrial emission accounts for about onethird of all global anthropogenic co2 emissions and are expected to grow rapidly with major contribution from developing economies in india the industrial sector is the largest and fastestgrowing energy enduse sector and is expected to be the single largest source of co2 emissions by 2040 decarbonization of industries is one of the most critical issues that needs to be addressed to achieve global climate ambition and india’s target of netzero emissions nze by 2070steel and cement are the most consumed emissionintensive industrial materials and india is the second largest producer of both these materi

In [56]:
summarize(model_predictor,df['Full_Report'][0][:2946])

' policies and enabling environment to drive private investments for industrial decarbonization in indiaidentifying priority actions for decarbonizing steel and cement sectors . The industrial sector is the largest and fastest growing energy enduse sector and is expected to be the single largest source of co2 emissions by 2040 decarbonisation of industries is one of'

In [None]:
df['summaries'] = df['Full_Report'].apply(lambda x: summarize(model_predictor, x[:2946]))

In [None]:
df.to_csv('summaries_v2.csv')

In [60]:
df

Unnamed: 0.1,Unnamed: 0,Title,Content,Program,Full_Report,Language,Accuracy: Bleu,Accuracy: Meteor,Information Gain,summaries
0,0,Policy Brief: Policies and Enabling Environmen...,identifying priority actions for decarbonizing...,,policies and enabling environment to drive pri...,en,0.004539096,0.096885,0.320382,India’s target of netzero emissions nze by 207...
1,1,Impact and Highlights from CPI’s Climate Finan...,photo by hendrik cornelissenour 2022 highlight...,Climate Finance Tracking,climate finance tracking programimpact and hig...,en,2.949671e-05,0.047711,0.426435,climate finance tracking programimpact and hi...
2,3,Challenges of Rural Insurance in the Context o...,promoting the modernization and sustainability...,Brazil Policy Center,promoting the modernization and sustainability...,en,0.0006697484,0.081017,0.018343,The current context of climate change amplifi...
3,4,An Innovative IFI Operating Model for the 21st...,photo by kalen emsleylast year saw major contr...,San Giorgio Group,an innovative ifi operating model for the 21st...,en,5.545797e-05,0.057592,0.018343,An innovative ifi operating model for the 21s...
4,7,Emissions Accounting in Managed Coal Phaseout ...,to meet the global temperature goal of the par...,Climate Finance,2 emissions accounting in managed coal p...,en,7.954489e-13,0.023373,0.018343,The authors would like to acknowledge and tha...
5,8,Where Does Brazil Stand and Where Is It Headin...,foto cristina leme lopeswhere do we standthe y...,Brazil Policy Center,march 2023action based agenda1where does brazi...,en,0.7765553,0.661083,0.018343,2022 marks the 10year anniversary of the nati...
6,9,Global Landscape of Renewable Energy Finance 2023,global investment in energy transition technol...,Climate Finance Tracking,2023global landscape of renewable energy finan...,en,3.618109e-10,0.028831,0.018343,2023global landscape of renewable energy fina...
7,10,CCFLA: 2022 Highlights and Impact,the cities climate finance leadership alliance...,Cities Climate Finance Leadership Alliance,2022 highlights and impact2table of contents01...,en,0.1369135,0.296281,0.018343,34the cities climate finance leadership allian...
8,11,Smallholders in the Caatinga and the Cerrado: ...,due to the increasingly high global carbon emi...,Brazil Policy Center,smallholders in the caatinga and the cerrado a...,en,1.52536e-08,0.03062,0.018343,smallholders in the caatinga and the cerrado ...
9,13,Guidelines to Assess the Direct and Indirect A...,climate policy initiativepucrio in partnership...,Brazil Policy Center,guidelines to assess the direct and indirect ...,en,1.515808e-18,0.013466,0.018343,guidelines to assess the direct and indirect ...


### 5. Clean up the endpoint

In [69]:
# Delete the SageMaker endpoint
model_predictor.delete_model()
model_predictor.delete_endpoint()