# Summarizing legal documents with Hugging Face and Amazon Sagemaker

In [1]:
# Number of parameters for flan-t5 family: small 80M, base 250M, large 780M, xl 3B, xxl 11B
model_id = "google/flan-t5-small"

# https://huggingface.co/datasets/billsum
dataset_name, dataset_version = "cnn_dailymail", "3.0.0"

# Setup

In [2]:
!pip -q install transformers datasets sagemaker --upgrade

In [3]:
!pip -q install widgetsnbextension ipywidgets

In [4]:
import sagemaker

print(sagemaker.__version__)

sess = sagemaker.Session()
bucket = sess.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
2.221.1


In [5]:
import transformers
import datasets

print(transformers.__version__)
print(datasets.__version__)

4.41.2
2.19.1


# Preprocessing

## Load dataset

In [6]:
from datasets import load_dataset, load_from_disk

dataset = load_dataset(dataset_name, dataset_version)
dataset

Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

## Preprocess dataset 

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

prefix = "summarize: "
input_max_length = 1024
output_max_length = 128


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=input_max_length, truncation=True)
    labels = tokenizer(
        text_target=examples["highlights"], max_length=output_max_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [8]:
tokenized_dataset = dataset.map(
    preprocess_function, batched=True, remove_columns=["article", "highlights", "id"]
)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [9]:
# tokenized_dataset.save_to_disk(f"billsum-t5-tokenized")

Saving the dataset (0/3 shards):   0%|          | 0/287113 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/13368 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11490 [00:00<?, ? examples/s]

# Upload processed dataset to S3

In [6]:
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

In [7]:
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

s3_prefix = "huggingface/billsum-t5-summarization"
bucket = "pupka"
dataset_input_path = "s3://{}/{}".format(bucket, s3_prefix)
train_input_path = "{}/train".format(dataset_input_path)
valid_input_path = "{}/validation".format(dataset_input_path)

print(dataset_input_path)
print(train_input_path)
print(valid_input_path)

s3://pupka/huggingface/billsum-t5-summarization
s3://pupka/huggingface/billsum-t5-summarization/train
s3://pupka/huggingface/billsum-t5-summarization/validation


In [11]:
# tokenized_dataset["train"].save_to_disk(train_input_path, fs=s3)
# tokenized_dataset["test"].save_to_disk(valid_input_path, fs=s3)

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Saving the dataset (0/3 shards):   0%|          | 0/287113 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11490 [00:00<?, ? examples/s]

In [12]:
%%sh -s $dataset_input_path
aws s3 ls --recursive $1

2024-05-30 23:40:15  424111416 huggingface/billsum-t5-summarization/train/data-00000-of-00003.arrow
2024-05-30 23:40:29  460139096 huggingface/billsum-t5-summarization/train/data-00001-of-00003.arrow
2024-05-30 23:40:47  453302224 huggingface/billsum-t5-summarization/train/data-00002-of-00003.arrow
2024-05-30 23:41:04       2195 huggingface/billsum-t5-summarization/train/dataset_info.json
2024-05-30 23:41:04        368 huggingface/billsum-t5-summarization/train/state.json
2024-05-30 23:41:04   52967328 huggingface/billsum-t5-summarization/validation/data-00000-of-00001.arrow
2024-05-30 23:41:06       2195 huggingface/billsum-t5-summarization/validation/dataset_info.json
2024-05-30 23:41:06        249 huggingface/billsum-t5-summarization/validation/state.json


In [None]:
train_ds = load_from_disk(train_input_path)
valid_ds = load_from_disk(valid_input_path)

# Fine-tune on SageMaker with a Hugging Face Deep Learning Container

In [13]:
!pygmentize train.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mevaluate[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m ([37m[39;49;00m
    AutoModelForSeq2SeqLM,[37m[39;49;00m
    AutoTokenizer,[37m[39;49;00m
    DataCollatorForSeq2Seq,[37m[39;49;00m
    Seq2SeqTrainer,[37m[39;49;00m
    Seq2SeqTrainingArguments,[37m[39;49;00m
)[37m[39;49;00m
[37m[39;49;00m
rouge = evaluate.load([33m"[39;49;00m[33mrouge[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32mcompute_met

In [14]:
hyperparameters = {
    "epochs": 1,
    "learning-rate": 1e-6,
    "train-batch-size": 1,
    "eval-batch-size": 8,
    "model-name": model_id,
}

In [36]:
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    role=sagemaker.get_execution_role(),
    # Fine-tuning script
    entry_point="train.py",
    dependencies=["requirements.txt"],
    hyperparameters=hyperparameters,
    # Infrastructure
    transformers_version="4.26.0",
    pytorch_version="1.13.1",
    py_version="py39",
    instance_type="ml.p3.16xlarge",
    instance_count=1,
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)

In [None]:
huggingface_estimator.fit({"train": train_input_path, "valid": valid_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2024-05-31-00-19-17-900


2024-05-31 00:19:18 Starting - Starting the training job...
2024-05-31 00:19:18 Pending - Training job waiting for capacity............
2024-05-31 00:21:43 Pending - Preparing the instances for training......
2024-05-31 00:22:37 Downloading - Downloading input data......
2024-05-31 00:23:42 Downloading - Downloading the training image...............
2024-05-31 00:26:07 Training - Training image download completed. Training in progress....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-05-31 00:26:39,865 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-05-31 00:26:39,927 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-05-31 00:26:39,940 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-05-31 00:26:39,943 sagemaker_pytorch_container.tra

In [None]:
huggingface_estimator.model_data

's3://sagemaker-us-east-1-381491949871/huggingface-pytorch-training-2024-05-31-00-19-17-900/output/model.tar.gz'

# Deploy on SageMaker with a Hugging Face Deep Learning Container

In [None]:
# huggingface_predictor = huggingface_estimator.deploy(
#     initial_instance_count=1, instance_type="ml.p3.2xlarge"
# )

In [45]:
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=1,
)

In [46]:
%%time

huggingface_predictor = huggingface_estimator.deploy(serverless_inference_config=serverless_config)

INFO:sagemaker.image_uris:Defaulting to CPU type when using serverless inference
INFO:sagemaker:Creating model with name: huggingface-pytorch-training-2024-05-31-02-12-02-948
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-training-2024-05-31-02-12-02-948
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-training-2024-05-31-02-12-02-948


----!CPU times: user 77.2 ms, sys: 10.1 ms, total: 87.3 ms
Wall time: 2min 31s


In [55]:
dataset['test'][12]['article']

'(CNN)Seventy years ago, Anne Frank died of typhus in a Nazi concentration camp at the age of 15. Just two weeks after her supposed death on March 31, 1945, the Bergen-Belsen concentration camp where she had been imprisoned was liberated -- timing that showed how close the Jewish diarist had been to surviving the Holocaust. But new research released by the Anne Frank House shows that Anne and her older sister, Margot Frank, died at least a month earlier than previously thought. Researchers re-examined archives of the Red Cross, the International Training Service and the Bergen-Belsen Memorial, along with testimonies of survivors. They concluded that Anne and Margot probably did not survive to March 1945 -- contradicting the date of death which had previously been determined by Dutch authorities. In 1944, Anne and seven others hiding in the Amsterdam secret annex were arrested and sent to the  Auschwitz-Birkenau concentration camp. Anne Frank\'s final entry . That same year, Anne and Ma

In [62]:
test_data['inputs']

['summarize: : (CNN)Seventy years ago, Anne Frank died of typhus in a Nazi concentration camp at the age of 15. Just two weeks after her supposed death on March 31, 1945, the Bergen-Belsen concentration camp where she had been imprisoned was liberated -- timing that showed how close the Jewish diarist had been to surviving the Holocaust. But new research released by the Anne Frank House shows that Anne and her older sister, Margot Frank, died at least a month earlier than previously thought. Researchers re-examined archives of the Red Cross, the International Training Service and the Bergen-Belsen Memorial, along with testimonies of survivors. They concluded that Anne and Margot probably did not survive to March 1945 -- contradicting the date of death which had previously been determined by Dutch authorities. In 1944, Anne and seven others hiding in the Amsterdam secret annex were arrested and sent to the  Auschwitz-Birkenau concentration camp. Anne Frank\'s final entry . That same yea

In [66]:
test_data = {
    "inputs": [
        f"{prefix}: {dataset['test'][12]['article']}", 
        f"{prefix}: {dataset['test'][13]['article']}"
    ]
}

In [76]:
test_samples = df_test.head(10)

In [77]:
test_samples

Unnamed: 0,article,highlights,id
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef
2,"(CNN)If you've been following the news lately,...",Mohammad Javad Zarif has spent more time with ...,4495ba8f3a340d97a9df1476f8a35502bcce1f69
3,(CNN)Five Americans who were monitored for thr...,17 Americans were exposed to the Ebola virus w...,a38e72fed88684ec8d60dd5856282e999dc8c0ca
4,(CNN)A Duke student has admitted to hanging a ...,Student is no longer on Duke University campus...,c27cf1b136cc270023de959e7ab24638021bc43f
5,(CNN)He's a blue chip college basketball recru...,College-bound basketball star asks girl with D...,1b2cc634e2bfc6f2595260e7ed9b42f77ecbb0ce
6,(CNN)Governments around the world are using th...,Amnesty's annual death penalty report catalogs...,e2706dce6cf26bc61b082438188fdb6e130d9e40
7,"(CNN)Andrew Getty, one of the heirs to billion...",Andrew Getty's death appears to be from natura...,0d3c8c276d079c4c225f034c69aa024cdab7869d
8,(CNN)Filipinos are being warned to be on guard...,"Once a super typhoon, Maysak is now a tropical...",6222f33c2c79b80be437335eeb3f488509e92cf5
9,"(CNN)For the first time in eight years, a TV l...","Bob Barker returned to host ""The Price Is Righ...",2bd8ada1de6a7b02f59430cc82045eb8d29cf033


In [87]:
test_data = {"inputs": test_samples['article'].apply(lambda x: f"{prefix}{x}").tolist()}

In [4]:
prediction = huggingface_predictor.predict(test_data)
print(prediction)

NameError: name 'huggingface_predictor' is not defined

In [89]:
prediction

[{'generated_text': 'Palestinians signed the Rome Statute in January, paving the way for possible war crimes investigations'},
 {'generated_text': 'Theia, a stray dog, has been buried in a field'},
 {'generated_text': 'Mohammad Javad Zarif is the Iranian foreign minister. He is the Iranian foreign minister'},
 {'generated_text': 'Five Americans who were monitored for three weeks at an Omaha, Nebraska hospital after being exposed to E'},
 {'generated_text': 'Duke student has admitted to hanging a noose made of rope from a tree'},
 {'generated_text': "Trey Moses and Ellie Meredith were invited to Eastern High School's prom."},
 {'generated_text': 'Amnesty International says it is shameful that so many states around the world are '},
 {'generated_text': 'Andrew Getty, 47, appears to have died of natural causes. He was found on his'},
 {'generated_text': 'Tropical storm Maysak is expected to make landfall Sunday morning on the southeastern coast'},
 {'generated_text': 'Bob Barker, who host

In [44]:
huggingface_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-training-2024-05-31-02-07-07-938
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-training-2024-05-31-02-07-07-938


In [None]:
import json
from sagemaker.predictor import Predictor

predictor = Predictor(endpoint_name="huggingface-pytorch-training-2024-05-31-02-12-02-948")

sample_1 = '(CNN)Seventy years ago, Anne Frank died of typhus in a Nazi concentration camp at the age of 15. Just two weeks after her supposed death on March 31, 1945, the Bergen-Belsen concentration camp where she had been imprisoned was liberated -- timing that showed how close the Jewish diarist had been to surviving the Holocaust. But new research released by the Anne Frank House shows that Anne and her older sister, Margot Frank, died at least a month earlier than previously thought. Researchers re-examined archives of the Red Cross, the International Training Service and the Bergen-Belsen Memorial, along with testimonies of survivors. They concluded that Anne and Margot probably did not survive to March 1945 -- contradicting the date of death which had previously been determined by Dutch authorities. In 1944, Anne and seven others hiding in the Amsterdam secret annex were arrested and sent to the  Auschwitz-Birkenau concentration camp. Anne Frank\'s final entry . That same year, Anne and Margot were separated from their mother and sent away to work as slave labor at the Bergen-Belsen camp in Germany. Days at the camp were filled with terror and dread, witnesses said. The sisters stayed in a section of the overcrowded camp with no lighting, little water and no latrine. They slept on lice-ridden straw and violent storms shredded the tents, according to the researchers. Like the other prisoners, the sisters endured long hours at roll call. Her classmate, Nannette Blitz, recalled seeing Anne there in December 1944: "She was no more than a skeleton by then. She was wrapped in a blanket; she couldn\'t bear to wear her clothes anymore because they were crawling with lice." Listen to Anne Frank\'s friends describe her concentration camp experience . As the Russians advanced further, the Bergen-Belsen concentration camp became even more crowded, bringing more disease. A deadly typhus outbreak caused thousands to die each day. Typhus is an infectious disease caused by lice that breaks out in places with poor hygiene. The disease causes high fever, chills and skin eruptions. "Because of the lice infesting the bedstraw and her clothes, Anne was exposed to the main carrier of epidemic typhus for an extended period," museum researchers wrote. They concluded that it\'s unlikely the sisters survived until March, because witnesses at the camp said the sisters both had symptoms before February 7. "Most deaths caused by typhus occur around twelve days after the first symptoms appear," wrote  authors Erika Prins and Gertjan Broek. The exact dates of death for Anne and Margot remain unclear. Margot died before Anne. "Anne never gave up hope," said Blitz, her friend. "She was absolutely convinced she would survive." Her diary endures as one of the world\'s most popular books. Read more about Anne Frank\'s cousin, a keeper of her legacy .'

sample_2 = "Nvidia on Sunday unveiled its next generation of artificial intelligence chips to succeed the previous model, which was announced just months earlier in March. Nvidia CEO Jensen Huang announced the new AI chip architecture, dubbed “Rubin,” ahead of the COMPUTEX tech conference in Taipei. Rubin comes months after the March announcement of the upcoming “Blackwell” model, which is still in production and expected to ship to customers later in 2024 Huang’s announcement of Rubin appears to quicken the company’s already-accelerated pace of AI chip advancement. Nvidia has pledged to release new AI chip models on a “one-year rhythm,” as Huang put it on Sunday. The company had previously been operating on a slower two-year update timeline for chips. The turnaround from Blackwell to Rubin was a matter of less than three months, underscoring the competitive frenzy in the AI chip market and Nvidia’s sprint to preserve its dominant spot. AMD and Intel are two major competitors working to catch up, though their gross margins trailed Nvidia’s in the most recent fiscal quarter. Companies like Microsoft , Google and Amazon are also vying for Nvidia’s top spot, even as they are simultaneously some of Nvidia’s biggest patrons. A flurry of startups are also working to enter the space. “Today, we’re at the cusp of a major shift in computing,” Huang said Sunday. “With our innovations in AI and accelerated computing, we’re pushing the boundaries of what’s possible and driving the next wave of technological advancement.” The Rubin chip platform will have new GPUs, the crucial graphic processing technology that helps train and launch AI systems. It will come with other new features like a central processor called “Vera,” though the Sunday announcement did not provide many details."

# Define the prompt
prompt = [sample_1, sample_2]

# Set up the payload with generation parameters
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,          # Enable sampling
        "temperature": 0.7,         # Set the creativity of the response
        "top_p": 0.7,               # Use nucleus sampling with cumulative probability of 0.7
        "top_k": 50,                # Limit the number of high probability tokens considered
        "max_length": 256,          # Limit the response to 256 tokens
        "repetition_penalty": 1.03, # Slightly discourage repetition
    }
}

# Convert the dictionary to JSON and encode it to bytes
data = json.dumps(payload)
encoded_data = data.encode('utf-8')

# Send the request and receive the response
response = predictor.predict(encoded_data, initial_args={"ContentType": "application/json"})

# Extract and print the generated text from the response
print(response)