# BART News article highlights

The aim of this project is to refresh NLP methods and to explore ways of improving a pre-trained BART LLM to give shorter hightlights of news articles.

The dataset used for this project is from the Hugging Face Datasets [CNN_DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail).

In [1]:
# load imports

import huggingface
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import time
import datasets
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import evaluate

from tqdm import tqdm

import random
import pandas as pd
from IPython.display import display, HTML


  from .autonotebook import tqdm as notebook_tqdm


#### Load Dataset

In [2]:
dataset = load_dataset("abisee/cnn_dailymail", "1.0.0")
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

#### Check Dataset

In [3]:
def show_random_elements(dataset, num_examples=2):
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


show_random_elements(dataset["train"])

Unnamed: 0,article,highlights,id
0,"In the Arizona desert the first of Qantas's aged fleet of Boeing 767s jets are landing ...for the last time. It's called the 'great aeroplane graveyard' and like thousands of decommissioned military and passenger craft before them, it's where Qantas's reliable but costly 767 'workhorses' will retire. In the dust of a former airforce base in Victorville, California. 'Picture hot winds and shifting desert sands, prairy dogs and rattlesnakes - and all you can hear is the hum of thousands of rusting propellers turning in the wind,' says 60 Minutes Presenter Charles Wooley, who recently took a ticket on an empty 767 to it's final resting place at the Southern California Logistics Airport. 'Let's just say I wouldn't recommend it as a tourist destination - it's a very eery place.' Scroll down for video . Aeroplane graveyard: Qantas's aged fleet of Boeing 767s retiring to a former airforce base in Victorville, California, USA. A birds eye view of the Victorville 'boneyard' for old planes in California, USA. Qantas announced plans for all nine of its remaining 767 fleet to be retired by the end of 2014, nine months earlier than originally planned. The jets have already outlived their predicted 2010 used-by date and are now too costly to keep in the air. In their place, Qantas have chosen the more fuel-efficient 787 Dreamliners. Television Presenter Charles Wooley enjoyed the cabin to himself on one of the last 787 flights, named 'Charlie 1' by the crew. 'It was a bit like the Twilight Zone, I was half-expecting to see a monkey appear on the wing,' he said. 'Actually it was an all-female crew, it was symbolic since the making of the Boeing 767 in the 1980s coincided with the time women really began flying passenger planes.' 'When we landed, I met all my old friends out there. I must have been on those jets more than a hundred times over the years with 60 minutes,' says Wooley. 'These were masterpieces of 20th century engineering.' What will happen to them now? Engineers will 'rip the guts out of some', others will be wrapped up to keep their engines intact. 'One guy out there says to me: ""Well Charlie, some might be picked up by a billionaire, but I'd say most will end up as beer cans,'"" says Wooley. Scroll down for video . The Flying Kangaroos in Arizona now look much like any other old jet, with the bounding marsupials on their tails painted out in red. Wooley says it was hard to film at the graveyard because 'Boeing shareholders don't like seeing all these mothballs sitting around.' 'Airlines, like TV companies these days, are never too far away from bankruptcy - but it's all about fuel, cheaper planes that are much more economical to continue to fly,' he says. And it's not all bad news for the airline industry. According to Wooley, Boeing are producing 42 new Dreamliner craft at their plant in Seattle every month. Presenter Charles Wooley and the crew . Between 1941 to 1992, the Southern California Logistics Airport at Victorville was used as a frontline United States Air Force base . Old commercial aircraft sit in the sand at the Southern California Logistics Airport in Victorville, California . Television Presenter Charles Wooley and 'Charlie One', the Qantas Boeing 737 he accompanied to the Arizona 'graveyard' In Australia, the final Qantas Boeing 767 flight will run from Melbourne to Sydney on December 27 - but residents of other Australian cities still have one last chance to grab a ride on the big bird between November and December this year. Join Charles Wooley onboard when 'Plane Graveyard' airs on 60 Minutes this Sunday, 21 September on Channel Nine.","Qantas's nine reliable but costly 767 'workhorses' are being retired to the Arizona desert . The airline is replacing them with the more fuel efficient 787 Dreamliners . Once on the ground, engineers strip the planes for parts . The fleet will be decommissioned by the end of 2014 .",e6d26a100b784f725032e9a4247365016a909718
1,"By . Larisa Brown . PUBLISHED: . 20:26 EST, 1 October 2013 . | . UPDATED: . 20:33 EST, 1 October 2013 . In days gone by, they were mainly the preserve of Del Boy-style wheeler-dealers. But now the middle-classes are snapping up fake designer handbags, watches and DVDs. More than half of all consumers have admitted buying counterfeit goods, according to a recent study. The research was conducted by consultants from PwC who wanted to explore public attitudes towards the growth in industry of counterfeiting. Buying fake goods used to be preserve of Del Boy-style wheeler-dealers but has now 'gone mainstream' according to a new survey on attitudes to the counterfeiting industry . They concluded fake goods ‘have gone mainstream’, leaving manufacturers of the genuine articles out of pocket. While 90 per cent of the public said they considered buying fake goods ‘morally wrong’, it did not stop them acquiring a knock-off watch or Gucci handbag. In total, 53 per cent of consumers admitted they had bought counterfeit goods – many blaming the fact they could not afford the real thing. Young people were the worst offenders, with almost two thirds buying fakes such as jewellery and designer clothes. Affluent professional and managerial classes were just as likely to buy fakes as unskilled workers and those on benefits. Londoners and people from Northern Ireland were found to be the worst offenders, with Scots the least likely to give in to temptation. More than half of all consumers have admitted buying counterfeit goods, such as DVDs (file picture) Fake designer watches and bags were also popular buys as people cannot afford the real thing (file picture) Mark James, head of anti-counterfeiting . at PwC, told The Times newspaper technological advances and more . fragmented supply chains were behind the fake goods flooding Britain. He said: ‘Just as it has never been easier to start your own business, it has also never been easier to start your own counterfeiting business. ‘Not surprisingly, the study found that most people go to the internet to buy fakes, where things are harder to track.’ One of the key drivers behind fake designer clothing and footwear was identified as the rise in the number of websites claiming to offer ‘discount’ or ‘cheap’ luxury goods. Shoppers are even advised of the best sellers of counterfeit goods on some of the sites. Mr James added that the British public seemed to be unaware that buying the occasional fake could do any harm. He said: ‘Consumers are just not thinking of it in those terms. Even when they are asked about deterrents, being caught and the morality of it come behind safety, the fear of losing bank details and the product not being delivered. ‘But there is a cost. Companies are having their reputations, their brands and their revenues stolen and that has a knock-on effect on jobs.’ The Government estimates that counterfeiting costs the economy £1.3 billion a year.",Survey by PwC found more than half of people say they bought fake goods . But 90 per cent said they considered buying fake goods ‘morally wrong' Londoners and people from Northern Ireland worst offenders .,94ce507b264bb3abf06354de4662abe24887cc79


These examples show that the news articles get summurised by in a fairly succinct fashion to give just the main highlights for each. 

#### Preprocess and Tokenize Dataset

In [4]:
# Load model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



In [17]:
def preprocess_function(examples):
    '''Function to tokenize all of the articles and labels in the dataset in advance'''
    inputs = [doc for doc in examples["article"]]
    
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

    

In [18]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [19]:
# test preprocess function is tokenizing correctly
original_article = dataset['train'][:2]['article']

tokenized_example = preprocess_function(dataset['train'][:2])
detokenized_example = tokenizer.decode(tokenized_example['input_ids'][0], skip_special_tokens=True)

print(150 * "-", "\n Original article:\n")
print(original_article)
print("\n\n", 150 * "-", "\n Detokenized article:\n")
print(detokenized_example)

------------------------------------------------------------------------------------------------------------------------------------------------------ 
 Original article:

['LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be abl

#### Tokenize and preprocess the whole dataset

In [20]:

tokenized_datasets = dataset.map(preprocess_function, batched=True)


Map: 100%|██████████| 287113/287113 [04:02<00:00, 1184.30 examples/s]
Map: 100%|██████████| 13368/13368 [00:11<00:00, 1131.48 examples/s]
Map: 100%|██████████| 11490/11490 [00:10<00:00, 1132.79 examples/s]


#### Test out pre-trained BART

In [21]:
# Testing out the pre-trained BART without any additional prompt information

example_indices = [20, 50]
dash_line = 100 * '-'


for i, index in enumerate(example_indices):
    article = dataset['test'][index]['article']
    summary = dataset['test'][index]['highlights']

    
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=250,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{article}')
    print(dash_line)
    print(f'Human labelled highlight:\n{summary}')
    print(dash_line)
    print(f'Pre-Trained BART model generation - WITHOUT PROMPT ENGINEERING:\n{output}\n')

----------------------------------------------------------------------------------------------------
Example  1
----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Norfolk, Virginia (CNN)The second mate of the Houston Express probably couldn't believe what he was seeing. Hundreds of miles from land there was a small boat nearby. At first it looked abandoned. It was in bad shape, listing to one side. The crew of the 1,000-foot long container ship thought it was a yacht that had wrecked. Incredibly, as they got closer, they saw there was a man on it, signaling for help. "He was moving, walking around, waving to us and in surprisingly good condition," Capt. Thomas Grenz told CNN by phone Friday. That man, Louis Jordan, 37, had an amazing story. He'd been drifting on the 35-foot Pearson sailboat for more than two months since leaving Conway, South Carolina, to fish in the ocean. Just a few days into his trip, a storm capsized h

In [22]:
# Set up evaluation metric 
import evaluate
rouge = evaluate.load("rouge")


### First we should get some baseline metrics for the original model

In [23]:
# simple prefix

example_indices = [20, 50]

for i, index in enumerate(example_indices):
    article = dataset['test'][index]['article']
    summary = dataset['test'][index]['highlights']

    prefix = 'Summarize the article in one or two sentences? :  '
    
    inputs = tokenizer(prefix + article, max_length=1024, truncation=True, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=250,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{article}')
    print(dash_line)
    print(f'Human generated highlight:\n{summary}')
    print(dash_line)
    print(f'Pre-Trained BART GENERATION - WITH SIMPLE PROMPT ENGINEERING:\n{output}\n')

----------------------------------------------------------------------------------------------------
Example  1
----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Norfolk, Virginia (CNN)The second mate of the Houston Express probably couldn't believe what he was seeing. Hundreds of miles from land there was a small boat nearby. At first it looked abandoned. It was in bad shape, listing to one side. The crew of the 1,000-foot long container ship thought it was a yacht that had wrecked. Incredibly, as they got closer, they saw there was a man on it, signaling for help. "He was moving, walking around, waving to us and in surprisingly good condition," Capt. Thomas Grenz told CNN by phone Friday. That man, Louis Jordan, 37, had an amazing story. He'd been drifting on the 35-foot Pearson sailboat for more than two months since leaving Conway, South Carolina, to fish in the ocean. Just a few days into his trip, a storm capsized h

#### Define model evaluation function

In [24]:
def evaluate_model_performance(model, mode="instruction", example_index = 125):

    # Subset of dataset for evaluation
    articles = dataset['test'][0:100]['article']
    highlights = dataset['test'][0:100]['highlights']
    
    model_summaries = []
    
    for i, article in enumerate(tqdm(articles, desc="Evaluating")):
        
        # Building the prompt based on the mode
        if mode == "zero shot":
            # just asking for a summary
            prompt = f"Summarize in one or two sentences the following article:\n\n{article}\n\nSummary:"
        
        elif mode == 'one shot' and example_indices is not None:
            # Provide one example and then ask for a summary
            example_article = dataset['test'][example_index]['article']
            example_highlight = dataset['test'][example_index]['highlights']

            prompt = f"""
Task: Summarise the following new article in one or two sentences
Article:
            
{article}
"""
            # showing the model an example in the prompt (note: because inputs are so long this might not be as effective)    
            prompt += 'Below are examples of article highlights'
            
            prompt += f"""
Example Article:
{example_article}
            
Example Summary:
{example_highlight}
"""        
            
        elif mode == "instruction":
            # using a task specific instruction
            prompt = f"""
Summarize in one or two sentences the following article:      
{article}
            
Summary:
            """
        
        else:
            raise ValueError(f"Unsupported mode: {mode}")
        
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).input_ids

        model_outputs = model.generate(input_ids=input_ids, max_new_tokens=200, num_beams=5)
        model_text_output = tokenizer.decode(model_outputs[0], skip_special_tokens=True)
        model_summaries.append(model_text_output)
    
    zipped_summaries = list(zip(highlights, model_summaries))
    df = pd.DataFrame(zipped_summaries, columns=['Highlights', 'Model_summaries'])
    #print(df.head(5))

    # Compute ROUGE scores
    model_results = rouge.compute(
        predictions=model_summaries,
        references=highlights[0:len(model_summaries)],
        use_aggregator=True,
        use_stemmer=True,
    )
    
    return model_results


In [25]:
evaluate_model_performance(model, mode='zero shot')

Evaluating: 100%|██████████| 100/100 [11:19<00:00,  6.79s/it]


{'rouge1': np.float64(0.34357522962453174),
 'rouge2': np.float64(0.1398480893584535),
 'rougeL': np.float64(0.24936714584583836),
 'rougeLsum': np.float64(0.2494629813146576)}

The original model struggles to reduce the summarisations to below 3 or 4 sentences. The CNN_daily mail dataset contains highlights that summarise the article in 1 or two sentences. 

With some basic prompt engineering the model still stuggles to reduce this further. This could possibly be improved with some fine tuning which is what we will try now.

### Basic instruction finetuneing

Adding a basic prefix before the article to add some additional context to the task

The original BART model gets a rouge1 score of 0.36 which shows moderate performance.

In [26]:
dataset['test']

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 11490
})

In [27]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11490
    })
})

### Reduce training datset size to decrease training time

In [28]:
# reduce the size of the dataset by 1/25th
tokenized_dataset_reduced = tokenized_datasets.filter(lambda example, index: index % 25 == 0, with_indices=True)

Filter: 100%|██████████| 287113/287113 [03:35<00:00, 1334.12 examples/s]
Filter: 100%|██████████| 13368/13368 [00:09<00:00, 1407.32 examples/s]
Filter: 100%|██████████| 11490/11490 [00:08<00:00, 1392.78 examples/s]


In [29]:
tokenized_dataset_reduced

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11485
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 535
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 460
    })
})

In [30]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 406290432
all model parameters: 406290432
percentage of trainable model parameters: 100.00%


In [31]:
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
    

In [32]:
# Load model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

In [33]:
# Freeze all encoder layers except the last 2
for i, layer in enumerate(model.model.encoder.layers):
    if i < len(model.model.encoder.layers) - 2:
        for param in layer.parameters():
            param.requires_grad = False

# Freeze all decoder layers except the last 2
for i, layer in enumerate(model.model.decoder.layers):
    if i < len(model.model.decoder.layers) - 2:
        for param in layer.parameters():
            param.requires_grad = False


In [34]:
print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 112361472
all model parameters: 406290432
percentage of trainable model parameters: 27.66%


In [35]:
output_dir = f'./article-highlight-training-{str(int(time.time()))}'


training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,  
    per_device_train_batch_size=4,  
    gradient_accumulation_steps=2,  
    num_train_epochs=3, 
    warmup_steps=50, 
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=6,
    load_best_model_at_end=True,
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_reduced['train'],
    eval_dataset=tokenized_dataset_reduced['validation'],
    data_collator=data_collator 
)




In [36]:
trainer.train()

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Step,Training Loss,Validation Loss
500,1.154,1.481511
1000,1.2836,1.47407
1500,1.1609,1.484079
2000,1.1915,1.476055
2500,1.1418,1.479574
3000,1.0447,1.493982
3500,1.2139,1.49637
4000,1.1238,1.491308


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_

TrainOutput(global_step=4308, training_loss=1.1983859448269107, metrics={'train_runtime': 14591.2478, 'train_samples_per_second': 2.361, 'train_steps_per_second': 0.295, 'total_flos': 7.176828881040998e+16, 'train_loss': 1.1983859448269107, 'epoch': 3.0})

## Test out finetuned model

In [40]:
# Load finetuned model
finetuned_model = BartForConditionalGeneration.from_pretrained(r"C:\Users\tom_r\Desktop\Generative-AI\LLM_Projects\article-highlight-training-1727169079\checkpoint-4000")


In [41]:
evaluate_model_performance(finetuned_model, mode='one shot')

Evaluating: 100%|██████████| 100/100 [13:24<00:00,  8.05s/it]


{'rouge1': np.float64(0.3077913615461613),
 'rouge2': np.float64(0.1184709166564583),
 'rougeL': np.float64(0.22146676590365294),
 'rougeLsum': np.float64(0.2209549607578961)}

In [133]:
example_indices = [20, 50, 75, 100, 200, 250, 300, 400]

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['article']
    summary = dataset['test'][index]['highlights']

    prefix = 'Summarize in one or two sentences the following article: '
    
    inputs = tokenizer(prefix + dialogue, max_length=512, truncation=True, return_tensors='pt')
    output = tokenizer.decode(
        finetuned_model.generate(
            inputs["input_ids"], 
            max_new_tokens=250,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'Finetuned BART model:\n{output}\n')

----------------------------------------------------------------------------------------------------
Example  1
----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Norfolk, Virginia (CNN)The second mate of the Houston Express probably couldn't believe what he was seeing. Hundreds of miles from land there was a small boat nearby. At first it looked abandoned. It was in bad shape, listing to one side. The crew of the 1,000-foot long container ship thought it was a yacht that had wrecked. Incredibly, as they got closer, they saw there was a man on it, signaling for help. "He was moving, walking around, waving to us and in surprisingly good condition," Capt. Thomas Grenz told CNN by phone Friday. That man, Louis Jordan, 37, had an amazing story. He'd been drifting on the 35-foot Pearson sailboat for more than two months since leaving Conway, South Carolina, to fish in the ocean. Just a few days into his trip, a storm capsized h

### Multi-shot inference

Finetuning the model has shown little improvement in shortening the length of the summary to a highlight. I think showing the model some examples during the prompt might aid the model.


In [144]:
# Reload base pretrained model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



In [145]:
tokenized_dataset_reduced

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11485
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 535
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 460
    })
})

In [195]:
def make_prompt(example_indices_full, index_to_summarize):

    # Adding the article that the LLM needs to highlight from the previous model answers shown    
    article = dataset['test'][index_to_summarize]['article']
    
    prompt = f"""
Task: Summarise the following new article in one or two sentences
Article:

{article}
"""
    
    # Examples to show the LLM in the prompt    
    prompt += 'Below are examples of article highlights'
    for i, index in enumerate(example_indices_full):
        article = dataset['test'][index]['article']
        highlight = dataset['test'][index]['highlights']
        
        prompt += f"""

        
Example: {i + 1}
Article:
    
{article}
    
Summarise the article in one or two sentences? 
{highlight}


"""

        
    return prompt

Would the LLM prefer the US spelling of summarise? 

In [324]:
example_indices_full = [125]
index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, index_to_summarize)

print(one_shot_prompt)


Task: Summarise the following new article in one or two sentences
Article:

(CNN)Former New England Patriots star Aaron Hernandez will need to keep his lawyers even after being convicted of murder and other charges in the death of Odin Lloyd. The 25-year-old potentially faces three more trials -- one criminal and two civil actions. Next up is another murder trial in which he is accused of killing two men and wounding another person near a Boston nightclub in July 2012. Prosecutors have said Hernandez fatally shot Daniel de Abreu and Safiro Furtado when he fired into their 2003 BMW.  Another passenger was wounded and two others were uninjured. Hernandez pleaded not guilty at his arraignment. The trial was originally slated for May 28, but Jake Wark, spokesman for the Suffolk County District Attorney's Office, said Wednesday the trial had been postponed and no new date had been set. "We expect to select a new court date in the coming days and then set the amended trial track. The Suffol

In [228]:
# There is just 1 example shown in the prompt (one shot)

highlight = dataset['test'][index_to_summarize]['highlights']

inputs = tokenizer(one_shot_prompt, return_tensors='pt', max_length=1024, truncation=True)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
        temperature=1,  # Reduce randomness
        num_beams=5,  # Use beam search to improve output quality
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'Baseline human highlights:\n{highlight}\n')
print(dash_line)
print(f'Model generated highlights - ONE SHOT:\n{output}')

----------------------------------------------------------------------------------------------------
Baseline human highlights:
Aaron Hernandez has been found guilty in Odin Lloyd's death, but his troubles are not over . He also faces murder charges in Suffolk County, Massachusetts, but trial was postponed . In addition, Hernandez will face two civil lawsuits; one is in relation to Suffolk County case .

----------------------------------------------------------------------------------------------------
Model generated highlights - ONE SHOT:
Former New England Patriots star Aaron Hernandez will need to keep his lawyers even after being convicted of murder and other charges in the death of Odin Lloyd. The 25-year-old potentially faces three more trials -- one criminal and two civil actions. The families of de Abreu and Furtado filed civil suits against Hernandez.


### Test out few shot inference

In [230]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 250

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Task: Summarise the following new article in one or two sentences
Article:

(CNN)A measles outbreak that affected more than 130 Californians since December is over, the California Department of Public Health declared Friday. It has been 42 days since the last known case of B3 strain of measles, the equivalent of two successive incubation periods, said Dr. Karen Smith, director of the health department. The department said in its latest update that 131 people came down with the B3 strain, and five who had a different genotype than the outbreak strain. Of the 131 cases, the state was able to obtain the vaccination status for 81 patients. Of the 81, 70% were unvaccinated. "Prompt investigation of cases, interviewing hundreds of contacts of infected people, vaccinating hundreds of at risk people, and increasing awareness among health care providers about measles, helped to control this outbreak," Smith said. The outbreak began with dozens of visitors to two Disney theme parks in the state

In [233]:
summary = dataset['test'][example_index_to_summarize]['highlights']

inputs = tokenizer(few_shot_prompt, return_tensors='pt', max_length=1024, truncation=True)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
        temperature=0.5, 
        do_sample=True,
        num_beams=5, 
    )[0], 
    skip_special_tokens=True
)
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Officials say 131 Californians were affected by one strain, five by other strains . About 70% of the people who could show health records were unvaccinated . Outbreak began in December among visitors to two Disney theme parks .

----------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
A measles outbreak that affected more than 130 Californians since December is over. Of the 131 cases, the state was able to obtain the vaccination status for 81 patients. A high temperature of 63.5 degrees Fahrenheit was recorded on the northern tip of the Antarctica Peninsula. The World Meteorological Organization will make the final determination.


The model has confused the seperate articles into a single set of highlights. 

In [236]:
inputs

{'input_ids': tensor([[    0, 50118, 47744,  ...,  2950, 10405,     2]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [238]:
tokenizer.decode(inputs['input_ids'][0])

'<s>\nTask: Summarise the following new article in one or two sentences\nArticle:\n\n(CNN)A measles outbreak that affected more than 130 Californians since December is over, the California Department of Public Health declared Friday. It has been 42 days since the last known case of B3 strain of measles, the equivalent of two successive incubation periods, said Dr. Karen Smith, director of the health department. The department said in its latest update that 131 people came down with the B3 strain, and five who had a different genotype than the outbreak strain. Of the 131 cases, the state was able to obtain the vaccination status for 81 patients. Of the 81, 70% were unvaccinated. "Prompt investigation of cases, interviewing hundreds of contacts of infected people, vaccinating hundreds of at risk people, and increasing awareness among health care providers about measles, helped to control this outbreak," Smith said. The outbreak began with dozens of visitors to two Disney theme parks in t


**Few shot inference is not practical in this usecase due to the large length of the input articles. When more than one article is included as examples the input gets truncated which can cause confusion in the model output**

In [331]:
def evaluate_model_performance(model, mode="instruction", example_index = 125):

    # Subset of dataset for evaluation
    articles = dataset['test'][0:100]['article']
    highlights = dataset['test'][0:100]['highlights']
    
    model_summaries = []
    
    for i, article in enumerate(tqdm(articles, desc="Evaluating")):
        
        # Building the prompt based on the mode
        if mode == "zero shot":
            # just asking for a summary
            prompt = f"Summarize in one or two sentences the following article:\n\n{article}\n\nSummary:"
        
        elif mode == 'one-shot' and example_indices is not None:
            # Provide one example and then ask for a summary
            example_article = dataset['test'][example_index]['article']
            example_highlight = dataset['test'][example_index]['highlights']

            prompt = f"""
Task: Summarise the following new article in one or two sentences
Article:
            
{article}
"""
    
            # Examples to show the LLM in the prompt    
            prompt += 'Below are examples of article highlights'
            
            prompt += f"""
Example Article:
{example_article}
            
Example Summary:
{example_highlight}
"""        
        elif mode == "instruction":
            # using a task specific instruction
            prompt = f"""
Summarize in one or two sentences the following article:      
{article}
            
Summary:
            """
        
        else:
            raise ValueError(f"Unsupported mode: {mode}")
        
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).input_ids

        model_outputs = model.generate(input_ids=input_ids, max_new_tokens=200)
        model_text_output = tokenizer.decode(model_outputs[0], skip_special_tokens=True)
        model_summaries.append(model_text_output)
    
    zipped_summaries = list(zip(highlights, model_summaries))
    df = pd.DataFrame(zipped_summaries, columns=['Highlights', 'Model_summaries'])
    #print(df.head(5))

    # Compute ROUGE scores
    model_results = rouge.compute(
        predictions=model_summaries,
        references=highlights[0:len(model_summaries)],
        use_aggregator=True,
        use_stemmer=True,
    )
    
    return model_results



In [332]:
evaluate_model_performance(finetuned_model, mode='instruction')

Evaluating: 100%|██████████| 100/100 [12:17<00:00,  7.38s/it]


{'rouge1': np.float64(0.35532632398771835),
 'rouge2': np.float64(0.15325439791500092),
 'rougeL': np.float64(0.25842654908955126),
 'rougeLsum': np.float64(0.25813768296344974)}

In [333]:
evaluate_model_performance(finetuned_model, mode='zero shot')

Evaluating: 100%|██████████| 100/100 [11:58<00:00,  7.19s/it]


{'rouge1': np.float64(0.34752412903503105),
 'rouge2': np.float64(0.15065534163583266),
 'rougeL': np.float64(0.2547654630707121),
 'rougeLsum': np.float64(0.2539430180069777)}

In [339]:
evaluate_model_performance(finetuned_model, mode='one-shot')

Evaluating: 100%|██████████| 100/100 [14:04<00:00,  8.45s/it]


{'rouge1': np.float64(0.30400882513393646),
 'rouge2': np.float64(0.12213599205442993),
 'rougeL': np.float64(0.22488490391312452),
 'rougeLsum': np.float64(0.22446823181942485)}

### Experimenting with increasing num_beams

In [341]:
def evaluate_model_performance(model, mode="instruction", example_index = 125):

    # Subset of dataset for evaluation
    articles = dataset['test'][0:100]['article']
    highlights = dataset['test'][0:100]['highlights']
    
    model_summaries = []
    
    for i, article in enumerate(tqdm(articles, desc="Evaluating")):
        
        # Building the prompt based on the mode
        if mode == "zero shot":
            # just asking for a summary
            prompt = f"Summarize in one or two sentences the following article:\n\n{article}\n\nSummary:"
        
        elif mode == 'one-shot' and example_indices is not None:
            # Provide one example and then ask for a summary
            example_article = dataset['test'][example_index]['article']
            example_highlight = dataset['test'][example_index]['highlights']

            prompt = f"""
Task: Summarise the following new article in one or two sentences
Article:
            
{article}
"""
    
            # Examples to show the LLM in the prompt    
            prompt += 'Below are examples of article highlights'
            
            prompt += f"""
Example Article:
{example_article}
            
Example Summary:
{example_highlight}
"""        
        elif mode == "instruction":
            # using a task specific instruction
            prompt = f"""
Summarize in one or two sentences the following article:      
{article}
            
Summary:
            """
        
        else:
            raise ValueError(f"Unsupported mode: {mode}")
        
        input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).input_ids

        model_outputs = model.generate(input_ids=input_ids, max_new_tokens=200, num_beams=5)
        model_text_output = tokenizer.decode(model_outputs[0], skip_special_tokens=True)
        model_summaries.append(model_text_output)
    
    zipped_summaries = list(zip(highlights, model_summaries))
    df = pd.DataFrame(zipped_summaries, columns=['Highlights', 'Model_summaries'])
    #print(df.head(5))

    # Compute ROUGE scores
    model_results = rouge.compute(
        predictions=model_summaries,
        references=highlights[0:len(model_summaries)],
        use_aggregator=True,
        use_stemmer=True,
    )
    
    return model_results



In [336]:
evaluate_model_performance(finetuned_model, mode='instruction')

Evaluating: 100%|██████████| 100/100 [12:42<00:00,  7.63s/it]


{'rouge1': np.float64(0.3556716864767809),
 'rouge2': np.float64(0.15790472378035086),
 'rougeL': np.float64(0.2630275608596195),
 'rougeLsum': np.float64(0.26219474345675675)}

In [337]:
evaluate_model_performance(finetuned_model, mode='zero shot')

Evaluating: 100%|██████████| 100/100 [12:28<00:00,  7.49s/it]


{'rouge1': np.float64(0.34629334938693523),
 'rouge2': np.float64(0.150107507250075),
 'rougeL': np.float64(0.2532882990861981),
 'rougeLsum': np.float64(0.252820692762342)}

In [342]:
evaluate_model_performance(finetuned_model, mode='one-shot')

Evaluating: 100%|██████████| 100/100 [14:17<00:00,  8.58s/it]


{'rouge1': np.float64(0.30400882513393646),
 'rouge2': np.float64(0.12213599205442993),
 'rougeL': np.float64(0.22488490391312452),
 'rougeLsum': np.float64(0.22446823181942485)}

### Final thing to try is to finetune on a larger subset of the dataset

I will also try and increase input sequence length beyong the 512 that was originally trained for to prevent larger articles from being truncated

In [344]:
# reduce the size of the dataset by 1/25th
tokenized_dataset_reducedBy10 = tokenized_datasets.filter(lambda example, index: index % 10 == 0, with_indices=True)
tokenized_dataset_reducedBy10

Filter: 100%|██████████| 287113/287113 [02:33<00:00, 1867.74 examples/s]
Filter: 100%|██████████| 13368/13368 [00:06<00:00, 1919.61 examples/s]
Filter: 100%|██████████| 11490/11490 [00:05<00:00, 1958.65 examples/s]


DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 28712
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1337
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1149
    })
})