In [65]:
import huggingface
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import time
import datasets
from datasets import load_dataset
from sklearn.model_selection import train_test_split

from tqdm import tqdm
import pickle


import random
import pandas as pd
from IPython.display import display, HTML


#### Load Dataset

In [4]:
dataset = load_dataset("abisee/cnn_dailymail", "1.0.0")
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

#### Check Dataset

In [5]:
def show_random_elements(dataset, num_examples=2):
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


show_random_elements(dataset["train"])

Unnamed: 0,article,highlights,id
0,"Ignorance really isn't bliss. But there are times when the lack of knowledge and expertise about a subject or place can actually serve to clear the mind and lead to some clarity and honesty in the debate on even the most complex matters. I'm certainly no expert on Ukraine. I'm not sure before this crisis that I could even name all of the countries that share its borders. But watching the ticktock of the debate on the issue this past week, I'm stunned by the lack of perspective and breathlessness in the discussion. Sadly, I've grown accustomed to the partisanship. It has become a permanent fixture of our analytical and policy landscape. But here are several things about the recent coverage and discussion on Ukraine that even my lack of expertise won't allow me to accept: . 1. We're back in the Cold War . Clearly, none of the resets have worked with Russian President Vladimir Putin. Whether it was President George W. Bush looking into his eyes and seeing his soul or Bob Gates finding a stone-cold killer there, Putin isn't Stalin, Khrushchev or Brezhnev. That's hardly a shocker. Nor are we still in that unique period when two superpowers with contrasting ideological systems under the threat of nuclear war clashed and fought by proxy from one end of the globe to the other. There's no doubt that the United States and Russia have major differences. But the issue is no longer ideological. Russian capitalism is here to stay, state-controlled and monitored though it may be. And what ideology exists has more to do with asserting Russian national interests than anything Marx or Lenin would have recognized. And in at least one respect, that's too bad. At least during much of the Cold War, from the 1970s on, there were rules, do's and don'ts that prevented situations like Ukraine. Are you in Crimea? Share your story with CNN iReport. We'll continue to struggle with Putin, to be sure. But the world's too small, the Europeans too dependent on Russia, and the realities of global interdependence too deep to imagine hitting the rewind button and turning the planet into an arena of conflict and competition. Would it make for a good video game? Yes. 2. Putin is Hitler . In the past week, I've heard people I admire and respect talk about Crimea as Munich and Putin as Hitler. Twain wrote that history doesn't repeat; it rhymes. But those rhythmic patterns aren't evident here, either. When we can't think of intelligent parallels in analyzing nations who do things America cannot abide, it seems we're drawn irresistibly to the Hitler trope. The same thing happens with Iran. And while I don't for a moment condone the vicious Israel-baiting and hating of the mullahcracy in Tehran (and the anti-Semitism, too) to bring up Hitler not only trivializes the monstrosity of the evil and the magnitude of the crimes in his time, it imposes unrealistic challenges in ours. The unique challenge of Hitler demanded that he be stopped and the Nazi regime destroyed. We don't have to like the Putin government in Russia, or Supreme Leader Ali Khamenei's in Iran, to recognize that the magnitude of the threat is different. To compare them to Hitler is to urge the United States into a game that we don't want to play and can't win. As best I can figure, Putin is a clever and easily riled Russian nationalist who presides over what remains of an empire whose time has come and gone. He lives in reality, not in some megalomaniacal world. But he is prepared to assert Russia's interests in spheres where it matters, and to block the West's intrusion into those areas as best he can. Russia is his ""ideology."" And on Ukraine, history and proximity give him cards to play. This man isn't a fanatic. Money, pleasure and power are too important to him. Any leader who is willing to be photographed shirtless on a horse, like some cover of Men's Health magazine, isn't going to shoot himself in the head or take cyanide in a bunker. This guy is way too hip (Russian style) and attached to the good life to be Hitler. And given Russia's own suffering at the hands of the Nazis, saying he is just makes matters worse. 3. It's all Obama's fault . President Barack Obama was never the catastrophic incompetent or Satan's finger on earth that his worst critics imagined nor the redeemer, savior, or great President that his most avid acolytes wanted. And yet the notion that Obama, through weak and feckless foreign policy, was responsible for Putin's move into Ukraine strains credulity to the breaking point. This urban legend that because of Benghazi and the ""red line"" affair in Syria, Putin was compelled to do something in Ukraine that he wouldn't have done had Obama acted differently, is absurd. The administration's foreign policy has often resembled a blend between a Marx Brothers movie and the Three Stooges. But on this one the charge is absurd, as is the notion that somehow Obama could have stopped him. When the Soviet Union invaded Hungary in 1956, there was no U.S. military response; ditto in 1968 when the Soviets put down Prague Spring in Czechoslovakia. Sometimes, geography really is destiny. Russia believed its vital interests in Ukraine were threatened and it had the means, will, and proximity to act on them. And it's about time we faced up to it. 4. Bombing Syria would have saved Ukraine . This notion that Obama's opponents have latched onto is, of course, unknowable. There are no rewind buttons in history. Counterfactuals are prime talking and debating points because they cannot be proven one way or the other. But to argue that launching cruise missiles at Syrian military targets somehow would have deterred Putin from acting on what he perceived to be a Russian vital interest, or emboldened the Europeans to stand tougher against him, really is off base. Syria and Ukraine are like apples and oranges the President's detractors insist on putting in the same basket. Even if Obama thought the U.S. had vital interests that justified an attack on Syria, it is likely it would not have altered Putin's policy toward Ukraine. The country perceived to be in Russia's zone of influence and manipulation was drifting westward. And Putin was determined to stop it. 5. Ukraine can have a 'Hollywood' ending . Are there good guys and bad guys in the Ukraine-Russian drama? Sure there are. We have courageous Ukrainian patriots who died in the Maidan for the dignity and freedom they believed in; corrupt and ruthless government officials who were willing to use force against their own citizens; Russian provocateurs eager to stir up trouble; extremist Ukrainian nationalists who are hardly democrats; and a Russian strongman who hosted the Olympics one week and invaded the territory of a sovereign country the next. I suspect that the Ukrainian Spring -- if that's what it is -- may turn out better than its Arab counterpart. But we have to be real. Ukraine may be fractious and troubled for some time to come. Below the morality play there is intense factionalism; regional differences; scores to settle; Russian manipulation; and a tendency to avoid the kind of compromise that would lead to real power sharing and good governance. We like Hollywood endings. But real democratization depends less on a friendly U.S. or EU hand than on the emergence of genuine leaders who are prepared to rise above factional affinities and see a vision for the country as a whole. It also depends on institutions that reflect popular will and some mechanism for accommodating differences peacefully without resorting to violence. There are no easy or happy endings here. And we can only make matters worse, as Henry Kissinger suggested recently, by trying to turn the Ukraine crisis into a Russia vs. the West (or worse, the U.S.) tug-of-war.","Aaron Miller says even those with little knowledge of Ukraine should spot the myths we've heard . He says Obama's foreign policy isn't to blame for what Putin did in Crimea . Miller says this doesn't represent a new Cold War, nor is Putin equivalent to Hitler . He says the outcome may be acceptable eventually but don't expect a Hollywood ending .",b462bafbd5f2f13fc214da313ad9d9fc90212bbd
1,"ARNOLD, Missouri (CNN) -- On his 100th day in office, President Obama said Wednesday that he was ""pleased with the progress we've made but not satisfied."" Obama marked his 100th day in office Wednesday with a town hall meeting and later a news conference. ""I've come back to report to you, the American people, that we have begun to pick ourselves up and dust ourselves off, and we've begun the work of remaking America,"" the president said at a town hall meeting in a high school gymnasium in Arnold, a St. Louis suburb. ""I'm confident in the future, but not I'm not content with the present,"" he said. ""You know the progress comes from hard choices and hard work, not miracles. I'm not a miracle worker,"" he said. Obama acknowledged challenges of ""unprecedented size and scope,"" including the recession. These challenges, he said, could not be met with ""half measures."" ""They demand action that is bold and sustained. They call on us to clear away the wreckage of a painful recession, But also, at the same time, lay the building blocks for a new prosperity. And that's the work that we've begun over these first 100 days,"" he said. He responded to critics who say he is trying to do too much as he works to address the recession as well as health care, energy and education. ""There's no mystery to what we've done; the priorities that we've acted upon were the things that we said we'd do during the campaign,"" he said, prompting loud applause. The president made an opening statement that lasted about 20 minutes before taking questions from the audience. The last question was from a fourth-grade girl who asked about the administration's environmental policies. Later Wednesday, Obama will hold a prime-time news conference in the East Room of the White House. Leading up to the date, White House aides had labeled the 100th day as a ""Hallmark"" holiday. ""They don't mean anything,"" quipped one aide, ""but you have to observe them."" More than six in 10 Americans approve of the job Obama is doing as president, a recent poll of polls shows. According to a CNN Poll of Polls compiled early Wednesday, 63 percent say they approve of how Obama is handling his duties. CNN's Ed Henry contributed to this report.","""We've begun the work of remaking America,"" he says in Missouri . Obama warns that progress comes from ""hard work, not miracles"" He will hold a prime-time news conference later Wednesday .",70d29e642bc29b28c288363d41fa608493ef47e8


These examples show that the news articles get summurised by in a fairly succinct fashion to give just the main highlights for each. 

#### Pre-process Dataset

In [6]:
# Load model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



In [7]:
print(tokenizer(text_target=["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[0, 31414, 6, 42, 65, 3645, 328, 2], [0, 713, 16, 277, 3645, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}


In [8]:

def preprocess_function(examples):
    #inputs = ['summarize' + doc for doc in examples["article"]]
    inputs = [doc for doc in examples["article"]]
    
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

    

In [9]:
preprocess_function(dataset['train'][:2])

{'input_ids': [[0, 574, 4524, 6, 1156, 36, 1251, 43, 480, 3268, 10997, 999, 3028, 7312, 20152, 3077, 899, 7, 10, 431, 984, 844, 153, 1358, 4006, 4, 134, 153, 43, 13016, 25, 37, 4072, 504, 15, 302, 6, 53, 37, 9838, 5, 418, 351, 75, 2471, 10, 8921, 15, 123, 4, 3028, 7312, 20152, 25, 3268, 10997, 11, 22, 29345, 10997, 8, 5, 9729, 9, 5, 5524, 113, 598, 5, 10208, 9, 20445, 6730, 1952, 198, 5, 232, 6, 5, 664, 2701, 161, 37, 34, 117, 708, 7, 856, 3961, 1334, 39, 1055, 409, 15, 1769, 1677, 6, 4076, 8, 6794, 1799, 4, 22, 100, 218, 75, 563, 7, 28, 65, 9, 167, 82, 54, 6, 25, 1010, 25, 51, 1004, 504, 6, 6017, 907, 1235, 10, 2232, 1612, 512, 2783, 50, 402, 1122, 60, 37, 174, 41, 2059, 33242, 656, 42, 353, 4, 22, 100, 218, 75, 206, 38, 581, 28, 1605, 31879, 4, 22, 133, 383, 38, 101, 2159, 32, 383, 14, 701, 59, 158, 2697, 480, 2799, 8, 32570, 8, 37206, 72, 497, 504, 6, 7312, 20152, 40, 28, 441, 7, 23104, 11, 10, 10297, 6, 907, 10, 4076, 11, 10, 8881, 50, 192, 5, 8444, 822, 22, 40534, 523, 35, 4657, 3

In [10]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)




In [11]:
sentence = "Does the tokenizer work I wonder?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([    0, 27847,     5, 19233,  6315,   173,    38,  5170,   116,     2])

DECODED SENTENCE:
Does the tokenizer work I wonder?


It does.

#### Test out pre-trained BART

In [13]:
example_indices = [20, 50]
dash_line = 100 * '-'


for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['article']
    summary = dataset['test'][index]['highlights']

    prefix = 'Provide a very short summary of the following article: '
    
    inputs = tokenizer(dialogue, max_length=512, truncation=True, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=250,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'Pre-Trained BART GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

----------------------------------------------------------------------------------------------------
Example  1
----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Norfolk, Virginia (CNN)The second mate of the Houston Express probably couldn't believe what he was seeing. Hundreds of miles from land there was a small boat nearby. At first it looked abandoned. It was in bad shape, listing to one side. The crew of the 1,000-foot long container ship thought it was a yacht that had wrecked. Incredibly, as they got closer, they saw there was a man on it, signaling for help. "He was moving, walking around, waving to us and in surprisingly good condition," Capt. Thomas Grenz told CNN by phone Friday. That man, Louis Jordan, 37, had an amazing story. He'd been drifting on the 35-foot Pearson sailboat for more than two months since leaving Conway, South Carolina, to fish in the ocean. Just a few days into his trip, a storm capsized h

In [157]:
example_indices = [20, 50]

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['article']
    summary = dataset['test'][index]['highlights']

    prefix = 'Summarize the article in one or two sentences? :  '
    
    inputs = tokenizer(prefix + dialogue, max_length=512, truncation=True, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=250,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'Pre-Trained BART GENERATION - WITH PROMPT ENGINEERING:\n{output}\n')

----------------------------------------------------------------------------------------------------
Example  1
----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Norfolk, Virginia (CNN)The second mate of the Houston Express probably couldn't believe what he was seeing. Hundreds of miles from land there was a small boat nearby. At first it looked abandoned. It was in bad shape, listing to one side. The crew of the 1,000-foot long container ship thought it was a yacht that had wrecked. Incredibly, as they got closer, they saw there was a man on it, signaling for help. "He was moving, walking around, waving to us and in surprisingly good condition," Capt. Thomas Grenz told CNN by phone Friday. That man, Louis Jordan, 37, had an amazing story. He'd been drifting on the 35-foot Pearson sailboat for more than two months since leaving Conway, South Carolina, to fish in the ocean. Just a few days into his trip, a storm capsized h

The original model struggles to reduce the summarisations to below 3 or 4 sentences. The CNN_daily mail dataset contains highlights that summarise the article in 1 or two sentences. 

With some basic prompt engineering the model still stuggles to reduce this further. This could possibly be improved with some fine tuning which is what we will try now.

### First we should get some baseline metrics for the original model

In [32]:
# Set up evaluation metric 
import evaluate
rouge = evaluate.load("rouge")


In [54]:
# Selecting a su

articles = dataset['test'][0:100]['article']
highlights = dataset['test'][0:100]['highlights']

original_model_summaries = []
instruct_model_summaries = []

for _, article in enumerate(articles):
    prompt = f"""
    'Summarize in one or two sentences the following article: '

    {article}

    Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).input_ids

    # Generate summary
    original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    
zipped_summaries = list(zip(highlights, original_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['Highlights', 'original_model_summaries'])
df


100
100


Unnamed: 0,Highlights,original_model_summaries
0,Membership gives the ICC jurisdiction over all...,Palestiniania. The move was made official. The...
1,"Theia, a bully breed mix, was apparently hit b...",A. The dog was hit by a car. The dog was burie...
2,Mohammad Javad Zarif has spent more time with ...,Moh. He is Iran's foreign minister. He is know...
3,17 Americans were exposed to the Ebola virus w...,The. The five were exposed to Ebola in Sierra ...
4,Student is no longer on Duke University campus...,The. The student admitted to hanging the noose...
...,...,...
95,"Manuscript of ""American Pie"" lyrics is sold to...",Don. Don McLean's song is 44-year. Christie's ...
96,Letourneau Fualaau had a sexual relationship w...,Mary. Mary Kay Kay Kay Letourneau Letour-Letou...
97,"Don McLean's ""American Pie"" lyrics auctioned f...",The. The song is one of the most dissected in ...
98,Mindy Kaling's brother Vijay Chokalingam prete...,Vay Chokalingam applied to St. Chokalingam pos...


In [59]:
print(df['original_model_summaries'])

0     Palestiniania. The move was made official. The...
1     A. The dog was hit by a car. The dog was burie...
2     Moh. He is Iran's foreign minister. He is know...
3     The. The five were exposed to Ebola in Sierra ...
4     The. The student admitted to hanging the noose...
                            ...                        
95    Don. Don McLean's song is 44-year. Christie's ...
96    Mary. Mary Kay Kay Kay Letourneau Letour-Letou...
97    The. The song is one of the most dissected in ...
98    Vay Chokalingam applied to St. Chokalingam pos...
99    T,. Travolt: "I've been so happy with my (Scie...
Name: original_model_summaries, Length: 100, dtype: object


In [55]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=highlights[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)



print('ORIGINAL MODEL:')
print(original_model_results)


ORIGINAL MODEL:
{'rouge1': np.float64(0.19090482977059542), 'rouge2': np.float64(0.0748375366841095), 'rougeL': np.float64(0.15159647802571558), 'rougeLsum': np.float64(0.15146151773990352)}


In [24]:
dataset['test']

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 11490
})

In [25]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11490
    })
})

In [119]:
tokenized_dataset_reduced = tokenized_datasets.filter(lambda example, index: index % 25 == 0, with_indices=True)

Filter: 100%|██████████| 287113/287113 [02:30<00:00, 1910.13 examples/s]
Filter: 100%|██████████| 13368/13368 [00:06<00:00, 1995.72 examples/s]
Filter: 100%|██████████| 11490/11490 [00:05<00:00, 1954.81 examples/s]


In [120]:
tokenized_dataset_reduced

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11485
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 535
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 460
    })
})

In [121]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 112361472
all model parameters: 406290432
percentage of trainable model parameters: 27.66%


In [122]:
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
    

In [123]:
# Load model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



In [124]:
# Freeze all encoder layers except the last 2
for i, layer in enumerate(model.model.encoder.layers):
    if i < len(model.model.encoder.layers) - 2:
        for param in layer.parameters():
            param.requires_grad = False

# Freeze all decoder layers except the last 2
for i, layer in enumerate(model.model.decoder.layers):
    if i < len(model.model.decoder.layers) - 2:
        for param in layer.parameters():
            param.requires_grad = False


In [125]:
print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 112361472
all model parameters: 406290432
percentage of trainable model parameters: 27.66%


In [126]:
output_dir = f'./article-highlight-training-{str(int(time.time()))}'


training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,  # Lower learning rate
    per_device_train_batch_size=4,  # Reduce batch size to fit memory
    gradient_accumulation_steps=2,  # Adjust gradient accumulation
    num_train_epochs=3,  # Increase epochs to allow for better learning
    warmup_steps=50,  # Reduce warmup steps
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    save_total_limit=3,
    load_best_model_at_end=True,
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_reduced['train'],
    eval_dataset=tokenized_dataset_reduced['validation'],
    data_collator=data_collator 
)




In [127]:
trainer.train()

Step,Training Loss,Validation Loss
500,1.3319,1.575234
1000,1.4648,1.563635
1500,1.3253,1.573067
2000,1.3465,1.579925
2500,1.326,1.585557
3000,1.1981,1.588271
3500,1.4017,1.588427
4000,1.2795,1.592092


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_

TrainOutput(global_step=4308, training_loss=1.3684610853398853, metrics={'train_runtime': 7019.9207, 'train_samples_per_second': 4.908, 'train_steps_per_second': 0.614, 'total_flos': 3.732646362434765e+16, 'train_loss': 1.3684610853398853, 'epoch': 3.0})

## Test out finetuned model

In [131]:
# Load finetuned model
finetuned_model = BartForConditionalGeneration.from_pretrained(r"C:\Users\tom_r\Desktop\Generative-AI\LLM_Projects\article-highlight-training-1726671662\checkpoint-4000")


In [133]:
example_indices = [20, 50, 75, 100, 200, 250, 300, 400]

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['article']
    summary = dataset['test'][index]['highlights']

    prefix = 'Summarize in one or two sentences the following article: '
    
    inputs = tokenizer(prefix + dialogue, max_length=512, truncation=True, return_tensors='pt')
    output = tokenizer.decode(
        finetuned_model.generate(
            inputs["input_ids"], 
            max_new_tokens=250,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'Finetuned BART model:\n{output}\n')

----------------------------------------------------------------------------------------------------
Example  1
----------------------------------------------------------------------------------------------------
INPUT PROMPT:
Norfolk, Virginia (CNN)The second mate of the Houston Express probably couldn't believe what he was seeing. Hundreds of miles from land there was a small boat nearby. At first it looked abandoned. It was in bad shape, listing to one side. The crew of the 1,000-foot long container ship thought it was a yacht that had wrecked. Incredibly, as they got closer, they saw there was a man on it, signaling for help. "He was moving, walking around, waving to us and in surprisingly good condition," Capt. Thomas Grenz told CNN by phone Friday. That man, Louis Jordan, 37, had an amazing story. He'd been drifting on the 35-foot Pearson sailboat for more than two months since leaving Conway, South Carolina, to fish in the ocean. Just a few days into his trip, a storm capsized h

### Multi-shot inference

Finetuning the model has shown little improvement in shortening the length of the summary to a highlight. I think showing the model some examples during the prompt might aid the model.


In [144]:
# Load model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



In [145]:
tokenized_dataset_reduced

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 11485
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 535
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 460
    })
})

In [146]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        article = dataset['test'][index]['article']
        highlight = dataset['test'][index]['highlights']
        
        prompt += f"""
Article:

{article}

Summarize the article in one or two sentences? 
{highlight}


"""
    
    article = dataset['test'][example_index_to_summarize]['article']
    
    prompt += f"""
Article:

{article}

Summarize the article in one or two sentences?
"""
        
    return prompt

In [147]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Article:

(CNN)A high temperature of 63.5 degrees Fahrenheit might sound like a pleasant day in early spring -- unless you're in Antarctica. The chilly continent recorded the temperature (15.5 degrees Celsius) on March 24, possibly the highest ever recorded on Antarctica, according to the Weather Underground. The temperature was recorded at Argentina's Esperanza Base on the northern tip of the Antarctica Peninsula, according to CNN affiliate WTNH. (Note to map lovers: The Argentine base is not geographically part of the South American continent.) The World Meteorological Organization, a specialized United Nations agency, is in the process of setting up an international ad-hoc committee of about 10 blue-ribbon climatologists and meteorologists to begin collecting relevant evidence, said Randy Cerveny, the agency's lead rapporteur of weather and climate extremes and Arizona State University professor of geographical sciences. The committee will examine the equipment used to measure the 

In [152]:
highlight = dataset['test'][example_index_to_summarize]['highlights']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{highlight}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

----------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Aaron Hernandez has been found guilty in Odin Lloyd's death, but his troubles are not over . He also faces murder charges in Suffolk County, Massachusetts, but trial was postponed . In addition, Hernandez will face two civil lawsuits; one is in relation to Suffolk County case .

----------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
The temperature was recorded at Argentina's Esperanza Base on the northern tip of the Antarctica Peninsula. The World Meteorological Organization is in the process of setting up an international ad-hoc committee of about 10 blue-ribbon climatologists and meteorologists. The committee will examine the equipment used to measure the temperature, whether it was in good working order.


In [153]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Article:

(CNN)A high temperature of 63.5 degrees Fahrenheit might sound like a pleasant day in early spring -- unless you're in Antarctica. The chilly continent recorded the temperature (15.5 degrees Celsius) on March 24, possibly the highest ever recorded on Antarctica, according to the Weather Underground. The temperature was recorded at Argentina's Esperanza Base on the northern tip of the Antarctica Peninsula, according to CNN affiliate WTNH. (Note to map lovers: The Argentine base is not geographically part of the South American continent.) The World Meteorological Organization, a specialized United Nations agency, is in the process of setting up an international ad-hoc committee of about 10 blue-ribbon climatologists and meteorologists to begin collecting relevant evidence, said Randy Cerveny, the agency's lead rapporteur of weather and climate extremes and Arizona State University professor of geographical sciences. The committee will examine the equipment used to measure the 

In [155]:
summary = dataset['test'][example_index_to_summarize]['highlights']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

IndexError: index out of range in self