1. datasets: Already imported dataset can be converted into dataset object. Then it can be understand by transformers library. We are using transformers library to fine tune our model. Other than that datasets library is used to import datasets form huggingface.

2. bitsandbytes: model quantization

3. accelerate: bitsandbytes depend on accelerate

4. peft: applying LoRA

In [6]:
!pip install opendatasets datasets transformers datasets peft accelerate bitsandbytes --upgrade --quiet

In [8]:
# Directly import dataset from kaggle
import opendatasets as od
od.download("https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail")

Skipping, found downloaded files in "./newspaper-text-summarization-cnn-dailymail" (use force=True to force download)


In [9]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
    GenerationConfig, TrainingArguments, Trainer
)
from peft import LoraConfig, get_peft_model
import pandas as pd
from datasets import Dataset
import re

In [10]:
# load_in_4bit=True - reduced from higher precision (e.g., 32-bit floating-point)
# to a lower precision (e.g., 4-bit integers).
# "nf4" stands for NormalFloat 4-bit quantization.
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", quantization_config=quant_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [11]:
# Only train dataset is considered bcz when number of data increase more computational
# power and time required.
# Required only text(article) and the summary(highlights)
train_df = pd.read_csv("/content/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/train.csv")[["article", "highlights"]]
train_df = train_df.sample(10000)

In [12]:
train_df.head()

Unnamed: 0,article,highlights
266348,"By . Sophie Jane Evans . For most owners, the ...",Amusing photos show missing pet signs put up b...
2112,"Nairobi, Kenya (CNN) -- Vote counting began he...","NEW: ""The work, so far, is very good,"" says el..."
247612,Britain has been damaged by . ‘unthinking clim...,Energy Secretary Ed Davey used floods crisis t...
216792,In their fresh-faced modelling heyday they wer...,Claudia Schiffer and Helena Christensen were b...
58291,Ravel Morrison has been cleared of threatening...,Ravel Morrison has been cleared of harassing h...


In [13]:
# This function is explain in full finetune notebook
def filter_text(text):
  text = text.lower()
  text = re.sub('[^A-Za-z0-9]+', ' ', text)
  return text

train_df["article"] = train_df["article"].apply(filter_text)
train_df["highlights"] = train_df["highlights"].apply(filter_text)

In [14]:
train_df.head()

Unnamed: 0,article,highlights
266348,by sophie jane evans for most owners the thoug...,amusing photos show missing pet signs put up b...
2112,nairobi kenya cnn vote counting began here mon...,new the work so far is very good says election...
247612,britain has been damaged by unthinking climate...,energy secretary ed davey used floods crisis t...
216792,in their fresh faced modelling heyday they wer...,claudia schiffer and helena christensen were b...
58291,ravel morrison has been cleared of threatening...,ravel morrison has been cleared of harassing h...


In [15]:
# Creating prompts and storing them in a column called final_statement
train_df["final_statement"] = ""
for idx, row in train_df.iterrows():
    train_df.at[idx, "final_statement"] = "Summarize the following article.\n\n" + str(row["article"]) + "\nSummary:\n" + str(row["highlights"])

# Keeping only the final_statement column
train_df = train_df[["final_statement"]]

In [16]:
train_df.sample()

Unnamed: 0,final_statement
101684,Summarize the following article.\n\nby daily m...


In [17]:
print(train_df["final_statement"].iloc[9])

Summarize the following article.

 cnn real madrid coach jose mourinho has revealed that the spanish club s bosses have relented and will seek to sign him a new striker after all mourinho had initially been told that he could not replace gonzalo higuain who could miss the rest of this season after having back surgery france international karim benzema has been filling in for the argentine but mourinho wants more options as real seek to stop barcelona retaining the spanish title and also to win the european champions league for the first time since 2002 i am sticking to my beliefs i think it would be good to have another forward for what is left of the season mourinho told madrid s website ahead of sunday s trip to bottom club almeria the club is looking into the possibility of signing a forward which pleases me to no end real have been linked with a move to bring veteran dutch striker ruud van nistelrooy back to the santiago bernabeu and the player s german club hamburg confirmed the l



---


Example Output

The resulting train_tokenized_datasets contains tokenized inputs and labels ready for model training. Here’s a simplified example of what one entry might look like after tokenization:

{

    'input_ids': tensor([[    12,     34,     56, ...,    50256,  50256,  50256]]),

    'labels': tensor([[    12,     34,     56, ...,    50256,  50256,  50256]])
}

In this example, 50256 is the EOS token, which is also used for padding.

Summary of Why Use pad_token = eos_token

By setting the padding token to the EOS token, we ensure:

Consistent handling of sequence endings and padding.
Avoidance of issues with undefined or unknown tokens.
Compatibility with models expecting specific token handling (like treating padding as end-of-sequence).


---



**In fine tuninig process transformers library asking for inputs and outputs(labels)**

For encorder only models (eg: GPT, Llama) we are using inputs and the labels as same.


---



The remove_columns=train_data.column_names argument in the map function is used to remove the original columns from the dataset after the tokenization process. Here's why this is important and beneficial:

When you apply the map function to a Dataset object, you often create new columns or modify existing ones. In this case, the tokenize_function adds new columns called input_ids and labels. The original columns, such as article, highlights, and final_statement, are no longer needed once the data has been tokenized.


---



In [18]:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(example):
    example["input_ids"] = tokenizer(example["final_statement"], padding="max_length", max_length = 250, truncation=True, return_tensors="pt").input_ids
    example["labels"] = tokenizer(example["final_statement"], padding="max_length", max_length = 250, truncation=True, return_tensors="pt").input_ids
    return example

# Convert your DataFrame into a Dataset object
train_data = Dataset.from_pandas(train_df)

# # Apply the tokenize function
train_tokenized_datasets = train_data.map(tokenize_function, batched=True, remove_columns=train_data.column_names)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [19]:
print(tokenizer.decode(train_tokenized_datasets[8]["input_ids"], skip_special_tokens = True))

Summarize the following article.

her husband remains in a secure hospital receiving treatment for a mental breakdown and drugs withdrawal by daily mail reporter published 14 35 est 17 july 2012 updated 03 45 est 18 july 2012 billionaire hans kristian rausing s in laws have vowed to stand by him as he s charged with preventing the lawful and decent burial of his wife eva insisting they love the tetra pak heir unconditionally after initially arresting him on suspicion of murder scotland yard announced yesterday that rausing 49 had been charged with preventing eva rausing s burial after it emerged he may have lived with his wife s body in their 70 million belgravia mansion for a week or more after she died of a suspected drugs overdose but mrs rausing s father tom kemeny is refusing to blame his son in law in a statement he has paid tribute to a beautiful generous and fun daughter wife mother sister and aunt and described mr rausing as his son grim discovery police found the badly decomp

In [20]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
)

# inside the get_peft_model function LoRA layers are added to the original model.
# in there for the original model parameters become required_grad = False (freeze weights)
# LoRA weights are required_grad = True (They will update when model fine tune)
peft_model = get_peft_model(model, peft_params)
peft_model.print_trainable_parameters()

trainable params: 2,359,296 || all params: 1,067,673,600 || trainable%: 0.22097539922313336


In [21]:
training_args = TrainingArguments(
output_dir = './model_checkpoints',
save_total_limit = 1,
auto_find_batch_size = True,
learning_rate = 1e-3,
num_train_epochs = 1,
)

# As only LoRA layers are updating in trainer class there is no need to specially
# metion about we are using LoRA. We are passing peft model
trainer = Trainer(
model = peft_model,
args = training_args,
train_dataset = train_tokenized_datasets,
)

trainer.train()

trainer.model.save_pretrained('./final_model')
tokenizer.save_pretrained('./final_model')



Step,Training Loss


Step,Training Loss
500,3.3096
1000,3.2582




('./final_model/tokenizer_config.json',
 './final_model/special_tokens_map.json',
 './final_model/tokenizer.json')

In here max_new_tokens means how many new tokens are generating. In here it is summary. As we feed article to the model. When we are increasing max_new_tokens length of the summary is increasing.

In [22]:
news_article = """
All but one of the 100 cities with the world’s worst air pollution last year were in Asia, according to a new report, with the climate crisis playing a pivotal role in bad air quality that is risking the health of billions of people worldwide.

The vast majority of these cities — 83 — were in India and all exceeded the World Health Organization’s air quality guidelines by more than 10 times, according to the report by IQAir, which tracks air quality worldwide.

The study looked specifically at fine particulate matter, or PM2.5, which is the tiniest pollutant but also the most dangerous. Only 9% of more than 7,800 cities analyzed globally recorded air quality that met WHO’s standard, which says average annual levels of PM2.5 should not exceed 5 micrograms per cubic meter.

“We see that in every part of our lives that air pollution has an impact,” said IQAir Global CEO Frank Hammes. “And it typically, in some of the most polluted countries, is likely shaving off anywhere between three to six years of people’s lives. And then before that will lead to many years of suffering that are entirely preventable if there’s better air quality.”

"""

filtered_news_article = "Summarize the following article.\n\n" +filter_text(news_article) + "\nSummary:\n"
tokenizerd_news_article = tokenizer(filtered_news_article, max_length = 250, return_tensors="pt")
output = model.generate(tokenizerd_news_article.input_ids, max_new_tokens = 100)
print(output.shape)
summary = tokenizer.decode(output[0], skip_special_tokens = True)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


torch.Size([1, 332])


In [23]:
print(summary.split("\nSummary:\n")[1])

the report shows that the majority of cities in the world have air quality that is more than 10 times higher than the world health organization s guidelines the study looked specifically at fine particulate matter or pm2 5 which is the tiniest pollutant but also the most dangerous the report shows that the majority of cities in the world have air quality that is more than 10 times higher than the world health organization s guidelines the study looked specifically at fine particulate matter or pm2 5 which is the t
