**Goal: Fine-tune a pre-trained language model to generate high-quality abstractive summaries using the SAMSum dataset.**

**Dataset: The SAMSum dataset consists of conversational dialogues and corresponding concise summaries. It is particularly challenging due to the informal and context-rich nature of the dialogues, requiring models to generate accurate summaries.**


# **Step 1: Import Libraries**

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
import pandas as pd


# **Step 2: Loading the Dataset**
> **I am going to load the dataset and explore its structure**


In [2]:
# Load dataset files
train_data = pd.read_csv("/kaggle/input/samsum-dataset-text-summarization/samsum-train.csv")
val_data = pd.read_csv("/kaggle/input/samsum-dataset-text-summarization/samsum-validation.csv")
test_data = pd.read_csv("/kaggle/input/samsum-dataset-text-summarization/samsum-test.csv")

# Check a sample
train_data.head()


Unnamed: 0,id,dialogue,summary
0,13818513,Amanda: I baked cookies. Do you want some?\r\...,Amanda baked cookies and will bring Jerry some...
1,13728867,Olivia: Who are you voting for in this electio...,Olivia and Olivier are voting for liberals in ...
2,13681000,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa...",Kim may try the pomodoro technique recommended...
3,13730747,"Edward: Rachel, I think I'm in ove with Bella....",Edward thinks he is in love with Bella. Rachel...
4,13728094,Sam: hey overheard rick say something\r\nSam:...,"Sam is confused, because he overheard Rick com..."


# Observations
* **Dialogue:
Text format includes speaker labels (e.g., "Amanda:" and "Sam:").
The dialogues are multiline (e.g., separated by \r\n) and have natural conversational flow thats why in the preprocessing i removed the whitespaces and the lines .**
* **Summary:
Summaries appear concise and relevant to the dialogues.
They seem to capture the core information in one or two sentences**


In [3]:
# to minimize training time i will use only a smaple
train_data = train_data.sample(n=4000,random_state=42).reset_index(drop=True)
val_data = val_data.sample(n=500, random_state=42).reset_index(drop=True)

# **Step 3: Preparing the Data (Preprocess &Tokenization)**
> Preprocessing ensures the dialogues and summaries are properly tokenized and truncated.
> I  will use the Facebook BART-large model, as it's well-suited for summarization tasks:


In [4]:
#drop null values
train_data = train_data.dropna()
val_data = val_data.dropna()
test_data = test_data.dropna()

In [5]:
import re
# Preprocessing function to clean the text
def preprocess_text(text):
    text = re.sub(r'\r\n', ' ', text)  # Remove carriage returns and line breaks
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'<.*?>', '', text)  # Remove any XML tags
    text = text.strip().lower()  # Strip and convert to lower case
    return text

# Apply preprocessing to both dialogue and summary columns
train_data["dialogue"] = train_data["dialogue"].apply(preprocess_text)
train_data["summary"] = train_data["summary"].apply(preprocess_text)

val_data["dialogue"] = val_data["dialogue"].apply(preprocess_text)
val_data["summary"] = val_data["summary"].apply(preprocess_text)

test_data["dialogue"] = test_data["dialogue"].apply(preprocess_text)
test_data["summary"] = test_data["summary"].apply(preprocess_text)


In [6]:
# Check for any empty or missing values after preprocessing
print(f"Empty dialogues in train data: {train_data['dialogue'].isnull().sum()}")
print(f"Empty summaries in train data: {train_data['summary'].isnull().sum()}")
print(f"Empty dialogues in validation data: {val_data['dialogue'].isnull().sum()}")
print(f"Empty summaries in validation data: {val_data['summary'].isnull().sum()}")
print(f"Empty dialogues in test data: {test_data['dialogue'].isnull().sum()}")
print(f"Empty summaries in test data: {test_data['summary'].isnull().sum()}")


Empty dialogues in train data: 0
Empty summaries in train data: 0
Empty dialogues in validation data: 0
Empty summaries in validation data: 0
Empty dialogues in test data: 0
Empty summaries in test data: 0


In [7]:
# Check some random samples from the cleaned data
print("Sample Dialogue (Train):", train_data.loc[0, "dialogue"])
print("Sample Summary (Train):", train_data.loc[0, "summary"])


Sample Dialogue (Train): violet: hi! i came across this austin's article and i thought that you might find it interesting violet:  claire: hi! :) thanks, but i've already read it. :) claire: but thanks for thinking about me :)
Sample Summary (Train): violet sent claire austin's article.


In [8]:
# Check the number of rows in each dataset
print(f"Training data size: {len(train_data)}")
print(f"Validation data size: {len(val_data)}")
print(f"Test data size: {len(test_data)}")


Training data size: 4000
Validation data size: 500
Test data size: 819


In [9]:
# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [10]:
#  function for tokenization
def tokenize_data(examples):
    # Tokenize the dialogue and summary
    inputs = tokenizer(examples["dialogue"], padding="max_length", truncation=True, max_length=512)
    targets = tokenizer(examples["summary"], padding="max_length", truncation=True, max_length=150)
    inputs["labels"] = targets["input_ids"]
    return inputs
    
 # Tokenize train, validation, and test datasets
train_dataset = train_data.apply(tokenize_data, axis=1)
val_dataset = val_data.apply(tokenize_data, axis=1)
test_dataset=test_data.apply(tokenize_data,axis=1)

# Check a tokenized sample
train_dataset [0]


{'input_ids': [25208, 10, 7102, 55, 3, 23, 764, 640, 48, 403, 17, 77, 31, 7, 1108, 11, 3, 23, 816, 24, 25, 429, 253, 34, 1477, 25208, 10, 3, 7997, 15, 10, 7102, 55, 3, 10, 61, 2049, 6, 68, 3, 23, 31, 162, 641, 608, 34, 5, 3, 10, 61, 3, 7997, 15, 10, 68, 2049, 21, 1631, 81, 140, 3, 10, 61, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

# **Step 4:Fine-Tune T5 Transformer**



In [11]:
# Load the pre-trained model
model = T5ForConditionalGeneration.from_pretrained('t5-small')

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# **Define Training Arguments**

In [12]:
training_args = TrainingArguments(
    output_dir="./results",          # output directory for checkpoints
    num_train_epochs=6,              # number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir="./logs",            # directory for storing logs
    logging_steps=50,                # how often to log training info
    save_steps=500,                  # how often to save a model checkpoint
    eval_steps=50,                   # how often to run evaluation
    eval_strategy="epoch",     # Ensure evaluation happens every `epoch`
    report_to="none"
)



# **Step 5: Trainer Setup & Training the Model**

In [13]:
# Setup the trainer
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_dataset, 
    eval_dataset=val_dataset
)

# Print details for verification
print("Training configuration set up successfully!")
print("Output directory:", training_args.output_dir)
print("Number of training epochs:", training_args.num_train_epochs)

Training configuration set up successfully!
Output directory: ./results
Number of training epochs: 6


In [14]:
# Start the training process
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.4246,0.38031
2,0.3784,0.359477
3,0.3711,0.354424
4,0.3631,0.349908
5,0.35,0.348632
6,0.3534,0.34883


TrainOutput(global_step=3000, training_loss=0.9062748317718505, metrics={'train_runtime': 697.3754, 'train_samples_per_second': 34.415, 'train_steps_per_second': 4.302, 'total_flos': 3248203235328000.0, 'train_loss': 0.9062748317718505, 'epoch': 6.0})

# **Step 6: Save and load model**

In [15]:
model.save_pretrained("./final_model")
tokenizer.save_pretrained("./final_model")


('./final_model/tokenizer_config.json',
 './final_model/special_tokens_map.json',
 './final_model/spiece.model',
 './final_model/added_tokens.json')

In [19]:
# Load the saved model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("./final_model")
tokenizer = T5Tokenizer.from_pretrained("./final_model")

**test the summarization system**

In [20]:
device = model.device  # Get the device the model is on

def summarize_dialogue(dialogue):
    dialogue = preprocess_text(dialogue)  
    inputs = tokenizer(dialogue, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
    
    # Move input tensors to the same device as the model
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Generate summary
    outputs = model.generate(
        inputs["input_ids"], 
        max_length=150,  
        num_beams=4, 
        early_stopping=True
    )
    
    # Decode the generated summary
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [24]:
sample_dialogue = """
Sarah: "I’ve been thinking about moving to a new city for a while now. I feel like I need a change of scenery, you know? I’ve been stuck in this routine for so long."
Jessica: "I get that. A fresh start can be really refreshing. Have you thought about where you might want to go?"
Sarah: "I’ve been looking into a few places. New York is obviously on my list. The energy there is unmatched, but I’m also considering somewhere a bit quieter, like Portland or Austin. I like the idea of being in a city with a strong arts scene, but I don’t want it to be too fast-paced."
Jessica: "I think both Portland and Austin sound like great options. Portland has such a laid-back vibe, and it’s surrounded by nature. You can easily go hiking or take weekend trips to the coast. Austin, on the other hand, is known for its tech scene and live music, so you’d never be bored."
Sarah: "Yeah, that’s true. I just want a place where I feel inspired and can grow both personally and professionally. I’m also trying to figure out if I can find a job that fits what I want to do."
Jessica: "That’s a big part of it too. Finding the right balance between work and lifestyle is key. Have you looked into remote jobs? That could give you more flexibility in terms of location."
Sarah: "I have, actually! I’ve been exploring some remote opportunities in the tech industry. It’s definitely something I’m leaning toward."
Jessica: "That sounds perfect for you. It’ll give you the freedom to live wherever you want and still have a great career."
Sarah: "Exactly. I think I’m just about ready to make the move. Now I just need to start narrowing down my options and figuring out the logistics."
Jessica: "You’ve got this! I’m sure whatever city you choose will be a great fit."
"""
summary = summarize_dialogue(sample_dialogue)
print("Summary:", summary)


Summary: sarah has been thinking about moving to a new city for a while now. jessica is considering a place quieter, like portland and austin.


In [25]:
sample_dialogue = """
Tom: "I’ve been struggling with managing my time lately. Work has been really demanding, and I’ve been finding it hard to balance everything else in my life. Do you ever feel like that?"
Rachel: "Oh, absolutely. It’s tough, especially when there are so many things pulling you in different directions. What do you usually do to manage it all?"
Tom: "I try to stay organized, but sometimes it feels like there’s just too much to do. I use a calendar and set reminders, but I still miss deadlines or feel like I’m not putting enough time into the things that matter outside of work."
Rachel: "I think it’s easy to get caught up in work. I used to have the same problem, but I’ve been trying to set clearer boundaries between work and personal time. For instance, I no longer check my work emails after 7 PM."
Tom: "That’s a good idea. I think I need to start doing that. The problem is, there’s always something that needs to be done, so it’s hard to just let go sometimes."
Rachel: "I get it. But if you keep pushing yourself without taking breaks, it’ll burn you out. I also started incorporating little things throughout the day to recharge, like going for a walk or reading a chapter of a book. It helps clear my mind."
Tom: "I haven’t been taking enough breaks, honestly. I just go from task to task, and it feels like I’m constantly running on empty. Maybe I should try some of the things you’re doing."
Rachel: "Definitely! And another thing that’s helped me is learning to say ‘no’ to things that I don’t have the energy for. It’s hard at first, but it’s freeing."
Tom: "Yeah, I think I need to work on that too. I always feel guilty about not helping out or taking on extra work, but I guess it’s important to prioritize my own well-being."
Rachel: "Exactly. You can’t be your best for others if you’re not taking care of yourself first. It’s all about finding that balance."
Tom: "Thanks for the advice, Rachel. I’m definitely going to try some of these strategies."
Rachel: "You’re welcome! I’m sure you’ll feel a lot better once you start making those changes."
"""
summary = summarize_dialogue(sample_dialogue)
print("Summary:", summary)


Summary: tom has been struggling with managing his time lately. rachel has been trying to balance everything else in her life. rachel has started incorporating little things throughout the day to recharge, like going for a walk or reading a book.


In [26]:
sample_dialogue = """
Mark: "I’ve been really into learning about personal finance lately. I think it’s something I should have paid more attention to earlier in life, you know?"
Lucas: "I totally agree. I didn’t really start understanding how important it is until a few years ago. Have you been reading up on budgeting or investing?"
Mark: "A bit of both. I’ve been using an app to track my spending and see where I can cut back, but I’m also really interested in investing for the long term. I’ve been trying to get into stocks and maybe even real estate."
Lucas: "That’s great. A lot of people don’t start thinking about investing until later on, so it’s good that you’re getting ahead of it. What have you learned about stocks so far?"
Mark: "Well, I’ve been reading about the stock market and trying to understand different strategies, like value investing and growth investing. I’m also starting small with index funds to minimize risk."
Lucas: "That’s smart. Index funds are a great way to get exposure to the market without having to pick individual stocks. I’ve been doing the same thing, but I’m also trying to learn more about cryptocurrency."
Mark: "I’ve heard a lot about crypto, but I’m a little cautious. It seems so volatile, and I don’t want to risk too much, especially when I’m just starting."
Lucas: "That’s understandable. Crypto is definitely risky, but it can also be a great opportunity for growth if you’re able to handle the ups and downs. I think the key is only investing what you’re willing to lose."
Mark: "I agree. I’ll probably stay conservative for now and stick to traditional investments, but it’s good to know more about crypto in case I decide to dive in later."
Lucas: "Exactly. Just keep learning and making informed decisions. And don’t forget about the importance of saving too—having an emergency fund is key."
Mark: "Definitely. I’m working on building that up as well. I want to have a solid financial foundation before making any big moves."
Lucas: "Sounds like you’re on the right track. Keep it up!"
"""
summary = summarize_dialogue(sample_dialogue)
print("Summary:", summary)

Summary: lucas has been learning about personal finance lately. lucas has been using an app to track his spending and see where he can cut back. lucas is starting small with index funds to minimize risk.
