# Introduction 

This notebook presents a comprehensive guide to fine-tuning Facebook's BART (Bidirectional and Auto-Regressive Transformers) model for the task of summarizing chat conversations. 

The notebook is structured to provide a seamless and educative experience in applying advanced NLP techniques for practical applications.

The model fine-tuning utilizes three distinct datasets:

1. **DialogSum Dataset:** A specialized dataset for dialogue summarization, offering diverse conversational examples.

2. **SAMSUM Dataset by Samsung:** This dataset comprises scripted chat conversations with associated human-written summaries, providing a rich ground for training and validating summarization models.

3. **Custom Dataset:** A personally curated dataset, designed to include a variety of chat styles and topics, ensuring robustness and versatility in the model's performance.


In [1]:
# Importing necessary libraries
import json
import pandas as pd
import random
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


ModuleNotFoundError: No module named 'transformers'

We will install rogue score to evaluate the summaries generated by the model.

In [8]:
# !pip install rouge_score

# 1. DialogSum dataset

### Defining functions to read and clean the dataset

**Note**

- To enhance the model's applicability to real-world scenarios, the original DialogSum dataset was modified by replacing generic placeholders 'Person1' and 'Person2' with a diverse list of names. 
- This adaptation ensures that the model is trained on data more representative of actual conversation patterns, thereby improving its practical utility and performance in real-world applications.

In [9]:
# Function to read JSONL files and convert them to pandas DataFrame
def read_jsonl_to_dataframe(file_path):
    """
    Reads a JSONL file and converts it into a pandas DataFrame.
    
    Parameters:
    file_path (str): Path to the JSONL file to be read.

    Returns:
    pandas.DataFrame: DataFrame containing the data from the JSONL file.
    """
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    
    return pd.DataFrame(data)

# Function to replace placeholder names in the DataFrame with random names from a list
def replace_names(df, names_list):
    """
    Replaces placeholders with random names from the names_list in the DataFrame.

    Parameters:
    df (pandas.DataFrame): DataFrame where names need to be replaced.
    names_list (list): List of names to be used for replacement.

    Returns:
    pandas.DataFrame: Updated DataFrame with names replaced.
    """
    for index, row in df.iterrows():
        # Randomly select two names
        name1, name2 = random.sample(names_list, 2)

        # Replace placeholders in dialogue and summary
        df.at[index, 'dialogue'] = row['dialogue'].replace('#Person1#', name1).replace('#Person2#', name2)
        df.at[index, 'summary'] = row['summary'].replace('#Person1#', name1).replace('#Person2#', name2)
        
    # Clean up the DataFrame by dropping unnecessary columns and duplicates
    df.drop(columns = ['fname', 'topic'], inplace = True)
    df.dropna(inplace = True, axis = 1)
    df.drop_duplicates(inplace = True)
    return df

# List of names to be used for replacement
names_list = [
    'Alice', 'Bob', 'Charlie', 'Diana', 'Edward', 'Fiona', 'George', 'Hannah', 'Ian', 'Julia', 'Kevin', 'Laura',
    'Megan', 'Nathan', 'Olivia', 'Peter', 'Quincy', 'Rachel', 'Samuel', 'Tina', 'Umar', 'Violet', 'William', 'Xena',
    'Yasmin', 'Zachary', 'Amelia', 'Brian', 'Carmen', 'David', 'Elena', 'Frank', 'Grace', 'Henry', 'Isla', 'Jack',
    'Kara', 'Leo', 'Maya', 'Nolan', 'Ophelia', 'Pablo', 'Queenie', 'Raj', 'Sara', 'Tom', 'Ursula', 'Victor', 'Wendy',
    'Xander', 'Yolanda', 'Zane', 'Anita', 'Blake', 'Claire', 'Derek', 'Eve', 'Felix', 'Giselle', 'Harold', 'Ivy', 'Jasper', 'Kylie', 'Liam', 'Monica', 'Nigel', 'Opal', 'Preston', 'Quinn',
    'Rosa', 'Sebastian', 'Tracy', 'Ulysses', 'Valerie', 'Winston', 'Xiomara', 'Yvette', 'Zelda', 'Aaron', 'Brianna',
    'Cody', 'Danielle', 'Ethan', 'Farrah', 'Gavin', 'Hazel', 'Isaac', 'Jocelyn', 'Kyle', 'Luna', 'Miles', 'Nadia',
    'Orlando', 'Penelope', 'Quincy', 'Rebecca', 'Shane', 'Tara', 'Ursula', 'Vance', 'Whitney', 'Xavier', 'Yasmine',
    'Zach', 'Aurora', 'Brandon', 'Celeste'
]

### Training set of DialogSum

In [10]:
# Path to JSONL file
train_file_path = 'dialogsum.train.jsonl'

# Reading the JSONL file and creating a DataFrame
train_df1 = read_jsonl_to_dataframe(train_file_path)

# Replacing names in the DataFrame
train_df1 = replace_names(train_df1, names_list)

# Displaying the first few rows of the DataFrame with replaced names
train_df1

Unnamed: 0,dialogue,summary
0,"Valerie: Hi, Mr. Smith. I'm Doctor Hawkins. Wh...","Mr. Smith's getting a check-up, and Doctor Haw..."
1,"Queenie: Hello Mrs. Parker, how have you been?...",Mrs Parker takes Ricky for his vaccines. Dr. P...
2,"Elena: Excuse me, did you see a set of keys?\n...",Elena's looking for a set of keys and asks for...
3,Hazel: Why didn't you tell me you had a girlfr...,Hazel's angry because Monica didn't tell Hazel...
4,"Ursula: Watsup, ladies! Y'll looking'fine toni...",Malik invites Nikki to dance. Nikki agrees if ...
...,...,...
12455,Ethan: Excuse me. You are Mr. Green from Manch...,Tan Ling picks Mr. Green up who is easily reco...
12456,Victor: Mister Ewing said we should show up at...,Victor and Fiona plan to take the underground ...
12457,Ulysses: How can I help you today?\nKara: I wo...,Kara rents a small car for 5 days with the hel...
12458,Umar: You look a bit unhappy today. What's up?...,Celeste's mom lost her job. Celeste hopes mom ...


### Validation set of DialogSum

In [11]:
# Path to your JSONL file
valid_file_path = 'dialogsum.dev.jsonl'

# Read JSONL file and create DataFrame
valid_df1 = read_jsonl_to_dataframe(valid_file_path)

# Replace names in the DataFrame
valid_df1 = replace_names(valid_df1, names_list)

# Display the first few rows of the DataFrame with replaced names
valid_df1

Unnamed: 0,dialogue,summary
0,"Zane: Hello, how are you doing today?\nUlysses...",Ulysses has trouble breathing. The doctor asks...
1,Liam: Hey Jimmy. Let's go workout later today....,Liam invites Jimmy to go workout and persuades...
2,Luna: I need to stop eating such unhealthy foo...,"Luna plans to stop eating unhealthy foods, and..."
3,Zane: Do you believe in UFOs?\nClaire: Of cour...,Claire believes in UFOs and can see them in dr...
4,Bob: Did you go to school today?\nRebecca: Of ...,Bob didn't go to school today. Rebecca wants t...
...,...,...
495,"Megan: Now that it's the new year, I've decide...",Megan decides to stop smoking and come out of ...
496,"Frank: You married Joe, didn't you? \nKara: Jo...",Frank thought Kara married Joe. Kara denies.
497,Victor: How can I help you mam?\nGeorge: I was...,George's car makes noises. Victor thinks it ne...
498,"Pablo: Hello, Amazon's customer service. How c...",Kara calls Amazon's customer service because o...


### Test set of DialogSum

In [6]:
# Path to your JSONL file
test_file_path = '/kaggle/input/dialogue-chat/dialogsum.test.jsonl'

# Read JSONL file and create DataFrame
test_df1 = read_jsonl_to_dataframe(test_file_path)

# Replace names in the DataFrame
# test_df = replace_names(test_df, names_list)

# Display the first few rows of the DataFrame with replaced names
test_df1

Unnamed: 0,fname,dialogue,summary1,topic1,summary2,topic2,summary3,topic3
0,test_0,"#Person1#: Ms. Dawson, I need you to take a di...",Ms. Dawson helps #Person1# to write a memo to ...,communication method,In order to prevent employees from wasting tim...,company policy,Ms. Dawson takes a dictation for #Person1# abo...,dictation
1,test_1,#Person1#: You're finally here! What took so l...,#Person2# arrives late because of traffic jam....,public transportation,#Person2# decides to follow #Person1#'s sugges...,transportation,#Person2# complains to #Person1# about the tra...,discuss transportation
2,test_2,"#Person1#: Kate, you never believe what's happ...",#Person1# tells Kate that Masha and Hero get d...,divorce,#Person1# tells Kate that Masha and Hero are g...,divorce,#Person1# and Kate talk about the divorce betw...,discuss divorce
3,test_3,"#Person1#: Happy Birthday, this is for you, Br...",#Person1# and Brian are at the birthday party ...,birthday party,#Person1# attends Brian's birthday party. Bria...,birthday party,#Person1# has a dance with Brian at Brian's bi...,birthday party
4,test_4,#Person1#: This Olympic park is so big!\n#Pers...,#Person1# is surprised at the Olympic Stadium'...,Olympic Stadium,#Person2# shows #Person1# around the construct...,sports stadium,#Person2# introduces the Olympic Stadium's fin...,Olympic Stadium
...,...,...,...,...,...,...,...,...
495,test_495,"#Person1#: Hey, Charlie, do you want to come t...",Jack invites Charlie to play a new video game ...,video game invitation,Jack asks Charlie to come over and play the ne...,\n\nentertainment activity schedule,Jack invites Charlie to play video games after...,play game
496,test_496,#Person1#: How did you get interested in count...,#Person2# explains to #Person1# about how #Per...,conversation about interest,#Person2# shares #Person2#'s career in the pas...,work experience,#Person2# tells #Person1# about #Person2#'s ow...,country music
497,test_497,"#Person1#: Excuse me, Alice, I've never used t...",Alice guides #Person1# to use the washing mach...,campus conversation,#Person1# asks Alice how to use the washing ma...,campus life conversation,#Person1# doesn't know how to use the washing ...,clothes washing
498,test_498,#Person1#: Matthew? Hi!\n#Person2#: Steve! Hav...,Steve is looking for a new place to live and M...,house renting,Matthew and Steve meet after a long time. Stev...,finding a house,Steve has been looking for a place to live. Ma...,find a house


# 2. SAMSUM Dataset

### Defining function to clean the dataset

In [7]:
def cleaning(df):
    # Count and print the number of duplicate rows in the DataFrame
    dup = df.duplicated().sum()
    print('Number of duplicates: ', dup)
    
    # If there are duplicates, remove them and print confirmation
    if dup > 0:
        df.drop_duplicates(inplace=True)
        print('Duplicate values removed.')
    
    # Remove the 'id' column from the DataFrame
    df.drop('id', axis=1, inplace=True)
    
    # Calculate and print the number of missing values in each column
    missing = df.isna().sum()
    print('Number of missing values: ', missing)
    
    # Remove rows with missing values from the DataFrame
    df.dropna(inplace=True)

    # Return the cleaned DataFrame
    return df

### Reading the dataset files

In [8]:
# Load the training data from the SAMSUM dataset for chat summarization
# This dataset is used for training the model to summarize chat conversations
train_df2 = pd.read_json('/kaggle/input/samsum-dataset-for-chat-summarization/train.json')

# Load the validation data from the SAMSUM dataset
# Validation data is used to tune the model's hyperparameters and prevent overfitting
valid_df2 = pd.read_json('/kaggle/input/samsum-dataset-for-chat-summarization/val.json')

# Load the test data from the SAMSUM dataset
# Test data is used for evaluating the model's performance on unseen data
test_df2 = pd.read_json('/kaggle/input/samsum-dataset-for-chat-summarization/test.json')

### Cleaning SAMSUM Training set

In [9]:
cleaning(train_df2)

Number of duplicates:  0
Number of missing values:  summary     0
dialogue    0
dtype: int64


Unnamed: 0,summary,dialogue
0,Amanda baked cookies and will bring Jerry some...,Amanda: I baked cookies. Do you want some?\r\...
1,Olivia and Olivier are voting for liberals in ...,Olivia: Who are you voting for in this electio...
2,Kim may try the pomodoro technique recommended...,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa..."
3,Edward thinks he is in love with Bella. Rachel...,"Edward: Rachel, I think I'm in ove with Bella...."
4,"Sam is confused, because he overheard Rick com...",Sam: hey overheard rick say something\r\nSam:...
...,...,...
14727,Romeo is trying to get Greta to add him to her...,Romeo: You are on my ‘People you may know’ lis...
14728,Theresa is at work. She gets free food and fre...,Theresa: <file_photo>\r\nTheresa: <file_photo>...
14729,Japan is going to hunt whales again. Island an...,John: Every day some bad news. Japan will hunt...
14730,Celia couldn't make it to the afternoon with t...,Jennifer: Dear Celia! How are you doing?\r\nJe...


### Cleaning SAMSUM Valid set

In [10]:
cleaning(valid_df2)

Number of duplicates:  0
Number of missing values:  summary     0
dialogue    0
dtype: int64


Unnamed: 0,summary,dialogue
0,A will go to the animal shelter tomorrow to ge...,"A: Hi Tom, are you busy tomorrow’s afternoon?\..."
1,Emma and Rob love the advent calendar. Lauren ...,Emma: I’ve just fallen in love with this adven...
2,Madison is pregnant but she doesn't want to ta...,Jackie: Madison is pregnant\r\nJackie: but she...
3,Marla found a pair of boxers under her bed.,Marla: <file_photo>\r\nMarla: look what I foun...
4,Robert wants Fred to send him the address of t...,Robert: Hey give me the address of this music ...
...,...,...
813,Carla's date for graduation is on June 4th. Di...,Carla: I've got it...\r\nDiego: what?\r\nCarla...
814,Bev is going on the school trip with her son. ...,"Gita: Hello, this is Beti's Mum Gita, I wanted..."
815,Greg cheated on Julia. He apologises to her. R...,"Julia: Greg just texted me\r\nRobert: ugh, del..."
816,Marry broke her nail and has a party tomorrow....,"Marry: I broke my nail ;(\r\nTina: oh, no!\r\n..."


### Cleaning SAMSUM Test set

In [11]:
cleaning(test_df2)

Number of duplicates:  0
Number of missing values:  summary     0
dialogue    0
dtype: int64


Unnamed: 0,summary,dialogue
0,Hannah needs Betty's number but Amanda doesn't...,"Hannah: Hey, do you have Betty's number?\nAman..."
1,Eric and Rob are going to watch a stand-up on ...,Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric:...
2,Lenny can't decide which trousers to buy. Bob ...,"Lenny: Babe, can you help me with something?\r..."
3,Emma will be home soon and she will let Will k...,"Will: hey babe, what do you want for dinner to..."
4,Jane is in Warsaw. Ollie and Jane has a party....,"Ollie: Hi , are you in Warsaw\r\nJane: yes, ju..."
...,...,...
814,Benjamin didn't come to see a basketball game ...,Alex: Were you able to attend Friday night's b...
815,The audition starts at 7.30 P.M. in Antena 3.,Jamilla: remember that the audition starts at ...
816,"Marta sent a file accidentally,","Marta: <file_gif>\r\nMarta: Sorry girls, I cli..."
817,There was a meet-and-greet with James Charles ...,Cora: Have you heard how much fuss British med...


# 3. Custom Dataset

### Reading the dataset

In [12]:
# Open the file 'dialogue.txt' in read mode
with open('/kaggle/input/chat-conversations-with-summary/dialogue.txt', 'r') as file:
    # Read all lines from the file
    lines = file.readlines()

# Initialize an empty list to store the data
data = []

# Iterate over each line in the file
for line in lines:
    # Strip leading and trailing whitespace and split the line into columns based on '~'
    columns = line.strip().split('~')
    # Append the processed line to the data list
    data.append(columns)

# Create a DataFrame from the list of data
df = pd.DataFrame(data)

# Exclude the first row which might be headers or unwanted data
df = df[1:]

# Rename the first and second columns to 'dialogue' and 'summary' respectively
df.rename(columns={0 : 'dialogue', 1 : 'summary'}, inplace=True)

# Reset the index of the DataFrame to make it start from 0
df.reset_index(drop=True, inplace=True)

# Rearrange the columns so that 'summary' comes first, followed by 'dialogue'
df = df[['summary', 'dialogue']]

# Display the DataFrame
df

Unnamed: 0,summary,dialogue
0,Alice is checking if Bob has finished the repo...,"Alice: Hey, have you finished the report? Bob:..."
1,"John and Emma discuss a new project, and Emma ...",John: Did you hear about the new project? Emma...
2,"Mike loses his keys, but Sara suggests checkin...",Mike: I can't find my keys anywhere. Sara: Did...
3,"Liam and Sophie confirm their movie plans, wit...",Liam: Are we still on for the movie tonight? S...
4,"David and Nina discuss the weather forecast, d...",David: Have you seen the weather forecast for ...
...,...,...
904,Ravi tells Mei about his culinary experiments ...,Ravi: I've been experimenting with cooking dif...
905,Sophie discusses her project to clean up the l...,Sophie: I'm working on a project to clean up o...
906,Neil shares his interest in historical novels ...,Neil: I've been exploring historical novels re...
907,Grace mentions her new blog on sustainable liv...,Grace: I started a blog about sustainable livi...


### I have joined Training and Valid sets together to increase the size of training data. We will use test datasets to evaluate the performance

In [13]:
# Concatenating the DataFrames
train_df = pd.concat([train_df1, train_df2, valid_df1, valid_df2, df])

# Reseting the index
train_df.reset_index(drop=True, inplace=True)

# Displaying the result
train_df

Unnamed: 0,dialogue,summary
0,"Ursula: Hi, Mr. Smith. I'm Doctor Hawkins. Why...","Mr. Smith's getting a check-up, and Doctor Haw..."
1,"Queenie: Hello Mrs. Parker, how have you been?...",Mrs Parker takes Ricky for his vaccines. Dr. P...
2,"Bob: Excuse me, did you see a set of keys?\nSe...",Bob's looking for a set of keys and asks for S...
3,Zelda: Why didn't you tell me you had a girlfr...,Zelda's angry because Victor didn't tell Zelda...
4,"Samuel: Watsup, ladies! Y'll looking'fine toni...",Malik invites Nikki to dance. Nikki agrees if ...
...,...,...
29414,Ravi: I've been experimenting with cooking dif...,Ravi tells Mei about his culinary experiments ...
29415,Sophie: I'm working on a project to clean up o...,Sophie discusses her project to clean up the l...
29416,Neil: I've been exploring historical novels re...,Neil shares his interest in historical novels ...
29417,Grace: I started a blog about sustainable livi...,Grace mentions her new blog on sustainable liv...


In [14]:
class SummaryDataset(Dataset):
    # Initialize the dataset with a tokenizer, data, and maximum token length
    def __init__(self, tokenizer, data, max_length=512):
        self.tokenizer = tokenizer  # Tokenizer for encoding text
        self.data = data            # Data containing dialogues and summaries
        self.max_length = max_length # Maximum length of tokens

    # Return the number of items in the dataset
    def __len__(self):
        return len(self.data)

    # Retrieve an item from the dataset by index
    def __getitem__(self, idx):
        item = self.data.iloc[idx]  # Get the row at the specified index
        dialogue = item['dialogue'] # Extract dialogue from the row
        summary = item['summary']   # Extract summary from the row

        # Encode the dialogue as input data for the model
        source = self.tokenizer.encode_plus(
            dialogue, 
            max_length=self.max_length, 
            padding='max_length', 
            return_tensors='pt', 
            truncation=True
        )

        # Encode the summary as target data for the model
        target = self.tokenizer.encode_plus(
            summary, 
            max_length=self.max_length, 
            padding='max_length', 
            return_tensors='pt', 
            truncation=True
        )

        # Return a dictionary containing input_ids, attention_mask, labels, and the original summary text
        return {
            'input_ids': source['input_ids'].flatten(),
            'attention_mask': source['attention_mask'].flatten(),
            'labels': target['input_ids'].flatten(),
            'summary': summary 
        }

In [15]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Initialize the tokenizer for BART
# 'facebook/bart-base' is a pretrained model identifier
# The tokenizer is responsible for converting text input into tokens that the model can understand
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

# Initialize the BART model for conditional generation
# This model is used for tasks like summarization where the output is conditional on the input text
# The model is loaded with pretrained weights from 'facebook/bart-base'
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [16]:
# Creating an instance of the SummaryDataset class for training data
# It uses the tokenizer to process the training data (train_df) 
# for model input
train_dataset = SummaryDataset(tokenizer, train_df)

# Creating an instance of the SummaryDataset class for validation data
# It uses the same tokenizer but with a different dataset (test_df2) 
# for validation purposes
valid_dataset = SummaryDataset(tokenizer, test_df2)

# Fine-tuning the model

In [17]:
from transformers import TrainingArguments

# Define training arguments for the model
training_args = TrainingArguments(
    output_dir='./results',          # Directory to save model output and checkpoints
    num_train_epochs=2,              # Number of epochs to train the model
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Weight decay for regularization
    logging_dir='./logs',            # Directory to save logs
    logging_steps=10,                # Log metrics every specified number of steps
    evaluation_strategy="epoch",     # Evaluation is done at the end of each epoch
    report_to='none'                 # Disables reporting to any online services (e.g., TensorBoard, WandB)
)

In [18]:
# Initializing the Trainer object
trainer = Trainer(
    model=model,             # The model to be trained (e.g., our BART model)
    args=training_args,      # Training arguments specifying training parameters like learning rate, batch size, etc.
    train_dataset=train_dataset,  # The dataset to be used for training the model
    eval_dataset=valid_dataset    # The dataset to be used for evaluating the model during training
)

# Starting the training process
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.0995,0.086062
2,0.0834,0.081724




TrainOutput(global_step=3678, training_loss=0.4837404892229399, metrics={'train_runtime': 5644.8729, 'train_samples_per_second': 10.423, 'train_steps_per_second': 0.652, 'total_flos': 1.793783686496256e+16, 'train_loss': 0.4837404892229399, 'epoch': 2.0})

# Model Evaluation using Rogue Score

In [19]:
from datasets import load_metric
from torch.utils.data import DataLoader

# Load the ROUGE metric for evaluation
rouge = load_metric('rouge')

def generate_summaries(model, tokenizer, dataset, batch_size=8):
    """
    Generate summaries using the provided model and tokenizer on the given dataset.

    Args:
        model: The trained summarization model.
        tokenizer: Tokenizer associated with the model.
        dataset: Dataset for which summaries need to be generated.
        batch_size: Number of data samples to process in each batch.

    Returns:
        summaries: Generated summaries by the model.
        references: Actual summaries from the dataset for comparison.
    """
    # Set model to evaluation mode
    model.eval()
    summaries = []    # List to store generated summaries
    references = []   # List to store actual summaries

    # Create a DataLoader for batch processing
    dataloader = DataLoader(dataset, batch_size=batch_size)

    # Disabled gradient calculations for efficiency
    with torch.no_grad():
        for batch in dataloader:
            # Move input data to the same device as the model
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)

            # Generate summaries with the model
            outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=2048, num_beams=2)
            batch_summaries = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outputs]

            # Append generated and actual summaries to the respective lists
            summaries.extend(batch_summaries)
            references.extend(batch['summary'])

    return summaries, references

# Generate summaries for the validation dataset
generated_summaries, actual_summaries = generate_summaries(model, tokenizer, valid_dataset, batch_size=8)

# Compute and print the ROUGE score for evaluation
rouge_score = rouge.compute(predictions=generated_summaries, references=actual_summaries)
print(rouge_score)

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.5203167404161397, recall=0.454729290212834, fmeasure=0.4632587632485201), mid=Score(precision=0.5354479435805708, recall=0.46890642945251565, fmeasure=0.47534285466245074), high=Score(precision=0.5502350562403443, recall=0.4824408056039646, fmeasure=0.48742275414871145)), 'rouge2': AggregateScore(low=Score(precision=0.25078879710302293, recall=0.21609817018001448, fmeasure=0.22050781070275272), mid=Score(precision=0.26568774609184886, recall=0.2292050456292191, fmeasure=0.2331994917519589), high=Score(precision=0.2808212845633841, recall=0.24289826444917545, fmeasure=0.24596704167812236)), 'rougeL': AggregateScore(low=Score(precision=0.43189569859182625, recall=0.3784438094396151, fmeasure=0.38432840225294457), mid=Score(precision=0.4465577989373004, recall=0.39079956181403797, fmeasure=0.39648919210982875), high=Score(precision=0.46139814838987636, recall=0.4039602038430952, fmeasure=0.4090468688096221)), 'rougeLsum': AggregateScore(low=

### Displaying the Rouge scores to better understand the results

In [21]:
rouge_scores = {
    'rouge1': {
        'low': {'precision': 0.5203, 'recall': 0.4547, 'fmeasure': 0.4632},
        'mid': {'precision': 0.5354, 'recall': 0.4689, 'fmeasure': 0.4753},
        'high': {'precision': 0.5502, 'recall': 0.4824, 'fmeasure': 0.4874}
    },
    'rouge2': {
        'low': {'precision': 0.2507, 'recall': 0.2160, 'fmeasure': 0.2205},
        'mid': {'precision': 0.2656, 'recall': 0.2292, 'fmeasure': 0.2331},
        'high': {'precision': 0.2808, 'recall': 0.2428, 'fmeasure': 0.2459}
    },
    'rougeL': {
        'low': {'precision': 0.4318, 'recall': 0.3784, 'fmeasure': 0.3843},
        'mid': {'precision': 0.4465, 'recall': 0.3907, 'fmeasure': 0.3964},
        'high': {'precision': 0.4613, 'recall': 0.4039, 'fmeasure': 0.4090}
    },
    'rougeLsum': {
        'low': {'precision': 0.4324, 'recall': 0.3770, 'fmeasure': 0.3830},
        'mid': {'precision': 0.4463, 'recall': 0.3903, 'fmeasure': 0.3960},
        'high': {'precision': 0.4616, 'recall': 0.4031, 'fmeasure': 0.4075}
    }
}

# Convert the nested dictionary into a Pandas DataFrame
scores = pd.DataFrame.from_dict({(i, j): rouge_scores[i][j] 
                            for i in rouge_scores.keys() 
                            for j in rouge_scores[i].keys()},
                            orient='index')

# Set column names for readability
scores.columns = ['Precision', 'Recall', 'F-Measure']

# Display the DataFrame
scores

Unnamed: 0,Unnamed: 1,Precision,Recall,F-Measure
rouge1,low,0.5203,0.4547,0.4632
rouge1,mid,0.5354,0.4689,0.4753
rouge1,high,0.5502,0.4824,0.4874
rouge2,low,0.2507,0.216,0.2205
rouge2,mid,0.2656,0.2292,0.2331
rouge2,high,0.2808,0.2428,0.2459
rougeL,low,0.4318,0.3784,0.3843
rougeL,mid,0.4465,0.3907,0.3964
rougeL,high,0.4613,0.4039,0.409
rougeLsum,low,0.4324,0.377,0.383


### Let's test the model on a conversation using input 

In [22]:
# Check if CUDA (GPU support) is available and choose the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the chosen device
model = model.to(device)

In [23]:
def summarize_text(text, max_length=5000):
    """
    Generates a summary for the given text using a pre-trained model.

    Args:
        text (str): The text to be summarized.
        max_length (int): The maximum length of the input text for the model.

    Returns:
        str: The generated summary of the input text.
    """
    # Encode the input text using the tokenizer. The 'pt' indicates PyTorch tensors.
    inputs = tokenizer.encode(text, return_tensors="pt", max_length=max_length, truncation=False)
    
    # Move the encoded text to the same device as the model (e.g., GPU or CPU)
    inputs = inputs.to(device)

    # Generate summary IDs with the model. num_beams controls the beam search width.
    # early_stopping is set to False for a thorough search, though it can be set to True for faster results.
    summary_ids = model.generate(inputs, max_length=2000, num_beams=30, early_stopping=False)

    # Decode the generated IDs back to text, skipping special tokens like padding or EOS.
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Return the generated summary
    return summary

In [31]:
# Prompt the user to enter text for summarization
text = input('Enter the text: ')
print()

# Call the summarize_text function to generate a summary of the input text
summary = summarize_text(text)

# Print the generated summary
print(summary)

Enter the text:  Web Developer (You): Hey, I just launched a new website with some exciting features. Would you like to check it out? Machine Learning Enthusiast: That sounds interesting! I'd love to see how you've integrated machine learning into it. Computer Science Student: Speaking of machine learning, have you heard about the latest breakthroughs in natural language processing? Science Enthusiast: Yes, I've been following those developments closely. It's amazing how AI is transforming language understanding. Mathematics Enthusiast: Absolutely! The mathematical foundations of deep learning play a crucial role in these advancements. News Enthusiast: By the way, did you catch the latest headlines? There's a lot happening in the world right now. Web Developer (You): I did! In fact, my website can recommend personalized news articles based on user preferences. Clinical Medical Assistant: That's impressive! Speaking of recommendations, have you worked on any projects related to healthca

Enter the text: Web Developer (You): Hey, I just launched a new website with some exciting features. Would you like to check it out? Machine Learning Enthusiast: That sounds interesting! I'd love to see how you've integrated machine learning into it. Computer Science Student: Speaking of machine learning, have you heard about the latest breakthroughs in natural language processing? Science Enthusiast: Yes, I've been following those developments closely. It's amazing how AI is transforming language understanding. Mathematics Enthusiast: Absolutely! The mathematical foundations of deep learning play a crucial role in these advancements. News Enthusiast: By the way, did you catch the latest headlines? There's a lot happening in the world right now. Web Developer (You): I did! In fact, my website can recommend personalized news articles based on user preferences. Clinical Medical Assistant: That's impressive! Speaking of recommendations, have you worked on any projects related to healthcar

## Conclusion

In this notebook, I fine-tuned the BART Base model for summarizing chat conversations. The model showed promising performance, especially in capturing the essence of dialogues. The findings revealed the model's strengths and areas of improvement as follows:

1. **Consistency in Capturing Key Points:** The ROUGE-1 scores, with a high precision of 0.5502 and recall of 0.4824, indicate that the model is consistently capturing key points from the conversations. This suggests that for most of the chat content, the generated summaries were aligned well with the essential topics.


2. **Complex Relationships and Nuances:** The ROUGE-2 scores, particularly the high precision of 0.2808 and recall of 0.2428, reflect the model's ability to grasp more complex relationships and nuances in the conversations. While lower than ROUGE-1, these scores are indicative of the model's potential in understanding subtleties in dialogues.


3. **Summary Length and Relevance:** The ROUGE-L and ROUGE-Lsum scores, with a high precision of around 0.4613 and a recall of approximately 0.4039, demonstrate the model's capability in maintaining the length and relevance of the original dialogues in the summaries.



- While the model shows effectiveness in summarizing chat conversations, there is room for improvement, particularly in capturing more intricate details and subtleties, as suggested by the ROUGE-2 scores.

I welcome any feedback or questions in the comments and am open to collaborations on similar projects. For further discussions or networking opportunities, feel free to connect with me on [LinkedIn](https://www.linkedin.com/in/farneet-singh-6b155b208/).