# Fine Tuning Transformer for Summary Generation

***Abstractive Summary***: The network creates new sentences to encapsulate maximum gist of the article and generates that as output. The sentences in the summary may or may not be contained in the article. 

- **Data**:
	- We are using the BBC News Summary dataset available at [Kaggle](https://www.kaggle.com/datasets/pariza/bbc-news-summary)
	- This dataset for extractive text summarization has four hundred and seventeen political news articles of BBC from 2004 to 2005 in the News Articles folder.

- **Language Model Used**:   
    - ***T5*** is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task.


## Installing Libraries

In [1]:
!pip install transformers -q
!pip install rouge-score

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=2d6e61e0eadf1bc7b12045086866f19dfaee9d3115d33d9b20feb10426a3dd62
  Stored in directory: /root/.cache/pip/wheels/84/ac/6b/38096e3c5bf1dc87911e3585875e21a3ac610348e740409c76
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
import nltk
import seaborn as sns 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
import os

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

### Sentence Similarity

In [3]:
def sentence_similarity(sent1,sent2,embed):  
    A = embed([sent1])[0]
    B = embed([sent2])[0]
    return 1 - (np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B)))

### BLEU Score

In [4]:
# hypothesis = Summarized_Text
# reference = actual_summary
# BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis)
# print(f"BLEUscore : {BLEUscore}")

### Similarity Score

In [5]:
# print(f"Senetence Similarity Score : {sentence_similarity(Summarized_Text, actual_summary, embed)}")

In [6]:
paths = os.listdir('../input/bbc-news-summary/BBC News Summary/News Articles')
articles_path = '../input/bbc-news-summary/BBC News Summary/News Articles/'
summaries_path = '../input/bbc-news-summary/BBC News Summary/Summaries/'

articles = []
summaries = []
file_arr = []

for path in paths:
    files = os.listdir(articles_path + path)
    for file in files:
        article_file_path = articles_path + path + '/' + file
        summary_file_path = summaries_path + path + '/' + file
        try:
            with open (article_file_path,'r') as f:
                articles.append('.'.join([line.rstrip() for line in f.readlines()]))
            with open (summary_file_path,'r') as f:
                summaries.append('.'.join([line.rstrip() for line in f.readlines()]))
            file_arr.append(path + '/' + file)
        except:
            pass

In [7]:
pd.set_option('display.max_colwidth', 200)
data = pd.DataFrame({'path':file_arr,'article': articles,'summary':summaries})
data.head(2)

Unnamed: 0,path,article,summary
0,politics/361.txt,Budget to set scene for election..Gordon Brown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT. He is expected to...,"- Increase in the stamp duty threshold from £60,000 - A freeze on petrol duty - An extension of tax credit scheme for poorer families - Possible help for pensioners The stamp duty threshold rise i..."
1,politics/245.txt,"Army chiefs in regiments decision..Military chiefs are expected to meet to make a final decision on the future of Scotland's Army regiments...A committee of the Army Board, which is made up of the...","""They are very much not for the good and will destroy Scotland's regiments by moulding them into a single super regiment which will lead to severe recruitment problems, a loss of local connections..."


In [8]:
data = data.rename(columns = {'article':'text', 'summary':'ctext'})
data.head(2)

Unnamed: 0,path,text,ctext
0,politics/361.txt,Budget to set scene for election..Gordon Brown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT. He is expected to...,"- Increase in the stamp duty threshold from £60,000 - A freeze on petrol duty - An extension of tax credit scheme for poorer families - Possible help for pensioners The stamp duty threshold rise i..."
1,politics/245.txt,"Army chiefs in regiments decision..Military chiefs are expected to meet to make a final decision on the future of Scotland's Army regiments...A committee of the Army Board, which is made up of the...","""They are very much not for the good and will destroy Scotland's regiments by moulding them into a single super regiment which will lead to severe recruitment problems, a loss of local connections..."


In [9]:
# Checking out the GPU we have access to. 
!nvidia-smi

Sat Apr  6 06:16:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0              27W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [10]:
#Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

<a id='section02'></a>
### Preparing the Dataset for data processing: Class

We will start with creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. This dataset will be used the the Dataloader method that will feed  the data in batches to the neural network for suitable training and processing. 
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network.

#### *CustomDataset* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the **T5** model for training. 
- We are using the **T5** tokenizer to tokenize the data in the `text` and `ctext` column of the dataframe. 
- The tokenizer uses the ` batch_encode_plus` method to perform tokenization and generate the necessary outputs, namely: `source_id`, `source_mask` from the actual text and `target_id` and `target_mask` from the summary text.
- The *CustomDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 


#### Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

In [11]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.text
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

<a id='section03'></a>
### Fine Tuning the Model: Function

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

Following events happen to fine tune the neural network:
- The epoch, tokenizer, model, device details, testing_ dataloader and optimizer are passed to the `train ()` when its called from the `main()`
- The dataloader passes data to the model based on the batch size.
- `language_model_labels` are calculated from the `target_ids` also, `source_id` and `attention_mask` are extracted.
- The model outputs first element gives the loss for the forward pass. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 500 steps the loss value is printed in the console.

#### Creating the training function. This will be called in the main function. It is run depending on the epoch value. The model is put into train mode and then we enumerate over the training loader and passed to the defined network 


In [12]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network 

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

<a id='section04'></a>
### Validating the Model Performance: Function

During the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. 

This unseen data is the 20% which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 


In [13]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

<a id='section503'></a>
#### Creation of Dataset and Dataloader

* The updated dataframe is divided into 80-20 ratio for test and validation. 
* Both the data-frames are passed to the `CustomerDataset` class for tokenization of the new articles and their summaries.
* The tokenization is done using the length parameters passed to the class.
* Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders.
* These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action.
* The shape of datasets is printed in the console.


<a id='section504'></a>
#### Neural Network and Optimizer

* In this stage we define the model and optimizer that will be used for training and to update the weights of the network. 
* We are using the `t5-base` transformer model for our project. You can read about the `T5 model` and its features above. 
* We use the `T5ForConditionalGeneration.from_pretrained("t5-base")` commad to define our model. The `T5ForConditionalGeneration` adds a Language Model head to our `T5 model`. The Language Model head allows us to generate text based on the training of `T5 model`.
* We are using the `Adam` optimizer for our project. This has been a standard for all our tutorials and is something that can be changed updated to see how different optimizer perform with different learning rates. 
* There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters. All those concepts have been kept out of scope for these tutorials. 


<a id='section505'></a>
#### Training Model

* We call the `train()` with all the necessary parameters.
* Loss at every 500th step is printed on the console.


<a id='section506'></a>
#### Validation and generation of Summary

* After the training is completed, the validation step is initiated.
* As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate new summaries based on the article text.
* An output is printed on the console giving a count of how many steps are complete after every 100th step. 
* The original summary and generated summary are converted into a list and returned to the main function. 
* Both the lists are used to create the final dataframe with 2 columns **Generated Summary** and **Actual Summary**
* The dataframe is saved as a csv file in the local drive.
* A qualitative analysis can be done with the Dataframe. 

In [14]:
TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
TRAIN_EPOCHS = 3      # number of epochs to train (default: 10)
VAL_EPOCHS = 1 
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               # random seed (default: 42)
MAX_LEN = 512
SUMMARY_LEN = 150 

In [15]:
# Set random seeds and deterministic pytorch for reproducibility
torch.manual_seed(SEED) # pytorch random seed
np.random.seed(SEED) # numpy random seed
torch.backends.cudnn.deterministic = True

In [16]:
# tokenzier for encoding the text
tokenizer = T5Tokenizer.from_pretrained("t5-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




In [17]:
# Adding the summarzie text in front of the text. 
# This is to format the dataset similar to how T5 model was trained for summarization task. 
df = data[['text','ctext']]
df.ctext = 'summarize: ' + df.ctext
print(df.head())

                                                                                                                                                                                                      text  \
0  Budget to set scene for election..Gordon Brown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT. He is expected to...   
1  Army chiefs in regiments decision..Military chiefs are expected to meet to make a final decision on the future of Scotland's Army regiments...A committee of the Army Board, which is made up of the...   
2  Howard denies split over ID cards..Michael Howard has denied his shadow cabinet was split over its decision to back controversial Labour plans to introduce ID cards...The Tory leader said his fron...   
3  Observers to monitor UK election..Ministers will invite international observers to check the forthcoming UK general election is fairly run...The move comes amid claims the p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [18]:
# Creation of Dataset and Dataloader
# Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
train_size = 0.8
train_dataset = df.sample(frac=train_size, random_state = SEED).reset_index(drop=True)
val_dataset = df.drop(train_dataset.index).reset_index(drop=True)

In [19]:
print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(val_dataset.shape))

FULL Dataset: (2224, 2)
TRAIN Dataset: (1779, 2)
TEST Dataset: (445, 2)


In [20]:
# Creating the Training and Validation dataset for further creation of Dataloader
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)
val_set = CustomDataset(val_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)

In [21]:
# Defining the parameters for creation of dataloaders
train_params = {
    'batch_size': TRAIN_BATCH_SIZE,
    'shuffle': True,
    'num_workers': 0
}

val_params = {
    'batch_size': VALID_BATCH_SIZE,
    'shuffle': False,
    'num_workers': 0
    }

In [22]:
# Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
training_loader = DataLoader(training_set, **train_params)
val_loader = DataLoader(val_set, **val_params)

In [23]:
# Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
# Further this model is sent to device (GPU/TPU) for using the hardware.
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




In [24]:
# Defining the optimizer that will be used to tune the weights of the network in the training session. 
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [25]:
# Training loop
print('Initiating Fine-Tuning for the model on our dataset')

for epoch in range(TRAIN_EPOCHS):
    train(epoch, tokenizer, model, device, training_loader, optimizer)

Initiating Fine-Tuning for the model on our dataset
Epoch: 0, Loss:  9.133647918701172
Epoch: 0, Loss:  2.6999082565307617
Epoch: 1, Loss:  1.5452290773391724
Epoch: 1, Loss:  1.6306101083755493
Epoch: 2, Loss:  1.0321205854415894
Epoch: 2, Loss:  1.3978021144866943


In [26]:
# Validation loop and saving the resulting file with predictions and acutals in a dataframe.
# Saving the dataframe as predictions.csv
print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
for epoch in range(VAL_EPOCHS):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
    final_df.to_csv('predictions.csv')
    print('Output Files generated for review')


Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Completed 200
Output Files generated for review


### Checking Prediction Files

In [27]:
df = pd.read_csv('/kaggle/working/predictions.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Generated Text,Actual Text
0,0,"The £10bn Crossrail transport plan, backed by business groups and is getting the go-ahead this month. It says that UK Treasury has allocated £7.5 billion ($13.99 Billion) for it; talks with busine...","Crossrail link 'to get go-ahead'..The £10bn Crossrail transport plan, backed by business groups, is to get the go-ahead this month, according to The Mail on Sunday...It says the UK Treasury has al..."
1,1,"The majority owner of embattled Russian oil firm Yukos has sued the Russia's government for $28.3bn (£15.2m). Group Menatep, whose shares are held by 51% of its unit Yugansky, says this was illega...",Yukos owner sues Russia for $28bn..The majority owner of embattled Russian oil firm Yukos has sued the Russian government for $28.3bn (£15.2bn)...The Kremlin last year seized and sold Yukos' main ...
2,2,"A US government claim accusing the country's biggest tobacco companies of covering upthe effects and smoking has been thrown out by an appeal court...In its case, it claimed tobacco firms had ille...",Court rejects $280bn tobacco case..A US government claim accusing the country's biggest tobacco companies of covering up the effects of smoking has been thrown out by an appeal court...The demand ...
3,3,"Ireland's economic -miracle"" is enjoying another wind, with 55% growth forecast for 2005...The ISEQ index of leading shares closed up 23 points to 661.89 on Thursday and fuelled by strong expansio...","Irish markets reach all-time high..Irish shares have risen to a record high, with investors persuaded to buy into the market by low inflation and strong growth forecasts...The ISEQ index of leadin..."
4,4,"The Pensions Policy Institute (PPI) said life expectancy for unskilled professional men has been understated...Life expectancy at birth is 71 years, and the PPI says, will be 81 to 80. But if meas...","Pension hitch for long-living men..Male life expectancy is much higher than originally estimated, leading pension researchers have said...The Pensions Policy Institute (PPI) said life expectancy f..."


In [42]:
print(f"ARTICLE : \n{df['Actual Text'].iloc[0]}\n\nSUMMARY : \n{df['Generated Text'].iloc[0]}")

ARTICLE : 
Crossrail link 'to get go-ahead'..The £10bn Crossrail transport plan, backed by business groups, is to get the go-ahead this month, according to The Mail on Sunday...It says the UK Treasury has allocated £7.5bn ($13.99bn) for the project and that talks with business groups on raising the rest will begin shortly. The much delayed Crossrail Link Bill would provide for a fast cross-London rail link. The paper says it will go before the House of Commons on 23 February...A second reading could follow on 16 or 17 March. "We've always said we are going to introduce a hybrid Bill for Crossrail in the Spring and

SUMMARY : 
The £10bn Crossrail transport plan, backed by business groups and is getting the go-ahead this month. It says that UK Treasury has allocated £7.5 billion ($13.99 Billion) for it; talks with businesses on raising more will begin shortly...The bill, which includes rail links to London, is expected in early next year. The Mail on Sunday's Financial Post said the £7.5

In [43]:
print(f"ARTICLE : \n{df['Actual Text'].iloc[10]}\n\nSUMMARY : \n{df['Generated Text'].iloc[10]}")

ARTICLE : 
Bank holds interest rate at 4.75%..The Bank of England has left interest rates on hold again at 4.75%, in a widely-predicted move...Rates went up five times from November 2003 - as the bank sought to cool the housing market and consumer debt - but have remained unchanged since August. Recent data has indicated a slowdown in manufacturing and consumer spending, as well as in mortgage approvals. And retail sales disappointed over Christmas, with analysts putting the drop down to less consumer confidence...Rising interest rates and the accompanying slowdown in the housing market have knocked consumers' optimism, causing a sharp fall in demand for expensive goods, according to a report earlier

SUMMARY : 
The Bank of England has left interest rates on hold again at 4.75%, in an widely-predicted move...Rates went up five times from November 2003 and as the bank sought to cool down consumer debt. It also said there was evidence that manufacturers' confidence may be weakening becau

In [44]:
print(f"ARTICLE : \n{df['Actual Text'].iloc[53]}\n\nSUMMARY : \n{df['Generated Text'].iloc[53]}")

ARTICLE : 
Man Utd to open books to Glazer..Manchester United's board has agreed to give US tycoon Malcolm Glazer access to its books...Earlier this month, Mr Glazer presented the board with detailed proposals on an offer to buy the football club. In a statement, the club said it would allow Mr Glazer "limited due diligence" to give him the opportunity to take the proposal on to a formal bid. But it said it continued to oppose Mr Glazer's plans, calling his assumptions "aggressive" and his plan "damaging". Many of Manchester United's supporters own shares in the club, and the fan-based group Shareholders United is strongly opposed

SUMMARY : 
Manchester United's board has agreed to give US tycoon Malcolm Glazer access and its books...Earlier this month, Mrglazer presented the boards with detailed proposals on an offer for buyout of Manchester City. The fan-based group Shareholder’Sued is strongly opposed by any takeover by Mr Glazer." In a statement in January, the club said it would a

In [28]:
def generate_summary(input_text, tokenizer, model, device, max_length=150):
    # Preprocess the input text using the tokenizer
    input_text = 'summarize: ' + input_text
    inputs = tokenizer.batch_encode_plus([input_text], max_length=max_length, pad_to_max_length=True, return_tensors='pt')

    # Move input tensors to the appropriate device
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    # Generate summary
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids = input_ids,
            attention_mask = attention_mask,
            max_length = max_length,
            num_beams = 2,
            repetition_penalty = 2.5,
            length_penalty = 1.0,
            early_stopping = True
        )

    # Decode the generated summary
    generated_summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

    return generated_summary[0]


In [36]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")

### Generating Summaries

In [37]:
text = "A recent episode of Shark Tank India saw a lucrative deal that involved all five sharks on the panel — \
Anupam Mittal, Aman Gupta, Azhar Iqubal, Namita Thapar and Radhika Gupta. Two co-founders representing their biomaterial \
science company Canvaloop, which specialises in producing alternative fibres that can reduce the environmental toll left by \
cotton and synthetic, asked for Rs 1 crore in exchange of 1.33% equity, valuing the company at Rs 75 crore.The founders claimed \
that their alternative fibres, which are created from their ‘zero-waste proprietary technology’, consumes 82% less energy, 87% \
less carbon emissions, and 99% less water than synthetic and cotton manufacturing. They said that their claims have been independently \
verified by a third-party organisation, and it was only after this that they started getting business from global brands such as Levi’s."

summary = generate_summary(text, tokenizer, model, device)
print("Generated Summary:", summary)

Generated Summary: A recent episode of Shark Tank India saw a lucrative deal that involved all five sharks on the panel — Anupam Mittal, Aman Gupte and Azhar Iqubal. Radhiki Gap was one of the co-founder members of Canvaloop, which specialise in producing alternative fibre to reduce environmental pollution left by cotton synthetic...The two co-founded companies asked for Rs 1 crore in exchanged with 1.33% equity, valuing the company at Rs 75 lakh (£1bn). The founder said their alternative fiber is created from its ‘zero waste proprietary technology’; it consume 82% less energy, 81% more carbon dioxide than conventional fuel used as fuel or diesel fuel


In [38]:
text = " The Indian government has cleared the supply of several essential commodities to the Maldives, including items such as rice, wheat and onions whose exports are currently banned, amid a downturn in relations between the two sides.\
The government allowed the export of these commodities for 2024-25 under a bilateral mechanism at the request of the Maldivian government, the Indian high commission in Male said in a statement on Friday. The approved quantities are also the highest since the mechanism was put in place in 1981.\
The clearance for the exports comes at a time when ties between India and the Maldives are at a low, especially after the election last year of President Mohamed Muizzu, who has sought to end the Indian archipelago’s dependence on India in strategic sectors. Muizzu has also moved the Maldives closer to China."

summary = generate_summary(text, tokenizer, model, device)
print("Generated Summary:", summary)

Generated Summary: India has cleared the supply of several essential commodities in The Maldives, including items such as rice and wheat onions whose exports are currently banned. The clearance comes at an age when relations between India'State Department and the Maldivian government are at a low, especially after last year’ election by President Mohamed Muiszu, who is now in power to replace Mujizawa. The approved quantities were also highest since it was put into place back then...The clearance comes amid ties among India and Pakistan which have been hit hard during the past year with the release of a series denial letter from Prime Minister Narendra Modal on Monday. The government allowed for the Exports 2024-


In [39]:
text = "In response to reports showing that 36% of graduates at IIT Bombay are yet to secure jobs this placement season, the institute has released data from an exit survey conducted among graduating students in the academic year 2022-23.\
According to the survey result shared by the institute, only 6.1% of graduates are looking for jobs, while the majority of students, accounting for 57.1%, secured jobs through IIT Bombay's placement process. Additionally, 12.2% of students opted to pursue higher degrees, while 8.3% chose careers in public service.\
Furthermore, a breakdown of employment preferences among students reveals that 10.9% secured jobs outside of IIT Bombay, with 1.6% venturing into startup ventures. A small percentage, 4.3%, remain undecided about their career paths."

summary = generate_summary(text, tokenizer, model, device)
print("Generated Summary:", summary)

Generated Summary: In response to reports showing that 36% of graduates at IIT Bombay are yet able secure jobs this placement season, the institute has released data from an exit survey conducted among graduating students in academic year 2022-23. Accordingly, only 6.1% and majority were looking for work, while the majority (57%) secured employment through my university's placement process...The institute also released data on how many students left school after graduation, with one in five opting out of IIM Bombeghta University. In response there is no mention of public service careers as well as higher education. Additionally, 12.2% have chosen to pursue higher degrees, whil 8.3% chose careers within private sector or government. The


In [33]:
save_directory = "./fine_tuned_model"


if not os.path.exists(save_directory):
    os.makedirs(save_directory)

# Saving the fine-tuned model to the specified directory
model.save_pretrained(save_directory)

# Saving the tokenizer as well, if needed
tokenizer.save_pretrained(save_directory)

print("Fine-tuned model and tokenizer saved successfully to:", save_directory)

Fine-tuned model and tokenizer saved successfully to: ./fine_tuned_model
