# Part 1 : Transformers
## Task 1
For the task we will make use of [Bloomberg Quint news dataset](https://data.world/crawlfeeds/bloomberg-quint-news-dataset). It's basically a collection of news articles scrapped from Bloomberg Quint. Columns include the title, description, short descrition, author and so on. However, for our task of summarization, we are only interested in the description and short_description (which will act as summary). More info on the data-set is shown below.

In [None]:
!pip install transformers rouge_score nltk



In [None]:
import pandas as pd
from transformers import BartTokenizer, BartForConditionalGeneration, get_linear_schedule_with_warmup
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
from transformers import pipeline
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

In [None]:
df = pd.read_json('https://query.data.world/s/6kftujtbmao4u67jg4byu3wchm5t4k?dws=00000')
df.head()

Unnamed: 0,url,title,short_description,author,date_created,date_modified,category,raw_description,description,publisher,scraped_at
0,https://www.bloombergquint.com/markets/all-you...,All You Need To Know Going Into Trade On Septe...,"Stocks in the news, big brokerage calls of the...",Darshan A Nakhwa,"23 Sep 2021, 7:05 AM IST","23 Sep 2021, 7:05 AM IST",Markets,"<div class=""story-element story-element-text"">...",Asian stocks were steady early Thursday after ...,https://www.bloombergquint.com/,2021-09-24 00:01:25
1,https://www.bloombergquint.com/business/bridge...,"Bridgestone CEO Backs Safe Tokyo Olympics, Dia...","Bridgestone CEO Backs Safe Tokyo Olympics, Dia...",Shiho Takezawa &,"23 Apr 2021, 5:35 AM IST","23 Apr 2021, 6:35 AM IST",Business,"<div class=""story-element story-element-text"">...",Bridgestone Corp. will support the Tokyo Olym...,https://www.bloombergquint.com/,2021-09-24 00:01:26
2,https://www.bloombergquint.com/markets/stocks-...,"Stocks To Watch: HCL Tech, Cyient, M&M Financi...",Here are the stocks to watch in trade today...,BQ Desk,"23 Apr 2021, 7:29 AM IST","23 Apr 2021, 7:29 AM IST",Markets,"<div class=""story-element story-element-text"">...",Indian equity benchmarks reversed losses made ...,https://www.bloombergquint.com/,2021-09-24 00:01:26
3,https://www.bloombergquint.com/research-report...,Localised Lockdowns Cannot But Impinge On Econ...,Localised Lockdowns Cannot But Impinge On Econ...,Nirmal Bang Institutional Research,"26 Apr 2021, 7:58 AM IST","26 Apr 2021, 7:58 AM IST",Research Reports,"<div class=""story-element story-element-text"">...","Nirmal Bang Report, We assess the ‘state of af...",https://www.bloombergquint.com/,2021-09-24 00:01:26
4,https://www.bloombergquint.com/business/cp-rai...,CP Rail Wins Regulator Exemption From Tougher ...,CP Rail Wins Regulator Exemption From Tougher ...,Thomas Black,"24 Apr 2021, 7:08 AM IST","24 Apr 2021, 7:44 AM IST",Business,"<div class=""story-element story-element-text"">...",Canadian Pacific Railway Ltd. won a petition ...,https://www.bloombergquint.com/,2021-09-24 00:01:27


In [None]:
print(df.isnull().sum())
df.dropna(inplace=True)  # dropping rows with missing values

url                     0
title                   0
short_description       0
author               2912
date_created         2912
date_modified        2912
category                0
raw_description         0
description             0
publisher               0
scraped_at              0
dtype: int64


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4106 entries, 0 to 6660
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   url                4106 non-null   object        
 1   title              4106 non-null   object        
 2   short_description  4106 non-null   object        
 3   author             4106 non-null   object        
 4   date_created       4106 non-null   object        
 5   date_modified      4106 non-null   object        
 6   category           4106 non-null   object        
 7   raw_description    4106 non-null   object        
 8   description        4106 non-null   object        
 9   publisher          4106 non-null   object        
 10  scraped_at         4106 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(10)
memory usage: 384.9+ KB


Just a view of the text and summaries which belong in the 'short_description' and 'description' columns.

In [None]:
df_desc = df[['short_description', 'description']]
df_desc.head()

Unnamed: 0,short_description,description
0,"Stocks in the news, big brokerage calls of the...",Asian stocks were steady early Thursday after ...
1,"Bridgestone CEO Backs Safe Tokyo Olympics, Dia...",Bridgestone Corp. will support the Tokyo Olym...
2,Here are the stocks to watch in trade today...,Indian equity benchmarks reversed losses made ...
3,Localised Lockdowns Cannot But Impinge On Econ...,"Nirmal Bang Report, We assess the ‘state of af..."
4,CP Rail Wins Regulator Exemption From Tougher ...,Canadian Pacific Railway Ltd. won a petition ...


In [None]:
avg_words_short_description = df['short_description'].apply(lambda x: len(x.split())).mean()
avg_words_description = df['description'].apply(lambda x: len(x.split())).mean()
# Print the averages
print(f"Average number of words in 'short_description': {avg_words_short_description}")
print(f"Average number of words in 'description': {avg_words_description}")

Average number of words in 'short_description': 11.269849001461276
Average number of words in 'description': 480.05820750121774


We will now split it in 90-10 ratio.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['description'], df['short_description'], test_size=0.1, random_state=42)
X_train = X_train.tolist()
X_test = X_test.tolist()
y_train = y_train.tolist()
y_test = y_test.tolist()
print(len(X_train), len(X_test), len(y_train), len(y_test))

3695 411 3695 411


In [None]:
for i in range(3):
    print(f"Text: {X_train[i]}")
    print(f"Summary: {y_train[i]}")

Text: A district-wise mapping of India’s vaccination coverage against Covid-19 has shown significant disparity so far, even as the country tries to accelerate its inoculation drive and curb a deadly second wave of the virus., Several districts in Uttar Pradesh, Tamil Nadu and the North-Eastern states have only vaccinated less than 5% of their population, according to an analysis by Credit Suisse. That’s in contrast to districts in states like Rajasthan, Gujarat, Uttarakhand, Himachal Pradesh and Kerala that have already administered shots to more than 10% of the population., Credit Suisse said that while there has been no visible divergence in the fatalities reported by these districts, that could show up in the next 2-3 weeks., A pace of vaccinating around 35-40 lakh people daily in India has been considered sustainable to reduce the spread, it noted. However, in the last five days the pace has slowed down to 25-30 lakh a day, Credit Suisse said., Even at this pace, about 40% of the p

From the huggingface site :
Implementation Notes
- Bart doesn’t use token_type_ids for sequence classification. Use BartTokenizer or encode() to get the proper splitting.
- The forward pass of BartModel will create the decoder_input_ids if they are not passed. This is different than some other modeling APIs. A typical use case of this feature is mask filling.
- Model predictions are intended to be identical to the original implementation when forced_bos_token_id=0. This only works, however, if the string you pass to fairseq.encode starts with a space.
- generate() should be used for conditional generation tasks like summarization, see the example in that docstrings.
- Models that load the facebook/bart-large-cnn weights will not have a mask_token_id, or be able to perform mask-filling tasks.

In [None]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
class NewsDataset(Dataset):
    def __init__(self, tokenizer, texts, summaries, max_length=512):
        self.tokenizer = tokenizer
        self.texts = texts
        self.summaries = summaries
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        summary = self.summaries[idx]

        text_encoding = tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
        summary_encoding = tokenizer(summary, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)

        return {
            'input_ids': text_encoding['input_ids'].flatten(),
            'attention_mask': text_encoding['attention_mask'].flatten(),
            'labels': summary_encoding['input_ids'].flatten()
        }

# methods for calcualting the rouge and bleu score. This is little bit complex, becasue we need to generate the summaries, and comapre the n-grams between this and actual text.
def calculate_scores(actual_summaries, generated_summaries):
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    for actual, generated in zip(actual_summaries, generated_summaries):
        reference = actual.split()
        candidate = generated.split()
        score = sentence_bleu([reference], candidate, weights=(0.25, 0.25, 0.25, 0.25))
        bleu_scores.append(score) # bleu score

        scores = scorer.score(actual, generated)
        rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
        rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
        rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge1 = sum(rouge_scores['rouge1']) / len(rouge_scores['rouge1'])
    avg_rouge2 = sum(rouge_scores['rouge2']) / len(rouge_scores['rouge2'])
    avg_rougeL = sum(rouge_scores['rougeL']) / len(rouge_scores['rougeL'])

    return avg_bleu, avg_rouge1, avg_rouge2, avg_rougeL

def generate_summaries(texts, model, tokenizer, device, max_length=1024, min_length=40):
    summaries = []
    for text in texts:
        # Encode the text, ensuring that it's truncated to the maximum length the model can handle
        inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=max_length, truncation=True)
        inputs = inputs.to(device)

        # Generate summary with the model
        summary_ids = model.generate(inputs, max_length=min_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        summaries.append(summary)
    return summaries

# for some reason the actual implementation from huggingface was throwing errors.
# def generate_summaries(texts, model, tokenizer, device):
#     summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=0 if device.type == 'cuda' else -1)
#     summaries = summarizer(texts, max_length=150, min_length=40, do_sample=False)
#     return [summary['summary_text'] for summary in summaries]

For hyper-parameters we will use a batch size of 4 to make sure GPU RAM doesn't get filled too much. We make the Dataloader pipelines through torch APIs. We use Adam optimizer with a learning rate of 5e-5. Each epoch takes about 10 mins in V100 GPU in paid colab. So for now, we will run only 10 epochs. But decent results can be expected only in the range of 100s

In [None]:
batch_size = 4
train_dataset = NewsDataset(tokenizer, X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 10

In [None]:
bleu_scores_epochs, rouge1_scores_epochs, rouge2_scores_epochs, rougeL_scores_epochs = [], [], [], []
best_bleu = 0.0

for epoch in range(num_epochs):
    model.train()
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass
        loss.backward()
        optimizer.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
    subset_X_test = X_test[:20]  # For example, use the first 20 samples from the test set
    subset_y_test = y_test[:20]

    generated_summaries = generate_summaries(subset_X_test, model, tokenizer, device)
    bleu, rouge1, rouge2, rougeL = calculate_scores(subset_y_test, generated_summaries)

    bleu_scores_epochs.append(bleu)
    rouge1_scores_epochs.append(rouge1)
    rouge2_scores_epochs.append(rouge2)
    rougeL_scores_epochs.append(rougeL)

    print(f"Epoch {epoch} - BLEU: {bleu}, ROUGE-1: {rouge1}, ROUGE-2: {rouge2}, ROUGE-L: {rougeL}")
    if bleu > best_bleu:
        best_bleu = bleu  # Update the best BLEU score
        # Define your checkpoint path
        checkpoint_path = f"bart_best_checkpoint_epoch_{epoch}.pt"
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            'bleu': bleu,
            'rouge1': rouge1,
            'rouge2': rouge2,
            'rougeL': rougeL
        }, checkpoint_path)
        print(f"Checkpoint saved to {checkpoint_path} with improved BLEU score.")

  0%|          | 0/924 [00:00<?, ?it/s]

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Epoch 0 - BLEU: 1.0735534912967036e-232, ROUGE-1: 0.05242340247135644, ROUGE-2: 0.0, ROUGE-L: 0.05242340247135644
Checkpoint saved to bart_best_checkpoint_epoch_0.pt with improved BLEU score.


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 1 - BLEU: 2.2409973599540725e-233, ROUGE-1: 0.042019035459437935, ROUGE-2: 0.0, ROUGE-L: 0.042019035459437935


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 2 - BLEU: 2.9159727936827646e-233, ROUGE-1: 0.03068172568172568, ROUGE-2: 0.0, ROUGE-L: 0.03068172568172568


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 3 - BLEU: 1.9814701025268694e-232, ROUGE-1: 0.055731386757601586, ROUGE-2: 0.0, ROUGE-L: 0.0466216559087403
Checkpoint saved to bart_best_checkpoint_epoch_3.pt with improved BLEU score.


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 4 - BLEU: 2.3821329134862613e-232, ROUGE-1: 0.06853485724690658, ROUGE-2: 0.0034482758620689646, ROUGE-L: 0.06076359624983914
Checkpoint saved to bart_best_checkpoint_epoch_4.pt with improved BLEU score.


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 5 - BLEU: 1.2665438683084556e-232, ROUGE-1: 0.03451315789473684, ROUGE-2: 0.0, ROUGE-L: 0.03451315789473684


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 6 - BLEU: 8.553352456753039e-233, ROUGE-1: 0.024683666570076306, ROUGE-2: 0.0, ROUGE-L: 0.024683666570076306


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 7 - BLEU: 1.3976369793078114e-232, ROUGE-1: 0.04356351236146633, ROUGE-2: 0.0, ROUGE-L: 0.03731351236146633


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 8 - BLEU: 3.250867188473171e-233, ROUGE-1: 0.04190106541150184, ROUGE-2: 0.0, ROUGE-L: 0.038675258959888936


  0%|          | 0/924 [00:00<?, ?it/s]

Epoch 9 - BLEU: 1.4879532713336404e-232, ROUGE-1: 0.018642877440831408, ROUGE-2: 0.0, ROUGE-L: 0.018642877440831408


In [None]:
# If needed summarise can be generated and observed.
# generated_summaries = generate_summaries(subset_X_test, model, tokenizer, device)
# for actual, generated in zip(subset_y_test, generated_summaries):
#     print(f"Actual Summary: {actual}")
#     print(f"Generated Summary: {generated}")
#     print("-" * 50)