In [1]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [2]:
!nvidia-smi

Fri Apr  5 14:23:50 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
from transformers import pipeline, set_seed

import matplotlib.pyplot as plt
from datasets import load_dataset
import pandas as pd
from datasets import load_dataset, load_metric

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
device="cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [5]:
model_ckpt = "google/pegasus-cnn_dailymail"


In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)#go to huggingface and download the tokenizer

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
dataset_samsum=load_dataset("samsum")#loading the data

In [8]:
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [9]:
split_lengths=[len(dataset_samsum[split])for split in dataset_samsum]
split_lengths

[14732, 819, 818]

In [10]:
print(f"features: {dataset_samsum['train'].column_names}")

features: ['id', 'dialogue', 'summary']


In [11]:
 print("\nDialogue")
 print(dataset_samsum["test"][1]["dialogue"])
 print("\nSummary")
 print(dataset_samsum["test"][1]["summary"])


Dialogue
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary
Eric and Rob are going to watch a stand-up on youtube.


In [12]:
pipe=pipeline('summarization',model=model_ckpt)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
dialogue=dataset_samsum['test'][0]['dialogue']

In [14]:
pipeout=pipe(dialogue)
pipeout

Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


[{'summary_text': "Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him ."}]

In [15]:
print(pipeout[0]['summary_text'].replace(" .<n>", ".\n"))

Amanda: Ask Larry Amanda: He called her last time we were at the park together.
Hannah: I'd rather you texted him.
Amanda: Just text him .


we will pass the data batch wise. it will look through all the data and give us final accuracy

In [16]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

In [17]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched = True)


Map:   0%|          | 0/819 [00:00<?, ? examples/s]



In [18]:
dataset_samsum_pt['train'][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',
 'input_ids': [12195,
  151,
  125,
  7091,
  3659,
  107,
  842,
  119,
  245,
  181,
  152,
  10508,
  151,
  7435,
  147,
  12195,
  151,
  125,
  131,
  267,
  650,
  119,
  3469,
  29344,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [12195, 7091, 3659, 111, 138, 650, 10508, 181, 3469, 107, 1]}

In [19]:
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)


In [20]:
pip install accelerate -U




In [21]:
!pip install transformers[integrations]



In [22]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

In [23]:


trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,#Specifies the batch size for evaluation.
    weight_decay=0.01, logging_steps=10,#panalize large waeights
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16#Gradient accumulation is a technique used to simulate larger batch sizes by accumulating gradients over multiple smaller batches before performing weight updates.
)

In [24]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["train"],
                  eval_dataset=dataset_samsum_pt["validation"])


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [25]:
trainer.train()

Step,Training Loss,Validation Loss
500,1.6599,1.483296


TrainOutput(global_step=920, training_loss=1.8251974468645842, metrics={'train_runtime': 2940.2079, 'train_samples_per_second': 5.011, 'train_steps_per_second': 0.313, 'total_flos': 5528248038285312.0, 'train_loss': 1.8251974468645842, 'epoch': 1.0})

In [26]:
model_pegasus.save_pretrained("pegasus-samsum-model")

Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


In [27]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [28]:
sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]


gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)


In [29]:
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda can't find Betty's number. Larry called Betty last time they were at the park together. Hannah wants Amanda to text Larry. Amanda will text Larry.


In [30]:
sample_text="""summarize:Trudy: Hey, so I’m having a party at my place next weekend. Do you want to come?

Ruth: Sure! That sounds like fun. Who else is coming?

Trudy: Let’s see. I think it’s going to be Jerome, Talia, Anna, Juan, Celeste, Michelle and possibly Jamie. It’s not really going to be a party, more like a small get-together. I’m cooking dinner, and we can just hang out.

Ruth: What time should I be there?

Trudy: Oh, anytime between 6 and 7 would be fine.

Ruth: Can I bring anything?

Trudy: Oh, don’t worry about it. I have everything covered.

Ruth: Can I at least bring a bottle of wine?

Trudy: Well, I’m not going to say no to wine. I’m sure that would be appreciated.

Ruth: I’ll do that, then. Thanks for inviting me. """

In [31]:
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Trudy is having a party at her place next weekend. Ruth wants to come. Trudy will bring a bottle of wine. Ruth will do that, then.


In [32]:
print(9)

9


In [33]:
sample_text="""summarize:Rohit: You look bit down. What’s the matter?

Mahesh: (Sighs) Nothing much.

Rohit: Looks like something isn’t right.

Mahesh: Ya. It’s at the job front. You know that the telecom industry is going through a rough patch because of falling prices and shrinking margins. These factors along with consolidation in the industry is threatening the stability of our jobs. And even if the job remains, career growth isn’t exciting.

Rohit: I know. I’ve been reading about some of these issues about your industry in the newspapers. So have you thought of any plan?

Mahesh: I’ve been thinking about it for a while, but haven’t concretized anything so far.

Rohit: What have you been thinking, if you can share?

Mahesh: Well, I’ve been thinking of switching to an industry that has at least few decades of growth left.

Rohit: That’s the right approach, but you need to reskill yourself for the industry you’re targeting.

Mahesh: I realize that, and I’ve been leaning toward digital marketing because in that industry I can carry over some of my skills from the current job. Another reason for this inclination is that digital marketing requires far less hardcore technical skills, which will make it relatively easier for me to acquire new skills.

Rohit: Your choice makes sense. So are you thinking of making the transition in near future?

Mahesh: Not immediately. I need to keep the job, as I’ve EMIs to pay. I’m 80-90 percent sure I’ll go with digital marketing as the industry to reskill in, but in the next 2-3 weeks I’ll take more opinions on other options, after all I wouldn’t want to change the industry again. And once I finalize the industry, I’ll explore different options to reskill while keeping my current job.

Rohit: Sounds like a plan. If you need I can put you in touch with few friends who can help you finalize your future industry.

Mahesh: That will be awesome. Thanks so much.

Rohit: You’re welcome."""

In [34]:
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Mahesh's job in the telecom industry is going through a rough patch because of falling prices and shrinking margins. Mahesh has been thinking of switching to an industry that has at least few decades of growth left. Mahesh has been leaning toward digital marketing because in that industry he can carry over some of his skills from the current job.


In [35]:
sample_text="""summarize:Rohit: How is your preparation for the exam going on?

Mahesh: Not too bad, overall. I’m worried about English and chemistry, though. How is yours going on?

Rohit: Mine is alright. I’m also finding chemistry to be bit challenging because of its vast syllabus and too much memorization in organic chemistry.

Mahesh: Organic chemistry has been a problem for me too. Can we study chemistry together, at least the organic part?

Rohit: Sure. I think it’s a good idea. Can you help me with English though?

Mahesh: Yes, I can. Where exactly in English you’re facing problem?

Rohit: Thanks. Prepositions and reading comprehension are the main problem areas for me.

Mahesh: As far as prepositions are concerned, I can help you in understanding the rules. But for reading comprehension, you need to put in lots of practice to get better at it.

Rohit: OK. Will do. How’s your preparation going on for other subjects?

Mahesh: Other subjects are more or less on track. Economics, however, seems to have an inexhaustible syllabus and I don’t think I’ll have enough time to revise the subject.

Rohit: Thankfully, I don’t have Economics. But, yes, I’m also struggling to get enough time for revision. Anyway, we’ve to manage in whatever time we have.

Mahesh: That’s right. OK, enough of talk. Let’s get back to study. All the best for your next exam.

Rohit: Thanks. All the best to you as well."""

In [36]:
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Mahesh and Rohit are preparing for the exam together. Mahesh will help Rohit with English. He will help him with prepositions and reading comprehension. Mahesh will help Rohit with Economics.


In [37]:
sample_text="""summarize:"""

In [38]:
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Your max_length is set to 128, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)


summarize what you need to know about the situation in which you are currently living and what you need to do to make sure that you are prepared for the situation in which you are currently living.


uploading to huggingface

In [46]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load your fine-tuned model and tokenizer
model_pegasus = PegasusForConditionalGeneration.from_pretrained("/content/pegasus-samsum-model")

tokenizer = PegasusTokenizer.from_pretrained("/content/tokenizer")

# Upload the model and tokenizer directly to the Hugging Face model hub
model_pegasus.push_to_hub("rishitau/pegasus-samsum-model", use_auth_token="")
tokenizer.push_to_hub("rishitau/pegasus-samsum-model", use_auth_token="")


Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]



README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rishitau/pegasus-samsum-model/commit/b4ae878bd0162c6fe00dd2362e9cd3f690834dbe', commit_message='Upload tokenizer', commit_description='', oid='b4ae878bd0162c6fe00dd2362e9cd3f690834dbe', pr_url=None, pr_revision=None, pr_num=None)

## Reloading the from huggingface

In [47]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load your model and tokenizer from the Hugging Face model hub
model_name = "rishitau/pegasus-samsum-model"

model = PegasusForConditionalGeneration.from_pretrained(model_name)
tokenizer = PegasusTokenizer.from_pretrained(model_name)


config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.1k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

In [50]:
# Define your input text
input_text = """summarize:Jeff, I’m going to the supermarket. Do you want to come with me?
I think the supermarket is closed now.
Oh, When does it close?
It closes at 7:00 on Sundays.
That’s too bad.
Don’t worry, we can go tomorrow morning. It opens at 8:00.
Alright. What do you want to do now?
Lets take a walk for a half an hour. My sister will get here at about 8:30PM and then we can all go out to dinner.
Where does she live?
She lives in San Francisco.
How long has she lived there?
I think she’s lived there for about 10 years.
That’s a long time. Where did she live before that?
San Diego."""

# Tokenize the input text
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)

# Generate summary
summary_ids = model.generate(input_ids)

# Decode the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Input text:", input_text)
print("\n")
print("Generated summary:", summary)


Input text: summarize:Jeff, I’m going to the supermarket. Do you want to come with me? 
I think the supermarket is closed now.
Oh, When does it close?
It closes at 7:00 on Sundays.
That’s too bad. 
Don’t worry, we can go tomorrow morning. It opens at 8:00.
Alright. What do you want to do now? 
Lets take a walk for a half an hour. My sister will get here at about 8:30PM and then we can all go out to dinner.
Where does she live? 
She lives in San Francisco.
How long has she lived there? 
I think she’s lived there for about 10 years.
That’s a long time. Where did she live before that? 
San Diego.


Generated summary: Jeff and his sister will go to the supermarket tomorrow morning. They will take a walk for half an hour and then go out to dinner. Jeff's sister lives in San Francisco.
