<a href="https://colab.research.google.com/github/rahulgundala007/NLP_text_summarization/blob/main/GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install transformers datasets torch




In [2]:
pip install transformers[torch]



In [3]:
pip install accelerate -U



In [4]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
import torch

# Load the dataset
dataset = load_dataset("Rahulgundala007/ToSdataset2")  # Update the dataset name

# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set the eos_token as the pad_token
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.config.pad_token_id = tokenizer.pad_token_id  # Update the model's pad_token_id

# Preprocess the dataset: concatenate the input and the summary with a special token in between
def preprocess_data(examples):
    input_texts = ["summarize: " + doc for doc in examples['ToS_Detail']]
    target_texts = [doc + tokenizer.eos_token for doc in examples['ToS_Summary']]
    model_inputs = tokenizer(input_texts, max_length=1024, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(target_texts, max_length=1024, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize the dataset
tokenized_dataset = dataset.map(preprocess_data, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test']
)

# Start training
trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Epoch,Training Loss,Validation Loss
1,0.6091,0.262968
2,0.6682,0.250444
3,0.7634,0.245041


TrainOutput(global_step=99, training_loss=1.004616419474284, metrics={'train_runtime': 73.7097, 'train_samples_per_second': 2.646, 'train_steps_per_second': 1.343, 'total_flos': 101903892480000.0, 'train_loss': 1.004616419474284, 'epoch': 3.0})

In [5]:
# Save the model and tokenizer manually
model_path = "/content/drive/MyDrive/SavedModel/GPT-2"
tokenizer_path = "/content/drive/MyDrive/SavedModel/GPT-2"

model.save_pretrained(model_path)
tokenizer.save_pretrained(tokenizer_path)

('/content/drive/MyDrive/SavedModel/GPT-2/tokenizer_config.json',
 '/content/drive/MyDrive/SavedModel/GPT-2/special_tokens_map.json',
 '/content/drive/MyDrive/SavedModel/GPT-2/vocab.json',
 '/content/drive/MyDrive/SavedModel/GPT-2/merges.txt',
 '/content/drive/MyDrive/SavedModel/GPT-2/added_tokens.json')

In [7]:
pip install PyPDF2 transformers sacrebleu


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: PyPDF2, portalocker, colorama, sacrebleu
Successfully installed PyPDF2-3.0.1 colorama-0.4.6 portalocker-2.8.2 sacrebleu-2.4.2


In [8]:
import sacrebleu

def compute_bleu(generated_summary, reference_summary):
    bleu_score = sacrebleu.corpus_bleu([generated_summary], [[reference_summary]])
    return bleu_score.score

In [9]:
pdf_path = '/content/drive/MyDrive/SavedModel/ToS_document/samsung.pdf'
pdf_text = extract_text_from_pdf(pdf_path)
generated_summary = generate_summary(pdf_text)
print("Generated Summary:", generated_summary)

# You must provide this
reference_summary = "Certainly! Here's the text without quotes: The following are the terms and conditions that govern the sale and performance of Samsung Business Services, which include the provision of services directly from Samsung and through authorized resellers. The term Services refers to each component of the Samsung enterprise service offering, as described in a service guide (each, a service order), with specific terms for each component. These terms and conditions apply to all enterprise purchases of services from Samsung and reseller purchases of services through Samsung's authorized reseller, with final pricing and sales terms between the reseller and the purchaser determined by the seller. The Services and Deliverables are intended for internal business use unless permitted by Samsung in writing. Here are the key points to keep in mind: 1. Service Description: This section describes the services and deliverables offered by Samsung. 2. Payment Terms: These terms are determined by Samsung, with specific payment terms for services purchased directly from the company. 3. Availability: The terms used in this section are specific to the service guide or Order, with the meanings defined by the specific Order. 4. Service Availability: Each Samsung Service Guide or Order describes: (i) Available service, and (ii) Deliverables. 5. Terms Not Specified: Capitalized terms used but not defined in the terms and Conditions shall have the meanings as set forth in the applicable Order. 6. Service Terms: Here's a summary of the key terms used within the terms, including service descriptions, payment terms, and terms for service purchases, including reseller terms and terms applicable to Samsung's service offerings. 7. Business Services: The sale and performance of Samsung Services is governed by these terms, which are agreed between you and Samsung in your own capacity, between the entity for whose benefit you act, and Samsung Electronics America, Inc. (Samsung), which is responsible for service provision and service management. 8. Business Use: The Services are meant for internal use, with limited external use for external use. 9. Payment Terms: The payment terms that apply to you if Services are purchased through a reseller differ from those for directly purchased from Samsung. 10. Services Availability: Includes the terms for individual purchases, with detailed descriptions of the services provided by Samsung within the Service Guide and Order. 11. Services: Includes a detailed description of each component, including pricing, delivery, and use terms, applicable to all users, within the scope of the service agreement."
bleu_score = compute_bleu(generated_summary, reference_summary)
print(f"BLEU score: {bleu_score}")


NameError: name 'extract_text_from_pdf' is not defined