<a href="https://colab.research.google.com/github/sahildal13/sklabs-genai-emailsumgen/blob/main/Gen_AI_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install the necessary libraries
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:

In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Define the model name and load the tokenizer and model
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#Set-up device

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Print out which device we're using (GPU or CPU)
print(device)

cuda


In [30]:
def summarize(text):
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [31]:
# Define a sample input text for summarization
input_text = """
Subject: Virtual Field Trip Collaboration
Hey Michael, I hope you're doing well! I've been thinking a lot about the ideas we discussed at the conference and I think I have an idea for our collaboration project. What if we created a virtual field trip that takes students on a journey through different cities around the world, exploring the impacts of climate change on urban environments? I think this could be a great way to combine our expertise in environmental and urban geography and create an engaging learning experience for our students. Let me know what you think! Best, Sarah
"""

# Summarize the sample text using the pre-trained model (without fine-tuning)
pre_finetuned_summary = summarize(input_text)
print("Summary before fine-tuning:", pre_finetuned_summary)

Summary before fine-tuning: Hi Michael, I hope you're doing well! I've been thinking a lot about the ideas we discussed at the conference and I think I have an idea for our collaboration project.


In [32]:
from datasets import load_dataset

# Load the argilla/FinePersonas-Conversations-Email-Summaries dataset, which contains emails and summaries
dataset = load_dataset("argilla/FinePersonas-Conversations-Email-Summaries", split="train")

In [33]:
# Split the dataset into training and testing subsets
dataset_split = dataset.train_test_split(test_size=0.1)

# Further reduce the training set size for faster testing during development
small_train_dataset = dataset_split['train'].train_test_split(test_size=0.99)['train']
eval_dataset = dataset_split['test']

In [35]:
# Preprocess the datasets
def preprocess_function(examples):
  # Extract the articles from the dataset
  inputs = [email_text for email_text in examples['email']]
  labels = [summary_text for summary_text in examples['summary']]


  # Tokenize the articles (inputs) with padding and truncation to a max length of 512
  model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True, return_tensors="pt")

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(labels, max_length=128, padding="max_length", truncation=True, return_tensors="pt")

  model_inputs["labels"] = labels["input_ids"]
  model_inputs = {k: v.to(device) for k, v in model_inputs.items()}
  return model_inputs

In [36]:
# Small training dataset tokenized
tokenized_train_dataset = small_train_dataset.map(preprocess_function, batched=True)

# Evaluation dataset tokenized
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True)

# # Move the data to GPU
# small_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
# eval_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# # Print the first example from the training dataset to verify preprocessing
# print(small_train_dataset[0])

Map:   0%|          | 0/3272 [00:00<?, ? examples/s]



Map:   0%|          | 0/36359 [00:00<?, ? examples/s]

In [37]:
from transformers import Seq2SeqTrainingArguments

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    logging_dir="./logs"
)



In [38]:
from transformers import Seq2SeqTrainer

# Create the trainer object
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    tokenizer=tokenizer
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.71444
2,4.542500,0.938695
3,1.441900,0.814255


TrainOutput(global_step=1227, training_loss=2.650965783104434, metrics={'train_runtime': 2212.3703, 'train_samples_per_second': 4.437, 'train_steps_per_second': 0.555, 'total_flos': 1824701194174464.0, 'train_loss': 2.650965783104434, 'epoch': 3.0})

In [43]:
model.save_pretrained("./results")
tokenizer.save_pretrained("./results")

('./results/tokenizer_config.json',
 './results/special_tokens_map.json',
 './results/spiece.model',
 './results/added_tokens.json',
 './results/tokenizer.json')

In [39]:
def summarize(text):
  # Tokenize the input text and move it to the correct device
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)

  # Generate the summary using the fine-tuned model
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)

  # Decode the generated summary back into text and return it
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

  input_text = """
   Subject: Virtual Field Trip Collaboration
   Hey Michael, I hope you're doing well! I've been thinking a lot about the ideas we discussed at the conference and I think I have an idea for our collaboration project. What if we created a virtual field trip that takes students on a journey through different cities around the world, exploring the impacts of climate change on urban environments? I think this could be a great way to combine our expertise in environmental and urban geography and create an engaging learning experience for our students. Let me know what you think! Best, Sarah
  """
print(summarize(input_text ) )


Sarah suggests creating a virtual field trip that takes students on a journey through different cities around the world exploring the impacts of climate change on urban environments. Sarah suggests creating a virtual field trip that takes students on a journey through different cities around the world, exploring the impacts of climate change on urban environments. Sarah suggests creating a virtual field trip that takes students on a journey through different cities around the world, exploring the impacts of climate change on urban environments.


In [40]:
!pip install huggingface_hub



In [41]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [44]:
model.save_pretrained("./flan-t5-finetuned")
tokenizer.save_pretrained("./flan-t5-finetuned")

('./flan-t5-finetuned/tokenizer_config.json',
 './flan-t5-finetuned/special_tokens_map.json',
 './flan-t5-finetuned/spiece.model',
 './flan-t5-finetuned/added_tokens.json',
 './flan-t5-finetuned/tokenizer.json')

In [45]:
model.push_to_hub("sahilc/email-summary-gen")
tokenizer.push_to_hub("sahilc/email-summary-gen")

README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/sahilc/email-summary-gen/commit/dedd70cbd30fde60b53826fe273280f9075eb57b', commit_message='Upload tokenizer', commit_description='', oid='dedd70cbd30fde60b53826fe273280f9075eb57b', pr_url=None, pr_revision=None, pr_num=None)

In [52]:
from transformers import pipeline

# Load the model from Hugging Face Hub
summarizer = pipeline("summarization", model="sahilc/email-summary-gen")

# Example usage
text = """
Dear Walmart Sales Team,

I hope this email finds you well. I am writing to place an order for several items required for an upcoming office event. We have compiled a list of 15 products that we would like to purchase, ranging from office supplies to tech equipment. Each item is specified with the desired quantity and model details to ensure accuracy in fulfilling the order.

The first set of items includes printers, ink cartridges, and other related office essentials. We need an HP OfficeJet Pro 8025e and a Canon PIXMA TR4520 printer, along with the corresponding ink cartridges for each. Additionally, we require several packs of Post-it notes, Sharpie markers, and a set of Epson EcoTank ink for our office machines. These will help us stay prepared for daily tasks.

Furthermore, we are looking to purchase two Acer Aspire 5 Slim Laptops and one Microsoft Surface Pro 7 for our team members. Along with these, we will need three Logitech MK270 wireless keyboard and mouse combos for smoother operations. A paper shredder and reams of copy paper are also on the list to ensure we maintain confidentiality and have sufficient resources for printing.

The final few items include durable packaging tape for shipping, Brother toner cartridges, and Duracell AA batteries. These essentials will help us manage office logistics effectively, ensuring smooth workflow during the busy event season.

Please confirm the availability of all items mentioned and provide us with a quote, including taxes, shipping fees, and an estimated delivery date. If any items are out of stock, we would appreciate recommendations for suitable substitutes. Kindly inform us if there are any discounts or promotions applicable to our bulk purchase.

We look forward to hearing from you and appreciate your timely response. Should you need further clarification, feel free to reach out to me directly. Thank you for your assistance in processing this order.

Best regards,
John Doe
"""
summary = summarizer(text)
print(summary)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': 'John Doe is writing to place an order for 15 products for an upcoming office event. He needs HP OfficeJet Pro 8025e and Canon PIXMA TR4520 printer, along with the corresponding ink cartridges for each. Additionally, he needs several packs of Post-it notes, Sharpie markers, and a set of Epson EcoTank ink for office machines. He also needs three Logitech MK270 wireless keyboard and mouse combos for smoother operations.'}]
