<a href="https://colab.research.google.com/github/venkata55s/ai-agents/blob/main/clip2-summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# 1 - Import required libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
# 2 - Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

In [None]:
def summarize_text(text):
  # Step 2: Tokenize the input text
  # Add a directive prompt "summarize: Focus on key impacts and industries:" to guide the model for better summarization.
  # Tokenize the text, convert it to PyTorch tensors, and truncate it to a maximum length of 512 tokens if necessary.
  inputs = tokenizer.encode(
      "summarize: Focus on key impacts and industries: " + text,
      return_tensors="pt",
      max_length=512,
      truncation=True
  )

  # Step 3: Generate a summary
  # Generate a summary from the model using specific parameters:
  # max_length: Limit the summary to a maximum of 40 tokens to keep it concise.
  # min_length: Ensure the summary is at least 10 tokens long.
  # length_penalty: Penalize longer outputs to prioritize brevity.
  # num_beams: Use beam search with 5 beams for higher-quality text generation.
  # early_stopping: Stop generation early when an acceptable output is found.
  outputs = model.generate(
      inputs,
      max_length=40,
      min_length=10,
      length_penalty=3.5,
      num_beams=5,
      early_stopping=True
  )

  # Step 4: Decode the model output
  # Decode the generated tokens back into human-readable text and remove special tokens (e.g., <pad>, <eos>).
  summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

  # Step 5: Post-process the summary
  # Deduplicate sentences in the summary to improve clarity and readability.
  unique_sentences = list(dict.fromkeys(summary.split(". ")))
  return ". ".join(unique_sentences)

In [None]:
if __name__ == "__main__":
  # Step 6: Define the input text
  # Provide a sample text to summarize.
  sample_text = (
      "Artificial intelligence is a rapidly growing field that involves the creation of "
      "intelligent machines capable of performing tasks that typically require human intelligence. "
      "It is being used in various industries, including healthcare, finance, and transportation, "
      "to improve efficiency and solve complex problems."
  )

  # Step 7: Generate and print the summary
  # Call the summarize_text function with the sample text and print the original text and its summary.
  print("Original Text:")
  print(sample_text)
  print("\nSummary:")
  print(summarize_text(sample_text))

############explain the below code lines. # 2 - Initialize tokenizer and model tokenizer = AutoTokenizer.from_pretrained("t5-base") model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
Sure! These lines initialize the tokenizer and the model:

tokenizer = AutoTokenizer.from_pretrained("t5-base"): This line loads a pre-trained tokenizer for the "t5-base" model. The tokenizer is responsible for converting raw text into a format that the model can understand (e.g., breaking text into tokens and converting them into numerical IDs).
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base"): This line loads the pre-trained "t5-base" model. AutoModelForSeq2SeqLM indicates that it's a sequence-to-sequence language model, which is suitable for tasks like summarization, translation, etc.
############
why "t5-base" has provided as parameter for the below code line. what if i dont provide parameter. tokenizer = AutoTokenizer.from_pretrained("t5-base")
The parameter "t5-base" specifies the particular pre-trained model whose tokenizer you want to load. Different models have different tokenization schemes, vocabulary, and special tokens, so it's crucial to use the tokenizer that corresponds to the model you intend to use.

If you don't provide a parameter (e.g., AutoTokenizer.from_pretrained()), the function would raise an error because it wouldn't know which tokenizer to load. The from_pretrained method requires a string identifier to locate and download the correct tokenizer configuration and vocabulary files.

############
show me the list of all possible pre-trained models and their respective tokenizers, in table format.
A comprehensive list of all possible pre-trained models and their respective tokenizers is extensive and constantly updated on the Hugging Face Model Hub. It's not practical to list them all in a static table here.

However, you can easily explore all available models and their tokenizers directly on the Hugging Face Model Hub:

Go to [redacted link]
You can use the filters on the left sidebar to narrow down your search, for example, by task (like 'summarization' or 'text-generation'), library ('Transformers'), or language.
Each model page will specify its identifier (e.g., "t5-base", "bert-base-uncased") which you can then use with AutoTokenizer.from_pretrained() and AutoModelForSeq2SeqLM.from_pretrained() (or AutoModel.from_pretrained() depending on the model architecture).
This method ensures you always get the most current and complete list of available models and their documentation.

############
explain method summarize_text
Certainly! The summarize_text method is designed to take an input text and generate a concise summary using the pre-trained T5 model. Here's a breakdown of what each step within the function does:

Step 2: Tokenize the input text

inputs = tokenizer.encode(...): This line takes the input text, prepends it with the directive prompt "summarize: Focus on key impacts and industries: " to guide the model, and then uses the tokenizer to convert this combined text into numerical input IDs. It also specifies return_tensors="pt" to get PyTorch tensors and max_length=512, truncation=True to ensure the input doesn't exceed the model's maximum sequence length.
Step 3: Generate a summary

outputs = model.generate(...): This is where the actual summary generation happens. The model takes the tokenized inputs and generates a summary based on several parameters:
max_length=40: The summary will have a maximum of 40 tokens.
min_length=10: The summary will have a minimum of 10 tokens.
length_penalty=3.5: This penalizes longer summaries, encouraging brevity.
num_beams=5: Uses beam search with 5 beams, which helps in finding a higher-quality summary by exploring multiple possibilities.
early_stopping=True: The generation stops early if a satisfactory output is found.
Step 4: Decode the model output

summary = tokenizer.decode(outputs[0], skip_special_tokens=True): After the model generates numerical outputs, this line uses the tokenizer to convert these numerical IDs back into human-readable text. outputs[0] refers to the first (and in this case, only) generated summary, and skip_special_tokens=True removes special tokens like "<pad>" or "<eos>" that are internal to the model.
Step 5: Post-process the summary

unique_sentences = list(dict.fromkeys(summary.split(". "))):
return ". ".join(unique_sentences): This step aims to improve the summary's readability by removing any duplicate sentences. It first splits the summary into sentences, uses a dict.fromkeys() trick to get unique sentences while preserving order, and then joins them back into a single string.