Exploring Hugging Face Pretrained Models for Text Generation, Summerization & Translation

## Installing Dependencies

In [None]:
!pip install transformers   # This will install transformer library



In [None]:
!pip install transformers datasets

# This will install the transformers and datasets libraries which are commonly used for working with LLMs on Hugging Face.

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


After Installing dependencies, restart runtime is required

# **1. Text Generation**

## Text Generation using GPT-2 Model

- Let's use the provided codes in model card at Huggingface to explore how GPT-2 Model works

- You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:


In [None]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
set_seed(40)

generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, so it's good to know which syntax it means—but in general I think these things don't match up"},
 {'generated_text': "Hello, I'm a language model, I'm not a model of any language like this.\n\nBut, that's the point. We can"},
 {'generated_text': "Hello, I'm a language model, I'm a language model, please listen to my brain so I can write some code so I can do stuff"},
 {'generated_text': "Hello, I'm a language model, not a language model. I'm trying to learn some new languages when I get out of school. And I"},
 {'generated_text': "Hello, I'm a language model, and I've always wanted to help out the community with some of these types of topics and learn more from them"}]

## Limitations and bias

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("The White man worked as a", max_length=10, num_return_sequences=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The White man worked as a clerk for the bank'},
 {'generated_text': 'The White man worked as a taxi driver for 25'},
 {'generated_text': 'The White man worked as a construction worker for 18'},
 {'generated_text': 'The White man worked as a lab technician for a'},
 {'generated_text': 'The White man worked as a carpenter in his'},
 {'generated_text': 'The White man worked as a "blacksmith"'},
 {'generated_text': 'The White man worked as a "liver-'},
 {'generated_text': 'The White man worked as a banker, with his'},
 {'generated_text': 'The White man worked as a trucker for a'},
 {'generated_text': 'The White man worked as a bartender for several years'}]

In [None]:
generator("The Black man worked as a", max_length=10, num_return_sequences=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The Black man worked as a librarian for the'},
 {'generated_text': 'The Black man worked as a waitress at a place'},
 {'generated_text': 'The Black man worked as a clerk in the local'},
 {'generated_text': 'The Black man worked as a cook at the factory'},
 {'generated_text': 'The Black man worked as a maid in a church'},
 {'generated_text': 'The Black man worked as a carpenter, when'},
 {'generated_text': 'The Black man worked as a security guard for the'},
 {'generated_text': 'The Black man worked as a waitress at a hotel'},
 {'generated_text': 'The Black man worked as a waiter and barista'},
 {'generated_text': 'The Black man worked as a security guard and told'}]

## Text Generation using GPT2-XL Model

**Important**
- Do not run the cell for GPT2-XL Model if your system has RAM less than 12GB
- The size of the Flan-T5 Base LLM is more than 6.5GB


In [None]:
# Let's use GPT-XL Model and see the difference in output

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-xl')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, but what I do you need to know isn't that hard. But if you want to understand us, you"},
 {'generated_text': "Hello, I'm a language model, this is my first commit and I'd like to get some feedback to see if I understand this commit.\n"},
 {'generated_text': "Hello, I'm a language model, and I'll guide you on your journey!\n\nLet's get to it.\n\nBefore we start"},
 {'generated_text': 'Hello, I\'m a language model, not a developer." If everything you\'re learning about code is through books, you\'ll never get to know about'},
 {'generated_text': 'Hello, I\'m a language model, please tell me what you think!" – I started out on this track, and now I am doing a lot'}]

**Compare the result of both GPT2 and GPT2-XL**

- Both models generate grammatically correct and coherent text.
- GPT-2-xl outputs tend to be longer and more elaborate than GPT-2 outputs.
- GPT-2-xl responses appear more focused on the prompt and provide - additional context or information.
- GPT-2 responses are more concise and direct, sometimes changing the topic or introducing new ideas.

***Bigger the model, more accurate result***

# Text Generation with GPT-2

In [None]:
# Import libraries
from transformers import pipeline

# Define model and prompt
model_name = "gpt2"
prompt = "Once upon a time, in a land far, far away..."

# Create text generation pipeline
generator = pipeline("text-generation", model=model_name)

# Generate text
generated_text = generator(prompt, max_length=500)

# Print the generated text
print(generated_text)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Once upon a time, in a land far, far away... a small town is besieged with enemies from all sides. Who is standing by the side? And with a single bullet...the bullet is directed downwards. I'm guessing one's name is... [pause] Hmm... well, that's just it... but when you've been waiting for those bullets, you know exactly what you should do. Let's see if those bullets hurt you or not, then I'll let you know if we're able to get the location when we leave...\n\n\nOkay, first, you're going to need to know how much space the tower will fill... and it's gonna cost two to three tons...\n\n\nI'll show you a simple way. Imagine the top and bottom are completely flooded with water... and they're in a place with nothing but rocks and gravel. This could be covered with buildings for a few months, I think. After that though... you can see the tower itself being flooded with water... and you can't get inside the tower without it being flooded. I'll tell you to go in the tower t

# **2. Text Summerization & Translation**

#### **SentencePiece Libray:**

- A powerful and efficient text processing library for NLP tasks. It provides subword segmentation, vocabulary management, and multilingual support. SentencePiece improves model performance, reduces memory footprint, and enables multilingual applications. It's a vital tool for large language models, machine translation, text summarization, chatbots, and text analysis.


In [None]:
!pip install sentencepiece

import sentencepiece

print(sentencepiece.__version__)


0.1.99


After Installing Sentencepiece, restart runtime is required

#### Flan-T5
- T5, short for "Text-to-Text Transfer Transformer," is a powerful Large Language Model (LLM) developed by Google AI. It is based on the Transformer architecture and utilizes a novel text-to-text approach to perform a wide range of natural language processing (NLP) tasks.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<pad> Wie old sind Sie?</s>


### Summerize the conversation

##### The conversation used in below codes are taken from hugging face library dataset

- Dataset Name: [knkarthick/dialogsum ](https://huggingface.co/datasets/knkarthick/dialogsum/viewer/default/train)

In [None]:
# Let's summerize the conversation

# Define the text to summarize
input_conversation = '''
#Person1#: Look! This picture of Mom in her cap and gown.
#Person2#: Isn't it lovely! That's when she got her Master's Degree from Miami University.
#Person1#: Yes, we are very proud of her.
#Person2#: Oh, that's a nice one of all of you together. Do you have the negative? May I have a copy?
#Person1#: Surely, I'll have one made for you. You want a print? #Person2#: No. I'd like a slide, I have a new projector.
#Person1#: I'd like to see that myself. #Person2#: Have a wallet size print made for me, too. #Person1#: Certainly.
'''

Baseline_human_summary = '#Person2# thinks the picture is lovely and asks #Person1# to give a slide and a wallet-size print.'

# Prepend the summarization prefix
input_conversation = "summarize: " + input_conversation

# Tokenize the input
input_ids = tokenizer(input_conversation, return_tensors="pt").input_ids

# Generate the summary
outputs = model.generate(input_ids)
Model_summary = tokenizer.decode(outputs[0])

# Print the summary
print('-' * 100)
print('Baseline human summary:', Baseline_human_summary)
print('-' * 100)
print("Model Summary:", Model_summary)

----------------------------------------------------------------------------------------------------
Baseline human summary: #Person2# thinks the picture is lovely and asks #Person1# to give a slide and a wallet-size print.
----------------------------------------------------------------------------------------------------
Model Summary: <pad>#Person1: Mom got her Master's Degree from Miami University. #Person


In [None]:
# Let's summerize the conversation

# Define the text to summarize
input_conversation = '''
#Person1#: It's Sunday today.
#Person2#: Yes, I know.
#Person1#: I think we should have a house cleaning today. What's your opinion?
#Person2#: Oh, no. We just did it last week.
#Person1#: Come on. What do you want to do? Washing clothes or cleaning the house?
#Person2#: I'd rather wash the clothes.
#Person1#: Okay. Here is the laundry.
#Person2#: Oh, My God! So much!
#Person1#: Don't worry. I'll help you with it later.
'''

Baseline_human_summary = '#Person1# suggests having a house cleaning, and #Person2# chooses to wash clothes.'

# Prepend the summarization prefix
input_conversation = "summarize: " + input_conversation

# Tokenize the input
input_ids = tokenizer(input_conversation, return_tensors="pt").input_ids

# Generate the summary
outputs = model.generate(input_ids)
Model_summary = tokenizer.decode(outputs[0])

# Print the summary
print('-' * 100)
print('Baseline human summary:', Baseline_human_summary)
print('-' * 100)
print("Model Summary:", Model_summary)

----------------------------------------------------------------------------------------------------
Baseline human summary: #Person1# suggests having a house cleaning, and #Person2# chooses to wash clothes.
----------------------------------------------------------------------------------------------------
Model Summary: <pad> The house cleaning is on Sunday.</s>
