<a href="https://colab.research.google.com/github/simulate111/Basics-of-Programming---Exercise/blob/main/Exercise%20task%2014.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation example

This is a brief example of how to run text generation with a causal language model and `pipeline`.

Install [transformers](https://huggingface.co/docs/transformers/index) python package. This will be used to load the model and tokenizer and to run generation.

In [1]:
!pip install --quiet transformers

Import the `AutoTokenizer`, `AutoModelForCausalLM`, and `pipeline` classes. The first two support loading tokenizers and generative models from the [Hugging Face repository](https://huggingface.co/models), and the last wraps a tokenizer and a model for convenience.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

Load a generative model and its tokenizer. You can substitute any other generative model name here (e.g. [other TurkuNLP GPT-3 models](https://huggingface.co/models?sort=downloads&search=turkunlp%2Fgpt3)), but note that Colab may have issues running larger models.

In [3]:
MODEL_NAME = 'TurkuNLP/gpt3-finnish-large'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/218 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/6.23M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/562 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.53G [00:00<?, ?B/s]

Instantiate a text generation pipeline using the tokenizer and model.

In [4]:
pipe = pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    device=model.device
)

We can now call the pipeline with a text prompt; it will take care of tokenizing, encoding, generation, and decoding:

In [5]:
output = pipe('Terve, miten menee?', max_new_tokens=25)

print(output)

[{'generated_text': 'Terve, miten menee?”\n”Hyvin, kiitos.”\n”Kiva kuulla.”\n”Kuule, minulla on sinulle asiaa.”\n'}]


Just print the text

In [6]:
print(output[0]['generated_text'])

Terve, miten menee?”
”Hyvin, kiitos.”
”Kiva kuulla.”
”Kuule, minulla on sinulle asiaa.”



We can also call the pipeline with any arguments that the model `generate` function supports. For details on text generation using `transformers`, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

Example with sampling and a high `temperature` parameter to generate more chaotic output:

In [7]:
output = pipe(
    'Terve, miten menee?',
    do_sample=True,
    temperature=10.0,
    max_new_tokens=25
)

print(output[0]['generated_text'])

Terve, miten menee? Mikä meininki tänä kauniina kesäkuisella, toivottavasti aika lämpöiseksi yltyvän loppukesän perjantaina tai viimeistään koko viikonlopun mittaisen työviikon lähestyessä?
Mitä


In [9]:
# Load the text classification pipeline
text_classifier = pipeline("text-classification", model="distilbert-base-uncased")

# Test cases for text classification
text_test_cases = [
    "This movie was amazing! I loved every moment of it.",
    "The product arrived broken and unusable. Very disappointed.",
    "I'm not sure about this book. It was okay, but not great.",
]

print("Text Classification Results:")
for text in text_test_cases:
    result = text_classifier(text)
    print(f"Text: {text}")
    print("Label:", result[0]['label'])
    print("Score:", result[0]['score'])
    print()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Text Classification Results:
Text: This movie was amazing! I loved every moment of it.
Label: LABEL_0
Score: 0.536824107170105

Text: The product arrived broken and unusable. Very disappointed.
Label: LABEL_0
Score: 0.5232463479042053

Text: I'm not sure about this book. It was okay, but not great.
Label: LABEL_0
Score: 0.5233213305473328



In [10]:


# Load the question answering pipeline
question_answerer = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased")

# Test cases for question answering
context = "The capital of France is Paris. It is known as the City of Light."
question_test_cases = [
    "What is the capital of France?",
    "What is Paris known as?",
    "Where is the Eiffel Tower located?",
]

print("Question Answering Results:")
for question in question_test_cases:
    result = question_answerer(question=question, context=context)
    print(f"Question: {question}")
    print("Answer:", result['answer'])
    print("Score:", result['score'])
    print()


Question Answering Results:
Question: What is the capital of France?
Answer: Paris
Score: 0.9846238493919373

Question: What is Paris known as?
Answer: the City of Light
Score: 0.5043127536773682

Question: Where is the Eiffel Tower located?
Answer: Paris
Score: 0.8830112814903259



In [14]:
from transformers import pipeline

# Load the translation pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

# Test cases for translation
translation_test_cases = [
    "Hello, how are you?",
    "What time is it?",
    "I love to travel and explore new places.",
]

print("Translation Results:")
for sentence in translation_test_cases:
    result = translator(sentence)
    print(f"English: {sentence}")
    print("French:", result)
    print()


Translation Results:
English: Hello, how are you?
French: [{'translation_text': 'Bonjour, comment allez-vous ?'}]

English: What time is it?
French: [{'translation_text': 'Quelle heure est-il ?'}]

English: I love to travel and explore new places.
French: [{'translation_text': "J'adore voyager et explorer de nouveaux endroits."}]



In [12]:
# Load the summarization pipeline
summarizer = pipeline("summarization", model="t5-small", tokenizer="t5-small")

# Test cases for summarization
text_test_cases = [
    "Abraham Lincoln was an American statesman and lawyer who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the nation through the American Civil War, its bloodiest war and its greatest moral, constitutional, and political crisis. He preserved the Union, abolished slavery, strengthened the federal government, and modernized the U.S. economy.",
    "The Industrial Revolution was the transition to new manufacturing processes in Europe and the United States, in the period from about 1760 to sometime between 1820 and 1840. This transition included going from hand production methods to machines, new chemical manufacturing and iron production processes, the increasing use of steam power and water power, the development of machine tools and the rise of the mechanized factory system.",
    "Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen. 'Strong' AI is usually labelled as AGI (Artificial General Intelligence) while attempts to emulate 'natural' intelligence have been called ABI (Artificial Biological Intelligence). Leading AI textbooks define the field as the study of 'intelligent agents': any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals."
]

print("Summarization Results:")
for text in text_test_cases:
    result = summarizer(text, max_length=50, min_length=20, do_sample=False)
    print(f"Original Text: {text}")
    print("Summary:", result[0]['summary_text'])
    print()


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Summarization Results:
Original Text: Abraham Lincoln was an American statesman and lawyer who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the nation through the American Civil War, its bloodiest war and its greatest moral, constitutional, and political crisis. He preserved the Union, abolished slavery, strengthened the federal government, and modernized the U.S. economy.
Summary: Abraham Lincoln served as the 16th president of the united states from 1861 until his assassination in 1865 . he led the nation through the American Civil War, its bloodiest war and its greatest moral, constitutional

Original Text: The Industrial Revolution was the transition to new manufacturing processes in Europe and the United States, in the period from about 1760 to sometime between 1820 and 1840. This transition included going from hand production methods to machines, new chemical manufacturing and iron production processes, the increasing us