In [38]:
# example call of the pipeline function
# to perform named entity recognition (NER) on a text

from transformers import pipeline

nlp = pipeline("sentiment-analysis")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge."
print(nlp(sequence))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9249897599220276}]


In [13]:
# example for zero shot classification
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445992469787598, 0.11197395622730255, 0.043426770716905594]}

In [37]:
# example for text generation with an explicit model
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to write your own language. This course will be aimed at teaching you how to use languages that represent the'},
 {'generated_text': 'In this course, we will teach you how to run a successful run like Run a Flugger on a flat screen. First of all, let'}]

In [36]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
checkpoint = "gpt2"
print("Initializing tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
print("Initialization complete.\n")

# Set padding token if not preset
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Set EOS token as padding token
    print("Padding token set to EOS token.\n")

# Define sample prompts
prompts = [
    "I've been waiting for a HuggingFace course my whole life,",
    "I hate this so much,"
]
print("Processing the following prompts:")
for prompt in prompts:
    print(f"- {prompt}")
print()

# Tokenize inputs
print("Tokenizing inputs...")
inputs = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt")
print("Tokens (input IDs):", inputs["input_ids"])
print("Attention masks:", inputs["attention_mask"])
print("Tokenization complete.\n")

# Generate text
print("Generating text...")
generated_outputs = model.generate(inputs["input_ids"], max_length=50)
print("Tokens (output IDs):", generated_outputs)
print("Generation complete.\n")

# Decode and display the outputs
print("Decoding outputs...")
generated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in generated_outputs]
print("Decoded texts:")
for i, text in enumerate(generated_texts, 1):
    print(f"Completion to Prompt {i}:\n{text}\n")


Initializing tokenizer and model...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Initialization complete.

Padding token set to EOS token.

Processing the following prompts:
- I've been waiting for a HuggingFace course my whole life,
- I hate this so much,

Tokenizing inputs...
Tokens (input IDs): tensor([[   40,  1053,   587,  4953,   329,   257, 12905,  2667, 32388,  1781,
           616,  2187,  1204,    11],
        [   40,  5465,   428,   523,   881,    11, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256]])
Attention masks: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])
Tokenization complete.

Generating text...
Tokens (output IDs): tensor([[   40,  1053,   587,  4953,   329,   257, 12905,  2667, 32388,  1781,
           616,  2187,  1204,    11,   290,   314,  1101,   523,  9675,   314,
           750,    13,   314,  1101,   523,  3772,   284,   307,  1498,   284,
          2648,   616,  1998,   351,   345,    13,   198,   198,    40,  1101,
           523,  3772,   284,   307,  1498,  