In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.





from transformers import pipeline: This imports the pipeline function from the transformers library. The pipeline function is a high-level API provided by the transformers library that allows you to easily use pre-trained models for various NLP tasks without having to deal with the complexities of the underlying models.

classifier = pipeline('sentiment-analysis'): This line creates a pipeline object for sentiment analysis. The argument 'sentiment-analysis' specifies the task that the pipeline will perform. In this case, it tells the pipeline to load a pre-trained model for sentiment analysis. The pipeline function will automatically download the necessary pre-trained model from the Hugging Face model hub if it's not already downloaded. After this line is executed, the classifier variable will hold a reference to the sentiment analysis pipeline, which can then be used to analyze the sentiment of text inputs.

In [3]:
classifier('The pizza is not that great.')

[{'label': 'NEGATIVE', 'score': 0.9997463822364807}]

In [4]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


In [5]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [6]:
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.33688193559646606}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model object and its associated tokenizer.

We will need two classes for this. The first is AutoTokenizer, which we will use to download the tokenizer associated to the model we picked and instantiate it. The second is AutoModelForSequenceClassification (or TFAutoModelForSequenceClassification if you are using TensorFlow), which we will use to download the model itself. Note that if we were using the library on an other task, the class of the model would change. The task summary tutorial summarizes which class is used for which task.

In [7]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

AutoTokenizer:

Tokenization is the process of breaking down a piece of text into smaller units, typically words or subwords, called tokens. Tokenization is a crucial preprocessing step in NLP tasks.
AutoTokenizer automatically selects the appropriate tokenizer for a given pre-trained model architecture. Different pre-trained models may require different tokenization methods.
Using AutoTokenizer ensures that you use the correct tokenizer without having to manually specify it, which can be error-prone and tedious.
This class helps maintain consistency across different models and simplifies the process of switching between models without changing code.
TFAutoModelForSequenceClassification:

This class is used for sequence classification tasks in TensorFlow. Sequence classification involves assigning a label or category to a sequence of tokens, such as a sentence or a paragraph.
TFAutoModelForSequenceClassification automatically selects the appropriate pre-trained model for sequence classification tasks based on the specified architecture (e.g., BERT, RoBERTa, DistilBERT).
It loads the pre-trained model along with the necessary architecture and weights, allowing you to perform sequence classification without having to manually download, configure, and load the model.
Using this class simplifies the process of using pre-trained models for sequence classification tasks in TensorFlow, making it easier to experiment with different models and architectures.

In [8]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)




All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [9]:
classifier("I am a good boy")

[{'label': '4 stars', 'score': 0.42292681336402893}]

# Using the tokenizer

In [10]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

This returns a dictionary string to list of ints. It contains the ids of the tokens, as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an attention mask that the model will use to have a better understanding of the sequence:

In [11]:
print(inputs)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [12]:
# from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# t5_qa_model = AutoModelForSeq2SeqLM.from_pretrained("google/t5-11b-ssm-tqa")
# t5_tok = AutoTokenizer.from_pretrained("google/t5-11b-ssm-tqa")

# input_ids = t5_tok("When was Franklin D. Roosevelt born?", return_tensors="pt").input_ids
# gen_output = t5_qa_model.generate(input_ids)[0]
# print(t5_tok.decode(gen_output, skip_special_tokens=True))