In [1]:
# Import necessary library
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

# Behind The Pipeline
My task is to play around introduction to HF Transformer, Behind the Pipeline, and Models. There is no code in introduction, so I only code the pipeline and models.

Here is the high-level overview of the pipeline:
1. **Introduction**: The script begins by printing an introduction to the "Behind the Pipeline" example, mentioning that Hugging Face Transformers provides a high-level pipeline API.

2. **Pipeline Initialization**: sentiment_classifier = pipeline("sentiment-analysis") initializes a sentiment analysis pipeline using a pre-trained model for sentiment analysis.

3. **Input Text**: text = "Hugging Face Transformers simplifies NLP workflows." sets the input text for sentiment analysis. This is the text for which we want to determine the sentiment.

4. **Pipeline Execution**: result = sentiment_classifier(text) runs the input text through the sentiment analysis pipeline, and the result is stored in the result variable.

5. **Result Analysis**: print(f"Sentiment Analysis Result for '{text}': {result[0]['label']}") prints the sentiment analysis result. The result is obtained in a JSON-like format, and we access the 'label' field to get the sentiment label (e.g., 'POSITIVE' or 'NEGATIVE').

In [4]:
# Sentiment Analysis Pipeline
sentiment_classifier = pipeline("sentiment-analysis")         # Initialize Sentiment Analysis Pipeline
text1 = "Hugging Face Transformers simplifies NLP workflows."  # Declare text example
result = sentiment_classifier(text1)                           # Declare result that fit text example into the pipeline

# Named Entity Recognition (NER) Pipeline
# Every pipeline use the same code structure as the first example (Sentiment Analysis)
ner_classifier = pipeline("ner")
text2 = "Hugging Face, based in New York, is a leading company in natural language processing."
entities = ner_classifier(text2)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Named Entity Recognition Result for 'Hugging Face Transformers simplifies NLP workflows.': [{'entity': 'I-ORG', 'score': 0.9992798, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9822902, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.99642724, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-LOC', 'score': 0.9989936, 'index': 7, 'word': 'New', 'start': 23, 'end': 26}, {'entity': 'I-LOC', 'score': 0.9986708, 'index': 8, 'word': 'York', 'start': 27, 'end': 31}]


In [8]:
# Print the result
print(f"Sentiment Analysis Result for \n'{text1}': \n{result[0]['label']}")
print(f"\nNamed Entity Recognition Result for \n'{text2}': \n{entities}")

Sentiment Analysis Result for 
'Hugging Face Transformers simplifies NLP workflows.': 
POSITIVE

Named Entity Recognition Result for 
'Hugging Face, based in New York, is a leading company in natural language processing.': 
[{'entity': 'I-ORG', 'score': 0.9992798, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9822902, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.99642724, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-LOC', 'score': 0.9989936, 'index': 7, 'word': 'New', 'start': 23, 'end': 26}, {'entity': 'I-LOC', 'score': 0.9986708, 'index': 8, 'word': 'York', 'start': 27, 'end': 31}]


Key aspects of the pipeline:

1. **High-Level Pipeline API**: The pipeline function abstracts the underlying complexities of loading the pre-trained model, tokenizing the input, making predictions, and post-processing the results. It simplifies the entire process into a single line of code.

2. **Convenient Interface**: By using the pipeline, you don't need to manually handle model loading, tokenization, and other details. The library takes care of these tasks, allowing you to focus on the specific NLP task you're interested in.

3. **Result Structure**: The result obtained from the pipeline is typically a list of dictionaries. In the sentiment analysis example, we access the first item in the list (result[0]) and then retrieve the 'label' field to get the sentiment label.

Overall, the pipeline provides a streamlined and user-friendly interface for applying pre-trained models to various NLP tasks. It enhances productivity by simplifying the workflow and reducing the amount of boilerplate code needed for common NLP tasks.

# Models Example

Here is the explanation of the example:

1. **Introduction**: The script starts by printing an introduction to the "Models Example," emphasizing that Hugging Face Transformers supports various pre-trained models for different NLP tasks.

2. **Model Specification**: model_name = "nlptown/bert-base-multilingual-uncased-sentiment" specifies the name of the pre-trained model to be used. In this case, it's a BERT-based model fine-tuned for sentiment analysis.

3. **Tokenizer and Model Loading**: tokenizer = AutoTokenizer.from_pretrained(model_name) and model = AutoModelForSequenceClassification.from_pretrained(model_name) load the tokenizer and model from Hugging Face Transformers using the specified model name.

4. **Input Text**: text = "Hugging Face Transformers is incredibly versatile and user-friendly." defines the input text for which we want to perform sentiment analysis.

5. **Tokenization**: inputs = tokenizer(text, return_tensors="pt", truncation=True) tokenizes the input text using the loaded tokenizer and prepares inputs for the model. The return_tensors="pt" option ensures that PyTorch tensors are returned.

6. **Model Inference**: outputs = model(**inputs) performs a forward pass through the model using the tokenized input to obtain model predictions. The result is stored in the outputs variable.

7. **Print Results**: The script prints information about the loaded model, the tokenized input, and the model output (logits).

In summary, this example demonstrates the process of loading a pre-trained sequence classification model, tokenizing an input text, and obtaining model predictions. The Hugging Face Transformers library simplifies these tasks by providing pre-trained models and a consistent interface for working with them.

In [9]:
# Specify the pre-trained model name
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

# Load the tokenizer and model from HF Transformers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define the input text
text = "Hugging Face Transformers is incredibly versatile and user-friendly."

# Tokenize the input text and prepare inputs for the model
inputs = tokenizer(text, return_tensors="pt", truncation=True)

# Forward pass through the model to obtain predictions
outputs = model(**inputs)

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

In [10]:
# Print the model information, tokenized input, and model output
print(f"Model: {model_name}")
print("Tokenized Input:", inputs)
print("Model Output:", outputs.logits)

Model: nlptown/bert-base-multilingual-uncased-sentiment
Tokenized Input: {'input_ids': tensor([[  101, 18368, 68635, 12828, 58263, 10127, 13565, 83887, 28634, 89916,
         42600, 10110, 24934,   118, 35751,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Model Output: tensor([[-2.5640, -2.6340, -1.2598,  1.8034,  3.6703]],
       grad_fn=<AddmmBackward0>)
