## Using Hugging Face Transformers

Updated 06/12/2024 C. Lizárraga, UArizona DataLab

**Install required libraries**

Hugging Face Libraries are built on top of frameworks as PyTorch, TensorFlow and JAX.
For this case we will import _PyTorch_.

**To execute code Notebook cells:** Press _SHIFT+ENTER_

In [1]:
!pip install -q torch
!pip install -q transformers


In [4]:
# Import specific classes from Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM


## Phi-2, an instruction model 
Defining the model to run. Will use Microsoft's Phi-2.
If you are looking for a model that can answer questions, Phi-2.0 is a great model.
It will download the model from Huging Face locally.

In [5]:
model_id = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_id)


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Next, create a tokenizer object and load the tokenizer

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True, padding_side='left')


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**Create the inputs for the model to process**
We pose a question to the model, and it will answer accordingly from the text corpus the model was instructed. 

In [7]:
input_text = "Please tell me, who are you?"
input_ids = tokenizer(input_text, return_tensors="pt")


**Run generation and decode the output**

In [9]:
eos_token_id=50256
model.generation_config.pad_token_id = tokenizer.pad_token_id
outputs = model.generate(input_ids["input_ids"], max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
decoded_outputs = tokenizer.decode(outputs[0])
print(decoded_outputs)


Please tell me, who are you?

The man smiled and said, "I am a traveler, and I have come to this village to learn about the culture and traditions of the people here. I have heard that you are the best person to show me around."

Lily was thrilled to have a new friend, and she eagerly showed the man around the village. She took him to the market, where he saw the colorful fruits and vegetables, and the handmade crafts that the villagers sold. She took him to the temple


**Let's try another QA example**

In [10]:
input_text = "Please write a detailed analogy between mathematics and a lighthouse."
input_ids = tokenizer(input_text, return_tensors="pt")


**Run generation and decode the output**

In [11]:
eos_token_id=50256
model.generation_config.pad_token_id = tokenizer.pad_token_id
outputs = model.generate(input_ids["input_ids"], max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
decoded_outputs = tokenizer.decode(outputs[0])
print(decoded_outputs)


Please write a detailed analogy between mathematics and a lighthouse.

Answer: Mathematics is like a lighthouse because it provides guidance and direction. Just as a lighthouse helps ships navigate through treacherous waters, mathematics helps us navigate through complex problems and find solutions. It illuminates the path ahead, allowing us to make informed decisions and avoid getting lost in the darkness of uncertainty.

Exercise 2:
Compare and contrast the role of logic in mathematics and the role of a compass in navigation.

Answer: Logic in mathematics is like a compass in navigation


## Now, try a different transformer for sentiment analysis 

We will use the DistilBERT base multilingual (cased). This model is cased: it does make a difference between english and English.

This Transformer-based language model is trained on the concatenation of Wikipedia in 104 different languages. 

In [12]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load a pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

**Pre-process text**

In [13]:
text = "I love programming!"
tokens = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

**Model Inference**

In [14]:
with torch.no_grad():
    outputs = model(**tokens)
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=1)
    

**Interpret results**

In [15]:
label_ids = torch.argmax(probabilities, dim=1)
labels = ['Negative', 'Positive']
label = labels[label_ids]
print(f"The sentiment is: {label}")


The sentiment is: Positive


## Pipelines
The pipeline is the easiest way to use a pretrained model for inference.

In [16]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis', model='distilbert-base-multilingual-cased')


config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [19]:
classifier("I hate this book")

[{'label': 'LABEL_0', 'score': 0.5270528793334961}]

In [20]:
classifier("I love this book")

[{'label': 'LABEL_0', 'score': 0.5238034129142761}]

**We need a pre-trained model**

In [22]:
from transformers import pipeline

distilled_sentiment_classifier = pipeline(
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", 
    top_k=None
)


In [23]:
# english
distilled_sentiment_classifier ("I love this movie and i would watch it again and again!")


[[{'label': 'positive', 'score': 0.9731044769287109},
  {'label': 'neutral', 'score': 0.016910091042518616},
  {'label': 'negative', 'score': 0.00998548325151205}]]

In [27]:
# spanish
distilled_sentiment_classifier ("Me encanta esta película y la vería una y otra vez")

[[{'label': 'positive', 'score': 0.9293941259384155},
  {'label': 'neutral', 'score': 0.04148319736123085},
  {'label': 'negative', 'score': 0.029122695326805115}]]

In [26]:
# french
distilled_sentiment_classifier ("J'adore ce film et je le regarderais encore et encore!")

[[{'label': 'positive', 'score': 0.8528407216072083},
  {'label': 'neutral', 'score': 0.0751723051071167},
  {'label': 'negative', 'score': 0.07198696583509445}]]

In [None]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-multilingual-cased')
unmasker("Hello I'm a [MASK] model.")


**Use the classifier on a target text**

In [28]:
distilled_sentiment_classifier("We are very happy to show you the 🤗 Transformers library.")

[[{'label': 'positive', 'score': 0.9666252732276917},
  {'label': 'neutral', 'score': 0.020221391692757607},
  {'label': 'negative', 'score': 0.013153337873518467}]]

## Another example: Use of BERT model and tokenizer in the pipeline

In [30]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

In [31]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [32]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")

[{'label': '5 stars', 'score': 0.7272651791572571}]

In [33]:
encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
print(encoding)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
