## HuggingFace Transformers


In [1]:
from transformers import pipeline


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
sentiment_classifier = pipeline("sentiment-analysis")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.






All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [3]:
sentiment_classifier("I'm so excited to be learning about large language models")


[{'label': 'POSITIVE', 'score': 0.9997096657752991}]

In [4]:
ner = pipeline("ner", model = "dslim/bert-base-NER")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [5]:
zeroshot_classifier = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing TFBartForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBartForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBartForSequenceClassification were initialized from the PyTorch m

In [6]:
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']

In [7]:
zeroshot_classifier(sequence_to_classify, candidate_labels)


{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938651323318481, 0.0032737741712480783, 0.0028610252775251865]}

## Pre-trained Tokenizers


In [8]:
from transformers import AutoTokenizer


In [9]:
model = "bert-base-uncased"


In [10]:
tokenizer = AutoTokenizer.from_pretrained(model)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
sentence = "I'm so excited to be learning about large language models"


In [12]:
input_ids = tokenizer(sentence)
print(input_ids)

{'input_ids': [101, 1045, 1005, 1049, 2061, 7568, 2000, 2022, 4083, 2055, 2312, 2653, 4275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [13]:
tokens = tokenizer.tokenize(sentence)


In [14]:
print(tokens)


['i', "'", 'm', 'so', 'excited', 'to', 'be', 'learning', 'about', 'large', 'language', 'models']


In [15]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)


In [16]:
print(token_ids)


[1045, 1005, 1049, 2061, 7568, 2000, 2022, 4083, 2055, 2312, 2653, 4275]


In [17]:

decoded_ids = tokenizer.decode(token_ids)
print(decoded_ids)

i ' m so excited to be learning about large language models


In [18]:
tokenizer.decode(101)


'[CLS]'

In [19]:
tokenizer.decode(102)


'[SEP]'

In [20]:
model2 = "xlnet-base-cased"


In [21]:
tokenizer2 = AutoTokenizer.from_pretrained(model2)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [22]:
input_ids = tokenizer2(sentence)


In [23]:
print(input_ids)


{'input_ids': [35, 26, 98, 102, 5564, 22, 39, 1899, 75, 392, 1243, 2626, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [24]:
tokens = tokenizer2.tokenize(sentence)
print(tokens)

['▁I', "'", 'm', '▁so', '▁excited', '▁to', '▁be', '▁learning', '▁about', '▁large', '▁language', '▁models']


In [25]:
token_ids = tokenizer2.convert_tokens_to_ids(tokens)
print(token_ids)

[35, 26, 98, 102, 5564, 22, 39, 1899, 75, 392, 1243, 2626]


In [26]:
tokenizer2.decode(4)


'<sep>'

In [27]:
tokenizer2.decode(3)


'<cls>'

## Huggingface and Pytorch/Tensorflow


Collecting torch
  Downloading torch-2.5.1-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting networkx (from torch)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting sympy==1.13.1 (from torch)
  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.5.1-cp312-cp312-win_amd64.whl (203.0 MB)
   ---------------------------------------- 0.0/203.0 MB ? eta -:--:--
    --------------------------------------- 2.9/203.0 MB 18.6 MB/s eta 0:00:11
   - -------------------------------------- 7.1/203.0 MB 18.9 MB/s eta 0:00:11
   - -------------------------------------- 10.0/203.0 MB 16.8 MB/s eta 0:00:12
   -- ------------------------------------- 14.7/203.0 MB 18.1 MB/s eta 0:00:11
   --- ------------------------------------ 18.1/203.0 MB 17.8 MB/s eta 0:00:11
   --- ------------------------------------ 18.6/203.0 MB 14.9 MB/s 

In [30]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [31]:

print(sentence)
print(input_ids)

I'm so excited to be learning about large language models
{'input_ids': [35, 26, 98, 102, 5564, 22, 39, 1899, 75, 392, 1243, 2626, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [32]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [33]:
input_ids_pt = tokenizer(sentence, return_tensors ="pt")
print(input_ids_pt)

ImportError: Unable to convert output to PyTorch tensors format, PyTorch is not installed.

In [34]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")


ImportError: 
AutoModelForSequenceClassification requires the PyTorch library but it was not found in your environment.
However, we were able to find a TensorFlow installation. TensorFlow classes begin
with "TF", but are otherwise identically named to our PyTorch classes. This
means that the TF equivalent of the class you tried to import would be "TFAutoModelForSequenceClassification".
If you want to use TensorFlow, please use TF classes instead!

If you really do want to use PyTorch please go to
https://pytorch.org/get-started/locally/ and follow the instructions that
match your environment.


In [35]:
with torch.no_grad():
    logits = model(**input_ids_pt).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

NameError: name 'input_ids_pt' is not defined

In [36]:
Saving and loading models


SyntaxError: invalid syntax (3130509797.py, line 1)

In [None]:
model_directory = "my_saved_models"

In [None]:
tokenizer.save_pretrained(model_directory)


In [None]:
model.save_pretrained(model_directory)


In [None]:
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)


In [None]:
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)
