# Hugging Face

### In the transformers library, a pipeline is a high-level abstraction that allows you to perform various tasks using pre-trained models with minimal code. The pipeline API simplifies the process of loading models, tokenizers, and other components required for specific natural language processing (NLP) tasks. It abstracts away much of the complexity, allowing you to focus on the task itself rather than the details of the underlying model architecture or data processing.

In [2]:
pip install transformers

Collecting transformers
  Downloading transformers-4.44.0-py3-none-any.whl (9.5 MB)
Note: you may need to restart the kernel to use updated packages.Collecting tokenizers<0.20,>=0.19

  Downloading tokenizers-0.19.1-cp39-none-win_amd64.whl (2.2 MB)
Collecting huggingface-hub<1.0,>=0.23.2
  Downloading huggingface_hub-0.24.5-py3-none-any.whl (417 kB)
Collecting safetensors>=0.4.1
  Downloading safetensors-0.4.4-cp39-none-win_amd64.whl (286 kB)
Collecting fsspec>=2023.5.0
  Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
Installing collected packages: fsspec, huggingface-hub, tokenizers, safetensors, transformers
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2022.2.0
    Uninstalling fsspec-2022.2.0:
      Successfully uninstalled fsspec-2022.2.0
Successfully installed fsspec-2024.6.1 huggingface-hub-0.24.5 safetensors-0.4.4 tokenizers-0.19.1 transformers-4.44.0


In [9]:
from transformers import pipeline



In [5]:
pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
Installing collected packages: tf-keras
Successfully installed tf-keras-2.17.0
Note: you may need to restart the kernel to use updated packages.


## Text Classification: Classify text into categories.

In [10]:
sentiment_classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.







In [11]:
sentiment_classifier("I'm so excited to be learning about large language models")

[{'label': 'POSITIVE', 'score': 0.9997096657752991}]

## Named Entity Recognition (NER): Identify entities in a text.

In [12]:
ner = pipeline("ner", model = "dslim/bert-base-NER") # entity recognition

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
ner("Her name is Anna and she works in New York City for Morgan Stanley")

[{'entity': 'B-PER',
  'score': 0.9954881,
  'index': 4,
  'word': 'Anna',
  'start': 12,
  'end': 16},
 {'entity': 'B-LOC',
  'score': 0.99960667,
  'index': 9,
  'word': 'New',
  'start': 34,
  'end': 37},
 {'entity': 'I-LOC',
  'score': 0.9993955,
  'index': 10,
  'word': 'York',
  'start': 38,
  'end': 42},
 {'entity': 'I-LOC',
  'score': 0.9995803,
  'index': 11,
  'word': 'City',
  'start': 43,
  'end': 47},
 {'entity': 'B-ORG',
  'score': 0.9957462,
  'index': 13,
  'word': 'Morgan',
  'start': 52,
  'end': 58},
 {'entity': 'I-ORG',
  'score': 0.9979346,
  'index': 14,
  'word': 'Stanley',
  'start': 59,
  'end': 66}]

In [16]:
zeroshot_classifier = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli")

In [17]:
sequence_to_classify = "one day I will see the world"
candidate_labels = candidate_labels = ['travel', 'cooking', 'dancing']

In [18]:
zeroshot_classifier(sequence_to_classify, candidate_labels)

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938651323318481, 0.0032737599685788155, 0.0028610422741621733]}

# Pre-trained Tokenizers

In [19]:
from transformers import AutoTokenizer

In [20]:
model = 'bert-base-uncased'

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model)

In [22]:
sentence = "I'm so excited to be learning about large language models"

In [23]:
input_ids = tokenizer(sentence)

In [24]:
print(input_ids)

{'input_ids': [101, 1045, 1005, 1049, 2061, 7568, 2000, 2022, 4083, 2055, 2312, 2653, 4275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [25]:
tokens = tokenizer.tokenize(sentence)

In [26]:
print(tokens)

['i', "'", 'm', 'so', 'excited', 'to', 'be', 'learning', 'about', 'large', 'language', 'models']


In [27]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

In [28]:
print(token_ids)

[1045, 1005, 1049, 2061, 7568, 2000, 2022, 4083, 2055, 2312, 2653, 4275]


In [29]:
decoded_ids = tokenizer.decode(token_ids)
print(decoded_ids)

i'm so excited to be learning about large language models


In [30]:
tokenizer.decode(101)

'[CLS]'

In [31]:
tokenizer.decode(102)

'[SEP]'

In [32]:
model2 = "xlnet-base-cased"

In [33]:
tokenizer2 = AutoTokenizer.from_pretrained(model2)

In [34]:
input_ids = tokenizer2(sentence)

In [35]:
print(input_ids)

{'input_ids': [35, 26, 98, 102, 5564, 22, 39, 1899, 75, 392, 1243, 2626, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [36]:
tokens = tokenizer2.tokenize(sentence)
print(tokens)

['▁I', "'", 'm', '▁so', '▁excited', '▁to', '▁be', '▁learning', '▁about', '▁large', '▁language', '▁models']


In [37]:
token_ids = tokenizer2.convert_tokens_to_ids(tokens)
print(token_ids)

[35, 26, 98, 102, 5564, 22, 39, 1899, 75, 392, 1243, 2626]


In [38]:
tokenizer2.decode(4)

'<sep>'

In [39]:
tokenizer2.decode(3)

'<cls>'

# what are SEP and CLS tokens?

### SEP is a seperator and CLS is a classification token. CLS is coming at the begining of the input and SEP is used to seperate different segments of text

### Mask token is used in a modelling or text with a blank to fill in

### padding token used to make all inputs the same length, truncation shortening longer inputs to a specific length

# Huggingface and Pytorch/Tensorflow

In [1]:
pip install torch torchvision torchaudio

Note: you may need to restart the kernel to use updated packages.


In [40]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [41]:
print(sentence)
print(input_ids)

I'm so excited to be learning about large language models
{'input_ids': [35, 26, 98, 102, 5564, 22, 39, 1899, 75, 392, 1243, 2626, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [42]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [43]:
pip install pytorch

Collecting pytorch
  Using cached pytorch-1.0.2.tar.gz (689 bytes)
Building wheels for collected packages: pytorch
  Building wheel for pytorch (setup.py): started
  Building wheel for pytorch (setup.py): finished with status 'error'
  Running setup.py clean for pytorch
Failed to build pytorch
Installing collected packages: pytorch
    Running setup.py install for pytorch: started
    Running setup.py install for pytorch: finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\sanaz\anaconda3\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\sanaz\\AppData\\Local\\Temp\\pip-install-o76p3r_n\\pytorch_2df6ceb13e734ac7ab0bf44c2d4c14c5\\setup.py'"'"'; __file__='"'"'C:\\Users\\sanaz\\AppData\\Local\\Temp\\pip-install-o76p3r_n\\pytorch_2df6ceb13e734ac7ab0bf44c2d4c14c5\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\sanaz\AppData\Local\Temp\pip-wheel-bwaxkt2a'
       cwd: C:\Users\sanaz\AppData\Local\Temp\pip-install-o76p3r_n\pytorch_2df6ceb13e734ac7ab0bf44c2d4c14c5\
  Complete output (5 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\sanaz\AppData\Loca

In [44]:
input_ids_pt = tokenizer(sentence, return_tensors ="pt") #return input ids as a pytorch tensor
print(input_ids_pt)

{'input_ids': tensor([[ 101, 1045, 1005, 1049, 2061, 7568, 2000, 2022, 4083, 2055, 2312, 2653,
         4275,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [45]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [46]:
with torch.no_grad():
    logits = model(**input_ids_pt).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'

In [51]:
model_directory = "my_saved_models"

In [52]:
tokenizer.save_pretrained(model_directory)

('my_saved_models\\tokenizer_config.json',
 'my_saved_models\\special_tokens_map.json',
 'my_saved_models\\vocab.txt',
 'my_saved_models\\added_tokens.json',
 'my_saved_models\\tokenizer.json')

In [53]:
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)

In [55]:
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)