<a id = 'top'></a>

#  A quick-start guide to BERT-like models with Hugging Face Transformers
  * A. [What is BERT?](#introBERT)
  * B. [Pipelines](#pipelines)    
      * 1. [Sentiment Classification](#sentimentClass)
      * 2. [Token Classification (Named Entity Recognition, Part-of-Speech tagging)](#tokenClass)
      * 3. [Question-Answering](#questionAnswer)
      * 4. [Masked Language Modeling](#MLModel)
      * 5. [Translation](#translation)
      


  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-spring-main/blob/master/materials/walkthrough_notebooks/bert_as_black_box/HuggingFaceThreeWays_1_Pipelines.ipynb)

Hugging Face is a company that offers a library of "transformers" as well as pre-trained models geared for a variety of tasks.  We are going to explore several ways of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look at the HuggingFace library at the same level as Keras rather at the lower level of TensorFlow.  We'll take full advantage of a number of abstract classes they've created to facilitate using their models.

Note that HuggingFace supports Tensorflow and an alternative called PyTorch.  The default language for HuggingFace is PyTorch.  They port many of their models to Tensorflow.  When using Huggingface just pay attention to which version you're using.  When the model you're using is TensorFlow, the model name often begins with TF as in TFBert or TFDistilBert.  If it doesn't have a TF at the begining of the model name, it is using PyTorch.


---

This directory includes three different uses of the HuggingFace Library because these classes and abstractions are incompatible with each other.

[Return to Top](#top)
 <a id = 'introBERT'></a>
# What is BERT?
This notebook leverages one of a variety of BERT models.  BERT models can be classified in terms of three parts.  The first part is a component named a transformer.  These can grow to be quite large.  The second part is the training it already has on language.  The third part is the tasks it is geared toward performing.  Different models will use different size transformers and may be optimized for different languages and different tasks.  For example, CamemBERT is trained in French and SciBERT is trained on scientific journal articles.  You'll want to make sure you use a model appropriate to your language and task.

---

The [HuggingFace web site](https://huggingface.co/transformers) offers an interesting set of resources.  Their [model documentation](https://huggingface.co/transformers/model_summary.html) provides an excellent explanation of transformers as well as the growing variety of models they offer (see the left hand navigation column).  In addition, their collection of [notebooks](https://huggingface.co/transformers/notebooks.html) is a valuable set of examples.  Their [blog](https://huggingface.co/blog) has some interesting and useful posts about how transformers of all varieties work. Finally, they offer [an excellent set of tutorials](https://huggingface.co/docs/transformers/index) as well as a Quick Start Guide with videos on the use of the full Hugging Face family of resources.

---

One word of caution:  this is a rapidly evolving resource and as a result you can often run in to bugs.  They will get fixed, eventually, but may be buggy for a while.  

In [None]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -q xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

[Return to Top](#top)
 <a id = 'pipelines'></a>
# Pipelines

In [None]:
from __future__ import print_function
import ipywidgets as widgets
from transformers import pipeline

The pipeline interface provides a very abstract and simple API that allows you to experiment with several different NLP tasks without having to do any training at all.  These can be useful if you have a limited set of tests or experiments you want to try.  Some of the supported tasks include:


*   Sentiment Classification
*   Token Classification (Named Entity Recognition, Part-of-Speech tagging)
*   Question-Answering
*   Masked Language Modeling
*   Translation


In its simplest and most abstract form we will use two commands.  First, instantiate a pipleline object and specify the task. Second, feed the pipeline the appropriate input and get an answer.

[Return to Top](#top)
 <a id = 'sentimentClass'></a>
### Sentiment Analysis

Sentiment analysis takes sentences as input and classifies into either two categories -- positive and negative -- or three categories -- positive, negative, neutral -- depending on the sentiments expressed in the sentence.

In [None]:
nlp_sentence_classif = pipeline(task='sentiment-analysis') # you aren't required to specify a model as it is configured with a default
nlp_sentence_classif('This NLP stuff is very cool !') #a very positive statement should yield a high positive score

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998658895492554}]

These two lines of code hide a lot of what's happening under the hood.  The input text is converted into tokens that the underlying model can understand.  The model is called with that set of tokens.  The result is converted back in to a label or text that can be understood by a user.


[Return to Top](#top)
<a id = 'tokenClass'></a>
### Token Classification

Token classification is a task where each token is assigned a label (e.g. classified).  For example, you might assign a label of article, noun, adjective, preposition, verb, or other to each token.   Named entity recognition assigns a label to each token in the token stream to identify a variety of different "entities" mentioned in the text.

In [None]:
nlp_token_class = pipeline(task='ner')
nlp_token_class('The iSchool is a part of UC Berkeley in the state of California.')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

[{'entity': 'I-ORG',
  'score': 0.98272264,
  'index': 2,
  'word': 'i',
  'start': 4,
  'end': 5},
 {'entity': 'I-ORG',
  'score': 0.9763638,
  'index': 3,
  'word': '##S',
  'start': 5,
  'end': 6},
 {'entity': 'I-ORG',
  'score': 0.9488824,
  'index': 4,
  'word': '##cho',
  'start': 6,
  'end': 9},
 {'entity': 'I-ORG',
  'score': 0.9534883,
  'index': 5,
  'word': '##ol',
  'start': 9,
  'end': 11},
 {'entity': 'I-ORG',
  'score': 0.9975151,
  'index': 10,
  'word': 'UC',
  'start': 25,
  'end': 27},
 {'entity': 'I-ORG',
  'score': 0.9888736,
  'index': 11,
  'word': 'Berkeley',
  'start': 28,
  'end': 36},
 {'entity': 'I-LOC',
  'score': 0.99658054,
  'index': 16,
  'word': 'California',
  'start': 53,
  'end': 63}]

[Return to Top](#top)
<a id = 'questionAnswer'></a>
### Question Answering

The question answering task tries to identify the answer to a question contained in a context paragraph that is fed in to the system along wth the question.  One formulation seeks to do this by tagging the tokens in the context paragraph as being outside the answer span or inside the answer span.  As noted, the question answering task requires two inputs:


*   The context paragraph
*   The question to be answered



In [None]:
nlp_question_answer = pipeline(task='question-answering')
nlp_question_answer(context='The iSchool is a part of UC Berkeley in the state of California.', question='In which state is the iSchool located ?')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9833861589431763, 'start': 53, 'end': 63, 'answer': 'California'}

[Return to Top](#top)
<a id = 'MLModel'></a>
### Masked Language Modeling

The masked language modeling task is like a fill in the blank test.  You provide a sentence but you "mask" a word.  The model then provides a set of candidate answers -- words that could fill in the blank arranged in order of highest to lowest probability.

In [None]:
nlp_mlm = pipeline(task='fill-mask', model='distilroberta-base')
nlp_mlm('UC Berkeley is located in ' + nlp_mlm.tokenizer.mask_token)


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.3897515833377838,
  'token': 10817,
  'token_str': ' Berkeley',
  'sequence': 'UC Berkeley is located in Berkeley'},
 {'score': 0.10624530166387558,
  'token': 5147,
  'token_str': ' Oakland',
  'sequence': 'UC Berkeley is located in Oakland'},
 {'score': 0.08518118411302567,
  'token': 886,
  'token_str': ' California',
  'sequence': 'UC Berkeley is located in California'},
 {'score': 0.03185350447893143,
  'token': 20738,
  'token_str': ' Irvine',
  'sequence': 'UC Berkeley is located in Irvine'},
 {'score': 0.02619953267276287,
  'token': 7759,
  'token_str': ' Sacramento',
  'sequence': 'UC Berkeley is located in Sacramento'}]

[Return to Top](#top)
<a id = 'translation'></a>
### Translation

The translation task supported by Hugging Face Pipelines takes as input a sentence in English and emits a translation in the specified language -- in this case French.  The pipeline provides a very limited set of translation inputs and outputs.  If you want to translate in different languages then you need to train a model yourself to work with those languages.

In [None]:
translator = pipeline(task='translation_en_to_fr', model='t5-base')
translator("I love studying NLP in the MIDS program .")
#translator("J'aime bien etudier la NLP dans le programme MIDS .")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': "J'aime étudier la NLP dans le programme MIDS ."}]