This notebook uses huggin face transformer to try out known NLP concepts:
- Sequence Classficiation
- Question Extraction
- NER
- Text Summarization

In [46]:
! pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


 ## **Using Transformers for Sequence Classification**

In [47]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier(" plants and cat respond to sunlight")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


label: POSITIVE, with score: 0.9973


In [48]:
result = classifier("The interview was long")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: NEGATIVE, with score: 0.9928


In [49]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "Orange juice sold in stores is filled with sugar"
sequence_1 = "We will go to the moon when you are ready"
sequence_2 = "You can do the split on the moon if you like"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 94%
is paraphrase: 6%


In [50]:
# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 96%
is paraphrase: 4%


In [51]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "Orange juice sold in stores is filled with sugar"
sequence_1 = "We will go to the moon when you are ready"
sequence_2 = "You can do the split on the moon if you like"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase).logits
not_paraphrase_classification_logits = model(not_paraphrase).logits

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

Some layers from the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing TFBertForSequenceClassification: ['dropout_183']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at bert-base-cased-finetuned-mrpc.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


not paraphrase: 94%
is paraphrase: 6%


In [52]:
# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 96%
is paraphrase: 4%


## **Extractive Question Answering**

In [53]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
It is harder for the average Twitter user to reach influencer status than it is
 for those on the more visual social media channels, such as Instagram and YouTube. 
 There is less of a marketplace for non-celebrity Twitter influencers to match up with 
 companies outside the platforms. One of the reasons for this is because Twitter has 
 its own advertising program, Promoted Tweets, where advertisers can simply pay to 
 have their own tweets reach a wider audience than just their company’s follower
 """

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


This returns an answer extracted from the text, a confidence score, alongside "start" and "end" values, which are the
positions of the extracted answer in the text.

In [54]:
result = question_answerer(question="What is extractive question answering?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'harder for the average Twitter user to reach influencer status', score: 0.0683, start: 7, end: 69


In [55]:
result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'Instagram', score: 0.4855, start: 142, end: 151


In [56]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
Are you someone who finds it difficult to strike up a conversation with someone you just met? 
People nowadays are not used to having face-to-face conversations anymore.
 The world has gone digital, and people are content with chatting online.
  However, meeting and conversing with people face-to-face can help build 
  rapport and business connections that can help you attain success.
   Let these 10 ways to have a better conversation guide you and 
   help you live your life to the fulles
"""

questions = [
    "How do people have conversations nowadays?",
    "How can you build business connection with people?",
    "How many ways can you have better conversation with people?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: How do people have conversations nowadays?
Answer: face - to - face conversations anymore. the world has gone digital, and people are content with chatting online
Question: How can you build business connection with people?
Answer: face - to - face
Question: How many ways can you have better conversation with people?
Answer: 10


In [57]:
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")


for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
    input_ids = inputs["input_ids"].numpy()[0]

    outputs = model(inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
    # Get the most likely end of answer with the argmax of the score
    answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

All the layers of TFBertForQuestionAnswering were initialized from the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.


Question: How do people have conversations nowadays?
Answer: face - to - face conversations anymore. the world has gone digital, and people are content with chatting online
Question: How can you build business connection with people?
Answer: face - to - face
Question: How many ways can you have better conversation with people?
Answer: 10


## **Named Entity Recognition**

In [64]:
from transformers import pipeline

ner_pipe = pipeline("ner")

sequence = """Twelve year old Percy Jackson is on the most
 dangerous quest of his life. With the help of a satyr and a
  daughter of Athena, Percy must journey across the United States 
to catch a thief who has stolen the original weapon of mass destruction — Zeus’ master bolt."""

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
Here are the expected results:

In [65]:
for entity in ner_pipe(sequence):
    print(entity)

{'entity': 'I-PER', 'score': 0.9992218, 'index': 4, 'word': 'Percy', 'start': 16, 'end': 21}
{'entity': 'I-PER', 'score': 0.9995484, 'index': 5, 'word': 'Jackson', 'start': 22, 'end': 29}
{'entity': 'I-PER', 'score': 0.9721522, 'index': 27, 'word': 'Athena', 'start': 120, 'end': 126}
{'entity': 'I-PER', 'score': 0.9992083, 'index': 29, 'word': 'Percy', 'start': 128, 'end': 133}
{'entity': 'I-LOC', 'score': 0.9990934, 'index': 34, 'word': 'United', 'start': 158, 'end': 164}
{'entity': 'I-LOC', 'score': 0.99896526, 'index': 35, 'word': 'States', 'start': 165, 'end': 171}
{'entity': 'I-PER', 'score': 0.9847956, 'index': 50, 'word': 'Zeus', 'start': 247, 'end': 251}


In [66]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


sequence = (
    "Twelve year old Percy Jackson is on the most dangerous quest of his life."
     "With the help of a satyr and a daughter of Athena, "
     "Percy must journey across the United States to catch a thief who has stolen the original weapon of mass destruction — Zeus’ master bolt.")

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

In [67]:
from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


sequence = (
    "Twelve year old Percy Jackson is on the most dangerous quest of his life."
     "With the help of a satyr and a daughter of Athena, "
     "Percy must journey across the United States to catch a thief who has stolen the original weapon of mass destruction — Zeus’ master bolt.")

inputs = tokenizer(sequence, return_tensors="tf")
tokens = inputs.tokens()

outputs = model(**inputs)[0]
predictions = tf.argmax(outputs, axis=2)

Some layers from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing TFBertForTokenClassification: ['dropout_147']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [68]:
for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

('[CLS]', 'O')
('Twelve', 'O')
('year', 'O')
('old', 'O')
('Percy', 'I-PER')
('Jackson', 'I-PER')
('is', 'O')
('on', 'O')
('the', 'O')
('most', 'O')
('dangerous', 'O')
('quest', 'O')
('of', 'O')
('his', 'O')
('life', 'O')
('.', 'O')
('With', 'O')
('the', 'O')
('help', 'O')
('of', 'O')
('a', 'O')
('sat', 'O')
('##yr', 'O')
('and', 'O')
('a', 'O')
('daughter', 'O')
('of', 'O')
('Athena', 'I-PER')
(',', 'O')
('Percy', 'I-PER')
('must', 'O')
('journey', 'O')
('across', 'O')
('the', 'O')
('United', 'I-LOC')
('States', 'I-LOC')
('to', 'O')
('catch', 'O')
('a', 'O')
('thief', 'O')
('who', 'O')
('has', 'O')
('stolen', 'O')
('the', 'O')
('original', 'O')
('weapon', 'O')
('of', 'O')
('mass', 'O')
('destruction', 'O')
('—', 'O')
('Zeus', 'I-PER')
('’', 'O')
('master', 'O')
('bolt', 'O')
('.', 'O')
('[SEP]', 'O')


## **Summarization**

In [69]:
from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ All year the half-bloods have been preparing for battle against the Titans,
 knowing the odds of victory are grim. Kronos’s army is stronger than ever,
  and with every god and half-blood he recruits, the evil Titan’s power only grows.
While the Olympians struggle to contain the rampaging monster Typhon, 
Kronos begins his advance on New York City, where Mount Olympus stands
 virtually unguarded. Now it’s up to Percy Jackson and an army 
 of young demigods to stop the Lord of Time. In this momentous 
 final book in the New York Times best-selling Percy Jackson 
 and the Olympians series, the long-awaited prophecy surrounding
  Percy’s sixteenth birthday unfolds. And as the battle for
  Western civilization rages on the streets of Manhattan,
Percy faces a terrifying suspicion that he may be fighting against his own fate"""

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [70]:
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': ' This is the final book in the New York Times best-selling Percy Jackson and the Olympians series . The long-awaited prophecy surrounding Percy Jackson’s sixteenth birthday unfolds .'}]


In [71]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


percy Jackson and an army of young demigods must stop the Lord of Time. the long-awaited prophecy surrounding percy’s sixteenth birthday unfolds.


In [72]:
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloading:   0%|          | 0.00/851M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of s

percy Jackson and an army of young demigods must stop the Lord of Time. the long-awaited prophecy surrounding percy’s sixteenth birthday unfolds.
