# Ugging Face Introduction
* https://www.youtube.com/watch?v=tkGIF4CFoV4&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi&index=53
* https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_11_2_py_huggingface.ipynb

Transformers have become a mainstay of natural language processing. This module will examine the Hugging Face Python library for natural language processing, bringing together pretrained transformers, data sets, tokenizers, and other elements. Through the Hugging Face API, you can quickly begin using sentiment analysis, entity recognition, language translation, summarization, and text generation.

In [1]:
!pip install transformers
!pip install transformers[sentencepiece]



# Sentiment Analysis
Sentiment analysis uses natural language processing, text analysis, computational linguistics, and biometrics to identify the tone of written text. Passages of written text can be into simple binary states of positive or negative tone. More advanced sentiment analysis might classify text into additional categories: sadness, joy, love, anger, fear, or surprise.

In [3]:
from urllib.request import urlopen

# Read sample text. a poem
URL = "https://data.heatonresearch.com/data/t81-558/"\
    "datasets/sonnet_18.txt"
f = urlopen(URL)
text = f.read().decode("utf-8")
print(text)

Sonnet 18 original text
William Shakespeare

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.


Usually, you have to preprocess text into embeddings or other vector forms before presentation to a neural network.
Hugging Face provides a pipeline that simplifies this process greatly. The pipeline allows you to pass regular Python strings to the transformers and return standard Python vales.


We begin by loading a text-classification model. We do not specify the exact model type wanted, so Hgging Face automatically chooses a network from the Hgging Face hub named:
* distilbert-base-uncased-finetuned-sst-2-english


To specify them model to use, pass the model parameter, such as:
```
pipe = pipeline(model="roberta-large-mnli")
```

The following code loads a model pipeline and a model for sentiment analysis.

In [4]:
import pandas as pd
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

We can now display the sentiment analysis results with a Pandas dataframe.

In [5]:
outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,POSITIVE,0.984666


In [8]:
text_2 = """'I'm sad, because it is rainny."""
outputs_2 = classifier(text_2)
pd.DataFrame(outputs_2)

Unnamed: 0,label,score
0,NEGATIVE,0.988715


In [10]:
text_3 = """My friend asked me about his private trouble. He fell in love with young lady. but he has wife.
So He should forget that young lady. When I told him that thing, his face became pale."""
outputs_3 = classifier(text_3)
pd.DataFrame(outputs_3)

Unnamed: 0,label,score
0,NEGATIVE,0.999144


# Entity Tagging
Entily tagging is the process that takes source text and finds parts of that text that represent entities, such as one of the following:
* Location (LOC)
* Organization (ORG)
* Person (PER)
* Miscellaneous(その他) (MISC)

The following code requests a "named entity recoginizer"(ner) and process the specified text.

In [11]:
text2 = "Abraham Lincoln was a president who lived in the United States."

tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [12]:
outputs = tagger(text2)
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998893,Abraham Lincoln,0,15
1,LOC,0.999651,United States,49,62


# Question Answering
Another common task for NLP is question answering from a reference text. we load such a model with the following code.

In [14]:
reader = pipeline("question-answering")
question = "What now shall fade?"

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

For this example, we will pose the question "What shall fade" to Hugging Face for Sonnet 18. We see the correct answer of eternal summer.

In [15]:
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

Unnamed: 0,score,start,end,answer
0,0.471141,414,428,eternal summer


# Language Translation

In [16]:
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



The following code translates Sonnet 18 from English into German.

In [17]:
outputs = translator(text, clean_up_tokenization_spaces=True,
                     min_length=100)
print(outputs[0]['translation_text'])

Sonnet 18 Originaltext William Shakespeare Soll ich dich mit einem Sommertag vergleichen? Du bist schöner und gemäßigter: Raue Winde schütteln die lieblichen Knospen des Mai, Und der Sommervertrag hat zu kurz ein Datum: Irgendwann zu heiß das Auge des Himmels leuchtet, Und oft ist sein Gold Teint dimm'd; Und jede faire von Fair irgendwann sinkt, Durch Zufall oder die Natur wechselnden Kurs untrimm'd; Aber dein ewiger Sommer wird nicht verblassen noch verlieren Besitz von dem Schönen du schuld; noch wird der Tod prahlen du wandert in seinem Schatten, Wenn in ewigen Linien zur Zeit wachsen: So lange die Menschen atmen oder Augen sehen können, So lange lebt dies und dies gibt dir Leben.


# Summarization

In [18]:
text2 = """
An apple is an edible fruit produced by an apple tree (Malus domestica).
Apple trees are cultivated worldwide and are the most widely grown species
in the genus Malus. The tree originated in Central Asia, where its wild
ancestor, Malus sieversii, is still found today. Apples have been grown
for thousands of years in Asia and Europe and were brought to North America
by European colonists. Apples have religious and mythological significance
in many cultures, including Norse, Greek, and European Christian tradition.
"""

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [19]:
outputs = summarizer(text2, max_length=45,
                     clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

Your min_length=56 must be inferior than your max_length=45.


 An apple is an edible fruit produced by an apple tree (Malus domestica) Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Apples have religious and mythological


# Text Generation
Finally, text generation allows us to take an input text and request the pretrained neural network to continue that text.

In [20]:
from urllib.request import urlopen

generator = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [21]:
outputs = generator(text, max_length=400)
print(outputs[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sonnet 18 original text
William Shakespeare

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee. Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine Augustine 

In [22]:
print(text)

Sonnet 18 original text
William Shakespeare

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this and this gives life to thee.
