<a href="https://colab.research.google.com/github/vicentegilso/textpreprocesing/blob/main/Hugging_Face_first_contact.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face - First Contact 🤗

In [None]:
#@markdown [Run me] Install `transformers` package
!pip install transformers datasets > /dev/null

Import hugging face libraries

In [None]:
from transformers import pipeline
import datasets

## `transformers` pipelines 

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. [Reference](https://huggingface.co/transformers/main_classes/pipelines.html?#pipelines)

A pipeline hides the complexity of managing a **model** and a **tokenizer**.


### Question and answering

In [None]:
text = "A pisco sour is a cocktail typical of South American cuisine. The drink's name is a combination of the word pisco, which is its base liquor, and the term sour, in reference to sour citrus juice and sweetener components. Chile and Peru both claim the pisco sour as their national drink, and each asserts exclusive ownership of both pisco and the cocktail. " #@param {type: "string"}
question = "Which countries claim pisco as their national drink?"  #@param {type: "string"}

In [None]:
qa = pipeline("question-answering")
answer = qa(question=question, context=text)

print(question, answer["answer"])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…


Which countries claim pisco as their national drink? Chile and Peru


In [None]:
answer

{'answer': 'Chile and Peru',
 'end': 234,
 'score': 0.9972382187843323,
 'start': 220}

### Name Entity Recognition

In [None]:
text = "My name is Guillem and I live in Lleida" #@param {type: "string"}


In [None]:
ner = pipeline("ner", aggregation_strategy="simple")
pred = ner(text)

for token in pred:
  print(token["word"], f'({token["entity_group"]})')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…


Guillem (PER)
Lleida (LOC)


### Summarization

In [None]:
text = """
Homer Jay Simpson is a fictional character and one of the main protagonists of the American animated sitcom The Simpsons. He is voiced by Dan Castellaneta and first appeared on television, along with the rest of his family, in The Tracey Ullman Show short "Good Night" on April 19, 1987. Homer was created and designed by cartoonist Matt Groening while he was waiting in the lobby of James L. Brooks' office. Groening had been called to pitch a series of shorts based on his comic strip Life in Hell but instead decided to create a new set of characters. He named the character after his father, Homer Groening. After appearing for three seasons on The Tracey Ullman Show, the Simpson family got their own series on Fox, which debuted December 17, 1989.

As patriarch of the eponymous family, Homer and his wife Marge have three children: Bart, Lisa and Maggie. As the family's provider, he works at the Springfield Nuclear Power Plant as safety inspector. Homer embodies many American working class stereotypes: he is obese, immature, outspoken, aggressive, balding, lazy, ignorant, unprofessional, and addicted to beer, junk food and watching television. However, he is fundamentally a good man and is staunchly protective of his family, especially when they need him the most. Despite the suburban blue-collar routine of his life, he has had a number of remarkable experiences, including going to space, climbing the tallest mountain in Springfield by himself, fighting former President George H. W. Bush and winning a Grammy Award as a member of a barbershop quartet
"""

In [None]:
summarizer = pipeline("summarization")
[summary] = summarizer(text, clean_up_tokenization_spaces=True)
print(summary["summary_text"].replace(".", ".\n"))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1802.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


 Homer was created and designed by cartoonist Matt Groening.
 He is voiced by Dan Castellaneta and first appeared on television, along with the rest of his family, in The Tracey Ullman Show short "Good Night" on April 19, 1987.
 Homer embodies many American working class stereotypes: he is obese, immature, outspoken, aggressive, balding, lazy, ignorant and unprofessional.



### SPAM classification

In [None]:
classifier = pipeline(
  "text-classification", 
  model="mrm8488/bert-tiny-finetuned-sms-spam-detection",
  tokenizer="mrm8488/bert-tiny-finetuned-sms-spam-detection")

label_mapping = {"LABEL_1": "SPAM", "LABEL_0": "LEGIT"}

[pred] = classifier("Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days.")
print(f'Spam probability {pred["score"]:.3f}')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=645.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=17565824.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=324.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…


Spam probability 0.902


## 🤗 Datasets 💾

Contains datasets and evaluation metrics for natural language processing tasks.

Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).





In [None]:
ds_lists = datasets.list_datasets()
print(f"# Available datasets:", len(ds_lists))
print(', '.join(ds_lists))

# Available datasets: 985
acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat, aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art, arxiv_dataset, ascent_kb, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbaw_egyptian, bbc_hindi_nli, bc2gm_corpus, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, blended_skill_talk, blimp, blog_authorship_corpus, bn_hate_speech, bookcorpus, bookcorpusopen, boolq, bprec, break_data, brwac, bsd_ja_en, bswac, c3, c4, cail2018, caner, capes, catalonia_independence, cawac, cbt, cc100, cc_news, ccaligned_multilingual, cdsc, cdt, cfq, chr_en, cifar10, cifar100, circa, civil_comments, clickbait_new

### Loading dataset

Look for a dataset name [here](https://huggingface.co/datasets) and reference the found name as the `load_dataset` function parameter. For example the `sms_spam` dataset.

Additionally, you can visually explore the dataset [here](https://huggingface.co/datasets/viewer/).

In [None]:
ds = datasets.load_dataset("sms_spam")
ds

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1501.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=901.0, style=ProgressStyle(description_…


Downloading and preparing dataset sms_spam/plain_text (download: 198.65 KiB, generated: 509.53 KiB, post-processed: Unknown size, total: 708.17 KiB) to /root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=203415.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset sms_spam downloaded and prepared to /root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['label', 'sms'],
        num_rows: 5574
    })
})

In [None]:
ds["train"].features

{'label': ClassLabel(num_classes=2, names=['ham', 'spam'], names_file=None, id=None),
 'sms': Value(dtype='string', id=None)}

In [None]:
ds["train"][2]

{'label': 1,
 'sms': "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n"}

Remember that creating a good train and test split is crucial for training and evaluating a model.

With 🤗 dataset you can create the splits directly when downloading the dataset with the `load_dataset` function using the "Slicing API".

Some datasets directly expose some predifined slipts (eg: train, test).

```python
train_ds, test_ds = datasets.load_dataset('bookcorpus', split=['train', 'test'])
```

On the other hand, there are dataset that only have a single predfined split. For instance the `sms_spam` dataset.

```python
datasets.load_dataset("sms_spam", split=["train", "test"])

# Fails with the following exception:
# Unknown split "test". Should be one of ['train'].
```

With the slicing API we can define splits:

In [None]:
train_ds = datasets.load_dataset("sms_spam", split="train[:80%]")
test_ds = datasets.load_dataset("sms_spam", split="train[80%:]")

print("Train dataset samples:", len(train_ds))
print("Test dataset samples:", len(test_ds))

Reusing dataset sms_spam (/root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c)
Reusing dataset sms_spam (/root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c)


Train dataset samples: 4459
Test dataset samples: 1115


### Tokenize a dataset

Tokenize our sentences in order to build sequences of integers that our model can digest from the pairs of sequences. HuggingFace transformers provide a set of tokenizers that will carry out the tokenization automatically.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def encode(example):
  return tokenizer(example["sms"], 
                   max_length=128, 
                   truncation=True,
                   padding="max_length")

train_ds = train_ds.map(encode, batched=True, batch_size=32)
test_ds = test_ds.map(encode, batched=True, batch_size=32)

HBox(children=(FloatProgress(value=0.0, max=140.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=35.0), HTML(value='')))




In [None]:
train_ds[0]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'input_ids': [101,
  2175,
  2127,
  18414,
  17583,
  2391,
  1010,
  4689,
  1012,
  1012,
  2800,
  2069,
  1999,
  11829,
  2483,
  1050,
  2307,
  2088,
  2474,
  1041,
  28305,
  1012,
  1012,
  1012,
  25022,
  2638,
  2045,
  2288,
  26297,
  28194,
  1012,
  1012,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
 