<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/mastering-transformers/02-hands-on-introduction-to-subject/1_hands_on_introduction_to_subject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install transformers
!pip -q install datasets

##Working with language models and tokenizers

In [2]:
from transformers import BertTokenizer, BertModel, pipeline
from datasets import load_dataset

from datasets import list_datasets, list_metrics

import pandas as pd

from pprint import pprint

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [4]:
text = "Using Transformer is easy!"
tokenizer(text)

{'input_ids': [101, 2478, 10938, 2121, 2003, 3733, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
encoded_input = tokenizer(text, return_tensors="pt")
encoded_input

{'input_ids': tensor([[  101,  2478, 10938,  2121,  2003,  3733,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
model = BertModel.from_pretrained("bert-base-uncased")

output = model(**encoded_input)
output

In [None]:
len(output)

2

In [None]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
unmasker("The man worked as a [MASK].")

In [None]:
pd.DataFrame(unmasker("The man worked as a [MASK]."))

Unnamed: 0,sequence,score,token,token_str
0,the man worked as a carpenter.,0.097476,10533,carpenter
1,the man worked as a waiter.,0.052383,15610,waiter
2,the man worked as a barber.,0.049627,13362,barber
3,the man worked as a mechanic.,0.037886,15893,mechanic
4,the man worked as a salesman.,0.037681,18968,salesman


##Working with community-provided models

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

In [None]:
sequence_to_classify = "I am going to france."
candidate_labels = ["travel", "cooking", "dancing"]

In [None]:
classifier(sequence_to_classify, candidate_labels)

{'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9866883754730225, 0.007197572384029627, 0.006114050280302763],
 'sequence': 'I am going to france.'}

In [None]:
pd.DataFrame(classifier(sequence_to_classify, candidate_labels))

Unnamed: 0,sequence,labels,scores
0,I am going to france.,travel,0.986688
1,I am going to france.,dancing,0.007198
2,I am going to france.,cooking,0.006114


In [None]:
sequence_to_classify = "I love to reading books."
candidate_labels = ["travel", "cooking", "dancing", "reading"]
pd.DataFrame(classifier(sequence_to_classify, candidate_labels))

Unnamed: 0,sequence,labels,scores
0,I love to reading books.,reading,0.989049
1,I love to reading books.,travel,0.004614
2,I love to reading books.,cooking,0.003654
3,I love to reading books.,dancing,0.002683


## Working with benchmarks and datasets

The datasets library provides a very efficient utility to load, process, and share datasets with the community through the Hugging Face hub. As with TensorFlow datasets, it makes it easier to download, cache, and dynamically load the sets directly from the original dataset host upon request.

In [6]:
cola = load_dataset("glue", "cola")
cola["train"][25:28]

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

{'idx': [25, 26, 27],
 'label': [0, 0, 1],
 'sentence': ['Harry coughed himself.',
  'Harry coughed us into a fit.',
  'Bill followed the road into the forest.']}

Currently, there are 1703 NLP datasets and 28 metrics for diverse tasks.

In [7]:
all_d = list_datasets()
metrics = list_metrics()

print(f"{len(all_d)} datasets and {len(metrics)} metrics exist in the hub\n")
pprint(all_d[:20], compact=True)
pprint(metrics, compact=True)

1703 datasets and 28 metrics exist in the hub

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc',
 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue',
 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity',
 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'ami', 'amttl',
 'anli', 'app_reviews']
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad',
 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor',
 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval',
 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']


A dataset comes with the DatasetDict object, including several Dataset instances. When the split selection (split='...') is used, we get Dataset instances. 

For example, the CoLA dataset comes with DatasetDict, where we have three splits: train, validation, and test.

Let's see the structure of the CoLA dataset object.

In [8]:
cola = load_dataset("glue", "cola")
cola

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [12]:
cola["train"]

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})

In [11]:
cola["train"][12]

{'idx': 12, 'label': 1, 'sentence': 'Bill rolled out of the room.'}

In [13]:
cola["validation"]

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 1043
})

In [15]:
cola["validation"][68]

{'idx': 68,
 'label': 0,
 'sentence': 'Which report that John was incompetent did he submit?'}

The dataset object has some additional metadata information that might be helpful for us: split, description, citation, homepage, license, and info.

In [16]:
print("1#", cola["train"].description)
print("2#", cola["train"].citation)
print("3#", cola["train"].homepage)
print("4#", cola["train"].license)

1# GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.


2# @article{warstadt2018neural,
  title={Neural Network Acceptability Judgments},
  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1805.12471},
  year={2018}
}
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

3# https://nyu-mll.github.io/CoLA/
4# 


XTREME (working with a cross-lingual dataset) is another popular cross-lingual dataset
that we already discussed. 

Let's pick the MLQA example from the XTREME set. MLQA is
a subset of the XTREME benchmark, which is designed for assessing the performance of
cross-lingual QA models. 

It includes about 5,000 extractive QA instances in the SQuAD
format across seven languages, which are English, German, Arabic, Hindi, Vietnamese,
Spanish, and Simplified Chinese.

In [18]:
en_de = load_dataset("xtreme", "MLQA.en.de")
en_de

Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/fb182342ff5c7a211ebf678cde070463acd29524b30b87f8f38c617948c2826a)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4517
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 512
    })
})

It could be more convenient to view it within a pandas DataFrame.

In [19]:
pd.DataFrame(en_de["test"][0:4])

Unnamed: 0,id,title,context,question,answers
0,037e8929e7e4d2f949ffbabd10f0f860499ff7c9,Cell culture,An established or immortalized cell line has a...,Woraus besteht die Linie?,"{'answer_start': [31], 'text': ['cell']}"
1,4b36724f3cbde7c287bde512ff09194cbba7f932,Cell culture,The 19th-century English physiologist Sydney R...,Wann hat Roux etwas von seiner Medullarplatte ...,"{'answer_start': [232], 'text': ['1885']}"
2,13e58403df16d88b0e2c665953e89575704942d4,TRIPS Agreement,"After the Uruguay round, the GATT became the b...","Was muss ratifiziert werden, wenn ein Land ger...","{'answer_start': [131], 'text': ['TRIPS']}"
3,d23b5372af1de9425a4ae313c01eb80764c910d8,TRIPS Agreement,"Since TRIPS came into force, it has been subje...",Welche Teile der Welt kritisierten das TRIPS a...,"{'answer_start': [67], 'text': ['developing co..."


###Data manipulation with the datasets library