<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/mastering-transformers/02-hands-on-introduction-to-subject/1_hands_on_introduction_to_subject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install transformers
!pip -q install datasets

##Working with language models and tokenizers

In [2]:
from transformers import BertTokenizer, BertModel, pipeline
from datasets import load_dataset

from datasets import list_datasets, list_metrics

import pandas as pd

from pprint import pprint

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [4]:
text = "Using Transformer is easy!"
tokenizer(text)

{'input_ids': [101, 2478, 10938, 2121, 2003, 3733, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [5]:
encoded_input = tokenizer(text, return_tensors="pt")
encoded_input

{'input_ids': tensor([[  101,  2478, 10938,  2121,  2003,  3733,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
model = BertModel.from_pretrained("bert-base-uncased")

output = model(**encoded_input)
output

In [7]:
len(output)

2

In [8]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
unmasker("The man worked as a [MASK].")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.09747546911239624,
  'sequence': 'the man worked as a carpenter.',
  'token': 10533,
  'token_str': 'carpenter'},
 {'score': 0.052383411675691605,
  'sequence': 'the man worked as a waiter.',
  'token': 15610,
  'token_str': 'waiter'},
 {'score': 0.04962698742747307,
  'sequence': 'the man worked as a barber.',
  'token': 13362,
  'token_str': 'barber'},
 {'score': 0.037886083126068115,
  'sequence': 'the man worked as a mechanic.',
  'token': 15893,
  'token_str': 'mechanic'},
 {'score': 0.037680838257074356,
  'sequence': 'the man worked as a salesman.',
  'token': 18968,
  'token_str': 'salesman'}]

In [9]:
pd.DataFrame(unmasker("The man worked as a [MASK]."))

Unnamed: 0,sequence,score,token,token_str
0,the man worked as a carpenter.,0.097475,10533,carpenter
1,the man worked as a waiter.,0.052383,15610,waiter
2,the man worked as a barber.,0.049627,13362,barber
3,the man worked as a mechanic.,0.037886,15893,mechanic
4,the man worked as a salesman.,0.037681,18968,salesman


##Working with community-provided models

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

In [11]:
sequence_to_classify = "I am going to france."
candidate_labels = ["travel", "cooking", "dancing"]

In [12]:
classifier(sequence_to_classify, candidate_labels)

{'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9866883754730225, 0.007197582628577948, 0.006114034913480282],
 'sequence': 'I am going to france.'}

In [13]:
pd.DataFrame(classifier(sequence_to_classify, candidate_labels))

Unnamed: 0,sequence,labels,scores
0,I am going to france.,travel,0.986688
1,I am going to france.,dancing,0.007198
2,I am going to france.,cooking,0.006114


In [14]:
sequence_to_classify = "I love to reading books."
candidate_labels = ["travel", "cooking", "dancing", "reading"]
pd.DataFrame(classifier(sequence_to_classify, candidate_labels))

Unnamed: 0,sequence,labels,scores
0,I love to reading books.,reading,0.989049
1,I love to reading books.,travel,0.004614
2,I love to reading books.,cooking,0.003654
3,I love to reading books.,dancing,0.002683


## Working with benchmarks and datasets

The datasets library provides a very efficient utility to load, process, and share datasets with the community through the Hugging Face hub. As with TensorFlow datasets, it makes it easier to download, cache, and dynamically load the sets directly from the original dataset host upon request.

In [None]:
cola = load_dataset("glue", "cola")

In [46]:
cola["train"][25:28]

{'idx': [25, 26, 27],
 'label': [0, 0, 1],
 'sentence': ['Harry coughed himself.',
  'Harry coughed us into a fit.',
  'Bill followed the road into the forest.']}

Currently, there are 1703 NLP datasets and 28 metrics for diverse tasks.

In [16]:
all_d = list_datasets()
metrics = list_metrics()

print(f"{len(all_d)} datasets and {len(metrics)} metrics exist in the hub\n")
pprint(all_d[:20], compact=True)
pprint(metrics, compact=True)

1889 datasets and 33 metrics exist in the hub

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc',
 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue',
 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity',
 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'ami', 'amttl',
 'anli', 'app_reviews']
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'chrf', 'code_eval', 'comet',
 'competition_math', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'google_bleu',
 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision',
 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad',
 'squad_v2', 'super_glue', 'ter', 'wer', 'wiki_split', 'xnli']


A dataset comes with the DatasetDict object, including several Dataset instances. When the split selection (split='...') is used, we get Dataset instances. 

For example, the CoLA dataset comes with DatasetDict, where we have three splits: train, validation, and test.

Let's see the structure of the CoLA dataset object.

In [None]:
cola = load_dataset("glue", "cola")

In [43]:
cola

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [18]:
cola["train"]

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})

In [19]:
cola["train"][12]

{'idx': 12, 'label': 1, 'sentence': 'Bill rolled out of the room.'}

In [20]:
cola["validation"]

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 1043
})

In [21]:
cola["validation"][68]

{'idx': 68,
 'label': 0,
 'sentence': 'Which report that John was incompetent did he submit?'}

The dataset object has some additional metadata information that might be helpful for us: split, description, citation, homepage, license, and info.

In [22]:
print("1#", cola["train"].description)
print("2#", cola["train"].citation)
print("3#", cola["train"].homepage)
print("4#", cola["train"].license)

1# GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.


2# @article{warstadt2018neural,
  title={Neural Network Acceptability Judgments},
  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1805.12471},
  year={2018}
}
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}

3# https://nyu-mll.github.io/CoLA/
4# 


XTREME (working with a cross-lingual dataset) is another popular cross-lingual dataset
that we already discussed. 

Let's pick the MLQA example from the XTREME set. MLQA is
a subset of the XTREME benchmark, which is designed for assessing the performance of
cross-lingual QA models. 

It includes about 5,000 extractive QA instances in the SQuAD
format across seven languages, which are English, German, Arabic, Hindi, Vietnamese,
Spanish, and Simplified Chinese.

In [None]:
en_de = load_dataset("xtreme", "MLQA.en.de")

In [42]:
en_de

DatasetDict({
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4517
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 512
    })
})

It could be more convenient to view it within a pandas DataFrame.

In [24]:
pd.DataFrame(en_de["test"][0:4])

Unnamed: 0,id,title,context,question,answers
0,037e8929e7e4d2f949ffbabd10f0f860499ff7c9,Cell culture,An established or immortalized cell line has a...,Woraus besteht die Linie?,"{'answer_start': [31], 'text': ['cell']}"
1,4b36724f3cbde7c287bde512ff09194cbba7f932,Cell culture,The 19th-century English physiologist Sydney R...,Wann hat Roux etwas von seiner Medullarplatte ...,"{'answer_start': [232], 'text': ['1885']}"
2,13e58403df16d88b0e2c665953e89575704942d4,TRIPS Agreement,"After the Uruguay round, the GATT became the b...","Was muss ratifiziert werden, wenn ein Land ger...","{'answer_start': [131], 'text': ['TRIPS']}"
3,d23b5372af1de9425a4ae313c01eb80764c910d8,TRIPS Agreement,"Since TRIPS came into force, it has been subje...",Welche Teile der Welt kritisierten das TRIPS a...,"{'answer_start': [67], 'text': ['developing co..."


###Data manipulation with the datasets library

Datasets come with many dictionaries of subsets, where the split parameter is used to
decide which subset(s) or portion of the subset is to be loaded. If this is none by default, it
will return a dataset dictionary of all subsets (train, test, validation, or any other
combination). If the split parameter is specified, it will return a single dataset rather than
a dictionary.

In [25]:
cola_train = load_dataset("glue", "cola", split="train")
cola_train

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})

We can get a mixture of train and validation subsets.

In [26]:
# first 300 examples of train and the last 30 examples of validation are obtained
cola_sel = load_dataset("glue", "cola", split="train[:300]+validation[-30:]")
cola_sel

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 330
})

In [27]:
# The first 100 examples from train and validation
cola_sel = load_dataset("glue", "cola", split="train[:100]+validation[:100]")
cola_sel

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 200
})

In [28]:
# 50% of train and the last 30% of validation
cola_sel = load_dataset("glue", "cola", split="train[:50%]+validation[-30%:]")
cola_sel

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 4589
})

In [29]:
# The first 20% of train and the examples in the slice [30:50] from validation
cola_sel = load_dataset("glue", "cola", split="train[:20%]+validation[30:50]")
cola_sel

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 1730
})

###Sorting, indexing, and shuffling

In [30]:
cola_sel.sort("label")["label"][:15]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [31]:
cola_sel.sort("label")["label"][-15:]

Loading cached sorted indices for dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e336eeb823488e8a.arrow


[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

We are already familiar with Python slicing notation. 

Likewise, we can also access several
rows using similar slice notation or with a list of indices.

In [32]:
cola_sel[6, 19, 44]

{'idx': [6, 19, 44],
 'label': [1, 1, 1],
 'sentence': ['Fred watered the plants flat.',
  'The professor talked us into a stupor.',
  'The trolley rumbled through the tunnel.']}

In [33]:
# We shuffle the dataset as follows
cola_sel.shuffle(seed=42)[2:5]

{'idx': [683, 435, 548],
 'label': [1, 0, 1],
 'sentence': ['Mary will play the violin soon.',
  'I explained to fix the sink.',
  'The party lasted till midnight.']}

###Caching and reusability

Using cached files allows us to load large datasets by means of memory mapping (if
datasets fit on the drive) by using a fast backend. Such smart caching helps in saving and
reusing the results of operations executed on the drive.

In [34]:
cola_sel.cache_files

[{'filename': '/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/glue-train.arrow'},
 {'filename': '/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/glue-validation.arrow'}]

###Dataset filter and map function

We might want to work with a specific selection of a dataset. For instance, we can retrieve
sentences only, including the term kick in the cola dataset, as shown in the following
execution. 

The datasets.Dataset.filter() function returns sentences including
kick where an anonymous function and a lambda keyword are applied:

In [35]:
# 100% of train and the last 30% of validation
cola_sel = load_dataset("glue", "cola", split="train[:100%]+validation[-30%:]")
cola_sel

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8864
})

In [36]:
cola_sel.filter(lambda s: "kick" in s["sentence"])["sentence"][:3]

  0%|          | 0/9 [00:00<?, ?ba/s]

['Jill kicked the ball from home plate to third base.',
 'Fred kicked the ball under the porch.',
 'Fred kicked the ball behind the tree.']

The following filtering is used to get positive (acceptable) examples from the set

In [37]:
cola_sel.filter(lambda s: s["label"]==1)["sentence"][:3]

  0%|          | 0/9 [00:00<?, ?ba/s]

["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]

In some cases, we might not know the integer code of a class label

In [41]:
cola_sel.filter(lambda s: s["label"]==cola_sel.features["label"].str2int("acceptable"))["sentence"][:3]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-804ec7c8ce72b3ab.arrow


["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]

###Processing data with the map function

The `datasets.Dataset.map()` function iterates over the dataset, applying a
processing function to each example in the set, and modifies the content of the examples.

In [40]:
cola_new = cola_sel.map(lambda e: {"len": len(e["sentence"])})
pd.DataFrame(cola_new[0:3])

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-1aab51ec77e54904.arrow


Unnamed: 0,sentence,label,idx,len
0,"Our friends won't buy this analysis, let alone...",1,0,71
1,One more pseudo generalization and I'm giving up.,1,1,49
2,One more pseudo generalization or I'm giving up.,1,2,48


As another example, the following piece of code cut the sentence after 20 characters. We
do not create a new feature, but instead update the content of the sentence feature

In [None]:
cola_cut = cola_new.map(lambda e: {"sentence": e["sentence"][:20] + "_"})

In [49]:
pd.DataFrame(cola_cut[:5])

Unnamed: 0,sentence,label,idx,len
0,Our friends won't bu_,1,0,71
1,One more pseudo gene_,1,1,49
2,One more pseudo gene_,1,2,48
3,The more we study ve_,1,3,46
4,Day by day the facts_,1,4,41


##Working with local files

To load a dataset from local files in a Comma-Separated Values (CSV), Text (TXT), or
JavaScript Object Notation (JSON) format, we pass the file type (csv, text, or json)
to the generic load_dataset() loading script, as shown in the following code snippet.

In [60]:
%%shell

wget -q https://github.com/PacktPublishing/Mastering-Transformers/raw/main/CH02/data/a.csv
wget -q https://github.com/PacktPublishing/Mastering-Transformers/raw/main/CH02/data/b.csv
wget -q https://github.com/PacktPublishing/Mastering-Transformers/raw/main/CH02/data/c.csv



In [59]:
from datasets import load_dataset

In [None]:
data1 = load_dataset("csv", data_files="a.csv", delimiter="\t")
data2 = load_dataset("csv", data_files=["a.csv", "b.csv", "c.csv"], delimiter="\t")
data3 = load_dataset("csv", data_files={"train": ["c.csv", "b.csv"], "test": ["c.csv"]}, delimiter="\t")

##Preparing a dataset for model training

Let's start with the tokenization process. Each model has its own tokenization model
that is trained before the actual language model.

To use a tokenizer, we should have installed the Transformer library.
The following example loads the tokenizer model from the pretrained distilBERTbase-
uncased model. We use map and an anonymous function with lambda to apply
a tokenizer to each split in data3. 

If batched is selected True in the map function, it
provides a batch of examples to the tokenizer function. The batch_size value is
1000 by default, which is the number of examples per batch passed to the function. If not
selected, the whole dataset is passed as a single batch.

In [53]:
from transformers import DistilBertTokenizer

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
encoded_data3 = data3.map(lambda e: tokenizer(e["sentence"], padding=True, truncation=True, max_length=12), batched=True, batch_size=1000)

In [64]:
data3

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 200
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 100
    })
})

In [65]:
encoded_data3

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 200
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence'],
        num_rows: 100
    })
})

In [66]:
pprint(encoded_data3["test"][12])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
 'input_ids': [101, 2019, 5186, 16010, 2143, 1012, 102, 0, 0, 0, 0, 0],
 'label': 0,
 'sentence': 'an extremely unpleasant film . '}
