# Fastai & AdaptNLP
Codebeispiel zur Verwendung der AdaptNLP Bibliothek. Wir betrachten ein Beispiel aus dem HuggingFace Online-Kurs und trainieren unser Modell auf Basis der `Bert`-Architektur und dem MRPC Datenset.

Quellen:

* https://novetta.github.io/adaptnlp
* https://github.com/novetta/adaptnlp


## Erforderliche Installationen
Wir benötigen eine Basisbibliotheken für die Verwendung von `AdaptNLP`, wie `transformers`, `datasets` und `fastai`.

In [1]:
!!pip install adaptnlp -U
#!git+https://github.com/novetta/adaptnlp@dev -U

['Collecting adaptnlp',
 '  Downloading adaptnlp-0.3.2-py3-none-any.whl (62 kB)',
 '\x1b[?25l',
 '\x1b[K     |█████▏                          | 10 kB 28.1 MB/s eta 0:00:01',
 '\x1b[K     |██████████▍                     | 20 kB 35.3 MB/s eta 0:00:01',
 '\x1b[K     |███████████████▋                | 30 kB 23.3 MB/s eta 0:00:01',
 '\x1b[K     |████████████████████▉           | 40 kB 18.5 MB/s eta 0:00:01',
 '\x1b[K     |██████████████████████████      | 51 kB 7.1 MB/s eta 0:00:01',
 '\x1b[K     |███████████████████████████████▎| 61 kB 7.7 MB/s eta 0:00:01',
 '\x1b[K     |████████████████████████████████| 62 kB 786 kB/s ',
 '\x1b[?25hCollecting fastcore>=1.3.21',
 '  Downloading fastcore-1.3.26-py3-none-any.whl (56 kB)',
 '\x1b[?25l',
 '\x1b[K     |█████▉                          | 10 kB 46.0 MB/s eta 0:00:01',
 '\x1b[K     |███████████▊                    | 20 kB 32.8 MB/s eta 0:00:01',
 '\x1b[K     |█████████████████▌              | 30 kB 40.0 MB/s eta 0:00:01',
 '\x1b[K     |██████████

In [2]:
!pip install nbdev

Collecting nbdev
  Downloading nbdev-1.1.22-py3-none-any.whl (46 kB)
[?25l[K     |███████                         | 10 kB 30.4 MB/s eta 0:00:01[K     |██████████████                  | 20 kB 31.3 MB/s eta 0:00:01[K     |█████████████████████           | 30 kB 18.9 MB/s eta 0:00:01[K     |████████████████████████████    | 40 kB 15.7 MB/s eta 0:00:01[K     |████████████████████████████████| 46 kB 3.0 MB/s 
[?25hCollecting fastrelease
  Downloading fastrelease-0.1.12-py3-none-any.whl (14 kB)
Collecting ghapi
  Downloading ghapi-0.1.19-py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 419 kB/s 
Installing collected packages: ghapi, fastrelease, nbdev
Successfully installed fastrelease-0.1.12 ghapi-0.1.19 nbdev-1.1.22


## Auswahl eines geeigneten Modells

In [6]:
from adaptnlp import HFModelHub, HF_TASKS

In [7]:
hub = HFModelHub()

In [8]:
models = hub.search_model_by_task(HF_TASKS.TEXT_CLASSIFICATION)

In [9]:
models[:10]

[Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification],
 Model Name: roberta-base-openai-detector, Tasks: [text-classification],
 Model Name: roberta-large-mnli, Tasks: [text-classification],
 Model Name: roberta-large-openai-detector, Tasks: [text-classification]]

In [13]:
models = hub.search_model_by_task(
    task=HF_TASKS.TEXT_CLASSIFICATION
)

In [14]:
models[:5]

[Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification],
 Model Name: roberta-base-openai-detector, Tasks: [text-classification],
 Model Name: roberta-large-mnli, Tasks: [text-classification],
 Model Name: roberta-large-openai-detector, Tasks: [text-classification]]

In [15]:
model = models[0]
model

Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification]

## Aufbau Datenset

In [3]:
from fastai.data.external import URLs, untar_data

In [4]:
data_path = untar_data(URLs.IMDB_SAMPLE)

In [5]:
data_path.ls()

(#1) [Path('/root/.fastai/data/imdb_sample/texts.csv')]

In [16]:
from adaptnlp import SequenceClassificationDatasets

In [18]:
from nbverbose.showdoc import *
from adaptnlp import SequenceClassificationDatasets
show_doc(SequenceClassificationDatasets.from_csvs)


ModuleNotFoundError: ignored

Wir schreiben eine tokenizer Funktion, die folgende Parameter akzeptiert: 

* `item`
* `tokenizer`
* `tokenizer_kwargs`

In [None]:
def tok_func(item, tokenizer, tokenize_kwargs):
  return tokenizer(item['sentence1'], item['sentence2'], **tokenize_kwargs)

In [None]:
from nbdev.showdoc import *
show_doc(TaskDatasets)

<h2 id="TaskDatasets" class="doc_header"><code>class</code> <code>TaskDatasets</code><a href="https://github.com/novetta/adaptnlp/tree/master/adaptnlp/training/core.py#L168" class="source_link" style="float:right">[source]</a></h2>

> <code>TaskDatasets</code>(**`train_dset`**, **`valid_dset`**, **`tokenizer_name`**:`str`=*`None`*, **`tokenize`**:`bool`=*`True`*, **`tokenize_func`**:`callable`=*`None`*, **`tokenize_kwargs`**:`dict`=*`{}`*, **`auto_kwargs`**:`dict`=*`{}`*, **`remove_cols`**:`Union`\[`str`, `List`\[`str`\]\]=*`None`*, **`label_keys`**:`list`=*`['labels']`*)

A set of datasets for a particular task, with a simple API.

Note: This is the base API, `items` should be a set of regular text and model-ready labels,
      including label or one-hot encoding being applied.

In [None]:
remove_cols=['sentence1', 'sentence2', 'idx']
tokenize_kwargs = {'max_length':64, 'padding':True}

In [None]:
dsets = TaskDatasets(
    raw_datasets['train'], raw_datasets['validation'],
    tokenizer_name = model_name,
    tokenize_kwargs = tokenize_kwargs,
    tokenize_func = tok_func,
    remove_cols = remove_cols
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-10e15044f80459ba.arrow


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
from transformers import DataCollatorWithPadding

In [None]:
dls = dsets.dataloaders(
    batch_size=8, 
    collate_fn=DataCollatorWithPadding(tokenizer=dsets.tokenizer)
)

In [None]:
dls.show_batch(n=4)

Unnamed: 0,Input,Labels
0,"in his female disguise, the real estate heir used the name dorothy ciner, a childhood friend. in his female disguise, he used the name dorothy ciner, a childhood friend, and rented an apartment in galveston.",
1,"of the 23. 5 million high - speed lines, 16. 3 million provided advanced services, which the fcc defines as speeds exceeding 200 kbps in both directions. a total of 16. 3 million lines provided advanced services, those services at speeds exceeding 200 kbps in both directions.",
2,"sirius recently began carrying national public radio, a deal pooh - poohed by xm because it doesn't include popular shows like "" all things considered "" and "" morning edition. "" sirius carries national public radio, although it doesn't include popular shows such as "" all things considered "" and "" morning edition. """,
3,"in late morning trading, the dow was up 13. 88, or 0. 2 percent, at 9, 002. 93, having shed 2. 3 percent last week. in early trading, the dow jones industrial average was down 39. 94, or 0. 4 percent, at 8, 945. 50, having slipped 3. 61 points monday.",


## Finetuning
Für das Finetuning verwenden wir den `SequenceClassificationTuner`.

In [None]:
from adaptnlp import SequenceClassificationTuner, Strategy

In [None]:
show_doc(SequenceClassificationTuner)

<h2 id="SequenceClassificationTuner" class="doc_header"><code>class</code> <code>SequenceClassificationTuner</code><a href="https://github.com/novetta/adaptnlp/tree/master/adaptnlp/training/sequence_classification.py#L222" class="source_link" style="float:right">[source]</a></h2>

> <code>SequenceClassificationTuner</code>(**`dls`**:`DataLoaders`, **`model_name`**:`str`, **`tokenizer`**=*`None`*, **`loss_func`**=*`CrossEntropyLoss()`*, **`metrics`**=*`[<function accuracy at 0x7fdfcacfecb0>, <fastai.metrics.AccumMetric object at 0x7fdfcac78090>]`*, **`opt_func`**=*`Adam`*, **`additional_cbs`**=*`None`*, **`expose_fastai_api`**=*`False`*, **`num_classes`**:`int`=*`None`*, **\*\*`kwargs`**) :: `AdaptiveTuner`

An `AdaptiveTuner` with good defaults for Sequence Classification tasks

**Valid kwargs and defaults:**
  - `lr`:float = 0.001
  - `splitter`:function = `trainable_params`
  - `cbs`:list = None
  - `path`:Path = None
  - `model_dir`:Path = 'models'
  - `wd`:float = None
  - `wd_bn_bias`:bool = False
  - `train_bn`:bool = True
  - `moms`: tuple(float) = (0.95, 0.85, 0.95)

In [None]:
tuner = SequenceClassificationTuner(dls, model.name, num_classes=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

TypeError: ignored

In [None]:
show_doc(SequenceClassificationTuner.tune)

In [None]:
tuner.tune(3, 5e-5, strategy=Strategy.OneCycle)

In [None]:
tuner.save('fine_tuned_model')