# DeepPavlov: Transfer Learning with BERT
### Valerie Nayak
### November 18, 2019

Notebook from TowardsDataScience

Today we will cover following tasks:
* classification
* tagging (Named Enitity Recognition)

## BERT input representation
Text preprocessing for BERT relies on tokenizing text on subtokens (or WordPieces). Then BERT internally represents each subtoken as sum of three vectors:
* subtoken embedding
* segment embedding
* position embedding

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_input.png?raw=1" width="75%" />

## BERT for text classification
When we want to use BERT model for text classification task we can train only one dense layer on top of the output from the last BERT Transformer layer for special `[CLS]` token.

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_classification.png?raw=1" width="75%" />

Install DeepPavlov library:

In [None]:
! pip install deeppavlov

Install requirements for BERT-based classification model trained to detect insults in [Social Commentary](https://www.kaggle.com/c/detecting-insults-in-social-commentary):

In [2]:
! python -m deeppavlov install insults_kaggle_bert

email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
2019-11-18 15:41:15.934 INFO in 'deeppavlov.core.common.file'['file'] at line 30: Interpreting 'insults_kaggle_bert' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/classifiers/insults_kaggle_bert.json'
Collecting tensorflow==1.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/de/f0/96fb2e0412ae9692dbf400e5b04432885f677ad6241c088ccc5fe7724d69/tensorflow-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (109.2MB)
[K     |████████████████████████████████| 109.2MB 101kB/s 
Collecting tensorboard<1.15.0,>=1.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/91/2d/2ed263449a078cd9c8a9ba50ebd50123adf1f8cfbea1492f9084169b89d9/tensorboard-1.14.0-py3-none-any.whl (3.1MB)
[K     |████████████████████████████████| 3.2MB 38.6MB/s 
Collecting tensorflow-estimator<1.15.0rc0,>=1.14.0rc0
[?25l  Downloading https://files.pythonhosted.org/packages/3c/d5/218

Download and interact with pre-trained model with CLI:


In [0]:
! python -m deeppavlov interact -d insults_kaggle_bert

email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
2019-11-18 15:42:02.410 INFO in 'deeppavlov.core.common.file'['file'] at line 30: Interpreting 'insults_kaggle_bert' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/classifiers/insults_kaggle_bert.json'
2019-11-18 15:42:03.505 INFO in 'deeppavlov.core.data.utils'['utils'] at line 64: Downloading from http://files.deeppavlov.ai/deeppavlov_data/bert/cased_L-12_H-768_A-12.zip?config=insults_kaggle_bert to /root/.deeppavlov/downloads/cased_L-12_H-768_A-12.zip
100% 404M/404M [01:36<00:00, 4.20MB/s]
2019-11-18 15:43:39.720 INFO in 'deeppavlov.core.data.utils'['utils'] at line 216: Extracting /root/.deeppavlov/downloads/cased_L-12_H-768_A-12.zip archive into /root/.deeppavlov/downloads/bert_models
2019-11-18 15:43:44.100 INFO in 'deeppavlov.core.data.utils'['utils'] at line 64: Downloading from http://files.deeppavlov.ai/deeppavlov_data/classifiers/insults_kaggle_v3.t

Interact with text classification model with DeepPavlov Python API:

In [0]:
from deeppavlov import build_model, configs

model = build_model(configs.classifiers.insults_kaggle_bert, download=False) # download=True if model is not downloaded yet

In [0]:
model(['hey, how are you?', 'You are so stupid!'])

['Not Insult', 'Insult']

### Dataformat for classification

Let's check training data for  insults classification model. We can get data path from model configuration file from section `dataset_reader`.

In [0]:
import json
from pprint import pprint
model_config = json.load(open(configs.classifiers.insults_kaggle_bert))

pprint(model_config['dataset_reader'])
pprint(model_config['metadata']['variables'])

there are three .csv files:

In [0]:
! ls ~/.deeppavlov/downloads/insults_data/

test.csv  train.csv  valid.csv


In [0]:
! head ~/.deeppavlov/downloads/insults_data/train.csv

If you want to train model on your data you need to create configuration file and set up `data_path` to folder with train.csv, valid.csv, test.csv and change `MODEL_PATH` where to save trained model. Details in [documentation](http://docs.deeppavlov.ai/en/master/features/models/classifiers.html#how-to-train-on-other-datasets).

Train model with CLI:
```
! python -m deeppavlov train config_name
```
or in Python
```
from deeppavlov import train_model
model = train_model(model_config)
```

## BERT for tagging (Named Entity Recognition)

BERT model can be used for tagging tasks such like Named Entity Recognition and Part of Speech tagging.
We train only one dense layer on top of the output from the last BERT Transformer layer for each token. You can optionally add CRF layer on top the dense layer like in most common architecture BiLSTM + CRF for tagging.

Named Entity Recognition:

For example, we want to extract persons' and organizations' names from the text. Then for the input text:

    Yan Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

Here is how input is preprocessed for tagging:

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_NER.png?raw=1" width="75%" />

In [0]:
! python -m deeppavlov interact ner_ontonotes_bert -d

Data for Named Enitity Recognition task is usually stored in CoNLL files.
Typical CoNLL file with NER data contains lines with pairs of tokens (word/punctuation symbol) and tags, separated by a whitespace. In many cases additional information such as POS tags included between  Different documents are separated by lines **started** with **-DOCSTART-** token. Different sentences are separated by an empty line. Example

    -DOCSTART- -X- -X- O

    EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O

    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER
    
    
If you want to train model on your own data you can convert it to this CoNLL format or implement your version of `dataset_reader`. As for classification task model can be trained with CLI:
```
! python -m deeppavlov train config_name
```
or in Python
```
from deeppavlov import train_model
model = train_model(model_config)
```