# Pre-train ELECTRA
In this section, we will train ELECTRA from scratch with TensorFlow using scripts provided by ELECTRA’s authors in google-research/electra. Then we will convert the model to PyTorch’s checkpoint, which can be easily fine-tuned on downstream tasks using Hugging Face’s transformers library.

### Setup
!pip install tensorflow-gpu==1.15

!pip install transformers==2.8.0

!pip install -U tensorboard

!git clone https://github.com/google-research/electra.git


In [2]:
import os, json

from datetime import datetime
from transformers import AutoTokenizer

### **Data**

We will pre-train ELECTRA on a Portuguese movie subtitle dataset retrieved from OpenSubtitles concatened with Wikipedia in Portuguese and BrWac corpus. This dataset is 5.4 GB in size and we will train on a small subset of ~30 MB for presentation.

In [3]:
# We will pre-train ELECTRA on a Portuguese Wikipedia dataset retrieved from https://dumps.wikimedia.org/. This dataset is 1.8 GB in size.
DATA_DIR = 'data/'
MODEL_DIR = DATA_DIR + 'models/'
MODEL_NAME = 'electranez-small'
VOCAB_FILE = DATA_DIR + 'vocab/vocab.txt'
TEXT_DIR = DATA_DIR + 'txt/'
TFRECORDS_DIR = DATA_DIR + 'pretrain_tfrecords/'

if not os.path.exists(VOCAB_FILE):
    print('Vocab file not found!')
    
if not os.path.exists(TEXT_DIR + 'text.txt'):
    print('Corpus not found!')
    
if not os.path.exists(MODEL_DIR):
    os.makedirs(MODEL_DIR, mode=0o777)
    
if not os.path.exists(TFRECORDS_DIR):
    os.makedirs(TFRECORDS_DIR, mode=0o777)

### **Start Training**
We use run_pretraining.py to pre-train an ELECTRA model.

This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).

To customize the training, create a .json file containing the hyperparameters. Please refer configure_pretraining.py for default values of all hyperparameters.

Below, we set the hyperparameters to train the model for only 100 steps.

**We use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text.**

In [4]:
start_time = datetime.now()

!python3 ../../google/electra/build_pretraining_dataset.py \
  --corpus-dir $DATA_DIR/txt/ \
  --vocab-file $VOCAB_DIR/vocab.txt \
  --output-dir $DATA_DIR/pretrain_tfrecords/ \
  --max-seq-length 128 \
  --blanks-separate-docs False \
  --no-lower-case \
  --num-processes 10

print('Duration: {}'.format(datetime.now() - start_time))

Traceback (most recent call last):
  File "../../google/electra/build_pretraining_dataset.py", line 230, in <module>
    main()
  File "../../google/electra/build_pretraining_dataset.py", line 216, in main
    utils.rmkdir(args.output_dir)
  File "/home/modanez/Documentos/AmbienteDeTrabalho/Desenvolvimento/python/google/electra/util/utils.py", line 65, in rmkdir
    mkdir(path)
  File "/home/modanez/Documentos/AmbienteDeTrabalho/Desenvolvimento/python/google/electra/util/utils.py", line 55, in mkdir
    tf.io.gfile.makedirs(path)
  File "/home/modanez/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 453, in recursive_create_dir_v2
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path))
tensorflow.python.framework.errors_impl.PermissionDeniedError: /pretrain_tfrecords; Permission denied
Duration: 0:00:01.286837


In [4]:
hparams = {
    'do_train': 'true',
    'do_eval': 'false',
    'model_size': 'small',
    'do_lower_case': 'false',
    'vocab_size': 52000,
    'num_train_steps': 10000,
    'save_checkpoints_steps': 1000,
    'train_batch_size': 36,
    'electra_objective' :  True
}

with open(DATA_DIR + 'hparams_36.json', 'w') as f:
    json.dump(hparams, f)

In [6]:
# Let’s start training:
start_time = datetime.now()

!python3 ../../google/electra/run_pretraining.py --data-dir $DATA_DIR --model-name $MODEL_NAME --hparams $DATA_DIR'hparams_36.json'

print('Duration: {}'.format(datetime.now() - start_time))

Config:
debug False
disallow_correct False
disc_weight 50.0
do_eval false
do_lower_case false
do_train true
electra_objective True
electric_objective False
embedding_size 128
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.25
generator_layers 1.0
iterations_per_loop 200
keep_checkpoint_max 5
learning_rate 0.0005
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 19
max_seq_length 128
model_dir data/models/electranez-small
model_hparam_overrides {}
model_name electranez-small
model_size small
num_eval_steps 100
num_tpu_cores 1
num_train_steps 10000
num_warmup_steps 10000
pretrain_tfrecords data/pretrain_tfrecords/pretrain_data.tfrecord*
results_pkl data/models/electranez-small/results/unsup_results.pkl
results_txt data/models/electranez-small/results/unsup_results.txt
save_checkpoints_steps 1000
temperature 1.0
tpu_job_name None
tpu_name None
tpu_zone None
train_batch_size 36
two_tower_generator False
uniform_generator False
untied_generator True
untie

If you are training on a virtual machine, run the following lines on the terminal to moniter the training process with TensorBoard.

In [2]:
%load_ext tensorboard

### **Convert Tensorflow checkpoints to PyTorch format**

Hugging Face has a tool to convert Tensorflow checkpoints to PyTorch. However, this tool has yet been updated for ELECTRA. Fortunately, I found a GitHub repo by [@lonePatient](https://github.com/lonePatient/electra_pytorch.git) that can help us with this task.


In [12]:
config = {
  'vocab_size': 119547,
  'embedding_size': 128,
  'hidden_size': 256,
  'num_hidden_layers': 12,
  'num_attention_heads': 4,
  'intermediate_size': 1024,
  'generator_size': '0.25',
  'hidden_act': 'gelu',
  'hidden_dropout_prob': 0.1,
  'attention_probs_dropout_prob': 0.1,
  'max_position_embeddings': 512,
  'type_vocab_size': 2,
  'initializer_range': 0.02
}

with open(MODEL_DIR + '/config.json', 'w') as f:
    json.dump(config, f)

In [15]:
print(MODEL_DIR)

data/models/electranez-small/


In [21]:
!python ../../diversos/electra_pytorch/convert_electra_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path=$MODEL_DIR/electranez-small/ \
    --electra_config_file=$MODEL_DIR/config.json \
    --pytorch_dump_path=$MODEL_DIR/pytorch_model.bin

INFO:model.configuration_utils:loading configuration file data/models/electranez-small//config.json
INFO:model.configuration_utils:Model config {
  "attention_probs_dropout_prob": 0.1,
  "embedding_size": 128,
  "finetuning_task": null,
  "generator_size": "0.25",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 119547
}

INFO:model.modeling_electra:Converting TensorFlow checkpoint from /home/modanez/Documentos/AmbienteDeTrabalho/Desenvolvimento/python/modanez/electra/data/models/electranez-small
INFO:model.modeling_electra:Loading TF weight discriminator_predictions/dense/bias with shape [256]
INFO:model.modeling_electra:L

### **Use ELECTRA with transformers**

After converting the model checkpoint to PyTorch format, we can start to use our pre-trained ELECTRA model on downstream tasks with the transformers library.

In [2]:
from transformers import ElectraForPreTraining, ElectraTokenizerFast

In [9]:
discriminator = ElectraForPreTraining.from_pretrained(MODEL_DIR+'/electranez-small/')
tokenizer = ElectraTokenizerFast.from_pretrained(DATA_DIR, do_lower_case=False)

In [10]:
sentence = 'os pássaros estão cantando' # The birds are singing
fake_sentence = 'os pássaros estão conversando' # The birds are speaking 

fake_tokens = tokenizer.tokenize(fake_sentence, add_special_tokens=True)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors='pt')
discriminator_outputs = discriminator(fake_inputs)
predictions = discriminator_outputs[0] > 0

[print('%7s' % token, end='') for token in fake_tokens]
print('\n')
[print('%7s' % int(prediction), end='') for prediction in predictions.tolist()];

  [CLS]     os passar   ##os   esta    ##oconversa  ##ndo  [SEP]

      1      0      0      0      0      0      0      0      1

### **Build Pretraining Dataset**

We will use the tokenizer of bert-base-multilingual-cased to process Portuguese texts.

In [None]:
# Save the pretrained WordPiece tokenizer to get vocab.txt
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.save_pretrained(DATA_DIR)

**To train a small ELECTRA model for 1 million steps, run:**

In [None]:
!python3 ../../google/electra/run_pretraining.py --data-dir $DATA_DIR --model-name $MODEL_NAME