# Pre-train ELECTRA
In this section, we will train ELECTRA from scratch with TensorFlow using scripts provided by ELECTRA’s authors in google-research/electra. Then we will convert the model to PyTorch’s checkpoint, which can be easily fine-tuned on downstream tasks using Hugging Face’s transformers library.

## Setup
- !pip install tensorflow-gpu==1.15
- !pip install transformers==2.8.0
- !pip install -U tensorboard
- !git clone https://github.com/google-research/electra.git


In [None]:
import os, json, time

from transformers import AutoTokenizer

### **Data**

We will pre-train ELECTRA on a Portuguese movie subtitle dataset retrieved from OpenSubtitles concatened with Wikipedia in Portuguese and BrWac corpus. This dataset is 5.4 GB in size and we will train on a small subset of ~30 MB for presentation.

In [None]:
import pandas as pd

from tokenizers.processors import BertProcessing
from tokenizers import ByteLevelBPETokenizer, BertWordPieceTokenizer

In [None]:
# We will pre-train ELECTRA on a Spanish movie subtitle dataset retrieved from OpenSubtitles. This dataset is 5.4 GB in size and we will train on a small 
# subset of ~30 MB for presentation.
DATA_DIR = 'data/' #@param {type: "string"}
TRAIN_SIZE = 1000000 #@param {type:"integer"}
TOKEN_DIR = 'data/vocab/'
MODEL_NAME = 'electranez-small' #@param {type: "string"}

### **Build Pretraining Dataset**

We will use the tokenizer of bert-base-multilingual-cased to process Portuguese texts.

In [None]:
# Save the pretrained WordPiece tokenizer to get vocab.txt
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.save_pretrained(DATA_DIR)

**We use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text.**

In [None]:
!python3 ../../google/electra/build_pretraining_dataset.py \
  --corpus-dir data/txt/ \
  --vocab-file data/electranez/vocab.txt \
  --output-dir data/electranez/pretrain_tfrecords/ \
  --max-seq-length 128 \
  --blanks-separate-docs False \
  --no-lower-case \
  --num-processes 5

### **Start Training**
We use run_pretraining.py to pre-train an ELECTRA model.

To train a small ELECTRA model for 1 million steps, run:

In [None]:
MODEL_DIR = DATA_DIR + 'electranez/'

In [None]:
!python3 ../../google/electra/run_pretraining.py --data-dir $MODEL_DIR --model-name electranez-small

This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).

To customize the training, create a .json file containing the hyperparameters. Please refer configure_pretraining.py for default values of all hyperparameters.

Below, we set the hyperparameters to train the model for only 100 steps.

In [None]:
hparams = {
    "do_train": "true",
    "do_eval": "false",
    "model_size": "small",
    "do_lower_case": "false",
    "vocab_size": 119547,
    "num_train_steps": 10000,
    "save_checkpoints_steps": 1000,
    "train_batch_size": 32,
    "electra_objective" :  True
}

with open("hparams.json", "w") as f:
    json.dump(hparams, f)

In [None]:
# Let’s start training:
start = time.time()

!python3 ../../google/electra/run_pretraining.py --data-dir data/electranez/ --model-name $MODEL_NAME --hparams "hparams.json"

print(time.time()-start)

If you are training on a virtual machine, run the following lines on the terminal to moniter the training process with TensorBoard.

### Convert Tensorflow checkpoints to PyTorch format

Hugging Face has a tool to convert Tensorflow checkpoints to PyTorch. However, this tool has yet been updated for ELECTRA. Fortunately, I found a GitHub repo by [@lonePatient](https://github.com/lonePatient/electra_pytorch.git) that can help us with this task.


In [None]:
MODEL_DIR = 'data/electranez'

In [None]:
config = {
  "vocab_size": 119547,
  "embedding_size": 128,
  "hidden_size": 256,
  "num_hidden_layers": 12,
  "num_attention_heads": 4,
  "intermediate_size": 1024,
  "generator_size":"0.25",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "attention_probs_dropout_prob": 0.1,
  "max_position_embeddings": 512,
  "type_vocab_size": 2,
  "initializer_range": 0.02
}

with open(MODEL_DIR + "/config.json", "w") as f:
    json.dump(config, f)

In [None]:
print(MODEL_DIR+'/models/electranez-small/')

In [None]:
!python ../../diversos/electra_pytorch/convert_electra_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path= $MODEL_DIR/models/electranez-small/ \
    --electra_config_file=$MODEL_DIR/config.json \
    --pytorch_dump_path=$MODEL_DIR/pytorch_model.bin

### Use ELECTRA with transformers

After converting the model checkpoint to PyTorch format, we can start to use our pre-trained ELECTRA model on downstream tasks with the transformers library.

In [None]:
import torch
from transformers import ElectraForPreTraining, ElectraTokenizerFast

In [None]:
discriminator = ElectraForPreTraining.from_pretrained(MODEL_DIR)
tokenizer = ElectraTokenizerFast.from_pretrained(DATA_DIR, do_lower_case=False)

In [None]:
sentence = "Os pássaros estão cantando" # The birds are singing
fake_sentence = "Os pássaros estão trabalhando" # The birds are speaking 

fake_tokens = tokenizer.tokenize(fake_sentence, add_special_tokens=True)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = discriminator_outputs[0] > 0

[print("%7s" % token, end="") for token in fake_tokens]
print("\n")
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()];