## ALBERT Fine tuning Tutorial
In this tutorial, we will be going through usage of SOTA transformers. We will be using ALBERT transformer model for this tutorial. You can check this link to understand more about [ALBERT](https://arxiv.org/abs/1909.11942)

Code can be founded in PR [#203](https://github.com/JuliaText/TextAnalysis.jl/pull/203)

We are going to use the following library for our tutorial
- TextAnlaysis.ALBERT
- WordTokenizer 
- Transformers and Flux 


In [2]:
using TextAnalysis
using TextAnalysis.ALBERT # it is where our model reside

lets checkout the model version avaliable in PretrainedTransformer

In [4]:
subtypes(ALBERT.PretrainedTransformer)

2-element Array{Any,1}:
 TextAnalysis.ALBERT.ALBERT_V1
 TextAnalysis.ALBERT.ALBERT_V2

To check different size model 

In [5]:
model_version( TextAnalysis.ALBERT.ALBERT_V1)

4-element Array{String,1}:
 "albert_base_v1"
 "albert_large_v1"
 "albert_xlarge_v1"
 "albert_xxlarge_v1"

Before moving forward let us look at the following basic steps involved in using any transformer,

 ### For preprocessing
- Tokenize the input data and other input details such as Attention Mask for BERT to not ignore the attention on padded sequences.
- Convert tokens to input ID sequences.
- Pad the IDs to a fixed length.

### For modelling
- Load the model and feed in the input ID sequence (Do it batch wise suitably based on the memory available).
- Get the output of the last hidden layer
- Last hidden layer has the sequence representation embedding at 1th index
- These embeddings can be used as the inputs for different machine learning or deep learning models.


`WordTokenizer` will handle the Preprocessing part
and `TextAnlaysis` will handle Modelling

In [6]:
transformer = ALBERT.from_pretrained( "albert_base_v1") #here we are using version 1 i.e base

This program has requested access to the data dependency albert_base_v1.
which is not currently installed. It can be installed automatically, and you will not see this message again.

sentencepiece albert vocabulary file by google research .
Website: https://github.com/google-research/albert
Author: Google Research
Licence: Apache License 2.0
albert base version1 of size ~500mb download.



Do you want to download the dataset from https://drive.google.com/uc?export=download&id=1RKggDgmlJrSRsx7Ro2eR2hTNuMmzyUJ7 to "/home/iamtejas/.julia/datadeps/albert_base_v1"?
[y/n]
stdin> y


┌ Info: Downloading
│   source = https://doc-00-3g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/h8ejgoll8vvi3skb2dmd7ntrm80frbea/1595875500000/15884229709856900679/*/1RKggDgmlJrSRsx7Ro2eR2hTNuMmzyUJ7?e=download
│   dest = /home/iamtejas/.julia/datadeps/albert_base_v1/albert_base_v1.bson
│   progress = NaN
│   time_taken = 5.01 s
│   time_remaining = NaN s
│   average_speed = 6.788 MiB/s
│   downloaded = 33.981 MiB
│   remaining = ∞ B
│   total = ∞ B
└ @ HTTP /home/iamtejas/.julia/packages/HTTP/BOJmV/src/download.jl:119
┌ Info: Downloading
│   source = https://doc-00-3g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/h8ejgoll8vvi3skb2dmd7ntrm80frbea/1595875500000/15884229709856900679/*/1RKggDgmlJrSRsx7Ro2eR2hTNuMmzyUJ7?e=download
│   dest = /home/iamtejas/.julia/datadeps/albert_base_v1/albert_base_v1.bson
│   progress = NaN
│   time_taken = 6.58 s
│   time_remaining = NaN s
│   average_speed = 6.972 MiB/s
│   downloaded = 45.903 MiB

3-element Array{Any,1}:
 CompositeEmbedding(tok = Embed(128), segment = Embed(128), pe = PositionEmbedding(128, max_len=512), postprocessor = Positionwise(LayerNorm(128), Dropout(0.1)))
 albert(layers=12, head=12, head_size=64, pwffn_size=3072, size=768)
 (pooler = Dense(768, 768, tanh), masklm = (transform = Chain(Dense(768, 128, gelu), LayerNorm(128)), output_bias = Float32[-5.345022, 2.1769698, -7.144285, -9.102521, -8.083536, 0.56541324, 1.2000155, 1.4699979, 1.5557922, 1.9452884  …  -0.6403663, -0.9401073, -1.0888876, -0.9298268, -0.64744073, -0.47156653, -0.81416136, -0.87479985, -0.8785063, -0.5505797]), nextsentence = Chain(Dense(768, 2), logsoftmax))

Tokenizer

In [7]:
using WordTokenizers

To get more detail on tokenizer refer the following [blog](https://tejasvaidhyadev.github.io/blog/Hey-Albert) 

In [8]:
spm = load(ALBERT_V1,1) #because we are using base-version1 

This program has requested access to the data dependency albert_large_v1_30k-clean.vocab.
which is not currently installed. It can be installed automatically, and you will not see this message again.

sentencepiece albert vocabulary file by google research .
Website: https://github.com/google-research/albert
Author: Google Research
Licence: Apache License 2.0
 albert large version1 of size ~800kb download.



Do you want to download the dataset from https://raw.githubusercontent.com/tejasvaidhyadev/ALBERT.jl/master/src/Vocabs/albert_large_v1_30k-clean.vocab to "/home/iamtejas/.julia/datadeps/albert_large_v1_30k-clean.vocab"?
[y/n]
stdin> y


┌ Info: Downloading
│   source = https://raw.githubusercontent.com/tejasvaidhyadev/ALBERT.jl/master/src/Vocabs/albert_large_v1_30k-clean.vocab
│   dest = /home/iamtejas/.julia/datadeps/albert_large_v1_30k-clean.vocab/albert_large_v1_30k-clean.vocab
│   progress = 1.0
│   time_taken = 0.17 s
│   time_remaining = 0.0 s
│   average_speed = 3.154 MiB/s
│   downloaded = 536.127 KiB
│   remaining = 0 bytes
│   total = 536.127 KiB
└ @ HTTP /home/iamtejas/.julia/packages/HTTP/BOJmV/src/download.jl:119


WordTokenizers.SentencePieceModel(Dict("▁shots" => (-11.2373, 7281),"▁ordered" => (-9.84973, 1906),"▁doubtful" => (-12.7799, 22569),"▁glancing" => (-11.6676, 10426),"▁disrespect" => (-13.13, 26682),"▁without" => (-8.34227, 367),"▁pol" => (-10.7694, 4828),"chem" => (-12.3713, 17661),"▁1947," => (-11.7544, 11199),"▁kw" => (-10.4402, 3511)…), 2)

we will use DataLoader avaliable in [`Transformers`](https://github.com/chengchingwen/Transformers.jl)

using QNLI Dataseet

In [11]:
using Transformers.Datasets
using Transformers.Datasets.GLUE
using Transformers.Basic
task = GLUE.QNLI()
datas = dataset(Train, task)

(Channel{String}(sz_max:0,sz_curr:1), Channel{String}(sz_max:0,sz_curr:0), Channel{String}(sz_max:0,sz_curr:0))

In [35]:
using Flux: onehotbatch

Basic Preprocessing function 

APIs[WIP] 

In [28]:
makesentence(s1, s2) = ["[CLS]"; s1; "[SEP]"; s2; "[SEP]"]
function preprocess(training_batch)
ids =[]
sent = []
for i in 1:length(training_batch[1])
    sent1 = tokenizer(spm,training_batch[1][i])
    sent2 = tokenizer(spm,training_batch[2][i])
    id = makesentence(sent1,sent2)
    push!(sent, id)
    push!(ids,ids_from_tokens(spm,id))
end
    mask = getmask(convert(Array{Array{String,1},1}, sent)) #better API underprogress

E = Flux.batchseq(ids,1)
E = Flux.stack(E,1)
length(E) #output embedding matrix
segment = fill!(similar(E), 1)
    for (i, sent) ∈ enumerate(sent)
      j = findfirst(isequal("[SEP]"), sent)
      if j !== nothing
        @view(segment[j+1:end, i]) .= 2
      end
end
data = (tok = E,segment = segment)
labels = get_labels(task)
label = onehotbatch(training_batch[3], labels)
return(data,label,mask)
end

preprocess (generic function with 1 method)

lets Define loss function

In [14]:
using Flux
using Flux: gradient
import Flux.Optimise: update!

clf = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)

ps = params(transformer[1])
opt = ADAM(1e-4)
#define the loss
function loss(data, label, mask=nothing)
    e = transformer[1](data)
    t = transformer[2](e)
    l = logcrossentropy( label,
         clf(
            transformer[3].pooler(
                t[:,1,:]
            )
        )
    )
    return l
end

loss (generic function with 2 methods)

In [31]:
for i ∈ 1:20 # iteration of 20 cycles
data_batch = get_batch(datas, 4)
data_batch, label_batch, mask = preprocess(data_batch)
l = loss(data_batch, label_batch, mask)
@show l
  grad = gradient(()->l, ps)
  update!(opt, ps, grad)
end

l = 0.8853355f0
l = 0.57006735f0
l = 0.9809218f0
l = 0.5881124f0
l = 0.78463817f0
l = 0.76752764f0
l = 0.7264092f0
l = 0.7885215f0
l = 0.5286734f0
l = 0.64378977f0
l = 0.7589431f0
l = 0.87304103f0
l = 0.7476368f0
l = 0.7716043f0
l = 0.6841873f0
l = 0.7801976f0
l = 0.5601203f0
l = 0.6203372f0
l = 0.6522941f0
l = 0.6564876f0
