## ALBERT
An upgrade to BERT that advances the state-of-the-art performance on 12 NLP tasks
 
The success of ALBERT demonstrates the importance of identifying the aspects of a model that give rise to powerful contextual representations. By focusing improvement efforts on these aspects of the model architecture, it is possible to greatly improve both the model efficiency and performance on a wide range of NLP tasks

## WHY ALBERT?
> An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.

ALBERT is a “lite” version of Google’s 2018 NLU pretraining method BERT. It has fewer parameter than BERT

In this notebook we are going to extract contextualised wordembedding by ALBERT and learing about classifer available for pretraining and finetuning

## Julia- Flux ALBERT Model
It very easy and similar to any of the other Flux layer for training 

In [2]:
using TextAnalysis

~ *ignore all the warning as TextAnalysis is checked out for developement*

In [3]:
using TextAnalysis.ALBERT # it is where our model reside

#### we are going to use DataDeps for handling download of pretrained model of ALBERT
- For now we are directly laoding 
- other pretrained Weights can be found [here](https://drive.google.com/drive/u/1/folders/1HHTlS_jBYRE4cG0elITEH7fAkiNmrEgz)

In [265]:
using BSON: @save, @load
@load "/home/iamtejas/Downloads/albert_base_v1.bson.tfbson" config weights vocab
transformer = TextAnalysis.ALBERT.load_pretrainedalbert(config, weights)

3-element Array{Any,1}:
 CompositeEmbedding(tok = Embed(128), segment = Embed(128), pe = PositionEmbedding(128, max_len=512), postprocessor = Positionwise(LayerNorm(128), Dropout(0.1)))
 TextAnalysis.ALBERT.albert_transformer(Dense(128, 768), TextAnalysis.ALBERT.ALGroup(Stack(Transformer(head=12, head_size=64, pwffn_size=3072, size=768)), Dropout(0.1)), 12, 1, 1)
 (pooler = Dense(768, 768, tanh), masklm = (transform = Chain(Dense(768, 128, gelu), LayerNorm(128)), output_bias = Float32[-5.345022, 2.1769698, -7.144285, -9.102521, -8.083536, 0.56541324, 1.2000155, 1.4699979, 1.5557922, 1.9452884  …  -0.6403663, -0.9401073, -1.0888876, -0.9298268, -0.64744073, -0.47156653, -0.81416136, -0.87479985, -0.8785063, -0.5505797]), nextsentence = Chain(Dense(768, 2), logsoftmax))

#### Todo 
better output repesentation

In [5]:
using WordTokenizers #we have albert_tokenizer residing in WordTokenizer 

For demo we are taking only 3 sentence

In [267]:
sample1 = "God is Great! I won a lottery."
sample2 = "If all their conversations in the three months he had been coming to the diner were put together, it was doubtful that they would make a respectable paragraph."
sample3 = "She had the job she had planned for the last three years."
sample = [sample1,sample2,sample3]

3-element Array{String,1}:
 "God is Great! I won a lottery."
 "If all their conversations in the three months he had been coming to the diner were put together, it was doubtful that they would make a respectable paragraph."
 "She had the job she had planned for the last three years."

#### loading of tokenizer form all the available  since we are using base_V1 

In [268]:
spm = load(ALBERT_V1,"albert_base_v1_30k-clean.vocab")

WordTokenizers.Sentencepiecemodel(["<pad>", "<unk>", "[CLS]", "[SEP]", "[MASK]", "(", ")", "\"", "-", "."  …  "_archivist", "_obverse", "error", "_tyrion", "_addictive", "_veneto", "_colloquial", "agog", "_deficiencies", "_eloquent"], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  -13.5298, -13.5298, -13.5298, -13.5299, -13.5299, -13.53, -13.5313, -13.5318, -13.5323, -13.5323])

### Preprocessing

In [269]:
s1 = ids_from_tokens(spm, tokenizer(spm,sample[1]))
s2 = ids_from_tokens(spm, tokenizer(spm,sample[2]))
s3 = ids_from_tokens(spm, tokenizer(spm,sample[3]))
E = Flux.batchseq([s1,s2,s3],1)
E = Flux.stack(E,1)

32×3 Array{Int64,2}:
   14     14    14
    2      2     2
 5649    411   439
   26     66    42
   14     67    15
    2  13528  1206
  100     20    40
  722     15    42
  188    133  2036
   14    819    27
    2     25    15
  231     42   237
   22     75   133
    ⋮         
    1     16     1
    1     33     1
    1     24     1
    1  22569     1
    1     31     1
    1     60     1
    1     84     1
    1    234     1
    1     22     1
    1  22740     1
    1  20600     1
    1     10     1

In [271]:
seg_indices = ones(Int, size(E)...)

32×3 Array{Int64,2}:
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 ⋮     
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1

We know input embedding requires both segment and token indices

the `embedding` itself handle position and addtion operation

In [278]:
embedding = transformer[1]
emb = embedding(tok=E, segment=seg_indices)

128×32×3 Array{Float32,3}:
[:, :, 1] =
  0.215637     0.974126    0.923233   …  -0.953364   -1.23193    -1.45834
  0.382527    -0.440211   -0.493132      -0.450212   -0.497128   -0.664383
 -1.06308     -0.596149    1.38881       -0.893396   -0.817059   -0.753805
  0.0192553   -0.606222   -0.0422194      0.213562    0.366818    0.655723
 -1.73208     -0.217243    0.329404       0.720658    0.467273    0.371431
  1.44428      0.352459   -0.307714   …   0.729       0.681432    0.574053
 -2.87975      1.02795    -1.55634       -0.0406037  -0.331129   -0.634261
  0.888469     2.01966     0.139051      -1.54702    -1.5215     -1.46764
  0.423823     0.10119    -1.7348        -1.40969    -0.998497   -0.439296
  0.519439     0.50028     1.78553        1.24271     0.801729    0.306854
 -2.73993     -2.15839    -0.950032   …  -0.726077   -1.00341    -1.15917
 -0.739749    -0.58749     0.697434       2.10209     1.90288     1.72213
 -2.06217     -1.16219     1.14314        1.72633     1.4998     

**Above we have embedding corresponding to input indices**

*lets pass the embedding through AlbertTranformer to get contextualised_embedding*

#### voilà we got Contextualised embedding

In [279]:
contextualised_embedding = transformer[2](emb)

768×32×3 Array{Float32,3}:
[:, :, 1] =
  0.304567    0.405941    0.326642   …   0.527699    0.53062     0.535387
  0.765856    1.05468     0.581975       0.902422    0.923356    0.910047
  0.760675    0.440686    1.81601        0.665885    0.615429    0.596886
 -0.0437517  -0.464514   -0.133041      -0.0119426   0.0575484   0.0867782
  0.438816   -0.0251722   0.686429       0.80559     0.805196    0.780889
  0.0296612  -0.685908    0.21051    …  -0.175856   -0.189034   -0.196193
  0.646657    1.67699     0.259575       0.684573    0.667937    0.646987
  0.267509    1.71205     0.686641       0.999966    1.0456      1.06525
 -0.0601687  -0.354132   -0.886672      -0.617012   -0.610717   -0.594557
 -0.364813   -0.159958   -0.996861      -0.187835   -0.205887   -0.215373
  0.988467    1.11282     1.00583    …   0.78229     0.908326    0.959077
  0.0731863  -0.432719    0.150405       0.585053    0.49542     0.441913
  0.15807    -0.490142   -0.405144       0.355488    0.33691     0.322694

In [280]:
cls = transformer[3][1] #pooler layer 


Dense(768, 768, tanh)

In [281]:
cls = transformer[3][2] #for mlm  tasked

(transform = Chain(Dense(768, 128, gelu), LayerNorm(128)), output_bias = Float32[-5.345022, 2.1769698, -7.144285, -9.102521, -8.083536, 0.56541324, 1.2000155, 1.4699979, 1.5557922, 1.9452884  …  -0.6403663, -0.9401073, -1.0888876, -0.9298268, -0.64744073, -0.47156653, -0.81416136, -0.87479985, -0.8785063, -0.5505797])

In [282]:
cls = transformer[3][3] #for Sentence order prediction

Chain(Dense(768, 2), logsoftmax)

In [283]:
using Transformers.Basic:@toNd
transformer[3].pooler(contextualised_embedding[:,1,:]) #mostly pooler layer is applied on [cls] token

768×3 Array{Float32,2}:
 -0.164622   -0.244687   -0.337835
 -0.109313   -0.238238    0.0731121
 -0.922526   -0.99792    -0.46698
 -0.403019   -0.863416    0.702999
  0.439029    0.97862    -0.107664
 -0.852264   -0.900703   -0.614105
  0.364179   -0.373978    0.0256299
 -0.508897   -0.47372     0.164483
  0.536143    0.0399534   0.775943
  0.970573    0.768918    0.205684
  0.999793    0.998327    0.999979
  0.791798   -0.33849     0.438395
 -0.503646    0.227776    0.245557
  ⋮                      
  0.757302    0.637661    0.859677
 -0.819503   -0.998739   -0.814534
 -0.505278   -0.654982   -0.670874
 -0.994383    0.0932321  -0.308428
 -0.6843     -0.984353   -0.602701
  0.751023    0.684121    0.818537
 -0.997563   -0.990052   -0.98816
 -0.405217   -0.750618   -0.819593
  0.53443     0.589601    0.67484
  0.973679    0.97328     0.951646
 -0.880809   -0.918532    0.0197378
 -0.0609629  -0.814388   -0.487664

Lets see our tokenized sentence

In [284]:
s1 =  tokenizer(spm,sample[1])
s2 =  tokenizer(spm,sample[2])
s3 =  tokenizer(spm,sample[3])
s =[s1,s2,s3]

3-element Array{Array{String,1},1}:
 ["_", "G", "od", "_is", "_", "G", "re", "at", "!", "_", "I", "_won", "_a", "_lottery", "."]
 ["_", "I", "f", "_all", "_their", "_conversations", "_in", "_the", "_three", "_months"  …  "_was", "_doubtful", "_that", "_they", "_would", "_make", "_a", "_respectable", "_paragraph", "."]
 ["_", "S", "he", "_had", "_the", "_job", "_she", "_had", "_planned", "_for", "_the", "_last", "_three", "_years", "."]

In [285]:
using Transformers
masks = Transformers.Basic.getmask(s) # we can directly use getmask function of Transformers

1×32×3 Array{Float32,3}:
[:, :, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0

[:, :, 2] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 3] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0

In [286]:
contextualised_embedding = transformer[2](emb, masks) #contextualised_embedding with masks

768×32×3 Array{Float32,3}:
[:, :, 1] =
 -0.428328   -0.0777661  -0.221679    …   0.0   0.0   0.0   0.0   0.0   0.0
 -0.569262    0.571153   -0.281473        0.0   0.0   0.0   0.0   0.0   0.0
  0.0193166   0.0451773   2.74292         0.0   0.0   0.0   0.0   0.0   0.0
 -0.978615   -1.16123    -0.574968       -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
  0.201378   -0.735267    0.0164461       0.0   0.0   0.0   0.0   0.0   0.0
 -0.322908   -0.379362   -0.813322    …  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
  0.660336    2.02064     0.649449        0.0   0.0   0.0   0.0   0.0   0.0
 -0.91708     0.392511   -0.667269       -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 -0.254425   -0.426472   -1.18446        -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
  0.617348    1.62654     0.141215        0.0   0.0   0.0   0.0   0.0   0.0
  1.55256     2.01011     1.37082     …   0.0   0.0   0.0   0.0   0.0   0.0
  0.0662555  -0.670731    0.121412       -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
  0.127238   -0.539187   -0.00406911     -0.0  -0

### Wait
It is just like an other Flux layers/ Structure means we can easily train it !

In [287]:
using Flux: params
params(transformer) #parameters can be updated easily 

Params([Float32[1.4228282, 1.3738428, 1.4611361, 1.4545419, 1.362615, 1.4506888, 1.378717, 1.5180601, 1.4845772, 1.3975797  …  1.4648936, 1.0347399, 1.3678652, 1.4229372, 1.4521085, 1.5046557, 1.5158035, 1.4426574, 1.4960195, 1.3573524], Float32[-0.053622514, -0.16088338, 0.22308606, 0.048916165, -0.0039820396, 0.099344134, -0.0748515, 0.039203927, 0.12206718, 0.05795675  …  -0.27271414, 0.8688461, -0.04732636, -0.18536331, -0.16637102, 0.1740389, -0.294258, -0.13245566, -0.00337506, -0.345267], Float32[-0.051017728 0.086519726 … -0.11926416 0.07093989; -0.056381047 0.022605542 … -0.11318762 -0.11180934; … ; -0.1064435 -0.05314829 … 0.087794125 0.1817395; -0.063876376 -0.0543424 … 0.22770554 -0.03343155], Float32[0.0027943964 -0.0062411674; -0.01266096 0.017194653; … ; 0.0003534697 0.0020992558; 0.044360843 -0.0007803185], Float32[-0.027650086 0.004228708 … -0.022506572 -0.05371121; -0.058056954 -0.009886336 … -0.007817211 0.05937399; … ; -0.061029218 -0.050291896 … 0.047393326 0.08027

Lets take some datasets present in Transformers

In [288]:
using Transformers.Datasets
using Transformers.Datasets.GLUE

task = GLUE.QNLI()
datas = dataset(Train, task)
training_batch=get_batch(datas, 100) #here 100 correspond to no. of sentence for train we output

3-element Array{Array{String,1},1}:
 ["When did the third Digimon series begin?", "Which missile batteries often have individual launchers several kilometres from one another?", "What two things does Popper argue Tarski's theory involves in an evaluation of truth?", "What is the name of the village 9 miles north of Calafat where the Ottoman forces attacked the Russians?", "What famous palace is located in London?", "When is the term 'German dialects' used in regard to the German language?", "What was the name of the island the English traded to the Dutch in return for New Amsterdam?", "How were the Portuguese expelled from Myanmar?", "What does the word 'customer' properly apply to?", "What did Arsenal consider the yellow and blue colors to be after losing a FA Cup final wearing red and white?"  …  "Which of Calatrava's creations contains an IMAX theater?", "What is Seattle's average December temperature?", "Bell learned to accurately read lips even without knowing what?", "What is Okl

In [289]:
using Flux: onehotbatch
makesentence(s1, s2) = ["[CLS]"; s1; "[SEP]"; s2; "[SEP]"]
function preprocess(training_batch)
ids =[]
sent = []
for i in 11:35
    sent1 = tokenizer(spm,training_batch[1][i])
    sent2 = tokenizer(spm,training_batch[2][i])
    id = makesentence(sent1,sent2)
    push!(sent, id)
    push!(ids,ids_from_tokens(spm,id))
end
E = Flux.batchseq(ids,1)
E = Flux.stack(E,1)
segment = fill!(similar(E), 1)
    for (i, sent) ∈ enumerate(sent)
      j = findfirst(isequal("[SEP]"), sent)
      if j !== nothing
        @view(segment[j+1:end, i]) .= 2
      end
end
segment
data = (tok = E,segment = segment)
labels = get_labels(task)
label = onehotbatch(training_batch[3][11:35], labels)
return(data,label)
end

preprocess (generic function with 1 method)

In [149]:
data,label = preprocess(training_batch)

((tok = [3 3 … 3 3; 14 14 … 14 14; … ; 1 1 … 1 1; 1 1 … 1 1], segment = [1 1 … 1 1; 1 1 … 1 1; … ; 2 2 … 2 2; 2 2 … 2 2]), Bool[0 0 … 0 0; 1 1 … 1 1])

In [48]:
E = Flux.batchseq(ids,1)
E = Flux.stack(E,1)

134×25 Array{Int64,2}:
    3     3     3     3      3      3  …     3     3     3      3      3
   14    14    14    14     14     14       14    14    14     14     14
    2     2     2     2      2      2        2     2     2      2      2
 1808  1808  6776   253   1823   3582     5609   104  6776   6776   6776
 5069    24   108  3871    414    213       47    99    26    145   1249
   20   889   128  1207    700  20270  …  5092   429    15     15     24
   14    29   369    17     27    415     1538    24   982  10481   2745
   23    15   566  1690  13926     93     1705    15    31   2167     21
    2    14  2663  4841     16     45       61   325  3807     29     58
  140     2  1995    51     99   6757        4   614    21     22  10955
 4186    59    61  1031    963     61  …    14    36   925    287     16
   14  9745     4  6271     93      4        2    15   114    393   3994
    2    14    14    21   4432     14      104  1475    41    141   6128
    ⋮                       

In [54]:
segment = fill!(similar(E), 1)
    for (i, sent) ∈ enumerate(sent)
      j = findfirst(isequal("[SEP]"), sent)
      if j !== nothing
        @view(segment[j+1:end, i]) .= 2
      end
end
segment
data = (tok = E,segment = segment)

134×25 Array{Int64,2}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1
 1  1  2  1  1  2  1  2  1  1  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1
 ⋮

In [139]:
using Flux: onehotbatch
labels = get_labels(task)
label = onehotbatch(training_batch[3][11:35], labels)


2×25 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 0  0  0  1  0  0  0  1  0  0  0  1  0  1  0  0  1  0  0  1  1  0  1  0  0
 1  1  1  0  1  1  1  0  1  1  1  0  1  0  1  1  0  1  1  0  0  1  0  1  1

In [254]:
using Flux
using Flux: gradient
import Flux.Optimise: update!

clf = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)

ps = params(transformer)
opt = ADAM(1e-4)
#define the loss
function loss(data, label, mask=nothing)
    e = transformer[1](data)
    t = transformer[2](e)
    l = logcrossentropy( label,
         clf(
            transformer[3].pooler(
                t[:,1,:]
            )
        )
    )
    return l
end

loss (generic function with 2 methods)

In [264]:
transformer[1]

CompositeEmbedding(tok = Embed(128), segment = Embed(128), pe = PositionEmbedding(128, max_len=512), postprocessor = Positionwise(LayerNorm(128), Dropout(0.1)))

In [260]:
clf1 = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)
clf2 = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)

Chain(Dropout(0.1), Dense(768, 2), logsoftmax)

In [290]:
for i ∈ 1:24 # 24/2 training step, just for illustration
data_batch = (tok = data.tok[:,i:i+10],segment= data.segment[:,i:i+10])
label_batch = label[:,i:i+10]
l = loss(data_batch, label_batch)
i=i+2
@show l
  grad = gradient(()->l, ps)
  update!(opt, ps, grad)
end

l = 2.0030386f0
l = 1.7268517f0
l = 1.6604753f0
l = 1.5049707f0
l = 1.8035961f0
l = 1.8366543f0
l = 1.6482123f0
l = 1.7435868f0
l = 1.906287f0
l = 1.6901565f0
l = 1.3915592f0
l = 1.3359941f0
l = 1.3426436f0
l = 1.3728523f0
l = 1.6148944f0


BoundsError: BoundsError: attempt to access 134×25 Array{Int64,2} at index [Base.Slice(Base.OneTo(134)), 16:26]