## ALBERT
An upgrade to BERT that advances the state-of-the-art performance on 12 NLP tasks
 
The success of ALBERT demonstrates the importance of identifying the aspects of a model that give rise to powerful contextual representations. By focusing improvement efforts on these aspects of the model architecture, it is possible to greatly improve both the model efficiency and performance on a wide range of NLP tasks

## WHY ALBERT?
> An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.

ALBERT is a “lite” version of Google’s 2018 NLU pretraining method BERT. It has fewer parameter than BERT

In this notebook we are going to extract contextualised wordembedding by ALBERT and learing about classifer available for pretraining and finetuning

## Julia- Flux ALBERT Model
It very easy and similar to any of the other Flux layer for training 

In [2]:
using TextAnalysis

~ *ignore all the warning as TextAnalysis is checked out for developement*

In [3]:
using TextAnalysis.ALBERT # it is where our model reside

#### we are going to use DataDeps for handling download of pretrained model of ALBERT
- For now we are directly laoding 
- other pretrained Weights can be found [here](https://drive.google.com/drive/u/1/folders/1HHTlS_jBYRE4cG0elITEH7fAkiNmrEgz)

In [4]:
using BSON: @save, @load
@load "/home/iamtejas/Downloads/albert_base_v1.bson.tfbson" config weights vocab
transformer = TextAnalysis.ALBERT.load_pretrainedalbert(config, weights)

3-element Array{Any,1}:
 CompositeEmbedding(tok = Embed(128), segment = Embed(128), pe = PositionEmbedding(128, max_len=512), postprocessor = Positionwise(LayerNorm(128), Dropout(0.1)))
 TextAnalysis.ALBERT.albert_transformer(Dense(128, 768), TextAnalysis.ALBERT.ALGroup(Stack(Transformer(head=12, head_size=64, pwffn_size=3072, size=768)), Dropout(0.1)), 12, 1, 1)
 (pooler = Dense(768, 768, tanh), masklm = (transform = Chain(Dense(768, 128, gelu), LayerNorm(128)), output_bias = Float32[-5.345022, 2.1769698, -7.144285, -9.102521, -8.083536, 0.56541324, 1.2000155, 1.4699979, 1.5557922, 1.9452884  …  -0.6403663, -0.9401073, -1.0888876, -0.9298268, -0.64744073, -0.47156653, -0.81416136, -0.87479985, -0.8785063, -0.5505797]), nextsentence = Chain(Dense(768, 2), logsoftmax))

#### Todo 
better output repesentation

In [5]:
using WordTokenizers #we have albert_tokenizer residing in WordTokenizer 

For demo we are taking only 3 sentence

In [6]:
sample1 = "God is Great! I won a lottery."
sample2 = "If all their conversations in the three months he had been coming to the diner were put together, it was doubtful that they would make a respectable paragraph."
sample3 = "She had the job she had planned for the last three years."
sample = [sample1,sample2,sample3]

3-element Array{String,1}:
 "God is Great! I won a lottery."
 "If all their conversations in the three months he had been coming to the diner were put together, it was doubtful that they would make a respectable paragraph."
 "She had the job she had planned for the last three years."

#### loading of tokenizer form all the available  since we are using base_V1 

In [7]:
spm = load(ALBERT_V1)

WordTokenizers.SentencePieceModel(["<pad>", "<unk>", "[CLS]", "[SEP]", "[MASK]", "(", ")", "\"", "-", "."  …  "_archivist", "_obverse", "error", "_tyrion", "_addictive", "_veneto", "_colloquial", "agog", "_deficiencies", "_eloquent"], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  -13.5298, -13.5298, -13.5298, -13.5299, -13.5299, -13.53, -13.5313, -13.5318, -13.5323, -13.5323], 2)

### Preprocessing

In [None]:
s1 = ids_from_tokens(spm, tokenizer(spm,sample[1]))
s2 = ids_from_tokens(spm, tokenizer(spm,sample[2]))
s3 = ids_from_tokens(spm, tokenizer(spm,sample[3]))
E = Flux.batchseq([s1,s2,s3],1)
E = Flux.stack(E,1)

In [None]:
seg_indices = ones(Int, size(E)...)

We know input embedding requires both segment and token indices

the `embedding` itself handle position and addtion operation

In [None]:
embedding = transformer[1]
emb = embedding(tok=E, segment=seg_indices)

**Above we have embedding corresponding to input indices**

*lets pass the embedding through AlbertTranformer to get contextualised_embedding*

#### voilà we got Contextualised embedding

In [None]:
contextualised_embedding = transformer[2](emb)

In [None]:
cls = transformer[3][1] #pooler layer 


In [None]:
cls = transformer[3][2] #for mlm  tasked

In [None]:
cls = transformer[3][3] #for Sentence order prediction

In [None]:
using Transformers.Basic:@toNd
transformer[3].pooler(contextualised_embedding[:,1,:]) #mostly pooler layer is applied on [cls] token

Lets see our tokenized sentence

In [None]:
s1 =  tokenizer(spm,sample[1])
s2 =  tokenizer(spm,sample[2])
s3 =  tokenizer(spm,sample[3])
s =[s1,s2,s3]

In [None]:
using Transformers
masks = Transformers.Basic.getmask(s) # we can directly use getmask function of Transformers

In [None]:
contextualised_embedding = transformer[2](emb, masks) #contextualised_embedding with masks

### Wait
It is just like an other Flux layers/ Structure means we can easily train it !

In [8]:
using Flux: params
params(transformer) #parameters can be updated easily 

Params([Float32[1.4228282, 1.3738428, 1.4611361, 1.4545419, 1.362615, 1.4506888, 1.378717, 1.5180601, 1.4845772, 1.3975797  …  1.4648936, 1.0347399, 1.3678652, 1.4229372, 1.4521085, 1.5046557, 1.5158035, 1.4426574, 1.4960195, 1.3573524], Float32[-0.053622514, -0.16088338, 0.22308606, 0.048916165, -0.0039820396, 0.099344134, -0.0748515, 0.039203927, 0.12206718, 0.05795675  …  -0.27271414, 0.8688461, -0.04732636, -0.18536331, -0.16637102, 0.1740389, -0.294258, -0.13245566, -0.00337506, -0.345267], Float32[-0.051017728 0.086519726 … -0.11926416 0.07093989; -0.056381047 0.022605542 … -0.11318762 -0.11180934; … ; -0.1064435 -0.05314829 … 0.087794125 0.1817395; -0.063876376 -0.0543424 … 0.22770554 -0.03343155], Float32[0.0027943964 -0.0062411674; -0.01266096 0.017194653; … ; 0.0003534697 0.0020992558; 0.044360843 -0.0007803185], Float32[-0.027650086 0.004228708 … -0.022506572 -0.05371121; -0.058056954 -0.009886336 … -0.007817211 0.05937399; … ; -0.061029218 -0.050291896 … 0.047393326 0.08027

Lets take some datasets present in Transformers

In [9]:
using Transformers.Datasets
using Transformers.Datasets.GLUE

task = GLUE.QNLI()
datas = dataset(Train, task)
training_batch=get_batch(datas, 100) #here 100 correspond to no. of sentence for train we output

3-element Array{Array{String,1},1}:
 ["When did the third Digimon series begin?", "Which missile batteries often have individual launchers several kilometres from one another?", "What two things does Popper argue Tarski's theory involves in an evaluation of truth?", "What is the name of the village 9 miles north of Calafat where the Ottoman forces attacked the Russians?", "What famous palace is located in London?", "When is the term 'German dialects' used in regard to the German language?", "What was the name of the island the English traded to the Dutch in return for New Amsterdam?", "How were the Portuguese expelled from Myanmar?", "What does the word 'customer' properly apply to?", "What did Arsenal consider the yellow and blue colors to be after losing a FA Cup final wearing red and white?"  …  "Which of Calatrava's creations contains an IMAX theater?", "What is Seattle's average December temperature?", "Bell learned to accurately read lips even without knowing what?", "What is Okl

In [None]:
training_batch[1]

In [10]:
using Flux: onehotbatch
makesentence(s1, s2) = ["[CLS]"; s1; "[SEP]"; s2; "[SEP]"]
function preprocess(training_batch)
ids =[]
sent = []
for i in length(training_batch[1])
    sent1 = tokenizer(spm,training_batch[1][i])
    sent2 = tokenizer(spm,training_batch[2][i])
    id = makesentence(sent1,sent2)
    push!(sent, id)
    push!(ids,ids_from_tokens(spm,id))
end
E = Flux.batchseq(ids,1)
E = Flux.stack(E,1)
segment = fill!(similar(E), 1)
    for (i, sent) ∈ enumerate(sent)
      j = findfirst(isequal("[SEP]"), sent)
      if j !== nothing
        @view(segment[j+1:end, i]) .= 2
      end
end
segment
data = (tok = E,segment = segment)
labels = get_labels(task)
label = onehotbatch(training_batch[3], labels)
return(data,label)
end

preprocess (generic function with 1 method)

In [None]:
training_batch[3][1:100]

In [11]:
using Transformers.Basic

In [None]:
lo

In [12]:
using Flux: onehotbatch
labels = get_labels(task)
label = onehotbatch(training_batch[3][11:35], labels)


2×25 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 0  0  0  1  0  0  0  1  0  0  0  1  0  1  0  0  1  0  0  1  1  0  1  0  0
 1  1  1  0  1  1  1  0  1  1  1  0  1  0  1  1  0  1  1  0  0  1  0  1  1

In [13]:
using Flux
using Flux: gradient
import Flux.Optimise: update!

clf = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)

ps = params(transformer)
opt = ADAM(1e-4)
#define the loss
function loss(data, label, mask=nothing)
    e = transformer[1](data)
    t = transformer[2](e)
    l = logcrossentropy( label,
         clf(
            transformer[3].pooler(
                t[:,1,:]
            )
        )
    )
    return l
end

loss (generic function with 2 methods)

In [14]:
transformer[1]

CompositeEmbedding(tok = Embed(128), segment = Embed(128), pe = PositionEmbedding(128, max_len=512), postprocessor = Positionwise(LayerNorm(128), Dropout(0.1)))

In [15]:
clf1 = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)
clf2 = Flux.Chain(
    Flux.Dropout(0.1),
    Flux.Dense(768, length(labels)), Flux.logsoftmax
)

Chain(Dropout(0.1), Dense(768, 2), logsoftmax)

you can see batch inputs and loss below

In [17]:
for i ∈ 1:10 # epoch of size 10
data_batch = get_batch(datas, 4)
println(data_batch)
data_batch, label_batch = preprocess(data_batch)
l = loss(data_batch, label_batch)
@show l
  grad = gradient(()->l, ps)
  update!(opt, ps, grad)
end

[["When was the Yale Herald established?", "What is the name of the largest denomination of the Presbyterian Church in America?", "What fuel is used for the torch?", "What is the term used to describe the \"Right to collect revenue\"?"], ["Newspapers include the Yale Daily News, which was first published in 1878, and the weekly Yale Herald, which was first published in 1986.", "The nation's largest Presbyterian denomination, the Presbyterian Church (U.S.A.) – PC (USA) – can trace their heritage back to the original PCUSA, as can the Presbyterian Church in America (PCA), the Orthodox Presbyterian Church (OPC), the Bible Presbyterian Church (BPC), the Cumberland Presbyterian Church (CPC), the Cumberland Presbyterian Church in America the Evangelical Presbyterian Church (EPC) and the Evangelical Covenant Order of Presbyterians (ECO).", "The torch is fueled by cans of propane.", "The development of New Imperialism saw the conquest of nearly all eastern hemisphere territories by colonial po

l = 0.31616497f0
[["What do the Orthodox believe Mary remained to be before and after she gave birth to Christ?", "What allies did Nasser meet at the Academy?", "Neo-classical music emerged during what era?", "When John Hunyadi died, which province was left in chaos?"], ["She is also proclaimed as the \"Lady of the Angels\".", "At the academy, he met Abdel Hakim Amer and Anwar Sadat, both of whom became important aides during his presidency.", "The high-modern era saw the emergence of neo-classical and serial music.", "In an extremely unusual event for the Middle Ages, Hunyadi's son, Matthias, was elected as King of Hungary by the nobility."], ["not_entailment", "not_entailment", "entailment", "not_entailment"]]
l = 1.1544688f0


In [18]:
for i ∈ 1:10 # epoch of size 10
data_batch = get_batch(datas, 4)
println(data_batch)
data_batch, label_batch = preprocess(data_batch)
l = loss(data_batch, label_batch)
@show l
  grad = gradient(()->l, ps)
  update!(opt, ps, grad)
end

[["When was the only vice presidential debate held at Washington University?", "What other royal figure strongly influenced Frederick William III's decision to go to war with France?", "Who did the Cubs send to the New York Yankees for minor leaguer Corey Black?", "Who still gives salmon to the abbey today?"], ["The university hosted the only 2008 vice presidential debate, between Republican Sarah Palin and Democrat Joe Biden, on October 2, 2008, also at the Washington University Athletic Complex.", "At the insistence of his court, especially his wife Queen Louise, Frederick William III decided to challenge the French domination of Central Europe by going to war.", "Three days later, the Cubs sent Alfonso Soriano to the New York Yankees for minor leaguer Corey Black.", "In the present era, the Fishmonger's Company still gives a salmon every year."], ["entailment", "entailment", "entailment", "entailment"]]
l = 0.19738148f0
[["What years do many historians consider Atlantic City's golde

l = 0.8726212f0
[["Who termed the slogan \"la Ciudad de la Esperanza?\"", "What is the main material used to build the cellar in the basement of Main Building?", "In millimeters, how much precipitation does New York receive a year?", "From what dialect is Hindi descended?"], ["This motto was quickly adopted as a city nickname, but has faded since the new motto Capital en Movimiento (\"Capital in Movement\") was adopted by the administration headed by Marcelo Ebrard, though the latter is not treated as often as a nickname in media.", "The entire vaulted brick structure of the cellar was encased in steel and concrete and relocated nine feet to the west and nearly 19 feet (5.8 m) deeper in 1949, when construction was resumed at the site after World War II.", "Hurricanes and tropical storms are rare in the New York area, but are not unheard of and always have the potential to strike the area.", "Sanskrit has greatly influenced the languages of India that grew from its vocabulary and gramma