<a href="https://colab.research.google.com/github/sugarme/nb/blob/master/transformer/bert-mask-lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT - Mask Language Model

This notebook using

1. [GoTch - Pytorch C++ APIs Go bindind](https://github.com/sugarme/gotch)
2. [Transformer](https://github.com/sugarme/transformer)
3. [GopherNotes - Jupyter Notebook Go kernel](https://github.com/gopherdata/gophernotes)

<hr style="border: 2pt solid blue"> </hr>

## Install Go kernel - GopherNotes (Google Colab Only)

- Save a copy to your Google Drive
- Change Runtime type to use GPU if needed (Runtime/Change runtime type/Hardware accelerator/GPU)


In [None]:
# run this cell first time using python runtime
!add-apt-repository ppa:longsleep/golang-backports -y > /dev/null
!apt update > /dev/null 
!apt install golang-go > /dev/null
%env GOPATH=/root/go
!go get -u github.com/gopherdata/gophernotes
!cp ~/go/bin/gophernotes /usr/bin/
!mkdir /usr/local/share/jupyter/kernels/gophernotes
!cp ~/go/src/github.com/gopherdata/gophernotes/kernel/* \
       /usr/local/share/jupyter/kernels/gophernotes
# then refresh (browser), it will now use gophernotes. Skip to golang in later cells

**Note**: refresh (reload) browswer after this step!

<hr style="border:1px solid red"> </hr>

## Install Pytorch C++ APIs and Go binding - GoTch

NOTE: `ldconfig` (GLIBC) current version 2.27 is currently broken when linking Libtorch library

see issue: https://discuss.pytorch.org/libtorch-c-so-files-truncated-error-when-ldconfig/46404/6

Google Colab default settings:
```bash
LD_LIBRARY_PATH=/usr/lib64-nvidia
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
```
We copy directly `libtorch/lib` to those paths as a hacky way. 

In [None]:
$wget -q --show-progress --progress=bar:force:noscroll -O /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip
$unzip -qq /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip -d /usr/local
$unzip -qq -j /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip libtorch/lib/* -d /usr/lib64-nvidia/
$unzip -qq -j /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip libtorch/lib/* -d /usr/local/cuda/lib64/stubs/

In [None]:
import("os")
os.Setenv("CPATH", "usr/local/libtorch/lib:/usr/local/libtorch/include:/usr/local/libtorch/include/torch/csrc/api/include")

In [None]:
$rm -f -- go.mod
$go mod init github.com/sugarme/playgo
$go get github.com/sugarme/gotch@v0.3.2

In [None]:
import(
    "fmt"

    "github.com/sugarme/gotch"
    ts "github.com/sugarme/gotch/tensor"
) 

## ...and we are ready to Go! Thank you for using GoTch!

<hr style="border:2pt solid blue"> </hr>

# BERT For Maked Language Model

In [1]:
$wget -q --show-progress --progress=bar:force:noscroll -O bert-base-uncased-config.json https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json
$wget -q --show-progress --progress=bar:force:noscroll -O bert-base-uncased-model.gt https://cdn.huggingface.co/bert-base-uncased-rust_model.ot



In [None]:
import(
    "fmt"
    
    ts "github.com/sugarme/gotch/tensor"
    "github.com/sugarme/gotch/nn"
    "github.com/sugarme/gotch"
    "github.com/sugarme/tokenizer"
    "github.com/sugarme/tokenizer/pretrained"
    "github.com/sugarme/transformer"
    "github.com/sugarme/transformer/bert"
)

In [None]:
device := gotch.CPU
vs := nn.NewVarStore(device)
// load BERT config
config, err := bert.ConfigFromFile("bert-base-uncased-config.json")
if err != nil{ fmt.Print(err)}
model := bert.NewBertForMaskedLM(vs.Root(), config)
// load weights for BERT masked language model
err := vs.Load("bert-base-uncased-model.gt")
if err != nil{fmt.Print(err)}
// load pretrained bert-base-uncased tokenizer
tk := pretrained.BertBaseUncased()

In [None]:
// Input sample
sentence := "Remi is 6 years old and he goes to [MASK] every day."
// Encode the input
enc, err := tk.EncodeSingle(sentence, true)
if err != nil{fmt.Print(err)}
var tokInput []int64
for _, id := range enc.Ids{
    tokInput = append(tokInput, int64(id))
}
// Create input tensors from token Ids
// NOTE: BERT model is designed to take multiple samples.
tokTensors := []ts.Tensor{*ts.TensorFrom(tokInput)} 
inputTs, err := ts.Stack(tokTensors, 0)
if err != nil{fmt.Print(err)}
input, err := inputTs.To(device, true)
if err != nil { fmt.Print(err) }

// Forward through the model
var output *ts.Tensor
ts.NoGrad(func(){
    output, _, _ = model.ForwardT(input, ts.None, ts.None, ts.None, ts.None, ts.None, ts.None, false)
})

// Get first sample
output1, err := output.Get(0)
if err != nil{fmt.Print(err)}
// Get p-values for [MASK] token - NOTE: there're added special tokens [CLS] sentence [SEP]
values, err := output1.Get(11)
if err != nil{ fmt.Print(err) }
// Get the best value
am, err := values.Argmax([]int64{0}, false, false)
if err != nil{ fmt.Print(err) }
id := am.Int64Values()[0]
// Lookup token from vocab
word, ok := tk.IdToToken(int(id))
if !ok { fmt.Printf("Token not found for input Id: %v\n", id)}
fmt.Printf("Tokens: %q\n", enc.Tokens)
fmt.Printf("Input: %v - Output: %v\n", sentence, word)