<a href="https://colab.research.google.com/github/sugarme/nb/blob/master/tokenizer/bpe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizer - BPE model

This notebook using

1. [GoTch - Pytorch C++ APIs Go bindind](https://github.com/sugarme/gotch)
2. [Tokenizer](https://github.com/sugarme/tokenizer)
2. [GopherNotes - Jupyter Notebook Go kernel](https://github.com/gopherdata/gophernotes)

<hr style="border: 2pt solid blue"> </hr>

## Install Go kernel - GopherNotes (Google Colab Only)

- Save a copy to your Google Drive
- Change Runtime type to use GPU if needed (Runtime/Change runtime type/Hardware accelerator/GPU)


In [None]:
# run this cell first time using python runtime
!add-apt-repository ppa:longsleep/golang-backports -y > /dev/null
!apt update > /dev/null 
!apt install golang-go > /dev/null
%env GOPATH=/root/go
!go get -u github.com/gopherdata/gophernotes
!cp ~/go/bin/gophernotes /usr/bin/
!mkdir /usr/local/share/jupyter/kernels/gophernotes
!cp ~/go/src/github.com/gopherdata/gophernotes/kernel/* \
       /usr/local/share/jupyter/kernels/gophernotes
# then refresh (browser), it will now use gophernotes. Skip to golang in later cells

**Note**: refresh (reload) browswer after this step!

<hr style="border:1px solid red"> </hr>

## Install Pytorch C++ APIs and Go binding - GoTch

NOTE: `ldconfig` (GLIBC) current version 2.27 is currently broken when linking Libtorch library

see issue: https://discuss.pytorch.org/libtorch-c-so-files-truncated-error-when-ldconfig/46404/6

Google Colab default settings:
```bash
LD_LIBRARY_PATH=/usr/lib64-nvidia
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
```
We copy directly `libtorch/lib` to those paths as a hacky way. 

In [None]:
$wget -q --show-progress --progress=bar:force:noscroll -O /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip
$unzip -qq /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip -d /usr/local
$unzip -qq -j /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip libtorch/lib/* -d /usr/lib64-nvidia/
$unzip -qq -j /tmp/libtorch-cxx11-abi-shared-with-deps-1.7.0%2Bcu101.zip libtorch/lib/* -d /usr/local/cuda/lib64/stubs/

In [None]:
import("os")
os.Setenv("CPATH", "usr/local/libtorch/lib:/usr/local/libtorch/include:/usr/local/libtorch/include/torch/csrc/api/include")

In [None]:
$rm -f -- go.mod
$go mod init github.com/sugarme/playgo
$go get github.com/sugarme/gotch@v0.3.2

In [None]:
import(
    "fmt"

    "github.com/sugarme/gotch"
    ts "github.com/sugarme/gotch/tensor"
) 

## ...and we are ready to Go! Thank you for using GoTch!

<hr style="border:2pt solid blue"> </hr>

In [3]:
import(
    "fmt"
    
    "github.com/sugarme/tokenizer/pretrained"
)

In [17]:
tk := pretrained.BertBaseUncased()

input := "Here is what we are going to encode."

e, err := tk.EncodeSingle(input)

fmt.Printf("Ids:\t\t%v\n", e.Ids)
fmt.Printf("TypeIds:\t%v\n", e.TypeIds)
fmt.Printf("Tokens:\t\t%q\n", e.Tokens)
fmt.Printf("Offsets:\t%v\n", e.Offsets)
fmt.Printf("Overflowing:\t%v\n", e.Overflowing)

Ids:		[2182 2003 2054 2057 2024 2183 2000 4372 16044 1012]
TypeIds:	[0 0 0 0 0 0 0 0 0 0]
Tokens:		["here" "is" "what" "we" "are" "going" "to" "en" "##code" "."]
Offsets:	[[0 4] [5 7] [8 12] [13 15] [16 19] [20 25] [26 28] [29 31] [31 35] [35 36]]
Overflowing:	[]


16 <nil>