# COMP541 - LAB
## LSTM Named Entity Tagger

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as the following **example**:

**INPUT:** Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights the names of entities:

**OUTPUT:** [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.(Wikipedia)

Your task in this lab is to implement named entity LSTM based tagger which uses an LSTM to extract features from the input sentence, which are then passed through a multi-layer perceptron to predict
the tag of the word. Finally, train that model on [WikiNER](https://github.com/neulab/dynet-benchmark/tree/master/data/tags) dataset.

Firstly we import required packages

In [5]:
using Pkg; for p in ["IterTools", "Knet","ArgParse", "CUDA"]; Pkg.add(p); end
using Printf, Dates, Random, CUDA, Knet, ArgParse, Test, Base.Iterators, IterTools

STDOUT = Base.stdout

import Knet: train!
include(joinpath(Knet.dir(), "data", "wikiner.jl"))
_atype = CUDA.functional() ? KnetArray{Float32} : Array{Float32}

@info "Adding required packages and importing WikiNER dataset"

[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `/scratch/users/vaydingul20/.julia/environments/v1.5/Manifes

## Prepare samples for the network
Your first task is to prepare instances for the network. We're given with the tokens (words and tags) and we need to make them understandable by our neural network. For this purpose, we build vocabularies (for both words and tags) and construct vocabulary to index dictionaries by using those vocabularies (w2i and t2i, word2index, tag2index). Then, we convert words and tags to indices with the usage of our dictionaries.

```julia
julia> show_instance() # show instance in not implemented in Knet, it is a hypothetical procedure
Inputs sentence:
Sent-> That inscribed in the genealogical records of his family is Jiang Zhoutai .
NERs-> O    O         O  O   O            O       O  O   O      O  I-PER I-PER   O

Timesteps:
Time step 1 ---> Inputs: That
                 Outputs: O
Time step 2 ---> Inputs: inscribed
                 Outputs:O
Time step 3 ---> Inputs: in
                 Outputs: O
Time step 4 ---> Inputs: the
                 Outputs: O
Time step 5 ---> Inputs: genealogical
                 Outputs: O
Time step 6 ---> Inputs: records  .
                 Outputs: O
Time step 7 ---> Inputs: of
                 Outputs: O
Time step 8 ---> Inputs: his
                 Outputs: O
Time step 9 ---> Inputs: family
                 Outputs: O
Time step 10 --->Inputs: is
                 Outputs: O
Time step 11 ---> Inputs: Jiang
                  Outputs: I-PER
Time step 12 ---> Inputs: Zhoutai
                  Outputs: I-PER
Time step 13 ---> Inputs: .
                  Outputs: O
```

Our input and output arrays should be integers instead of texts.

In this step, you need to implement `make_instance` function
instance is a list of tuples. Each tuple contains a word and the corresponding tag as string.
You need to convert them into indices using word to index (w2i) and tag to index (t2i)

In [6]:
"""
    make_instance(instance, w2i, t2i)

Return tuple of two sequences containing inputs and the corresponding outputs respectively.

This function does this by converting each input unit in the instance into its corresponding value in w2i, and does the same for output units using t2i.
"""
function make_instance(instance, w2i, t2i, unk=UNK)
    input = Int[]
    output = Int[]
    # Your code here
    for ins in instance
       
        # If the word does not exists, add ´_UNK_´
        push!(input, get(w2i, ins[1], w2i[unk]))
        # It the tag does not exists add ´0´
        push!(output, get(t2i, ins[2], 0))
        
    end
        
    
    return input, output
end


"""
   make_instances(data, w2i, t2i)

Iterate over `data` and Return `words` and `tags`
"""
function make_instances(data, w2i, t2i)
    words = []; tags = []
    for k = 1:length(data)
        this_words, this_tags = make_instance(data[k], w2i, t2i)
        push!(words, this_words)
        push!(tags, this_tags)
    end
    return words, tags
end

@info "Testing instances"
data = WikiNERData();
dev = make_instances(data.dev, data.w2i, data.t2i);
@test dev[1][2][3] == 11110
@test size.(dev) == ((1696,), (1696,))

┌ Info: Testing instances
└ @ Main In[6]:39


[32m[1mTest Passed[22m[39m

### WikiNERProcessed
This struct contains processed data (e.g words and tags are indices)
and necessary variables to prepare minibatches.
WikiNERProcessed struct works as a data iterator, which will you implement in the next step.

In [7]:
mutable struct WikiNERProcessed
    words
    tags
    batchsize
    ninstances
    shuffled
end

"""
   WikiNERProcessed(instances, w2i, t2i; batchsize=16, shuffled=true)

Return a WikiNERProcessed object with the given instances
"""
function WikiNERProcessed(instances, w2i, t2i; batchsize=16, shuffled=true)
    words, tags = make_instances(instances, w2i, t2i)
    ninstances = length(words)
    return WikiNERProcessed(words, tags, batchsize, ninstances, shuffled)
end

@info "WikiNERProcessed"
devdata = WikiNERProcessed(data.dev, data.w2i, data.t2i; shuffled=false);
@test devdata.words[1][1] == 8653
@test length(devdata.words) == 1696

┌ Info: WikiNERProcessed
└ @ Main In[7]:20


[32m[1mTest Passed[22m[39m

### WikiNERProcessed Iterator
Please note that this function returns tuple of two tuples.

The first one contains a data batch with words as an input for our model, and tags as the corresponding output, and batchsizes of this batch.
Since you will use the RNN callable object in your model.
It supports variable length instances in its input.
However, you need to prepare your input such as the RNN object can work on it. See `batchSizes` option of the RNN object using `@doc RNN` and Look up `zeros`, `sortperm`, `min`

In [8]:
"""
    iterate(d::WikiNERProcessed[, state])

Iterate over `d::WikiNERProcessed` object. If `state` is missing, it's the beginning
of the whole iteration process.
"""
function Base.iterate(d::WikiNERProcessed, state=ifelse(d.shuffled, randperm(d.ninstances), 1:d.ninstances))
    # Your code here
    # Initialization of the array sturctures
    words = Int[]
    tags = Int[]
    batchsizes = Int[]
    
    # Get sizes of each sequence
    sequence_sizes = getindex.(size.(d.words[state[1:d.batchsize]]), 1)
    # Determine the sequence having the highest length
    max_sequence_size = maximum(sequence_sizes)
    # Sort the sequences and get indexes
    size_ixs = sortperm(sequence_sizes)[end:-1:1] # Descending order sequence size indexes
    
    # Batch construction scheme that is explained in the COMP541 forum
    for k in 1:max_sequence_size
        cnt = 0
        for m in size_ixs
            if (k <= sequence_sizes[m])
                push!(words, d.words[state[m]][k])
                push!(tags, d.tags[state[m]][k])
                cnt += 1
            else
                continue
            end
           
        end
        push!(batchsizes, cnt)
    end
    
    # Next state
    new_state = state[1+d.batchsize:end]
    # Iteration ending criteria
    remain = length(new_state)
    residue = min(d.batchsize, remain)
    residue < d.batchsize && return nothing
    
    return ((words, tags, batchsizes), new_state)
end

Base.IteratorSize(::Type{WikiNERProcessed}) = Base.SizeUnknown()
Base.IteratorEltype(::Type{WikiNERProcessed}) = Base.HasEltype()

@info "Testing WikiNERProcessed Iterator"
((words, tags, batchsizes), new_state) = iterate(devdata);
@test length.((words, tags, batchsizes)) == (397, 397, 55)

counter = 1;
for ddd in devdata
    counter+=1;
end

@test new_state == 17:1696

┌ Info: Testing WikiNERProcessed Iterator
└ @ Main In[8]:43


[32m[1mTest Passed[22m[39m

## Model Components implementation

### Embedding layer
This layer maps each vocabulary to its corresponding vector using its Int id. It works with mini-batches.

In [9]:
"""
    Embedding(vocabsize::Int, embedsize::Int, atype=_atype, scale=0.01)

Create a Embedding layer and initialize its weight. Initial weight parameters are
sampled from normal distribution scaled by a `scale` factor.

# Examples
```julia-repl
julia> embed = Embedding(100, 25);

julia> x = rand(1:10, 10);

julia> embed(x); # forward call
```
"""
mutable struct Embedding
    w # weight
end

function Embedding(vocabsize::Int, embedsize::Int, atype=_atype, scale=0.01)
    w = Param(convert(atype, scale*randn(embedsize, vocabsize)));
    return Embedding(w)
end


function (l::Embedding)(x)
    l.w[:, x]
end

@info "Testing embedding layer"
Random.seed!(1)
embed = Embedding(100, 25);
x = rand(1:25, 12, 32);
@test size(embed(x)) == (25, 12, 32)
@test sum(embed(x)) ≈ -4.231335f0

┌ Info: Testing embedding layer
└ @ Main In[9]:30


[32m[1mTest Passed[22m[39m

### Linear layer

In [10]:
"""
    Linear(inputsize, outputsize; atype=Array{Float64}, scale::Float64=0.1)

Create a linear layer with its weight and bias. Initial weight parameters are
sampled from normal distribution scaled by a `scale` factor. Initial bias
values are zeros.

# Examples
```julia-repl
julia> layer = Linear(50, 10);

julia> x = rand(2, 50);

julia> layer(x); # forward call
```
"""
mutable struct Linear
    w # weight
    b # bias

    function Linear(inputsize, outputsize; atype=_atype, scale::Float64=0.01)
        w = Param(convert(atype, scale*randn(outputsize, inputsize)));
        b = Param(convert(atype, zeros(outputsize)));
        new(w, b)
    end
end

function (l::Linear)(x)
    l.w * x .+ l.b
end

@info "Testing linear layer"
Random.seed!(1)
lin = Linear(100, 200);
x = _atype(randn(100, 32));
@test size(lin(x)) == (200, 32)
@test sum(lin(x)) ≈ -3.8317218f0

┌ Info: Testing linear layer
└ @ Main In[10]:32


[32m[1mTest Passed[22m[39m

### Hidden layer

In [11]:
"""
    Hidden(inputsize, outputsize, fun=relu, atype=_atype, scale=0.1)

Create a hidden layer with its weight and bias and activation function. Initial weight parameters are
sampled from normal distribution scaled by a `scale` factor. Initial bias
values are zeros.

# Examples
```julia-repl
julia> layer = Hidden(100, 200);

julia> x = rand(100, 5);

julia> layer(x); # forward call
```
"""
mutable struct Hidden
    w # weight
    b # bias
    fun # non-linear activation function like relu or tanh

    function Hidden(inputsize, outputsize, fun=relu, atype=_atype, scale=0.1)
        w = Param(convert(atype, scale*randn(outputsize, inputsize)));
        b = Param(convert(atype, zeros(outputsize)));
        new(w, b, fun)
    end
end

function (l::Hidden)(x)
    l.fun.(l.w * x .+ l.b)
end

@info "Testing hidden layer"
Random.seed!(1)
hid = Hidden(200, 256);
x = _atype(randn(200, 32));
@test size(hid(x)) == (256, 32)
@test sum(hid(x)) ≈ 4635.545f0

┌ Info: Testing hidden layer
└ @ Main In[11]:33


[32m[1mTest Passed[22m[39m

### NER Tagger model

Our model consists of four layers. Size of their outputs are as the following:
* **(T)** - Input
* **(E, T)** - Embedding
* **(RNN, T)** - RNN
* **(H, T)** - Hidden
* **(NTags, T)** - Projection

In [12]:
mutable struct NERTagger
    embed::Embedding
    rnn::RNN
    hidden::Hidden
    projection::Linear
end

function NERTagger(no_words, no_tags, embed_size, rnn_hidden_size, mlp_hidden_size, atype=_atype)
    embed = Embedding(no_words, embed_size, atype)
    rnn = RNN(embed_size, rnn_hidden_size)
    hidden = Hidden(rnn_hidden_size, mlp_hidden_size)
    projection = Linear(mlp_hidden_size, no_tags; atype = atype)
    return NERTagger(embed, rnn, hidden, projection)
end

function (m::NERTagger)(x; batchsizes=nothing)
    m.projection(m.hidden(m.rnn(m.embed(x); batchSizes = batchsizes)))
end

@info "Testing forward pass of NERTagger"
Random.seed!(1)
nwords, ntags = length(data.w2i), data.ntags
model = NERTagger(nwords, ntags, 128, 50, 32)

output = model(words; batchsizes=batchsizes)
@test size(output) == (9, 397)
@test sum(output) == 1.512398f0

┌ Info: Testing forward pass of NERTagger
└ @ Main In[12]:20


[91m[1mTest Failed[22m[39m at [39m[1mIn[12]:27[22m
  Expression: sum(output) == 1.512398f0
   Evaluated: 1.5123978f0 == 1.512398f0


LoadError: There was an error during testing

Now you will implement loss function for your model.
Firstly get your probabilities from your model.
Then calculate the loss function for average per token. You can use `nll` for this purpose.

In [13]:
function (m::NERTagger)(x, ygold, batchsizes, average=true)
   nll(m(x; batchsizes = batchsizes), ygold; average = average)
end

@info "Testing loss function of NERTagger"
Random.seed!(1)
nwords, ntags = length(data.w2i), data.ntags
model = NERTagger(nwords, ntags, 128, 50, 32)
l = model(words, tags, batchsizes)

@test l ≈ 2.1969666f0

┌ Info: Testing loss function of NERTagger
└ @ Main In[13]:5


[32m[1mTest Passed[22m[39m

### Loss for a whole dataset

Define a `loss(model, data)` which returns a `(Σloss, Nloss)` pair if `average=false` and
a `Σloss/Nloss` average if `average=true` for a whole dataset. Assume that `data` is an
iterator of `(words, gold_tags, batchsizes)` such as `WikiNERProcessed` and `model(x,y;average)` is a model like
`NERTagger` that computes loss on a single `(x,y)` pair.

In [14]:
"""
    loss(model::NERTagger, data; average=true)

Return overall loss of model on data.
"""
function loss(model::NERTagger, data; average=true)
    l = 0
    n = 0
    
    for (x, y, b) in data
        loss,number = model(x,y,b,false)
        l += loss
        n += number
    end
    average && return l / n
    return l, n
end
@info "Testing loss function"
Random.seed!(1)
@test loss(model, devdata) ≈ 2.196791f0
@test loss(model, devdata; average=false) == (85688.48f0, 39007)


┌ Info: Testing loss function
└ @ Main In[14]:18


[91m[1mTest Failed[22m[39m at [39m[1mIn[14]:21[22m
  Expression: loss(model, devdata; average = false) == (85688.48f0, 39007)
   Evaluated: (85075.47f0, 38728) == (85688.48f0, 39007)


LoadError: There was an error during testing

### Question
Why are we getting such value for loss? is this expected and why?

Write your answer here:

- Due to the nature of the negative log likelihood, one can expect that the loss value in the beginning should be $ln(n_{classes})$, since all parameters are randomly initialized. In our case, one can expect that $loss = ln(9) \approx 2.196$

### Accuracy metric
This function will be the metric which will evaluate our model's performance.

You will iterate over the given `data` object, predicting each instance and adding number of correctly predicted tokens to `ncorrect` and the number of tokens to `ntokens`.

possible helpful procedures: `argmax`, `vec`

In [15]:
"""
    accuracy(model::NERTagger, data, i2t)

Return accuracy of tags given a model and dataset
"""
function accuracy(model::NERTagger, data, i2t)
    ncorrect = 0
    ntokens = 0

    for (x, ygold, batchsizes) in data
        scores = model(x; batchsizes=batchsizes)
        # Your answer here
        ntokens += length(ygold)
        ypred = map( y -> y[1], argmax(scores, dims=1))
        ncorrect += sum(ygold .== ypred')
        # Your answer here
    end

    return ncorrect / ntokens
end

@info "Testing accuracy function"
@test accuracy(model, devdata, data.i2t) ≈ 0.1758145973799

┌ Info: Testing accuracy function
└ @ Main In[15]:23


[91m[1mTest Failed[22m[39m at [39m[1mIn[15]:24[22m
  Expression: accuracy(model, devdata, data.i2t) ≈ 0.1758145973799
   Evaluated: 0.17628072712249535 ≈ 0.1758145973799


LoadError: There was an error during testing

The following function can be used to train our model. trn is the training data, dev is used to determine the best model, tst... can be zero or more small test datasets for loss reporting. It returns the model that does best on dev.

In [16]:
"""
    train!(model, trn, dev, tst...)

Train `model` on `trn` data with Adam optimizer and Return the best performing model on `dev` data.
"""
function train!(model, trn, dev, tst...)
    bestmodel, bestloss = deepcopy(model), loss(model, dev)
    progress!(adam(model, trn), steps=1000) do y
        losses = [ loss(model, d) for d in (dev,tst...) ]
        if losses[1] < bestloss
            bestmodel, bestloss = deepcopy(model), losses[1]
        end
        return (losses...,)
    end
    return bestmodel
end

Knet.Train20.train!

## Training the model
Here we train our model for 10 epochs using the previous procedure. You can try to fiddle with the hyperparameters i.e. (embed_size, hidden_size, epochs, etc..) to get better loss on dev set. You should get a value of dev loss around `0.26`.

In [17]:
@info "Training NERTagger model"
@info "Seeding random number generator"
Random.seed!(1)

@info "Loading data"
data = WikiNERData();
dtrn = WikiNERProcessed(data.trn, data.w2i, data.t2i);
ddev = WikiNERProcessed(data.dev, data.w2i, data.t2i; shuffled=false);
epochs = 1; @show epochs
ctrn = [ b for b in dtrn ]
trnx10 = collect(flatten(shuffle!(ctrn) for i in 1:epochs))
trn20 = ctrn[1:20]
dev = [ b for b in ddev ]

@info "Initializing model"

@show nwords
@show ntags
embed_size = 128; @show embed_size
rnn_size = 50; @show rnn_size
hidden_size = 32; @show hidden_size
model = NERTagger(nwords, ntags, embed_size, rnn_size, hidden_size)

┌ Info: Training NERTagger model
└ @ Main In[17]:1
┌ Info: Seeding random number generator
└ @ Main In[17]:2
┌ Info: Loading data
└ @ Main In[17]:5


epochs = 1
nwords = 28484
ntags = 9
embed_size = 128
rnn_size = 50
hidden_size = 32


┌ Info: Initializing model
└ @ Main In[17]:15


NERTagger(Embedding(P(KnetArray{Float32,2}(128,28484))), LSTM(input=128,hidden=50), Hidden(P(KnetArray{Float32,2}(32,50)), P(KnetArray{Float32,1}(32)), Knet.Ops20.relu), Linear(P(KnetArray{Float32,2}(9,32)), P(KnetArray{Float32,1}(9))))

In [18]:
#Uncomment this to train the model (one epoch should take around 2 mins on gpu):
model = train!(model, trnx10, dev, trn20)

┣████████████████████┫ [100.00%, 8883/8883, 01:38/01:38, 90.34i/s] (0.2469082f0, 0.11470737f0))


NERTagger(Embedding(P(KnetArray{Float32,2}(128,28484))), LSTM(input=128,hidden=50), Hidden(P(KnetArray{Float32,2}(32,50)), P(KnetArray{Float32,1}(32)), Knet.Ops20.relu), Linear(P(KnetArray{Float32,2}(9,32)), P(KnetArray{Float32,1}(9))))

## Evaluation of the best model
**Expected Values**
- Development loss = 0.25991333
- Development accuracy = 0.9176301689440357
- Training loss = 0.11450425
- Training accuracy = 0.9643299845754065

In [20]:
@info "Evaluating the model"

dloss = loss(model, ddev)
tloss = loss(model, dtrn)
dacc = accuracy(model, ddev, data.i2t)
tacc = accuracy(model, dtrn, data.i2t)

println("Development loss = ", dloss)
println("Development accuracy = ", dacc)
println("Training loss = ", tloss)
println("Training accuracy = ", tacc)

┌ Info: Evaluating the model
└ @ Main In[20]:1


Development loss = 0.2469082
Development accuracy = 0.9191799215038216
Training loss = 0.111691974
Training accuracy = 0.9652775832749825


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*