# M2 project : NER tagging / Zong-You Ke 



This project aims to implement a NER Tagger with Pytorch. We will be using the English CONLL 2003 data set.

Data download & description
--------

In [None]:
from urllib.request import urlretrieve
urlretrieve('https://raw.githubusercontent.com/pranabsarkar/Conll_task/master/conll-2003/eng.train','eng.train')
urlretrieve('https://raw.githubusercontent.com/pranabsarkar/Conll_task/master/conll-2003/eng.testa','eng.testa')

#Prints the beginning of the training set
istream = open('eng.train')
for idx, line in enumerate(istream):
  print(line.strip())
  if idx >=20:
  # if idx >=200: # test
    break
istream.close()


-DOCSTART- -X- -X- O

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG


The CONLL 2003 dataset encodes each token on a single line followed by its annotation. A token line is a quadruple:

> (token,tag,chunk,named entity)

A named entity tagger aims to predict the named entity annotations given the raw tokens. The NER tags follows the IOB convention. 
* **I** stands for **Inside** and is used to flag tokens that are part of a named entity.
* **B** stands for **Begin** and is used to flag a token starting a new entity when the preceding token is already part of an entity.
* **O** stands for **Outside** and is used to flag tokens that are not part of a named entity.

The I and B Tag are followed by a specifier. For instance I-PER means that the named entity refers to a person, I-ORG means that the entity is refers to an Organisation.

Sentences are separated by a blank line. The train file is `eng.train`, the dev file is `eng.testa`. I will evaluate your work with a test file unknown to you.
To do this, I will change the content of the dev file 



First exercise : data preprocessing (1pts)
---


Using CONLL2003 the train file, you will: 

* Extract an input vocabulary and create two maps: one mapping tokens to integers and a second mapping integers to tokens (see the pdf notes)
* Include elements in the input vocabulary for padding and for unknown words
* Extract an output vocabulary (the set of NER tags) and returns two maps 
mapping tags to integer and vice-versa.

These functionalities should be implemented in a function with signature `vocabulary(filename)` that returns the two maps

In [None]:
# Batch encoding for input symbols
import torch
import torch.nn as nn

def vocabulary(filename,input_vocab,padding='<pad>',unknown='<unk>'):
    #input_vocab is a boolean flag that tells if we extract input or output vocabulary
    #the two optional flags indicate that a padding and an unknown token 
    #have to be added to the vocabulary if their value is not None
        
    ###########################
    idx2sym = {}
    sym2idx = {}
    idx = 0
    
    istream = open(filename)

    # parsing entity
    for line in istream:
      tags = line.strip()
      if tags != "" and not tags.startswith('-DOCSTART-'):
        if input_vocab == 'input': # token
          token= tags.split()[0]
          if token not in sym2idx.keys(): 
            sym2idx[token] = idx
            idx2sym[idx] = token
            idx +=1
        elif input_vocab == 'pos': # POS tag
          pos = tags.split()[1]
          if pos not in sym2idx.keys(): 
            sym2idx[pos] = idx
            idx2sym[idx] = pos
            idx +=1

        else: # NE tag
          ne = tags.split()[-1]
          if ne not in sym2idx.keys(): 
            sym2idx[ne] = idx
            idx2sym[idx] = ne
            idx +=1
            
    # padding
      sym2idx[padding] = idx
      idx2sym[idx] = padding
    # unknown
      sym2idx[unknown] = idx+1
      idx2sym[idx+1] = unknown
    

    #TODO : return the two vocabulary maps idx2sym and sym2idx
    #you have to include two special tokens : padding and unknown symbols

    return idx2sym, sym2idx
    ###########################

In [None]:
def char_vocabulary(filename,padding='<pad>',unknown='<unk>'):

    idx2sym = {}
    sym2idx = {}
    idx = 0
    
    with open(filename) as istream:
      for line in istream:
        tags = line.strip()
        if tags != "" and not tags.startswith('-DOCSTART-'):
          token= tags.split()[0]
          for char in token:
            if char not in sym2idx.keys():
              sym2idx[char] = idx
              idx2sym[idx] = char
              idx +=1

  # padding
    sym2idx[padding] = idx
    idx2sym[idx] = padding
  # unknown
    sym2idx[unknown] = idx+1
    idx2sym[idx+1] = unknown

    return idx2sym, sym2idx

Now we implement three functions: 

* One that performs padding 
* The second will encode a sequence of tokens (or a sequence of tags) on integers
* The third will decode as sequence of symbols from integers to strings

At test time, some tokens might not belong to the vocabulary. Ensure that your encoding function does not crash in this case.


In [None]:
def pad_sequence(sequence,pad_size,pad_token):
    #returns a list with additional pad tokens if needed
    ls_pad = [pad_token for i in range(pad_size-len(sequence))]
    if len(ls_pad) > 0: sequence.extend(ls_pad)

    return sequence

def code_sequence(sequence,coding_map,unk_token=None):
    #takes a list of strings and returns a list of integers
    code_seq = []
    for elem in sequence:
      if elem not in coding_map.keys(): # UNK
        code_seq.append(coding_map[unk_token])
      else: 
        code_seq.append(coding_map[elem])

    return code_seq

def decode_sequence(sequence,decoding_map):
    #takes a list of integers and returns a list of strings 
    decode_seq = []
    for elem in sequence:
      decode_seq.append(decoding_map[elem])

    return decode_seq

We then add two more functions for processing characters:

In [None]:
import itertools

def pad_char(input, pad_token='<pad>'): 
    """
    Separates tokens into characters and make padding 
    so that of each sequence of characters are of equal length.
    
    Takes a list of lists of sentences with tokens only, 
    and returns a list of padded character lists.
    """
    ls_flatten = list(itertools.chain.from_iterable(input))
    pad_size_char = len(max(ls_flatten, key=len))

    all_tokens = []
    sentences = []

    for sentence in input:
      tokens = []
      all_tokens = []
      for token in sentence:
          if token != pad_token:
            ls_pad = [pad_token for i in range(pad_size_char-len(token))]
            tokens = [char for char in token]
            tokens.extend(ls_pad)
          else:
            tokens = [pad_token for i in range(pad_size_char)]
          all_tokens.append(tokens)
      sentences.append(all_tokens)
    return sentences

def code_char(input,encodingmap):
    """
    Encodes a list of lists of characters and returns a list of lists of characters indices.  
    """
    all_tokens = []
    all_sents = []

    for sentence in input:
      for token in sentence:
        tok_codes = [encodingmap[c] for c in token if c in encodingmap]
        all_tokens.append(tok_codes)  

      all_sents.append(all_tokens)
      all_tokens = []

    return all_sents

Second exercise: data generator (4pts)
------

In this second exercise, we will write a mini-batch generator. 
This is a class in charge of generating randomized batches of data from the dataset. We start by implementing two functions for reading the textfile


In [None]:
def read_conll_tokens(conllfilename):
    """
    Reads a CONLL 2003 file and returns a list of sentences.
    A sentence is a list of strings (tokens)
    """
    istream = open(conllfilename)

    tokens = []
    sentences = []

    for line in istream:
      tags = line.strip()
      if not tags.startswith('-DOCSTART-'):
        if tags !="":
            tokens.append(tags.split()[0])
        else:
            if len(tokens) > 0: sentences.append(tokens)
            tokens = []
        
    return sentences


def read_conll_tags(conllfilename, tag_type):
    """
    Reads a CONLL 2003 file and returns a list of sentences.
    A sentence is a list of strings (NER-tags)
    """
    istream = open(conllfilename)

    all_tags = []
    sentences = []

    for line in istream:
      tags = line.strip()
      if not tags.startswith('-DOCSTART-'):
        if tags !="":
          if tag_type: # BIO tags
            all_tags.append(tags.split()[-1])
          else:        # POS tags
            all_tags.append(tags.split()[1])
        else:
            if len(all_tags) > 0: sentences.append(all_tags)
            all_tags = []

    return sentences



Now we implement the class. You will rely on the helper functions designed above in order to fill in the blanks in the constructor. 

In [None]:
import torch
import torch.nn as nn
from random import shuffle

class DataGenerator:

        #Reuse all relevant helper functions defined above to solve the problems
        def __init__(self,conllfilename, parentgenerator = None, pad_token='<pad>',unk_token='<unk>'):

              if parentgenerator is not None: #Reuse the encodings of the parent if specified
                  self.pad_token      = parentgenerator.pad_token
                  self.unk_token      = parentgenerator.unk_token
                  self.input_sym2idx  = parentgenerator.input_sym2idx 
                  self.input_idx2sym  = parentgenerator.input_idx2sym 

                  self.char_sym2idx   = parentgenerator.char_sym2idx
                  self.char_idx2sym   = parentgenerator.char_idx2sym

                  self.tag_sym2idx  = parentgenerator.tag_sym2idx 
                  self.tag_sym2idx  = parentgenerator.tag_sym2idx 

                  self.output_sym2idx = parentgenerator.output_sym2idx 
                  self.output_idx2sym = parentgenerator.output_idx2sym  
              else:                           #Creates new encodings
                  self.pad_token = pad_token
                  self.unk_token = unk_token

                  #Create encoding maps from datafile 
                  self.input_idx2sym,self.input_sym2idx     = vocabulary(conllfilename,"input",pad_token,unk_token)
                  self.char_idx2sym,self.char_sym2idx       = char_vocabulary(conllfilename,pad_token,unk_token)
                  self.tag_idx2sym,self.tag_sym2idx         = vocabulary(conllfilename,"pos",pad_token,unk_token)
                  self.output_idx2sym,self.output_sym2idx   = vocabulary(conllfilename,"output",pad_token,unk_token)

              #Store the conll dataset with sentence structure (a list of lists of strings) in the following fields 
              self.Xtokens = read_conll_tokens(conllfilename)
              self.Ytokens = read_conll_tags(conllfilename, True)
              self.Ttokens = read_conll_tags(conllfilename, False)
              #######################

        def generate_batches(self,batch_size,conv_mode='False'):

              #This is an example generator function yielding one batch after another
              #Batches are lists of lists
              
              assert(len(self.Xtokens) == len(self.Ytokens))
              assert(len(self.Ytokens) == len(self.Ttokens))
              
              N     = len(self.Xtokens)
              idxes = list(range(N))

              #Data ordering (try to explain why these 2 lines make sense...)
              shuffle(idxes)
              idxes.sort(key=lambda idx: len(self.Xtokens[idx]))

              #batch generation
              bstart = 0
              while bstart < N:
                 bend        = min(bstart+batch_size,N)
                 batch_idxes = idxes[bstart:bend] 
                 batch_len   = max(len(self.Xtokens[idx]) for idx in batch_idxes)              

                 seqX = [ self.Xtokens[idx] for idx in batch_idxes]  
                 seqX = [ pad_sequence(self.Xtokens[idx],batch_len,self.pad_token) for idx in batch_idxes]
                 seqY = [ pad_sequence(self.Ytokens[idx],batch_len,self.pad_token) for idx in batch_idxes]
                 seqT = [ pad_sequence(self.Ttokens[idx],batch_len,self.pad_token) for idx in batch_idxes]
                 
                 # In convolutive mode, we will do character preprocessing
                 if conv_mode:
                  seqX = code_char(pad_char(seqX), self.char_sym2idx)
                 else:
                  seqX = [ code_sequence(seq,self.input_sym2idx,self.unk_token) for seq in seqX]
                
                 seqY  = [ code_sequence(seq,self.output_sym2idx,self.unk_token) for seq in seqY]
                 seqT  = [ code_sequence(seq,self.tag_sym2idx,self.unk_token) for seq in seqT]
                 
                 assert(len(seqX) == len(seqY))
                 assert(len(seqY) == len(seqT))
                 yield (seqX,seqY,seqT)
                 bstart += batch_size

Third exercise : implement the tagger (5pts)
---------------
This is the core exercise. There are three main tasks:
* Implement parameter allocation. This implies allocating the embedding layer, the LSTM (or bi-LSTM) layer and the Linear Layer.
* Implement the forward method. This method expects a tensor encoding the input and outputs a tensor of predictions
* Implement the train method 

The evaluation (`validate`) method is given and cannot be modified. But it can be used as source of inspiration for implementing the train method. 

In [None]:
class CharConvolution(nn.Module):
      def __init__(self,windowK,chars_vocab_size,input_embedding_size,output_embedding_size,padding_idx = None,device='cpu'):

          super(CharConvolution, self).__init__()
          self.input_embedding_size = input_embedding_size
          self.output_embedding_size = output_embedding_size
          self.conv1d = nn.Conv1d(in_channels=self.input_embedding_size, out_channels=self.output_embedding_size, kernel_size=windowK)
          self.embeddings = nn.Embedding(chars_vocab_size, input_embedding_size)
          self.conv1d     = self.conv1d.to(device)
          self.embeddings = self.embeddings.to(device)

      def forward(self,Xinput):
   
          batch_size = Xinput.shape[0]
          seq_length = Xinput.shape[1] 
          word_length = Xinput.shape[2] 
          emb = self.embeddings(Xinput)

          emb_view = emb.view(emb.shape[0],emb.shape[1]*emb.shape[2],emb.shape[3]) 
          emb= torch.transpose(emb_view, 1, 2)  # batch,char_emb,nb_all_char
          
          # filtering
          conv_emb = self.conv1d(emb) 
          
          # pooling
          pool = nn.MaxPool1d(conv_emb.shape[-1]-seq_length+1, stride=1)
          emb_of_word = pool(conv_emb) 

          emb_of_word_transp = torch.transpose(emb_of_word, 1, 2)
          emb_of_word_transp_view = emb_of_word_transp.view(batch_size,seq_length,self.output_embedding_size) 
  
          return  emb_of_word_transp_view


In [None]:
import torch.optim as optim

class NERtagger(nn.Module):

      def __init__(self,traingenerator,embedding_size,char_embedding_size,tag_embedding_size,hidden_size,bidirectional,window_size=2,dropout=0,num_layers=1,device='cpu'):
        super(NERtagger, self).__init__()        
        self.embedding_size      = embedding_size
        self.char_embedding_size = char_embedding_size
        self.tag_embedding_size  = tag_embedding_size
        self.hidden_size         = hidden_size
        self.window_size         = window_size
        self.allocate_params(traingenerator,device,bidirectional,dropout,num_layers) 

      def load(self,filename):
        self.load_state_dict(torch.load(filename))

      def allocate_params(self,datagenerator,device,bidirectional,dropout,num_layers):
        
        #create fields for nn Layers
        #########################
        bi_factor = 1
        if bidirectional: bi_factor = 2

        invocab_size    = len(datagenerator.input_idx2sym)
        outvocab_size   = len(datagenerator.output_idx2sym)
        inchar_size     = len(datagenerator.char_sym2idx)
        tag_size        = len(datagenerator.tag_idx2sym)

        pad_index       = datagenerator.input_sym2idx[datagenerator.pad_token]
        pad_tag_index   = datagenerator.tag_sym2idx[datagenerator.pad_token]
        self.embeddings = nn.Embedding(invocab_size,self.embedding_size,padding_idx=pad_index)
        self.tag_embeddings  = nn.Embedding(tag_size,self.tag_embedding_size,padding_idx=pad_tag_index)   

        self.charE      = CharConvolution(self.window_size,inchar_size,self.char_embedding_size,self.embedding_size,device=device)    
        self.lstm       = nn.LSTM(self.embedding_size,self.hidden_size,batch_first=True, bidirectional=bidirectional,dropout=dropout,num_layers=num_layers)
        self.lstm_tag   = nn.LSTM(self.embedding_size+self.tag_embedding_size,self.hidden_size,batch_first=True, bidirectional=bidirectional,dropout=dropout,num_layers=num_layers)
        
        self.linear_out      = nn.Linear(self.hidden_size*bi_factor,outvocab_size)
        self.embeddings      = self.embeddings.to(device)
        self.tag_embeddings  = self.tag_embeddings.to(device)
        self.lstm            = self.lstm.to(device)
        self.lstm_tag        = self.lstm_tag.to(device)
        self.linear_out      = self.linear_out.to(device)
        #########################

      def forward(self,Xinput,Tinput=None,conv_mode=False,tag_mode=False):
        
        if tag_mode and Tinput != None:
          Xinput = self.embeddings(Xinput) # batch,seqlen,features
          Tinput = self.tag_embeddings(Tinput)

          # POS tags attached at the end of each token tensor 
          concat = torch.cat((Xinput, Tinput),0)
          output, (h,c) = self.lstm(concat)

          # POS tags inserted after all token tensors
          #concat = torch.cat((Xinput, Tinput),2)
          #output, (h,c) = self.lstm_tag(concat)

        elif conv_mode:
          Xinput = self.charE(Xinput)
          output, (h,c) = self.lstm(Xinput)

        else: 
          Xinput = self.embeddings(Xinput) # batch,seqlen,features
          output, (h,c) = self.lstm(Xinput)
          
        return self.linear_out(output)

      def train(self,traingenerator,validgenerator,epochs,batch_sizes,conv_mode=False,tag_mode=False,device='cpu',learning_rate=0.001): 

        ############################
        #once implemented, it is strongly advised to run this method on a GPU
        ############################

        self.minloss = 10000000 #the min loss found so far on validation data
        
        device = torch.device(device)
        pad_index = traingenerator.output_sym2idx[traingenerator.pad_token]
        loss_fnc = nn.CrossEntropyLoss(ignore_index=pad_index)
        optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)

        batch_accurracies = []
        batch_losses = []

        for epoch in range(epochs):
          for (seqX,seqY,seqT) in traingenerator.generate_batches(batch_sizes,conv_mode):  
              X = torch.LongTensor(seqX).to(device)
              Y = torch.LongTensor(seqY).to(device) 
              T = torch.LongTensor(seqT).to(device) 
              
              Yhat = self.forward(X,T,conv_mode,tag_mode)

              batch_size,seq_len = Y.shape
              Yhat = Yhat.view(batch_size*seq_len,-1)  
              Y    = Y.view(batch_size*seq_len)

              loss = loss_fnc(Yhat, Y)
              batch_losses.append(loss.item())
              loss.backward()
              optimizer.step()

              mask    = (Y != pad_index)
              Yargmax = torch.argmax(Yhat,dim=1)
              correct = torch.sum((Yargmax == Y) * mask)
              total   = torch.sum(mask)
              batch_accurracies.append(float(correct)/float(total))

          L = len(batch_losses)
          train_loss = sum(batch_losses)/L

          print('Epoch %d:'%(epoch))
          print('[train] mean loss = %f | mean accuracy = %f'%(train_loss,sum(batch_accurracies)/L))
          self.validate(validgenerator, batch_sizes, device=device, save_min_model=True)

      def validate(self,datagenerator,batch_size,device='cpu',save_min_model=False):
          
          batch_accurracies = []
          batch_losses      = []

          device = torch.device(device)
          pad_index = datagenerator.output_sym2idx[datagenerator.pad_token]
          loss_fnc  = nn.CrossEntropyLoss(ignore_index=pad_index)

          for (seqX,seqY,_) in datagenerator.generate_batches(batch_size,False):
                with torch.no_grad():   
                  X = torch.LongTensor(seqX).to(device)
                  Y = torch.LongTensor(seqY).to(device)
                
                  Yhat = self.forward(X)

                  #Flattening and loss computation
                  batch_size,seq_len = Y.shape
                  Yhat = Yhat.view(batch_size*seq_len,-1)
                  Y    = Y.view(batch_size*seq_len)
                  loss = loss_fnc(Yhat,Y)
                  batch_losses.append(loss.item())

                  #Accurracy computation
                  mask    = (Y != pad_index)
                  Yargmax = torch.argmax(Yhat,dim=1)
                  correct = torch.sum((Yargmax == Y) * mask)
                  total   = torch.sum(mask)
                  batch_accurracies.append(float(correct)/float(total))

          L = len(batch_losses)                  
          valid_loss = sum(batch_losses)/L

          if save_min_model and valid_loss < self.minloss:
            self.minloss = valid_loss
            torch.save(self.state_dict(), 'tagger_params.pt')
          
          print('[valid] mean loss = %f | mean accurracy = %f'%(valid_loss,sum(batch_accurracies)/L))

The main program is the following. You are expected to add code for searching for hyperparameters that maximise the validation score

In [None]:
trainset = DataGenerator('eng.train')
validset = DataGenerator('eng.testa',parentgenerator = trainset)
tagger   = NERtagger(trainset,64,16,64,256,bidirectional=True,dropout=0.4,num_layers=1,device='cuda')  
                  # tuple of four numbers: word_emb size, char_emb size, tag size, hid size
tagger.train(trainset,validset,20,16,conv_mode=False,tag_mode=False,device='cuda',learning_rate=0.0001) 
                  # tuple of two numbers: epoch, batch size


Fourth exercise : improve the tagger (10pts)
----------

This exercise is relatively free. You may add improvements to the basic tagger.
Note that I expect that improving the management of unknown words and of subword units is key on this task. You may wish to:
* Find a way to learn a word embedding for unknown words (word dropout)
* Use a BiLSTM rather than a simple LSTM
* Use part of speech tags embeddings as additional inputs
* Integrate your convolutional word embedding module into the tagger
* ...

Describe your improvements below and point me out the name(s) of the function(s)
where they are implemented. 


# Results

I implemented the following four methods that may help increase accuracy:

1. Search for better hyperparameters (including dropout rate)
2. Apply bidirectional LSTM
3. Use POS tags 
4. Add convolutional module

Below I will describe these methods and results one by one.

### 1. Hyperparameters

All "sizes" are tested with numbers in the form of powers of 2.

* **optimizer**: fixed at **Adam** since it learns allows the machine to learn much faster than SGD.

* **batch size**: fixed at **16** since greater sizes make accuracy drop due to insufficient update.

* **word embbeding size**: fixed at **64**. Between 32, 64 and greater sizes, no significant improvement of accuracy could be observed. However, the training time increases with embedding size.

* **hidden size**: fixed at **256** as an optimized choice. With one layer of LSTM and emneddings size of 64, greater hidden sizes generally yield an accuracy at over 0.93, but **the maximum it can reach is around 0.94**. However, the greater the hidden size, the longer the training process takes.

* **learning rate**: fixed at **0.0001** since it yields the best results. When it is higher (>0.0001) dev accuracy drops. When it is lower, the machine learns too slow.  

* **dropout rate**: **0.4** is preferred. I tested with 0.2, 0.3, 0.4 and 0.5 and found no significant differences between the first three. Conversly, 0.5 did not yield an accuracy of over 0.93 easily.

* **number of layers**: **1** is preferred since greater numbers not only make training time longer but also bring no significant improvement, if not worse results. 

* **epochs**: fixed at **20**. with multiple test results, I came to the conclusion that the overall pattern of an accuracy curve can be observed within 20 epochs.  


### 2. BiLSTM

* I controlled the "bidirectional switch" for nn.LSTM function. I found that LSTM runs slightly faster than biLSTM but the accuracy results the former yielded were not as stably high as the latter.
* Therefore, **BiLSTM is preferred**, which could yield a dev accuracy as high as 0.94.



### 3. POS tags

* I extracted the POS tags for each token and sent them out as **Tinput/T** with datagenerator. In forward, I created a special case "tag mode" to determine whether POS tags are taken into consideration and whether LSTM should process them. 

* In the postive case, POS tags are first converted to embeddings through tag_embeddings function, and then  attached to token tensors (Xinput). They may either be attached to respective tokens to which they belong (concat axis = 0), creating longer tensors; or be inserted after all token tensors, making the whole batch bigger (concat axis = 2) and requiring a different lstm function (defined as self.lstm_tag). 

* **Results**: accuracy results became terrible at below 0.50 and could reach as low as around 0.30. Varying other hyperparameters did not help much. Does the poor results have anything to do with the concatenation methods?




### 4. Convolutional module

* **Usage**: placed before LSTM, this module serves as a fine-tuner for token embeddings. By slicing up tokens into characters, as well as filtering and pooling character embeddings, I wish to find the best embeddings for each token.


* **Architecture**: 

 1. I created a **class CharConvolution** which defines conv1d and maxpool1d the convolutional module. Its forward function receives tensors made of characters and transforms each character into embeddings in order to determine the best embeddings for each token. 

 2. To generate adequate tensor input, I added three functions, namely **char_vocabulary**, **pad_char** and **code_char**, which help slice up tokens into characters and build a character index lookup table. When conv_mode is True, datagenerator will turn all token seqences into character sequences and perform the same padding and encoding operations as with normal token sequences.

 3. An option of conv mode is added within the forward function of class NERtagger. 
 
* **Results**: Results are generally much worse (below 0.70) than pure biLSTM with fine-tuned hyperparameters. Varying character embedding size did not help at all. I wonder what could go wrong with my convolutional pipeline....
