# Named Entity Recognition on Stack Overflow

Name: Changheng Liou, Zhiying Cui

NetID: cl5533, zc2191

## Introduction

As increasing interest in studying snippets composed of natural languages and computer codes, named entity recognization (NER) for computer programming languages becomes a promising application. 

Though large numbers of programming texts are readily available on the Internet, there is still a lack of fundamental natural language processing (NLP) techniques for identifying code tokens or software-related named entities that appear within natural language sentences. 
Another challenge in NER for the computer programming domain is that name entities are often ambiguous. It is hard to refer a word to a technical programming concept or common language. For example, "list" not only refers to a data structure but also is used as a variable name. They often have implicit reliance on the accompanied code snippets.

[Tabassum et al.](https://arxiv.org/abs/2005.01634) did a comprehensive study in this area. They introduced a new StackOverflow NER corpus for the social computer programming domain, which consists of 15,372 sentences annotated with 20 fine-grained entity types. They proposed a named entity recognizer SoftNER model, which incorporated a context-independent code token classiﬁer with corpus-level features to improve the BERT based tagging model. The evaluation showed that the SoftNER outperforms on identifying software-related named entities than existing models such as fine-tuned BERT, BiLSTEM-CRF.

## Goal of This Work

- Identify named entities on software-related texts. Classify them into 20 types of entities.
- Evaluate the SoftNER model's performance on other sources of technical articles, such as Leetcode.

## Development Environmet

- Framework: Pythorch
- Server: Jupyter server on Google Cloud
- Programming environment

In [1]:
import sys

print(sys.executable)
print(sys.version)
print(sys.version_info)

/opt/conda/envs/py36/bin/python
3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:09:42) 
[GCC 7.5.0]
sys.version_info(major=3, minor=6, micro=11, releaselevel='final', serial=0)


- Directory of the project

In [2]:
import os

os.chdir('/root/Project/')
!tree -L 1

.
├── data
├── Huggingface_SoftNER
├── leetcode-discuss.txt
├── main.ipynb
├── StackOverflowNER
└── zipFiles

4 directories, 2 files


![dirProject](./notebook_resources/dirProject.png)

- Directory of the SoftNER model

In [3]:
os.chdir('/root/Project/StackOverflowNER')
!tree -L 2

.
├── code
│   ├── Attentive_BiLSTM
│   ├── BERT_NER
│   ├── DataReader
│   ├── Readme.md
│   └── SOTokenizer
├── License
├── Readme.md
└── resources
    ├── annotated_ner_data
    └── pretrained_word_vectors

8 directories, 3 files


![dirSoftNER](./notebook_resources/dirSoftNER.png)

## Annotated StackOverﬂow Corpus

### Construction of StackOverﬂow NER corpus

Authors introduced a new StackOverflow NER corpus, they
- Selected **1,237** question-answer threads from StackOverﬂow 10-year archive (from September 2008 to March 2018).
- Manually annotated them with 20 types of entities.
    - 8 code entities: CLASS, VARIABLE, IN LINE CODE, FUNCTION, LIBRARY, VALUE, DATA TYPE, HTML XML TAG
    - 12 natural language entities: APPLICATION, UI ELEMENT, LANGUAGE, DATA STRUCTURE, ALGORITHM, FILE TYPE, FILE NAME, VERSION, DEVICE, OS, WEBSITE, USER NAME
- Corpus was annotated by annotators. Because of the low rate of code-related entities marked by the StackOverﬂow users, and the high possibility of mistakenly enclosed texts that are actually code-related.


### StackOverflow tokenizer - SOTokenizer

Authors implemented a custom tokenizer `SOTokenizer` for the social computer programming platform, because existing tokenizers, such as CMU Twokenizer, Stanford TweetTokenizer and NLTK Twitter tokenizer all mistakenly split code, for example:

```
txScope.complete() => ["txScope", ".", "complete", "(", ")"]
std::condition_variable => ["std", ":", ":", "condition_variable"]
math.h => ["math", ".", "h"]
<html> => ["<", "html", ">"]
a == b => ["a", "=", "=", "b"]
```

Tokenize sentences with `SOTokenizer`, the results are as follows:

In [4]:
""" SOTokenizer """

os.chdir('/root/Project/StackOverflowNER/code/SOTokenizer')

import stokenizer

# example 1 - code snippets
sentence = 'std::condition_variable'
tokens = stokenizer.tokenize(sentence)
print(sentence, "\ntokens: ", tokens, '\n')

# example 2 - sentences from StackOverflow
sentence = 'I do think that the request I send to my API should be more like {post=>{"kind"=>"GGG"}} and not {"kind"=>"GGG"}.'
tokens = stokenizer.tokenize(sentence)
print(sentence, "\ntokens: ", tokens, '\n')

# example 3 - sentences from Leetcode (markdown format)
sentence = '**Basic idea:** If we start from ```sx,sy```, it will be hard to find ```tx, ty```. If we start from ```tx,ty```, we can find only one path to go back to ```sx, sy```. I cut down one by one at first and I got TLE. So I came up with remainder.  **First line:** if 2 target points are still bigger than 2 starting point, we reduce target points. **Second line:** check if we reduce target points to (x, y+kx) or (x+ky, y)  **Time complexity** I will say ```O(logN)``` where ```N = max(tx,ty)```.  **C++:** ```cpp     bool reachingPoints(int sx, int sy, int tx, int ty) {         while (sx < tx && sy < ty)             if (tx < ty) ty %= tx;             else tx %= ty;         return sx == tx && sy <= ty && (ty - sy) % sx == 0 ||                sy == ty && sx <= tx && (tx - sx) % sy == 0;     }'
tokens = stokenizer.tokenize(sentence)
print(sentence, "\ntokens: ", tokens, '\n')


std::condition_variable 
tokens:  ['std::condition_variable'] 

I do think that the request I send to my API should be more like {post=>{"kind"=>"GGG"}} and not {"kind"=>"GGG"}. 
tokens:  ['I', 'do', 'think', 'that', 'the', 'request', 'I', 'send', 'to', 'my', 'API', 'should', 'be', 'more', 'like', ' { post=> { "kind"=>"GGG" }  } ', 'and', 'not', ' { "kind"=>"GGG" } ', '.'] 

**Basic idea:** If we start from ```sx,sy```, it will be hard to find ```tx, ty```. If we start from ```tx,ty```, we can find only one path to go back to ```sx, sy```. I cut down one by one at first and I got TLE. So I came up with remainder.  **First line:** if 2 target points are still bigger than 2 starting point, we reduce target points. **Second line:** check if we reduce target points to (x, y+kx) or (x+ky, y)  **Time complexity** I will say ```O(logN)``` where ```N = max(tx,ty)```.  **C++:** ```cpp     bool reachingPoints(int sx, int sy, int tx, int ty) {         while (sx < tx && sy < ty)             if (tx

### Load annotated files

Autours provided a function to read the annotated dataset.
- Read the dataset using `loader_so.py` from `DataReader`.
- By default the `loader_so_text` function merges the following 6 entities to 3.

```
"Library_Function" -> "Function"
"Function_Name" -> "Function"

"Class_Name" -> "Class"
"Library_Class" -> "Class"

"Library_Variable" -> "Variable"
"Variable_Name" -> "Variable"

"Website" -> "Website"
"Organization" -> "Website"
```

Arguments settings
- `merge_tag=False`: skip the merging by setting.
- `replace_low_freq_tags=False`: skip the conversion. By default the `loader_so_text` function converts the 5 low frequency entiies as "O".  

In [5]:
""" Load annotated data """

os.chdir('/root/Project/StackOverflowNER/code/DataReader')

import loader_so

# dir of dataset
path_to_file = "../../resources/annotated_ner_data/StackOverflow/train.txt"

# merge entities (default)
all_sentences = loader_so.loader_so_text(path_to_file)

# skip merging
all_sentences_no_merge = loader_so.loader_so_text(path_to_file, replace_low_freq_tags= False)

# skip conversion
all_sentences_no_conversion = loader_so.loader_so_text(path_to_file, replace_low_freq_tags= False)

print("\nTraining dataset (preview first 5 lines):")
for i in range(5):
    print(all_sentences[i], '\n')
    print(all_sentences_no_merge[i], '\n')
    print(all_sentences_no_conversion[i], '\n')
    print('\n')


------------------------------------------------------------
Number of questions in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  741
Number of answers in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  897
Number of sentences in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  9263
Max len sentences has 92 words
------------------------------------------------------------
------------------------------------------------------------
Number of questions in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  741
Number of answers in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  897
Number of sentences in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  9263
Max len sentences has 92 words
------------------------------------------------------------
------------------------------------------------------------
Numbe

## SoftNER Model Architecture

1. **Input embedding layer**: Extract contextualized embeddings from the BERT model and two new domain-specific embeddings for each word in the input sentence.
2. **Embedding attention layer**: Combine the three word embeddings using an attention network.
3. **Linear-CRF layer**: Predict the entity type of each word using the attentive word representations from the previous layer.

![modelArchitecture](./notebook_resources/modelArchitecture.png)

The SoftNER model and two auxiliary models are trained separately. We can train the two standalone modules by the instructions from [Running BERT NER Model](https://github.com/jeniyat/StackOverflowNER/tree/master/code#running-bert-ner-model). Noted that the modified transformers is [here](https://github.com/jeniyat/Huggingface_SoftNER).

### Input embeddings

- **BERT**, which transforms a token into contextual vector representation.
- **Code Recognizer**, which represents if a word can be part of a code entity regardless of context.
- **Entity Segmenter**, which predicts whether a word is part of any named entity in the given sentence.

#### In-domain Word Embeddings (BERT)

Wikipedia text is unsuitable for computer programming context. So, authors pre-trained three in-domain word embeddings on 152 million sentences from [StackOverflow 10-year archive](https://archive.org/details/stackexchange), including BERT (BERTOverflow), ELMo (ELMoVerflow), and GloVe (GloVerflow).
The results showed that the NER model with BERT outperformed on identifying entities than the others. Thus, we chose BERT as our in-domain word embedding.

The pretrained BERT is saved in `/root/Project/StackOverflowNER/pretrained_word_vectors/BERT/`

<!-- BERT has 12 layers, hidden size 768, self-attention heads 12, total parameters 110M -->

#### Context-independent Code Recognition (Code Recognizer)

A code recognition module is introduced to capture the probability of how likely a word can be a code token without considering any contextual information. Code Recognizer is a binary classifier. The implementation details are as follows:

- Input features: unigram word and 6-gram character probabilities from two language models that are trained on the Gigaword corpus and all the code-snippets in the StackOverflow corpus.
- Transform each ngram probability into a k-dimensional vector using Gaussian binning, which can improve the performance of neural models using numeric features.
- Feed the vectorized features into a linear layer and concatenate the output with pretrained FastText character-level embeddings.
- Pass the outputs through another hidden layer with sigmoid activation, and see if the output probability is greater than 0.5.

The directory of Code Recognizer is `/root/Project/StackOverflowNER/code/BERT_NER/`. Since the source codes are too long, we picked out the training function to see how it was implemented and ran the training process in the server.

**Parameters**
- `LR=0.0015`: learning rate 
- `epochs=70`: epochs
- `word_dim=300`: embedding dimension, the same as weights dimension in fastText
- `hidden_layer_1_dim=30`: dimension of hidden layer

In [None]:
""" Function of training Code Recognizer """

def train_ctc_model(train_file, test_file):
    
    # training and test dataset (default)
    train_file = parameters_ctc['train_file']
    test_file = parameters_ctc['test_file']
    
    # extract features from two language models trained on Gigaword and StackOverﬂow
    features = Features(RESOURCES)
    train_tokens, train_features, train_labels = features.get_features(train_file, True)
    test_tokens, test_features, test_labels = features.get_features(test_file, False)
    
    # fastText embedding
    vocab_size, word_to_id, id_to_word, word_to_vec = get_word_dict_pre_embeds(train_file, test_file)
    train_ids, test_ids = get_train_test_word_id(train_file, test_file,  word_to_id)
    
    # transform each ngram probability into a k-dimensional vector using Gaussian binning
    word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (vocab_size, parameters_ctc['word_dim']))
    for word in word_to_vec:
        word_embeds[word_to_id[word]]=word_to_vec[word]
    
    # concatenate the outputs with fastText embedding
    ctc_classifier = NeuralClassifier(len(train_features[0]), max(train_labels) + 1, vocab_size, word_embeds)
    ctc_classifier.to(device)
    
    # binary classifier
    optimizer = torch.optim.Adam(ctc_classifier.parameters(), lr=parameters_ctc["LR"])
    step_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.8)
    
    # prepare dataset
    train_x = Variable(torch.FloatTensor(train_features).to(device))
    train_x_words = Variable(torch.LongTensor(train_ids).to(device))
    train_y = Variable(torch.LongTensor(train_labels).to(device))

    test_x = Variable(torch.FloatTensor(test_features).to(device))
    test_x_words = Variable(torch.LongTensor(test_ids).to(device))
    test_y = Variable(torch.LongTensor(test_labels).to(device))

    # training
    for epoch in range(parameters_ctc['epochs']):
        loss = ctc_classifier.CrossEntropy(train_features, train_x_words, train_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_scores,  train_preds = ctc_classifier(train_features, train_x_words)
        test_scores, test_preds = ctc_classifier(test_features, test_x_words)
        
        eval(test_preds, test_labels, "test")

    return ctc_classifier, vocab_size, word_to_id, id_to_word, word_to_vec, features

In [6]:
""" Train Code Recognizer """

os.chdir('/root/Project/StackOverflowNER/code/BERT_NER/')

from utils_ctc import *
from utils_ctc.prediction_ctc import *

# training dataset & test dataset
train_file = parameters_ctc['train_file']
test_file = parameters_ctc['test_file']

# train the Code Recognizer
ctc_classifier, vocab_size, word_to_id, id_to_word, word_to_vec, features = train_ctc_model(train_file, test_file)




[0.428365932464601, 0.19701098799705508, 0.29119153594970726, 0.01899447679519639, 0.002]
-------------------- test --------------------
P:  54.4444  R: 18.5606  F:  27.6836
              precision    recall  f1-score   support

           0       0.76      0.94      0.84       736
           1       0.54      0.19      0.28       264

    accuracy                           0.74      1000
   macro avg       0.65      0.56      0.56      1000
weighted avg       0.71      0.74      0.69      1000

-------------------------------------------------
-------------------- test --------------------
P:  89.2857  R: 9.4697  F:  17.1233
              precision    recall  f1-score   support

           0       0.75      1.00      0.86       736
           1       0.89      0.09      0.17       264

    accuracy                           0.76      1000
   macro avg       0.82      0.55      0.51      1000
weighted avg       0.79      0.76      0.68      1000

---------------------------------------

-------------------- test --------------------
P:  73.7624  R: 56.4394  F:  63.9485
              precision    recall  f1-score   support

           0       0.86      0.93      0.89       736
           1       0.74      0.56      0.64       264

    accuracy                           0.83      1000
   macro avg       0.80      0.75      0.76      1000
weighted avg       0.82      0.83      0.82      1000

-------------------------------------------------
-------------------- test --------------------
P:  71.875  R: 60.9848  F:  65.9836
              precision    recall  f1-score   support

           0       0.87      0.91      0.89       736
           1       0.72      0.61      0.66       264

    accuracy                           0.83      1000
   macro avg       0.79      0.76      0.78      1000
weighted avg       0.83      0.83      0.83      1000

-------------------------------------------------
-------------------- test --------------------
P:  72.0833  R: 65.5303  F:  68.

-------------------- test --------------------
P:  72.5086  R: 79.9242  F:  76.036
              precision    recall  f1-score   support

           0       0.93      0.89      0.91       736
           1       0.73      0.80      0.76       264

    accuracy                           0.87      1000
   macro avg       0.83      0.85      0.83      1000
weighted avg       0.87      0.87      0.87      1000

-------------------------------------------------
-------------------- test --------------------
P:  73.2639  R: 79.9242  F:  76.4493
              precision    recall  f1-score   support

           0       0.93      0.90      0.91       736
           1       0.73      0.80      0.76       264

    accuracy                           0.87      1000
   macro avg       0.83      0.85      0.84      1000
weighted avg       0.87      0.87      0.87      1000

-------------------------------------------------
-------------------- test --------------------
P:  73.6842  R: 79.5455  F:  76.

-------------------- test --------------------
P:  77.3381  R: 81.4394  F:  79.3358
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       736
           1       0.77      0.81      0.79       264

    accuracy                           0.89      1000
   macro avg       0.85      0.86      0.86      1000
weighted avg       0.89      0.89      0.89      1000

-------------------------------------------------
-------------------- test --------------------
P:  77.3381  R: 81.4394  F:  79.3358
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       736
           1       0.77      0.81      0.79       264

    accuracy                           0.89      1000
   macro avg       0.85      0.86      0.86      1000
weighted avg       0.89      0.89      0.89      1000

-------------------------------------------------
-------------------- test --------------------
P:  77.4194  R: 81.8182  F:  79

From the training results above, our Code Recognizer model achieves the precision of 78.44, the recall of 79.92, and the $F_1$ scores of 79.14. In the paper, the Code Recognizer model has the precision of 78.43, the recall of 83.33, and the $F_1$ scores of 80.80. Though there exist some differences, results from our training model agree well with the authors.

![paperCodeRecogizer](./notebook_resources/paperCodeRecogizer.png)


#### Entity segmentation (Entity Segmenter)

The segmentation task refers to identifying entity spans without assigning entity category. The Entity Segmenter model trained on the annotated StackOverﬂow corpus can achieve 90.41% precision on the validation dataset. The Entity Segmenter concatenates with two hand-crafted features:

- **Word frequency**, which represents the word occurrence count in the training set. In the given StackOverflow corpus, code and non-code have an average frequency of 1.47 and 7.41. An ambiguous token that can be either code or non-code entities, such as "windows", have a much higher average frequency of 92.57.

- **Code markdown**, which indicates whether the given token appears inside a `⟨code⟩` markdown in the StackOverflow post. This is noisy as users do not always enclose inline code in a `⟨code⟩` tag or sometimes use the tag to highlight non-code texts.


The segmentation model follows the simple BERT fine-tuning architecture except for the input, where BERT embeddings are concatenated with 100-dimensional code markdown and 10-dimensional word frequency features. 

**Noted that** some essential files (such as word frequency) and the pretrained model are missing in the folder provided by the authors.

### Embedding-Level Attention

For each word $w_i$, there are 3 embeddings 
- BERT ($w_{i1}$)
- Code recognizer ($w_{i2}$)
- Entity Segmenter ($w_{i3}$)

The embedding-level attention $\alpha_{it}$ ($t \in \{1, 2, 3\}$) captures the word's contribution to the meaning of the word. (~params to be learned~)

To compute $\alpha_{it}$, we pass the input embeddings through a bidirectional GRU and generate their corresponding hidden representations $h_{it} = \vec{GRU}(w_{it})$

These vectors are then passed through a non-linear layer, which outputs $u_{it} = tanh(W_e h_{it} + b_e)$.

$u_e$: randomly initialized and updated during the training process.

This context vector is combined with the hidden embedding representation using a softmax function to extract weight of the embeddings:
$$
  \alpha_{it} = \frac{\exp{u_{it}^T u_e}}{\sum_t \exp{u_{it}^T u_e}}
$$

Finally, we create the word vector by a weighted sum of all the information from different embeddings as 
$$
  word_i = \sum_t \alpha_{it}h_{it}
$$


![embeding](./notebook_resources/embeddingLevel.png)


The result is then fed into a linear-CRF layer, which predicts the entity category for each word based the BIO tagging schema.

In [None]:
import os

os.chdir('/root/Project/StackOverflowNER/code/Attentive_BiLSTM')

import torch
import torch.autograd as autograd
from torch.autograd import Variable

import utils_so as utils 
from utils_so import *

from config_so import parameters

torch.backends.cudnn.deterministic = True
torch.manual_seed(parameters["seed"])

from allennlp.modules.elmo import Elmo, batch_to_ids
from allennlp.commands.elmo import ElmoEmbedder

from HAN import *

In [None]:
from utils_so import *

class Embedded_Attention(nn.Module):
  def __init__(self):
    super(Embedded_Attention, self).__init__()
    
    self.max_len = 3
    self.input_dim = 1824
    self.hidden_dim = 150
    self.bidirectional = True
    self.drop_out_rate = 0.5 
    self.context_vector_size = [parameters['embedding_context_vecotr_size'], 1]
    self.drop = nn.Dropout(p=self.drop_out_rate)
    self.word_GRU = nn.GRU(input_size=self.input_dim, 
                           hidden_size=self.hidden_dim,
                           bidirectional=self.bidirectional,
                           batch_first=True)
    self.w_proj = nn.Linear(in_features=2*self.hidden_dim ,out_features=2*self.hidden_dim)
    self.w_context_vector = nn.Parameter(torch.randn(self.context_vector_size).float())
    self.softmax = nn.Softmax(dim=1)
    init_gru(self.word_GRU)

  def forward(self,x):  
    x, _ = self.word_GRU(x)
    Hw = torch.tanh(self.w_proj(x))
    w_score = self.softmax(Hw.matmul(self.w_context_vector))
    x = x.mul(w_score)
    x = torch.sum(x, dim=1)
    return x

### Current configuration
- `LR`: learning rate
- `epochs`: number of epochs to train
- `lower`: lowercase all inputs
- `zeros`: replace all digits by 0
- `char_dim`: Character embedding dimension
- `char_lstm_dim`: Char LSTM hidden layer size
- `char_bidirect`: Use bidirectional LSTM for chars
- `word_dim`: Token embedding dimension
- `word_lstm_dim`: Token LSTM hidden layer size
- `word_bidirect`: Use bidirectional LSTM for words
- `dropout`: Droupout on the input
- `char_mode`: char_CNN or char_LSTM
- `use_elmo`: whether or not to ues elmo
- `use_elmo_w_char`: whether or not to ues elmo with char embeds
- `use_freq_vector`: whether or not to ues the word frequency
- `freq_mapper_bin_count`: how many bins to use in gaussian binning of the frequency vector
- `freq_mapper_bin_width`: the width of each bin for the gaussian binning of the frequency vector
- `use_segmentation_vector`: whether or not to ues the code_pred vector
- `use_han`: whether or not to ues Hierarchical Attention Network


In [None]:
for k in [
    'LR', 'epochs', 'lower', 'zeros', 
    'char_dim', 'char_lstm_dim', 'char_bidirect', 'word_dim', 
    'word_lstm_dim', 'word_bidirect', 'dropout', 'char_mode', 
    'use_elmo', 'use_elmo_w_char', 'use_freq_vector', 'freq_mapper_bin_count', 
    'freq_mapper_bin_width', 'use_segmentation_vector', 'use_han']:
    print(k, parameters[k])

2 special tokens `<START>` and `<STOP>` are defined to specify the starting and ending point of the training text.

In [None]:
START_TAG = '<START>'
STOP_TAG = '<STOP>'

Helper functions for convinence

In [None]:
def to_scalar(var):
    return var.view(-1).data.tolist()[0]

def argmax(vec):
    _, idx = torch.max(vec, 1)
    return to_scalar(idx)

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return Variable(tensor)

def log_sum_exp(vec):
    # vec 2D: 1 * tagset_size
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + \
        torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

In [None]:
def _align_word(input_matrix, word_pos_list=[1]):
    """
        To change presentation of the batch.
    Args:
        input_matrix (autograd.Variable):  The presentation of the word in the sentence. [sentence_num, sentence_embedding]
        word_pos_list (list): The list contains the position of the word in the current sentence.
        
    Returns:
        new_matrix (torch.FloatTensor): The aligned matrix, and its each row is one sentence in the passage.
                                        [passage_num, max_len, embedding_size]
    """
    embedding_size = input_matrix.shape[-1]      # To get the embedding size of the sentence
    number_of_words = len(word_pos_list)                  # To get the number of the sentences
    max_len=1
    new_matrix = autograd.Variable(torch.zeros(number_of_words, max_len, embedding_size))
    init_index = 0
    for index, length in enumerate(word_pos_list):
        end_index = init_index + length
        # temp_matrix
        temp_matrix = input_matrix[init_index:end_index, :]      # To get one passage sentence embedding.
        if temp_matrix.shape[0] > max_len:
            temp_matrix = temp_matrix[:max_len]
        new_matrix[index, -length:, :] = temp_matrix
        # update the init_index of the input matrix
        init_index = length
    return new_matrix

In [None]:
class BiLSTM_CRF(nn.Module):
    def __init__(self, vocab_size, tag_to_ix, embedding_dim, freq_embed_dim, 
                 seg_pred_embed_dim, hidden_dim, char_lstm_dim=25,
                 char_to_ix=None, pre_word_embeds=None, word_freq_embeds=None, word_ner_pred_embeds=None, 
                 word_seg_pred_embeds=None, word_markdown_embeds=None, word_ctc_pred_embeds=None, char_embedding_dim=25, use_gpu=False,
                 n_cap=None, cap_embedding_dim=None, use_crf=True, char_mode='CNN'):
        super(BiLSTM_CRF, self).__init__()
        self.use_gpu = use_gpu
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tag_to_ix = tag_to_ix
        self.n_cap = n_cap
        self.cap_embedding_dim = cap_embedding_dim
        self.use_crf = use_crf
        self.tagset_size = len(tag_to_ix)
        self.out_channels = char_lstm_dim
        self.char_mode = char_mode
        self.embed_attn = Embedded_Attention()
        self.word_attn = Word_Attn()        
        self.use_elmo = parameters['use_elmo']

        if self.use_elmo:
            options_file = parameters["elmo_options"]
            weight_file = parameters["elmo_weight"]

            self.elmo = Elmo(options_file, weight_file, 2, dropout=0)
            self.elmo_2 = ElmoEmbedder(options_file, weight_file)
        
        print('char_mode: %s, out_channels: %d, hidden_dim: %d, ' % (char_mode, char_lstm_dim, hidden_dim))
        if parameters['use_han']:
            self.lstm=nn.LSTM(300, hidden_dim, bidirectional=True)
        else:
            self.lstm=nn.LSTM(1824, hidden_dim, bidirectional=True)
        
        if self.use_elmo:
            
            if parameters['use_elmo_w_char']:
                if self.n_cap and self.cap_embedding_dim:
                    self.cap_embeds = nn.Embedding(self.n_cap, self.cap_embedding_dim)
                    init_embedding(self.cap_embeds.weight)

                if char_embedding_dim is not None:
                    self.char_lstm_dim = char_lstm_dim
                    self.char_embeds = nn.Embedding(len(char_to_ix), char_embedding_dim)
                    init_embedding(self.char_embeds.weight)

                    if self.char_mode == 'LSTM':
                        self.char_lstm = nn.LSTM(char_embedding_dim, char_lstm_dim, num_layers=1, bidirectional=True)
                        init_lstm(self.char_lstm)
                    if self.char_mode == 'CNN':
                        self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))
        else:
            if self.n_cap and self.cap_embedding_dim:
                self.cap_embeds = nn.Embedding(self.n_cap, self.cap_embedding_dim)
                init_embedding(self.cap_embeds.weight)

            if char_embedding_dim is not None:
                self.char_lstm_dim = char_lstm_dim
                self.char_embeds = nn.Embedding(len(char_to_ix), char_embedding_dim)
                init_embedding(self.char_embeds.weight)
                
                if self.char_mode == 'LSTM':
                    self.char_lstm = nn.LSTM(char_embedding_dim, char_lstm_dim, num_layers=1, bidirectional=True)
                    init_lstm(self.char_lstm)
                if self.char_mode == 'CNN':
                    self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))

            self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
            if pre_word_embeds is not None:
                self.pre_word_embeds = True
                self.word_embeds.weight = nn.Parameter(torch.FloatTensor(pre_word_embeds))
            else:
                self.pre_word_embeds = False
        
        self.dropout = nn.Dropout(parameters['dropout'])

        #----------adding frequency embedding--------
        self.freq_embeds = nn.Embedding(vocab_size, freq_embed_dim)
        if word_freq_embeds is not None:
            self.word_freq_embeds = True
            self.freq_embeds.weight = nn.Parameter(torch.FloatTensor(word_freq_embeds))
        else:
            self.word_freq_embeds = False
        
        #----------adding segmentation embedding--------
        self.seg_embeds = nn.Embedding(parameters['segmentation_count'], parameters['segmentation_dim'])
        if word_seg_pred_embeds is not None:
            self.use_seg_pred_embed = True
            self.seg_embeds.weight = nn.Parameter(torch.FloatTensor(word_seg_pred_embeds))
        else:
            self.use_seg_pred_embed = False
        
        #----------adding ctc prediction embedding--------
        self.ctc_pred_embeds = nn.Embedding(parameters['code_recognizer_count'], parameters['code_recognizer_dim'])
        if word_ctc_pred_embeds is not None:
            self.use_ctc_pred_embed = True
            self.ctc_pred_embeds.weight = nn.Parameter(torch.FloatTensor(word_ctc_pred_embeds))
        else:
            self.use_ctc_pred_embed = False
        
        init_lstm(self.lstm)
        # init_lstm(self.lstm_2)
        self.hw_trans = nn.Linear(self.out_channels, self.out_channels)
        self.hw_gate = nn.Linear(self.out_channels, self.out_channels)
        self.h2_h1 = nn.Linear(hidden_dim*2, hidden_dim)
        self.tanh = nn.Tanh()
        self.hidden2tag = nn.Linear(hidden_dim*2, self.tagset_size)
        init_linear(self.h2_h1)
        init_linear(self.hidden2tag)
        init_linear(self.hw_gate)
        init_linear(self.hw_trans)

        if self.use_crf:
            self.transitions = nn.Parameter(
                torch.zeros(self.tagset_size, self.tagset_size))
            self.transitions.data[tag_to_ix[START_TAG], :] = -10000
            self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000
    
    def _score_sentence(self, feats, tags):
        # tags is ground_truth, a list of ints, length is len(sentence)
        # feats is a 2D tensor, len(sentence) * tagset_size
        r = torch.LongTensor(range(feats.size()[0]))
        if self.use_gpu:
            r = r.cuda()
            pad_start_tags = torch.cat([torch.cuda.LongTensor([self.tag_to_ix[START_TAG]]), tags])
            pad_stop_tags = torch.cat([tags, torch.cuda.LongTensor([self.tag_to_ix[STOP_TAG]])])
        else:
            pad_start_tags = torch.cat([torch.LongTensor([self.tag_to_ix[START_TAG]]), tags])
            pad_stop_tags = torch.cat([tags, torch.LongTensor([self.tag_to_ix[STOP_TAG]])])

        score = torch.sum(self.transitions[pad_stop_tags, pad_start_tags]) + torch.sum(feats[r, tags])

        return score    

    def get_char_embedding(self, sentence, chars2, caps, chars2_length, d):
        if self.char_mode == 'LSTM':
            chars_embeds = self.char_embeds(chars2).transpose(0, 1)
            packed = torch.nn.utils.rnn.pack_padded_sequence(chars_embeds, chars2_length)
            lstm_out, _ = self.char_lstm(packed)
            outputs, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
            outputs = outputs.transpose(0, 1)
            chars_embeds_temp = Variable(torch.FloatTensor(torch.zeros((outputs.size(0), outputs.size(2)))))
            if self.use_gpu:
                chars_embeds_temp = chars_embeds_temp.cuda()
            for i, index in enumerate(output_lengths):
                chars_embeds_temp[i] = torch.cat((outputs[i, index-1, :self.char_lstm_dim], outputs[i, 0, self.char_lstm_dim:]))
            chars_embeds = chars_embeds_temp.clone()
            for i in range(chars_embeds.size(0)):
                chars_embeds[d[i]] = chars_embeds_temp[i]

        if self.char_mode == 'CNN':
            chars_embeds = self.char_embeds(chars2).unsqueeze(1)
            chars_cnn_out3 = nn.functional.relu(self.char_cnn3(chars_embeds))   
            chars_embeds = nn.functional.max_pool2d(
                chars_cnn_out3,
                kernel_size=(chars_cnn_out3.size(2), 1)
            ).view(chars_cnn_out3.size(0), self.out_channels)
        return chars_embeds

    def _get_lstm_features_w_elmo_and_char(self, sentence_words, sentence,markdown, chars2, caps, chars2_length, d):
        if self.char_mode == 'LSTM':
            # self.char_lstm_hidden = self.init_lstm_hidden(dim=self.char_lstm_dim, bidirection=True, batchsize=chars2.size(0))
            chars_embeds = self.char_embeds(chars2).transpose(0, 1)
            packed = torch.nn.utils.rnn.pack_padded_sequence(chars_embeds, chars2_length)
            lstm_out, _ = self.char_lstm(packed)
            outputs, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
            outputs = outputs.transpose(0, 1)
            chars_embeds_temp = Variable(torch.FloatTensor(torch.zeros((outputs.size(0), outputs.size(2)))))
            if self.use_gpu:
                chars_embeds_temp = chars_embeds_temp.cuda()
            for i, index in enumerate(output_lengths):
                chars_embeds_temp[i] = torch.cat((outputs[i, index-1, :self.char_lstm_dim], outputs[i, 0, self.char_lstm_dim:]))
            chars_embeds = chars_embeds_temp.clone()
            for i in range(chars_embeds.size(0)):
                chars_embeds[d[i]] = chars_embeds_temp[i]

        if self.char_mode == 'CNN':
            chars_embeds = self.char_embeds(chars2).unsqueeze(1)
            # chars_cnn_out3 = self.char_cnn3(chars_embeds)
            chars_cnn_out3 = nn.functional.relu(self.char_cnn3(chars_embeds))   
            chars_embeds = nn.functional.max_pool2d(chars_cnn_out3,
                                                 kernel_size=(chars_cnn_out3.size(2), 1)).view(chars_cnn_out3.size(0), self.out_channels)
        
        character_ids = batch_to_ids([sentence_words])
        if self.use_gpu:
            character_ids = character_ids.cuda()
        embeddings = self.elmo(character_ids)
        embeddings = embeddings['elmo_representations'][0]
        embeds = embeddings[0]
        
        if self.use_gpu:
            embeds=embeds.cuda()
        
        embeds = torch.cat((embeds, chars_embeds), 1)
        embeds = embeds.unsqueeze(1)
        # print(embeds.size())
        embeds = self.dropout(embeds)
        lstm_out, _ = self.lstm(embeds)
        lstm_out = lstm_out.view(len(sentence_words), self.hidden_dim*2)
        lstm_out = self.dropout(lstm_out)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats

    def apply_attention(self, elmo_embeds, seg_embeds, ctc_embeds):
        word_tensor_list = []
        word_pos_list=[]
        sent_len =elmo_embeds.size()[0]

        for index in range(sent_len):
            elmo_rep = elmo_embeds[index]
            ctc_rep = ctc_embeds[index]
            seg_rep = seg_embeds[index]
            comb_rep = torch.cat((elmo_rep, ctc_rep, seg_rep)).view(1, 1, -1)
            attentive_rep=self.embed_attn(comb_rep)
            word_tensor_list.append(attentive_rep)
            word_pos_list.append(index+1)

        word_tensor =  torch.stack(word_tensor_list)
        if self.use_gpu:
            word_tensor=word_tensor.cuda()
        
        x = _align_word(word_tensor, word_pos_list)
        if self.use_gpu:
            x=x.cuda()
        y = self.word_attn(x)
        return y


    def _get_lstm_features_w_elmo(self, sentence_words, sentence, seg_pred, ctc_pred):
        character_ids = batch_to_ids([sentence_words])
        if self.use_gpu:
            character_ids = character_ids.cuda()
        embeddings = self.elmo(character_ids)
        embeddings = embeddings['elmo_representations'][0]
        embeds = embeddings[0]
        
        if self.use_gpu:
            embeds=embeds.cuda()
            elmo_embeds=embeds.cuda()
        else:
            elmo_embeds=embeds

        if parameters['use_freq_vector']:
            frequency_embeddings = self.freq_embeds(sentence)
            embeds = torch.cat((embeds, frequency_embeddings), 0)
        
        if parameters['use_segmentation_vector'] :
            segment_embeddings = self.seg_embeds(seg_pred)
            embeds = torch.cat((embeds, segment_embeddings), 1)
        
        if parameters['use_code_recognizer_vector']:
            ctc_pred_embeddings = self.ctc_pred_embeds(ctc_pred)
            embeds = torch.cat((embeds, ctc_pred_embeddings), 1)
        
        if parameters['use_han']:
            attentive_word_embeds = self.apply_attention(elmo_embeds, segment_embeddings, ctc_pred_embeddings)
            embeds=attentive_word_embeds
        else:
            embeds = embeds.unsqueeze(1)
        
        embeds = self.dropout(embeds)
        

        # embeds = self.dropout(merged_embeds)
        lstm_out, _ = self.lstm(embeds)
        #lstm_out, _ = self.lstm_2(lstm_out)
        lstm_out = lstm_out.view(len(sentence_words), self.hidden_dim*2)
        lstm_out = self.dropout(lstm_out)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats
    
    def _get_lstm_features(self, sentence, markdown, chars2, caps, chars2_length, d):

        if self.char_mode == 'LSTM':
            # self.char_lstm_hidden = self.init_lstm_hidden(dim=self.char_lstm_dim, bidirection=True, batchsize=chars2.size(0))
            chars_embeds = self.char_embeds(chars2).transpose(0, 1)
            packed = torch.nn.utils.rnn.pack_padded_sequence(chars_embeds, chars2_length)
            lstm_out, _ = self.char_lstm(packed)
            outputs, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
            outputs = outputs.transpose(0, 1)
            chars_embeds_temp = Variable(torch.FloatTensor(torch.zeros((outputs.size(0), outputs.size(2)))))
            if self.use_gpu:
                chars_embeds_temp = chars_embeds_temp.cuda()
            for i, index in enumerate(output_lengths):
                chars_embeds_temp[i] = torch.cat((outputs[i, index-1, :self.char_lstm_dim], outputs[i, 0, self.char_lstm_dim:]))
            chars_embeds = chars_embeds_temp.clone()
            for i in range(chars_embeds.size(0)):
                chars_embeds[d[i]] = chars_embeds_temp[i]

        if self.char_mode == 'CNN':
            chars_embeds = self.char_embeds(chars2).unsqueeze(1)
            # chars_cnn_out3 = self.char_cnn3(chars_embeds)
            chars_cnn_out3 = nn.functional.relu(self.char_cnn3(chars_embeds))   
            chars_embeds = nn.functional.max_pool2d(chars_cnn_out3,
                                                 kernel_size=(chars_cnn_out3.size(2), 1)).view(chars_cnn_out3.size(0), self.out_channels)

        # t = self.hw_gate(chars_embeds)
        # g = nn.functional.sigmoid(t)
        # h = nn.functional.relu(self.hw_trans(chars_embeds))
        # chars_embeds = g * h + (1 - g) * chars_embeds

        embeds = self.word_embeds(sentence)
        
        if self.n_cap and self.cap_embedding_dim:
            cap_embedding = self.cap_embeds(caps)

        if self.n_cap and self.cap_embedding_dim:
            embeds = torch.cat((embeds, chars_embeds, cap_embedding), 1)
        else:
            embeds = torch.cat((embeds, chars_embeds), 1)

        if parameters['use_freq_vector']:
            frequency_embeddings = self.freq_embeds(sentence)
            embeds = torch.cat((embeds, frequency_embeddings), 1)
        if parameters['use_markdown_vector'] :
            markdown_embeddings = self.markdown_embeds(markdown)
            embeds = torch.cat((embeds, markdown_embeddings), 1)
        
        embeds = embeds.unsqueeze(1) #what is done here
        embeds = self.dropout(embeds)
        lstm_out, _ = self.lstm(embeds)
        lstm_out = lstm_out.view(len(sentence), self.hidden_dim*2)
        lstm_out = self.dropout(lstm_out)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats

    def _forward_alg(self, feats):
        # calculate in log domain
        # feats is len(sentence) * tagset_size
        # initialize alpha with a Tensor with values all equal to -10000.
        init_alphas = torch.Tensor(1, self.tagset_size).fill_(-10000.)
        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
        forward_var = autograd.Variable(init_alphas)
        if self.use_gpu:
            forward_var = forward_var.cuda()
        for feat in feats:
            emit_score = feat.view(-1, 1)
            tag_var = forward_var + self.transitions + emit_score
            max_tag_var, _ = torch.max(tag_var, dim=1)
            tag_var = tag_var - max_tag_var.view(-1, 1)
            forward_var = max_tag_var + torch.log(torch.sum(torch.exp(tag_var), dim=1)).view(1, -1) # ).view(1, -1)
        terminal_var = (forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]).view(1, -1)
        alpha = log_sum_exp(terminal_var)
        # Z(x)
        return alpha

    def viterbi_decode(self, feats):
        backpointers = []
        # analogous to forward
        init_vvars = torch.Tensor(1, self.tagset_size).fill_(-10000.)
        init_vvars[0][self.tag_to_ix[START_TAG]] = 0.0
        forward_var = Variable(init_vvars)
        if self.use_gpu:
            forward_var = forward_var.cuda()
        for feat in feats:
            next_tag_var = forward_var.view(1, -1).expand(self.tagset_size, self.tagset_size) + self.transitions
            _, bptrs_t = torch.max(next_tag_var, dim=1)
            bptrs_t = bptrs_t.squeeze().data.cpu().numpy()
            next_tag_var = next_tag_var.data.cpu().numpy()
            viterbivars_t = next_tag_var[range(len(bptrs_t)), bptrs_t]
            viterbivars_t = Variable(torch.FloatTensor(viterbivars_t))
            if self.use_gpu:
                viterbivars_t = viterbivars_t.cuda()
            forward_var = viterbivars_t + feat
            backpointers.append(bptrs_t)

        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        terminal_var.data[self.tag_to_ix[STOP_TAG]] = -10000.
        terminal_var.data[self.tag_to_ix[START_TAG]] = -10000.
        best_tag_id = argmax(terminal_var.unsqueeze(0))
        path_score = terminal_var[best_tag_id]
        best_path = [best_tag_id]
        for bptrs_t in reversed(backpointers):
            best_tag_id = bptrs_t[best_tag_id]
            best_path.append(best_tag_id)
        start = best_path.pop()
        assert start == self.tag_to_ix[START_TAG]
        best_path.reverse()
        return path_score, best_path

    def neg_log_likelihood(self, sentence_tokens, sentence, sentence_seg_preds,sentence_ctc_preds, tags, chars2, caps, chars2_length, d):
        # sentence, tags is a list of ints
        # features is a 2D tensor, len(sentence) * self.tagset_size
        feats = self._get_lstm_features_w_elmo(sentence_tokens, sentence, sentence_seg_preds, sentence_ctc_preds)

#         if self.use_elmo:
#             if parameters['use_elmo_w_char']:
#                 feats = self._get_lstm_features_w_elmo_and_char(sentence_tokens, sentence, sentence_markdowns, chars2, caps, chars2_length, d)
#             else:
#                 feats = self._get_lstm_features_w_elmo(sentence_tokens, sentence, sentence_seg_preds, sentence_ctc_preds)
#         else:
#             feats = self._get_lstm_features(sentence,sentence_markdowns, chars2, caps, chars2_length, d)

        if self.use_crf:
            forward_score = self._forward_alg(feats)
            gold_score = self._score_sentence(feats, tags)
            return forward_score - gold_score
        else:
            tags = Variable(tags)
            scores = nn.functional.cross_entropy(feats, tags)
            return scores


    def forward(self,sentence_tokens, sentence, sentence_seg_preds, sentence_ctc_preds, chars, caps, chars2_length, d):
        feats = self._get_lstm_features_w_elmo(sentence_tokens,  sentence, sentence_seg_preds,sentence_ctc_preds)

        # if self.use_elmo:
        #     if parameters['use_elmo_w_char']:
        #         feats = self._get_lstm_features_w_elmo_and_char(sentence_tokens, sentence,sentence_markdowns, chars, caps, chars2_length, d)
        #     else:
        #         feats = self._get_lstm_features_w_elmo(sentence_tokens,  sentence, sentence_markdowns,sentence_ctc_preds)
            
        # else:
        #     feats = self._get_lstm_features(sentence,sentence_markdowns, chars, caps, chars2_length, d)
        # viterbi to get tag_seq
        if self.use_crf:
            score, tag_seq = self.viterbi_decode(feats)
        else:
            score, tag_seq = torch.max(feats, 1)
            tag_seq = list(tag_seq.cpu().data)

        return score, tag_seq

### Prepare the data for training

In [None]:
def prepare_train_set_dev_data():
    lower = parameters['lower']
    zeros = parameters['zeros']
    tag_scheme = parameters['tag_scheme']
    
    # create the frequency vector
    if parameters['use_freq_vector']:
        create_frequecny_vector()
    
    # create the ctc_dict
    ctc_pred_dict = {}
    for line in open(parameters["ctc_pred"]):
        if line.strip() == "":
            continue
        line_values = line.strip().split("\t")
        word, ctc_pred = line_values[0], line_values[-1]
        ctc_pred_dict[word] = ctc_pred
    print("completed ctc predictions reading ")

    # prepare the training data
    # merge labels and select category specific entities 
    input_train_file=utils.Merge_Label(parameters["train"])
    
    Sort_Entity_by_Count(input_train_file,parameters["sorted_entity_list_file_name"])

    with open(parameters["sorted_entity_list_file_name"], encoding='utf-8') as f:
        sorted_entity_list = json.load(f)
    set_of_selected_tags=[]

    entity_category_code=parameters["entity_category_code"]
    entity_category_human_language=parameters["entity_category_human_language"]

    set_of_selected_tags.extend(sorted_entity_list[0:-6])
    if parameters['entity_category'] == 'code':
        for entity in entity_category_human_language:
            if entity in entity_category_human_language and entity in set_of_selected_tags:
                set_of_selected_tags.remove(entity)
    
    if parameters['entity_category'] == 'human_lang':
        for entity in entity_category_code:
            if entity in entity_category_code and entity in set_of_selected_tags:
                set_of_selected_tags.remove(entity)
        if 'Algorithm' not in set_of_selected_tags:
            set_of_selected_tags.append('Algorithm')
    
    if parameters['entity_category'] == 'all':
        if 'Algorithm' not in set_of_selected_tags:
            set_of_selected_tags.append('Algorithm')
    
    print("set of entities: ", set_of_selected_tags)

    merge_tags=parameters['merge_tags']
    train_sentences = loader.load_sentences_so_w_pred(parameters["train"], parameters["train_pred"], lower, zeros, merge_tags, set_of_selected_tags)
    if parameters["mode"] == "dev":
        dev_sentences = loader.load_sentences_so_w_pred(parameters["dev"], parameters["dev_pred"], lower, zeros, merge_tags, set_of_selected_tags)
        test_sentences = dev_sentences
    elif parameters["mode"] == "test":
        dev_sentences = loader.load_sentences_so_w_pred(parameters["test"], parameters["test_pred"],lower, zeros, merge_tags, set_of_selected_tags)
        test_sentences = dev_sentences
    
    loader.update_tag_scheme(train_sentences, tag_scheme)
    loader.update_tag_scheme(dev_sentences, tag_scheme)
    loader.update_tag_scheme(test_sentences, tag_scheme)
    
    dico_words_train = loader.word_mapping(train_sentences, lower)[0]
    dico_chars, char_to_id, id_to_char = loader.char_mapping(train_sentences)
    dico_tags, tag_to_id, id_to_tag = loader.tag_mapping(train_sentences)    
    
    #------------------------------------------------------------------------------------------------------------
    #------------- based on parameters setting(should be set by command line argutments) ------------------------
    #------------- load pretrained word embeddings --------------------------------------------------------------
    #------------------------------------------------------------------------------------------------------------
    if parameters['all_emb']:
        all_dev_test_words=[w[0][0] for w in dev_sentences+test_sentences]
    else:
        all_dev_test_words = []

    if parameters['use_pre_emb']:
        dico_words, word_to_id, id_to_word = loader.augment_with_pretrained(
            dico_words_train.copy(),
            parameters['pre_emb'],
            all_dev_test_words
        )
    else:
        dico_words = dico_words_train
        word_to_id, id_to_word = loader.create_mapping(dico_words_train.copy())

    train_data = loader.prepare_dataset(train_sentences, word_to_id, char_to_id, tag_to_id, ctc_pred_dict, lower)
    dev_data = loader.prepare_dataset(dev_sentences, word_to_id, char_to_id, tag_to_id, ctc_pred_dict, lower)
    test_data = loader.prepare_dataset(test_sentences, word_to_id, char_to_id, tag_to_id,ctc_pred_dict, lower)

    all_freq_embed={}
    for line in open(parameters['freq_vector_file'], encoding='utf-8'):
        s = line.strip().split()
        if len(s) == parameters['freq_dim'] + 1:
            all_freq_embed[s[0]] = np.array([float(i) for i in s[1:]])
        else:
            print("freq dim mismatch: ","required: ", parameters['freq_dim'], "given: ",len(s)-1)
    
    freq_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), parameters['freq_dim']))
    for w in word_to_id:
        if w in all_freq_embed:
            freq_embeds[word_to_id[w]] = all_freq_embed[w]
        elif w.lower() in all_freq_embed:
            freq_embeds[word_to_id[w]] = all_freq_embed[w.lower()]
    
    all_word_embeds = {}
    if parameters['use_pre_emb']:
        for i, line in enumerate(codecs.open(parameters['pre_emb'] , 'r', 'utf-8')):
            s = line.strip().split()
            if len(s) == parameters['word_dim'] + 1:
                all_word_embeds[s[0]] = np.array([float(i) for i in s[1:]])
    
    word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), parameters['word_dim']))
    seg_pred_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (parameters['segmentation_count'] , parameters['segmentation_dim']))
    ctc_pred_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (parameters['code_recognizer_count'], parameters['code_recognizer_dim']))
    if parameters['use_pre_emb']:
        for w in word_to_id:
            if w in all_word_embeds:
                word_embeds[word_to_id[w]] = all_word_embeds[w]
            elif w.lower() in all_word_embeds:
                word_embeds[word_to_id[w]] = all_word_embeds[w.lower()]
    
    return train_data, dev_data, test_data, word_to_id, id_to_word, tag_to_id, id_to_tag, char_to_id, id_to_char, word_embeds, freq_embeds, seg_pred_embeds, ctc_pred_embeds

In [None]:
train_data, dev_data, test_data, word_to_id, id_to_word, tag_to_id, id_to_tag, char_to_id, id_to_char, word_embeds, freq_embeds, seg_pred_embeds, ctc_pred_embeds = prepare_train_set_dev_data()
model = BiLSTM_CRF(
    vocab_size=len(word_to_id),
    tag_to_ix=tag_to_id,
    embedding_dim=parameters['word_dim'],
    freq_embed_dim=parameters['freq_dim'], 
    seg_pred_embed_dim=parameters['segmentation_dim'],
    hidden_dim=parameters['word_lstm_dim'],
    use_gpu=parameters['use_gpu'],
    char_to_ix=char_to_id,
    pre_word_embeds=word_embeds,
    word_freq_embeds=freq_embeds,
    word_seg_pred_embeds=seg_pred_embeds,
    word_ctc_pred_embeds=ctc_pred_embeds,
    use_crf=parameters['crf'],
    char_mode=parameters['char_mode'],
)

In [None]:
from torch.optim import lr_scheduler
from torch.autograd import Variable
import time

if parameters['use_gpu']:
    print("GPU ID = ", parameters["gpu_id"])
    torch.cuda.set_device(parameters["gpu_id"])
    model.cuda()
learning_rate = parameters["LR"]
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)
step_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.8)
t = time.time()

# train model
char_embed_dict = {}
losses = []
loss = 0.0
best_dev_F = -1.0
best_test_F = -1.0
best_train_F = -1.0
all_F = [[0, 0, 0]]
plot_every = 10
eval_every = 20
count = 0

model.train(True)
start = time.time()
# what the data instance looks like"
#'str_words': ['Trial', 'and', 'error', 'seems', 'a', 'very', 'dumb', '(', 'and', 'annoying', ')', 'approach', 'to', 'solve', 'this', 'problem', '.'], 
#'words':    [1, 9, 76, 179, 7, 215, 1, 26, 9, 1, 29, 332, 4, 310, 15, 64, 3], 
#'markdown': [0, 0, 0, 0, 0, 0, 0,  0, 0, 0,  0, 0, 0, 0,  0,  0, 0],
#'chars': [[26, 8, 5, 4, 10], [4, 6, 11], [1, 8, 8, 3, 8], [7, 1, 1, 14, 7], [4], [22, 1, 8, 17], [11, 13, 14, 21], [35], [4, 6, 11], [4, 6, 6, 3, 17, 5, 6, 16], [34], [4, 15, 15, 8, 3, 4, 12, 9], [2, 3], [7, 3, 10, 22, 1], [2, 9, 5, 7], [15, 8, 3, 21, 10, 1, 14], [20]], 
#'caps': [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'handcrafted': [28052, 28053, 28054, 28055, 28056, 28057, 28058, 28059, 28060, 28061, 28062, 28063, 28064, 28065, 28066, 28067, 28068]
for epoch in range(1, parameters["epochs"] + 1):
    print("---------epoch count: ", epoch)
    for i, index in enumerate(np.random.permutation(len(train_data))):

        tr = time.time()
        count += 1
        data = train_data[index]
        
        model.zero_grad()
        
        sentence_in = data['words']
        sentence_tokens=data['str_words']
        sentence_seg_preds = data['seg_pred']
        sentence_ctc_preds = data['ctc_pred']
        
        tags = data['tags']
        chars2 = data['chars']

        sentence_in = Variable(torch.LongTensor(sentence_in))
        sentence_seg_preds = Variable(torch.LongTensor(sentence_seg_preds))
        sentence_ctc_preds = Variable(torch.LongTensor(sentence_ctc_preds))

        ######### char lstm
        if parameters['char_mode'] == 'LSTM':
            chars2_sorted = sorted(chars2, key=lambda p: len(p), reverse=True)
            d = {}
            for i, ci in enumerate(chars2):
                for j, cj in enumerate(chars2_sorted):
                    if ci == cj and not j in d and not i in d.values():
                        d[j] = i
                        continue
            chars2_length = [len(c) for c in chars2_sorted]
            char_maxl = max(chars2_length)
            chars2_mask = np.zeros((len(chars2_sorted), char_maxl), dtype='int')
            for i, c in enumerate(chars2_sorted):
                chars2_mask[i, :chars2_length[i]] = c
            chars2_mask = Variable(torch.LongTensor(chars2_mask))
        
        # ######## char cnn
        if parameters['char_mode'] == 'CNN':
            d = {}
            chars2_length = [len(c) for c in chars2]
            char_maxl = max(chars2_length)
            chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
            for i, c in enumerate(chars2):
                chars2_mask[i, :chars2_length[i]] = c
            chars2_mask = Variable(torch.LongTensor(chars2_mask))
        
        targets = torch.LongTensor(tags)
        caps = Variable(torch.LongTensor(data['caps']))
        if use_gpu:
            neg_log_likelihood = model.neg_log_likelihood(sentence_tokens, sentence_in.cuda(), sentence_seg_preds.cuda(),sentence_ctc_preds.cuda(),  targets.cuda(), chars2_mask.cuda(), caps.cuda(), chars2_length, d)
        else:
            neg_log_likelihood = model.neg_log_likelihood(sentence_tokens,sentence_in,sentence_seg_preds,sentence_ctc_preds,  targets, chars2_mask, caps, chars2_length, d)
        
        loss += neg_log_likelihood.data.item() / len(data['words']) #JT : data[0]> data.item()
        neg_log_likelihood.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0) #JT : clip_grad_norm > clip_grad_norm_
        optimizer.step()
        if count % len(train_data) == 0:
            utils.adjust_learning_rate(optimizer, lr=learning_rate/(1+0.05*count/len(train_data)))
        
    # JT: evaluate after 1 epoch
    model.train(False)

    best_train_F, new_train_F, _ = evaluating(model, train_data,  best_train_F,  epoch, "train")

    if parameters["mode"]=="dev":
        phase_name="dev"
    else:
        phase_name="test"

    best_dev_F, new_dev_F, save = evaluating(model, dev_data, best_dev_F, epoch, phase_name) 
    if save:
        torch.save(model, model_name)
    best_test_F, new_test_F = 0, 0
    all_F.append([new_train_F, new_dev_F, new_test_F])
    step_lr_scheduler.step()
    
    #-------------------------------------------------------------------------------------------------
    #--------------------- save model for each epoch, after finding the optimal epoch ----------------
    #--------------------- save model from last epoch only -------------------------------------------
    #-------------------------------------------------------------------------------------------------
    PATH=parameters["models_path"]+"/model_epoch."+str(epoch)
    torch.save(model, PATH)
    model.train(True)
    end = time.time()
    time_in_this_epoch = end - start
    print("time in this epoch: ", time_in_this_epoch, "secs")
    start=end

print("total time elapsed for training: ", time.time() - t)

## Evaluation

Although the Entity Segmenter module is missing as mentioned above, the authors have pre-handled the dataset and provided the predictions from the Code Recognizer and the Entity Segmenter in the source codes. The details as below.

![dataset](./notebook_resources/dataset.png)

Thus, we can still train the SoftNER model with those datasets. 100 epochs are required for training and each epoch costs more than 4 hours on our server. We have trained 3 epochs to observe the results.

The training log and the conscise results in each epoch are as follows:

In [7]:
os.chdir('/root/Project/StackOverflowNER/code/Attentive_BiLSTM/')
!cat running.log

nohup: ignoring input
sh: 1: ./evaluation/conlleval: Permission denied
sh: 1: ./evaluation/conlleval: Permission denied
sh: 1: ./evaluation/conlleval: Permission denied
completed ctc predictions reading 
set of entities:  ['Class', 'Application', 'Variable', 'User_Interface_Element', 'Code_Block', 'Function', 'Language', 'Library', 'Value', 'Data_Structure', 'Data_Type', 'File_Type', 'File_Name', 'Version', 'HTML_XML_Tag', 'Device', 'Operating_System', 'Website', 'Algorithm']
------------------------------------------------------------
Number of questions in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  741
Number of answers in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  897
Number of sentences in  ../../resources/annotated_ner_data/StackOverflow/train_merged_labels.txt  :  9263
Number of sentences after merging :  9263
Max len sentences has 92 words
-----------------------------------------------------------

In [8]:
!cat perf_per_epoch_2020-10-19_9911_1.txt

test: epoch: 1 P: 70.58 R: 70.22 F1: 70.4
test: epoch: 2 P: 74.94 R: 74.81 F1: 74.87
test: epoch: 3 P: 73.19 R: 72.73 F1: 72.96


Compared with the results in the paper, our trainig model

![paperSoftNER](./notebook_resources/paperSoftNER.png)


## Evaluate The SoftNER Model's Performance on Leetcode Sentences


<!-- Something -->



Here gives an example of how to use SoftNER to indentify named entities in article.

In [None]:
""" Identify Named Entities  """

os.chdir('/root/Project/StackOverflowNER/code/BERT_NER/')

from utils_preprocess import *
from utils_preprocess.format_markdown import *
from utils_preprocess.anntoconll import *

import glob

from utils_ctc import *
from utils_ctc.prediction_ctc import *

import softner_segmenter_preditct_from_file
import softner_ner_predict_from_fil

from E2E_SoftNER import read_file, merge_all_conll_files, create_segmenter_input, create_ner_input


# input file
input_file = "xml_filted_body.txt"
base_temp_dir = "temp_files/"
standoff_folder = "temp_files/standoff_files/"
# dir of code recognizer
conll_folder = "temp_files/conll_files/"
conll_file = "temp_files/conll_format_txt.txt"
# dir of entity segmenter
segmenter_input_file = "temp_files/segmenter_ip.txt"
segmenter_output_file = "temp_files/segemeter_preds.txt"
# input features & SoftNER prediction 
ner_input_file = "temp_files/ner_ip.txt"
ner_output_file = "ner_preds.txt"

if not os.path.exists(base_temp_dir): os.makedirs(base_temp_dir)
if not os.path.exists(standoff_folder): os.makedirs(standoff_folder)
if not os.path.exists(conlll_folder): os.makedirs(conlll_folder)

# read sentences and tokenize the sentences
read_file(input_file, standoff_folder)

# Code Recognizer
convert_standoff_to_conll(standoff_folder, conll_folder)
merge_all_conll_files(conll_folder, conll_file)
create_segmenter_input(conll_file, segmenter_input_file, ctc_classifier, vocab_size, word_to_id, id_to_word, word_to_vec, features)

# Entity Segmenter
predict_segments(segmenter_input_file, segmenter_output_file)
create_ner_input(segmenter_output_file, ner_input_file, ctc_classifier, vocab_size, word_to_id, id_to_word, word_to_vec, features)

# recognize named entities
softner_ner_predict_from_file.predict_entities(ner_input_file, ner_output_file)


## Challenges and Solutions

<!-- challenges & solutions -->

- The size of the dataset is more than 30G. Quote is limited when running codes on Colab with the mounted Google Drive.
    - We set up an instance on Google Cloud
    - Built a Jupyter server
- Bugs in the source codes
    - Encoding problem when read or write files
    - When running the codes in Jupyter
    - A typo
    - Incompatible version of modules
- Several essential files are missing in the original folder
    - We have contacted the author multiple times. Thank you Jeniya!
- Instruction to set up the model is not clear
    - Code review

## Contribution

<!-- Acknowledgement -->

**Changheng Liou**
- Build up a Jupyter server, which directs to the instance in Google Cloud
- Crawling the sentences in Leetcode discussion
- Analyze the SoftNER model architecture
- Code review
- Responsible for presenting our work in the final presentation
- Complete the report

**Zhiying Cui**
- Provide a server in Google Cloud and set up the environment
- Debug the source codes 
- Analyze the auxiliary models
- Code review
- Responsible for answering questions in the final presentation
- Complete the report


## References

1. Jeniya Tabassum et al., [Code and Named Entity Recognition in StackOverﬂow](https://arxiv.org/abs/2005.01634)
2. Jacob Devlin et al., [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
3. [StackOverflowNER](https://github.com/jeniyat/StackOverflowNER)
4. [Pre-trained BERTOverflow](https://huggingface.co/jeniya/BERTOverflow#)
5. [BERT](https://huggingface.co/transformers/model_doc/bert.html)
6. [transformers](https://github.com/huggingface/transformers)
7. [fastText](https://github.com/facebookresearch/fastText)
