## Challenges
- Authors implement a custom tokenizer because existing tokenizers: CMU Twokenizer, Stanford TweetTokenizer, NLTK twitter tokenizer all mistakenly split code, for example:
```
txScope.complete() => ["txScope", ".", "complete", "(", ")"]
std::condition_variable => ["std", ":", ":", "condition_variable"]
math.h => ["math", ".", "h"]
<html> => ["<", "html", ">"]
a == b => ["a", "=", "=", "b"]
```
- Ambiguity, hard to refer word to technical programming concepts or common languages, such as "go", "spring", "while", "if", "select"

### Model architecture
1. **Input embedding layer**: Extract contextualized embeddings from the $BERT_{base}$ model and two new domain-specific embeddings for each word in the input sentence.
2. **Embedding attention layer**: Combine the three word embeddings using an attention network.
3. **Linear-CRF layer**: Predict the entity type of each word using the attentive word representations from the previous layer.
![model.png](./notebook_resources/model.png)

## Input embeddings

1. **Code Recognizer**, which represents if a word can be part of a code entity regardless of context
2. **Entity Segmenter**, that predicts whether a word is part of any named entity in the given sentence.

#### In-domain Word Embeddings
Wikipedia text is unsuitable for technical context, StackOverflow 10-year archive is used in this task.
**BERT (BERTOverflow)**, **ELMo (ELMoVerflow)**, and **GloVe (GloVerflow)** are trained.

#### Context-independent Code Recognition
Because `list` can be either common english word or code snippet, while `listing` is unlikely to be a code snippet.
Thus, a a binary classfier code recognition moodule is used for this task regardless of the context.

- The input features include unigram word and 6-gram character probabilities from two language models that are trained on the **Gigaword corpus** and all the code-snippets in the **StackOverflow 10-year archive**
- Pre-trained **FastText word embeddings** using these code-snippets, where a word vector is represented as a sum of its character ngrams. We first transform each ngram probability into a k-dimensional vector using Gaussian binning.
- then feed the vectorized features into a linear layer, concatenate the output with FastText character-level embeddings
- pass them through another hidden layer with sigmoid activation, and see if the output probability is greater than 0.5

#### Entity segmentation
- **Word frequency**: the word occurrence count in the training set.

In the giving context(Stackoverflow), code and non-code have an average frequency of 1.47 and 7.41. Ambiguous token that can be either code or non-code entities, such as "windows", have a much higher average frequency of 92.57
- **Code markdown**: whether the given token appears inside a ⟨code⟩ markdown in the StackOverflow post.

This is noisy as users do not always enclose inline code in a ⟨code⟩ tag or sometimes use the tag to highlight non-code texts 

#### Embedding-Level Attention
For each word $w_i$, there are 3 embeddings, BERT ($w_{i1}$), Code recognizer ($w_{i2}$), Entity Segmenter ($w_{i3}$). The embedding-level attention $\alpha_{it}$ ($t \in \{1, 2, 3\}$) capture the word's contribution to the meaning of the word.

To compute $\alpha_{it}$, we pass the input embeddings through a bidirectional GRU and generate their corresponding hidden representations $h_{it} = \vec{GRU}(w_{it})$

These vectors are then passed through a non-linear layer, which outputs $u_{it} = tanh(W_e h_{it} + b_e)$.

$u_e$: randomly initialized and updated during the training process.

This context vector is combined with the hidden embedding representation using a softmax function to extract weight of the embeddings:
$$
  \alpha_{it} = \frac{\exp{u_{it}^T u_e}}{\sum_t \exp{u_{it}^T u_e}}
$$

Finally, we create the word vector by a weighted sum of all the information from different embeddings as 
$$
  word_i = \sum_t \alpha_{it}h_{it}
$$

The result is then fed into a linear-CRF layer, which predicts the entity category for each word based the BIO tagging schema.

## Training

Train SoftNER model and 2 auxiliary model separately. Segmentation model follows the simple BERT fine-tuning architecture except for the input, where BERT embeddings are concatenated with 100-dimensional code markdown and 10-dimensional word frequency features.

Number of bins $k = 10$ for Gaussian vectorization

## Data
- [Pretrained BERTOverflow](https://github.com/lanwuwei/BERTOverflow)
- [BertOverflow weights and words](https://drive.google.com/drive/folders/1z4zXexpYU10QNlpcSA_UPfMb2V34zHHO)
### 20 entities
#### 8 code entities
- CLASS
- VARIABLE
- IN LINE CODE
- FUNCTION
- LIBRARY
- VALUE
- DATA TYPE
- HTML XML TAG

#### 12 natural language entities
- APPLICATION
- UI ELEMENT
- LANGUAGE
- DATA STRUCTURE
- ALGORITHM
- FILE TYPE
- FILE NAME
- VERSION
- DEVICE
- OS
- WEBSITE
- USER NAME