# III. Weakly Supervised Named Entity Recognition (NER)

We'll use the public [BioCreative V Chemical Disease Relation](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/) (BC5CDR) dataset, focusing on Chemical entities. 

See `../applications/BC5CDR/` for the complete labeling function set used in our paper. 

### Installation Instructions

- Trove requires access to the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) which is freely available after signing up for an account with the National Library of Medicine. See the notebook `1_Installing_the_UMLS` for detailed instructions on downloading and installing the UMLS.
- Unzip the preprocessed BioCreative V CDR chemical dataset `bc5cdr.zip`

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(0,'../../trove')


## A. Load Unlabeled Data & Define Entity Classes

### 1. Load Preprocessed Documents
This notebook assumes documents have already been preprocessed for sentence boundary detection and dumped into JSON format. See `preprocessing/README.md` and `2_NLP_Preprocessing.ipynb` for details.


In [2]:
%%time
import transformers
from trove.dataloaders import load_json_dataset

tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

data_dir = "data/bc5cdr/"
dataset = {
    split : load_json_dataset(f'{data_dir}/{split}.cdr.chemical.json', tokenizer)
    for split in ['train', 'dev', 'test']
}

Tagged Entities: 5203


Tokenization Error: Token is not a head token Annotation[Chemical](Cl|1240-1242) 19692487
Tokenization Error: Token is not a head token Annotation[Chemical](Cl|1579-1581) 15075188
Errors: Span Alignment: 2/5347 (0.0%)


Tagged Entities: 5345
Tagged Entities: 5385
CPU times: user 26.5 s, sys: 491 ms, total: 27 s
Wall time: 28.8 s


### 2. Define Entity Categories
In popular biomedical annotators such as [NCBO BioPortal](https://bioportal.bioontology.org/annotator), we configure the annotator by selecting a set of semantic categories which define our entity class and a corresponding set of ontologies mapped to those types.  

Trove uses a similar style of interface in API form. For `CHEMICAL` tagging, we define an entity class consisting of [UMLS Semantic Network](https://semanticnetwork.nlm.nih.gov/) types mapped to $\{-1,0,1\}$ (where -1 is _abstain_). The semantic network defines 127 concept categories called _Semantic Types_ (e.g., Disease or Syndrome , Medical Device) which are mappable to 15 coarser-grained _Semantic Groups_ (e.g., Anatomy, Chemicals & Drugs, Disorders). 

We use the _Chemicals & Drugs_ (CHEM) semantic group as the basis of our positive class label $1$, abstaining on some categories (e.g., Gene or Genome) that do not match the definition of chemical as outlined in the BC5CDR annotation guidelines. Non-chemical STYs define our negative class label $0$.

In [5]:
import pandas as pd

# load the chemical entity definition
entity_def = pd.read_csv('data/chemical_semantic_types.tsv', sep='\t')
class_map = {row.TUI:row.LABEL for row in entity_def.itertuples() if row.LABEL != -1}


## B. Load Ontology Labeling Sources
### 1. Unified Medical Language System (UMLS) Metathesaurus
The UMLS Metathesaurus is a convenient source for deriving labels, since it provides over 200 source vocabularies (terminologies) with consistent entity categorization provided by the UMLS Semantic Network.

The first time this is run, Trove requires access to the installation zip


In [6]:
from trove.labelers.umls import UMLS

# setup defaults
UMLS.config(
    cache_root = "~/.trove/umls2020AB",
    backend = 'pandas'
)

if not UMLS.is_initalized():
    print(f'Please initalize the UMLS before running this notebook.')
    

We apply some minimal preprocessing to each source vocabularies term set, as outlined in the Trove paper. The most important settings are:
- `SmartLowercase()`, a string matching heuristic for preserving likely abbreviations and acronyms
- `min_char_len`, `filter_rgx`, filters for terms that are single characters or numbers  

Other choices are largely for speed purposes, such as restricting the max token length used for string matching. 


In [7]:
%%time
from trove.labelers.umls import UMLS
from trove.transforms import SmartLowercase

# english stopwords
stopwords = set(open('data/stopwords.txt','r').read().splitlines())
stopwords = stopwords.union(set([t[0].upper() + t[1:] for t in stopwords]))

# options for filtering terms
config = {
    "type_mapping"  : "TUI",  # TUI = semantic types, CUI = concept ids
    'min_char_len'  : 2,
    'max_tok_len'   : 8,
    'min_dict_size' : 500,
    'stopwords'     : stopwords,
    'transforms'    : [SmartLowercase()],
    'languages'     : {"ENG"},
    'filter_sabs'   : {"SNOMEDCT_VET"},
    'filter_rgx'    : r'''^[-+]*[0-9]+([.][0-9]+)*$'''  # filter numbers
}

umls = UMLS(**config)


CPU times: user 1min 24s, sys: 7.11 s, total: 1min 31s
Wall time: 1min 26s


In [8]:
%%time
import numpy as np

def map_entity_classes(dictionary, class_map):
    """
    Given a dictionary, create the term entity class probabilities
    """
    k = len([y for y in set(class_map.values()) if y != -1])
    ontology = {}
    for term in dictionary:
        proba = np.zeros(shape=k).astype(np.float32)
        for cls in dictionary[term]:
            # ignore abstains
            idx = class_map[cls] if cls in class_map else -1
            if idx != -1:
                proba[idx - 1] += 1
        # don't include terms that don't map to any classes
        if np.sum(proba) > 0:
            ontology[term] = proba / np.sum(proba)
    return ontology

# These are the top 4 ontologies as ranked by term overlap with the BC5CDR training set
terminologies = ['CHV', 'SNOMEDCT_US', 'NCI', 'MSH']

ontologies = {
    sab : map_entity_classes(umls.terminologies[sab], class_map)
    for sab in terminologies
}


CPU times: user 50.9 s, sys: 1.07 s, total: 52 s
Wall time: 52.4 s


In [9]:
%%time

# create dictionaries for our Schwartz-Hearst abbreviation detection labelers
positive, negative = set(), set()

for sab in umls.terminologies:
    for term in umls.terminologies[sab]:
        for tui in umls.terminologies[sab][term]:
            if tui in class_map and class_map[tui] == 1:
                positive.add(term)
            elif tui in class_map and class_map[tui] == 0:
                negative.add(term)


CPU times: user 8.6 s, sys: 550 ms, total: 9.15 s
Wall time: 9.19 s


### 2. Additional Ontologies

We also want to utilize non-UMLS ontologies. External databases typically don't include rich mappings to Semantic Network types, so we treat this as an ontology/dictionary mapping to a single class label.

- ChEBI Database
- CTD


In [32]:
from chebi import ChebiDatabase
from ctd import CtdDatabase

config = {
    'min_char_len'  : 2,
    'max_tok_len'   : 8,
    'min_dict_size' : 1,
    'stopwords'     : stopwords,
    'transforms'    : [SmartLowercase()],
    'languages'     : None,
    'filter_sources': None,
    'filter_rgx'    : r'''^[-+]*[0-9]+([.][0-9]+)*$'''  # filter numbers
}

chebi = ChebiDatabase(cache_path="~/.trove/chebi/", **config)
ctd = CtdDatabase(cache_path="~/.trove/", **config)

downloading ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names.tsv.gz


CTD_diseases.csv.gz: 1.66MB [00:00, 8.39MB/s]                           
CTD_chemicals.csv.gz: 9.30MB [00:00, 11.4MB/s]                            


### 3. ADAM Biomedical Abbreviations

In [33]:
# TBD

## C. Create Sequence Labeling Functions
### 1. Guideline Labeling Functions

Annotation guidelines -- the instructions provided to domain experts when labeling training data -- can have a big impact on the generalizability of named enity classifiers. These instructions include seeminly simple choices such as whether to include determiners in entity spans ("the XXX") or more complex tagging choices like not labeling negated mentions of drugs. These choices are baked into the dataset and expensive to change. 

With weak supervision, many of these annotation assumptions can encoded as labeling functions, making training set changes faster, more flexible, and lower cost. For our `Chemical` labeling functions, we use the instructions provided [here](https://biocreative.bioinformatics.udel.edu/media/store/files/2015/bc5_CDR_data_guidelines.pdf) (pages 5-6) to create small dictionaries encoding some of these guidelines. Note that these can be easily expanded on, and in some cases complex rules (e.g., not annotating polypeptides with more than 15 amino acids) can be coupled with richer structured resources to create more sophisticated rules. 

We also fine it useful to include labeling functions that exclude numbers and punctuation tokens, another common flag in online biomedical annotators. 


In [34]:
from trove.labelers.labeling import (
    OntologyLabelingFunction,
    DictionaryLabelingFunction, 
    RegexEachLabelingFunction
)

# load our guideline dictionaries
df = pd.read_csv('data/bc5cdr_guidelines.tsv', sep='\t',)
guidelines = {
    t:np.array([1.,0.]) if y==1 else np.array([0.,1.]) 
    for t,y in zip(df.TERM, df.LABEL)
}

# use guideline negative examples as an additional stopword list
guideline_stopwords = {t:2 for t in df[df.LABEL==0].TERM}
stopwords = {t:2 for t in stopwords}

guideline_lfs = [
    OntologyLabelingFunction('guidelines', guidelines),
    DictionaryLabelingFunction('stopwords', stopwords, 2),
    DictionaryLabelingFunction('punctuation', set('!"#$%&*+,./:;<=>?@[\\]^_`{|}~'), 2),
    RegexEachLabelingFunction('numbers', [r'''^[-]*[1-9]+[0-9]*([.][0-9]+)*$'''], 2)
]


### 2. Semantic Type Labeling Functions

The bulk of our supervision comes from structured medical ontologies. 

In [35]:
%%time

ontology_lfs = [
    OntologyLabelingFunction(
        f'UMLS_{name}', 
        ontologies[name], 
        stopwords=guideline_stopwords 
    )
    for name in ontologies
]


CPU times: user 18 s, sys: 222 ms, total: 18.2 s
Wall time: 18.2 s


In [36]:
ext_ontology_lfs = [
    DictionaryLabelingFunction('CHEBI', chebi.terms(), 1, 
                               stopwords=guideline_stopwords),   
    DictionaryLabelingFunction('CTD_chemical', ctd.get_source_terms('chemical'), 1, 
                               stopwords=guideline_stopwords),
    DictionaryLabelingFunction('CTD_disease', ctd.get_source_terms('disease'), 2, 
                               stopwords=guideline_stopwords)
]


### 3. SynSet Labeling Functions

For biomedical concepts, abbreviations and acronymns (more generally "short forms") are a large source of ambiguity. 
These can be ambiguous to human readers as well, so authors of PubMed abstract typically define ambiguous terms when they are introduced in text. We can take adavantage of this redundancy to both handle ambiguous mentions and identify out-of-ontology short forms using classic text mining techniques such as the [Schwartz-Hearst algorithm](https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf).

In [38]:
from trove.labelers.abbreviations import SchwartzHearstLabelingFunction

ontology_lfs += [
    SchwartzHearstLabelingFunction('UMLS_schwartz_hearst_1', positive, 1, stopwords=guideline_stopwords),
    SchwartzHearstLabelingFunction('UMLS_schwartz_hearst_2', negative, 2)
]

### 4. Task-specific Labeling Functions

Ontology-based labeling functions can do suprisingly well on their own, but we can get more performance gains by adding custom labeling functions. For this demo, we focus on simple rules that are easy to create via data exploration but any existing rule-based model can be transformed into a labeling function. 

In [39]:
import re
from trove.labelers.labeling import RegexLabelingFunction

task_specific_lfs = []

# We noticed parentheses were causing errors so this labeling function 
# identifies negative examples, e.g. (n=100), (10%)
parens_rgxs = [
    r'''[(](p|n)\s*([><=]+|(less|great)(er)*)|(ml|mg|kg|g|(year|day|month)[s]*)[)]|[(][0-9]+[%][)]'''
]
# case insensitive 
parens_rgxs = [re.compile(rgx, re.I) for rgx in parens_rgxs]
task_specific_lfs.append(RegexLabelingFunction('LF_parentheses', parens_rgxs, 2))


In [40]:
lfs = guideline_lfs + ontology_lfs + ext_ontology_lfs #+ task_specific_lfs 

## D. Construct the Label Matrix $\Lambda$
### 1. Apply Sequence Labeling Functions

In [41]:
%%time
import itertools
from trove.labelers.core import SequenceLabelingServer

X_sents = [
    dataset['train'].sentences,
    dataset['dev'].sentences,
    dataset['test'].sentences,
]

labeler = SequenceLabelingServer(num_workers=4)
L_sents = labeler.apply(lfs, X_sents)


Parallel(n_jobs=4)
auto block size=3495
Partitioned into 4 blocks, [3494 3495] sizes
CPU times: user 21.2 s, sys: 8.09 s, total: 29.3 s
Wall time: 1min 18s


In [42]:
import itertools

splits = ['train', 'dev', 'test']
tag2idx = {'O':2, 'I-Chemical':1}

X_words = [
    np.array(list(itertools.chain.from_iterable([s.words for s in X_sents[i]]))) 
    for i,name in enumerate(splits)
]

X_seq_lens = [
    np.array([len(s.words) for s in X_sents[i]])
    for i,name in enumerate(splits)
]

X_doc_seq_lens = [  
    np.array([len(doc.sentences) for doc in dataset[name].documents]) 
    for i,name in enumerate(splits)
]

Y_words = [
    [dataset['train'].tagged(i)[-1] for i in range(len(dataset['train']))],
    [dataset['dev'].tagged(i)[-1] for i in range(len(dataset['dev']))],
    [dataset['test'].tagged(i)[-1] for i in range(len(dataset['test']))],
]

Y_words[0] = np.array([tag2idx[t] for t in list(itertools.chain.from_iterable(Y_words[0]))])
Y_words[1] = np.array([tag2idx[t] for t in list(itertools.chain.from_iterable(Y_words[1]))])
Y_words[2] = np.array([tag2idx[t] for t in list(itertools.chain.from_iterable(Y_words[2]))])


### 2. Build the Label Matrix

In [43]:
%%time
from scipy.sparse import dok_matrix, vstack, csr_matrix

def create_word_lf_mat(Xs, Ls, num_lfs):
    """
    Create word-level LF matrix from LFs indexed by sentence/word
    0 words X lfs
    1 words X lfs
    2 words X lfs
    ...
    
    """
    Yws = []
    for sent_i in range(len(Xs)):
        ys = dok_matrix((len(Xs[sent_i].words), num_lfs))
        for lf_i in range(num_lfs):
            for word_i,y in Ls[sent_i][lf_i].items():
                ys[word_i, lf_i] = y
        Yws.append(ys)
    return csr_matrix(vstack(Yws))

L_words = [
    create_word_lf_mat(X_sents[0], L_sents[0], len(lfs)),
    create_word_lf_mat(X_sents[1], L_sents[1], len(lfs)),
    create_word_lf_mat(X_sents[2], L_sents[2], len(lfs)),
]


CPU times: user 46.8 s, sys: 685 ms, total: 47.5 s
Wall time: 48.4 s


### 3. Inspect Labeling Function Performance
Here we use the standard metrics displayed for Data Programming / Snorkel. 

In [44]:
from trove.metrics.analysis import lf_summary

lf_summary(L_words[0], Y=Y_words[0], lf_names=[lf.name for lf in lfs])


Unnamed: 0,j,Polarity,Coverage%,Overlaps%,Conflicts%,Coverage,Correct,Incorrect,Emp. Acc.
guidelines,0,"[1.0, 2.0]",0.006085,0.004745,0.001539,704,678,26,0.963068
stopwords,1,2,0.282796,0.021618,0.00083,32717,32649,68,0.997922
punctuation,2,2,0.099489,0.004279,0.000251,11510,11425,85,0.992615
numbers,3,2,0.035387,0.002809,0.001737,4094,3790,304,0.925745
UMLS_CHV,4,"[1.0, 2.0]",0.352145,0.339888,0.017555,40740,39696,1044,0.974374
UMLS_SNOMEDCT_US,5,"[1.0, 2.0]",0.334633,0.329749,0.018039,38714,37829,885,0.97714
UMLS_NCI,6,"[1.0, 2.0]",0.397032,0.351687,0.020477,45933,45115,818,0.982191
UMLS_MSH,7,"[1.0, 2.0]",0.181172,0.17959,0.011392,20960,20427,533,0.974571
UMLS_schwartz_hearst_1,8,1,0.006033,0.006033,0.003371,698,649,49,0.929799
UMLS_schwartz_hearst_2,9,2,0.01313,0.01313,0.004892,1519,1207,312,0.794602


## E. Train the Label Model

In [45]:
# Trove uses a different internal mapping for labeling function abstains
def convert_label_matrix(L):
    # abstain is -1
    # negative is 0
    L = L.toarray().copy()
    L[L == 0] = -1
    L[L == 2] = 0
    return L

L_words_hat = [
    convert_label_matrix(L_words[0]),
    convert_label_matrix(L_words[1]),
    convert_label_matrix(L_words[2])
]

Y_words_hat = [
    np.array([0 if y == 2 else 1 for y in Y_words[0]]),
    np.array([0 if y == 2 else 1 for y in Y_words[1]]),
    np.array([0 if y == 2 else 1 for y in Y_words[2]])
]


In [46]:
import functools
from trove.models.model_search import grid_search
from snorkel.labeling.model.label_model import LabelModel

np.random.seed(1234)

n = L_words_hat[0].shape[0]

param_grid = {
    'lr': [0.01, 0.005, 0.001, 0.0001],
    'l2': [0.001, 0.0001],
    'n_epochs': [50, 100, 200, 600, 700, 1000],
    'prec_init': [0.6, 0.7, 0.8, 0.9],
    'optimizer': ["adamax"], 
    'lr_scheduler': ['constant'],
}

model_class_init = {
    'cardinality': 2, 
    'verbose': True
}

n_model_search = 25
num_hyperparams = functools.reduce(lambda x,y:x*y, [len(x) for x in param_grid.values()])
print("Hyperparamater Search Space:", num_hyperparams)


L_train      = L_words_hat[0]
Y_train      = Y_words_hat[0]
L_dev        = L_words_hat[1]
Y_dev        = Y_words_hat[1]

label_model, best_config = grid_search(LabelModel, 
                                       model_class_init, 
                                       param_grid,
                                       train = (L_train, Y_train, X_seq_lens[0]),
                                       dev = (L_dev, Y_dev, X_seq_lens[1]),
                                       n_model_search=n_model_search, 
                                       val_metric='f1', 
                                       seq_eval=True,
                                       seed=1234,
                                       tag_fmt_ckpnt='IO')

Hyperparamater Search Space: 192
Using SEQUENCE dev checkpointing
Using IO dev checkpointing
Grid search over 25 configs
[0] Label Model
[1] Label Model
[2] Label Model
[3] Label Model
[4] Label Model
[5] Label Model
[6] Label Model
[7] Label Model
[8] Label Model
{'lr': 0.0001, 'l2': 0.0001, 'n_epochs': 600, 'prec_init': 0.6, 'optimizer': 'adamax', 'lr_scheduler': 'constant'}
[TRAIN] accuracy: 97.94 | precision: 83.47 | recall: 85.63 | f1: 84.53
[DEV]   accuracy: 98.17 | precision: 85.67 | recall: 87.97 | f1: 86.81
----------------------------------------------------------------------------------------
[9] Label Model
[10] Label Model
[11] Label Model
[12] Label Model
[13] Label Model
[14] Label Model
[15] Label Model
[16] Label Model
[17] Label Model
[18] Label Model
[19] Label Model
[20] Label Model
[21] Label Model
[22] Label Model
[23] Label Model
[24] Label Model
BEST
{'lr': 0.0001, 'l2': 0.0001, 'n_epochs': 600, 'prec_init': 0.6, 'optimizer': 'adamax', 'lr_scheduler': 'constant'

In [47]:
from trove.metrics import eval_label_model

print("BIO Tag Format")
for i in range(3):
    eval_label_model(label_model, L_words_hat[i], Y_words_hat[i], X_seq_lens[i])
    print('-' * 80)


BIO Tag Format
[Label Model]   accuracy: 97.94 | precision: 83.47 | recall: 85.63 | f1: 84.53
[Majority Vote] accuracy: 97.53 | precision: 75.69 | recall: 83.49 | f1: 79.40
--------------------------------------------------------------------------------
[Label Model]   accuracy: 98.17 | precision: 85.67 | recall: 87.97 | f1: 86.81
[Majority Vote] accuracy: 97.79 | precision: 78.05 | recall: 86.08 | f1: 81.87
--------------------------------------------------------------------------------
[Label Model]   accuracy: 98.38 | precision: 84.95 | recall: 87.72 | f1: 86.31
[Majority Vote] accuracy: 97.82 | precision: 75.51 | recall: 85.50 | f1: 80.20
--------------------------------------------------------------------------------


## F. Export Proba Conll

In [None]:
#
# TBD
#