This notebook will guide you through training and validation of the ChemPred model. It uses Python package sciNER, developped specifically for scinetific named entity recognition. Since the package has been in the development stage at the time of this writing, some API's could've been updated. The following code was tested with package version pushed to the branch `chempred-pub`. 

In [None]:
# Import all the necessary packages

import operator as op
import re
from itertools import chain, repeat, starmap

import numpy as np
from fn import F

from sciner import util, intervals
from sciner.corpora import corpus, chemdner
from sciner.preprocessing import encoding, preprocessing, sampling, parsing
from sciner.util import oldmap


We now need to define a mapping to encode CHEMDNER annotations as positive integers. Any unmentioned class will be ignored by the annotation utilities. As per the publication, we ignore the `IDENTIFIER` class. We also set the limits for sentence lengths (`nsteps`) and token lengths (`charlen`). We've set these limits so that only 15-sentences exceded their limit and and no tokens have exceeded `charlen`. You can choose whether to discard or trim sentences/word exceeding the limit. We used the latter option.

In [None]:
mapping = corpus.parse_mapping(
    ["ABBREVIATION:1",
     "FAMILY:2",
     "FORMULA:3",
     "MULTIPLE:4",
     "TRIVIAL:5",
     "SYSTEMATIC:6"]
)

## Limits
nsteps = 200
charlen = 30

Here we parse and sample the abstracts, stored in `sciner.corpora.corpus.Abstract` objects along with their annotations and sentence borders. 

In [None]:
def process_abstracts(tokeniser, abstracts, mapping=None):
            
    def flatten(arr):
        def f(x):
            pos = x.nonzero()[-1]
            return np.random.choice(pos[pos > 0]) if pos.any() else 0
        return np.apply_along_axis(f, 1, arr)
    
    flat_abstracts = map(corpus.flatten_abstract, abstracts)
    ids, srcs, texts, annotations, borders = zip(*chain.from_iterable(flat_abstracts))
    # parse texts and sample tokens within sentences
    parsed_texts = list(map(tokeniser, texts))
    samples = list(starmap(sampling.sample_sentences, zip(borders, parsed_texts)))
    tokens = (F(map, F(map, intervals.unload) >> F(map, list)) >> list)(samples)
    # make annotations if necessary
    if mapping is not None:
        nlabels = len(set(mapping.values()) | {0})
        anno_encoder = F(encoding.encode_annotation, mapping)
        border_encoder = F(encoding.encode_annotation, mapping, start_only=True)
        enc_annotations = list(starmap(anno_encoder, zip(annotations, map(len, texts))))
        enc_borders = list(starmap(border_encoder, zip(annotations, map(len, texts))))
        sample_annotations = [[flatten(preprocessing.annotate_sample(nlabels, anno, s)) for s in samples_]
                              for anno, samples_ in zip(enc_annotations, samples)]
        entity_borders = [[flatten(preprocessing.annotate_sample(nlabels, b_anno, s)) for s in samples_]
                           for b_anno, samples_ in zip(enc_borders, samples)]
    else:
        sample_annotations = repeat(repeat(None))
        sample_borders = repeat(repeat(None))
    return zip(*util.flatzip([ids, srcs], [samples, tokens, sample_annotations, entity_borders]))


def join_nested(arrays, nsteps, nfeatures, trim=True):
    joined_features = (F(map, F(util.join, length=nfeatures, trim=trim)) >> (map, op.itemgetter(0)) >> list)(arrays)
    return util.join(joined_features, nsteps, trim=trim)

Separate the training data into training, validation and test datasets. As per the publication. We used 10% of the training and development CHEMDNER datasets for in-training validation. You can optionally save the data by uncommenting the last 3 lines.

In [None]:
abstracts1 = (
list(chemdner.align_abstracts(
    chemdner.parse_abstracts("chemdner_corpus/training.abstracts.txt"), 
    chemdner.parse_annotations("chemdner_corpus/training.annotations.txt"),
    chemdner.parse_borders("chemdner_corpus/training.borders.tsv")))
    +
list(chemdner.align_abstracts(
    chemdner.parse_abstracts("chemdner_corpus/development.abstracts.txt"), 
    chemdner.parse_annotations("chemdner_corpus/development.annotations.txt"),
    chemdner.parse_borders("chemdner_corpus/development.borders.tsv")))
)
abstracts2 = list(chemdner.align_abstracts(
    chemdner.parse_abstracts("chemdner_corpus/testing.abstracts.txt"), 
    chemdner.parse_annotations("chemdner_corpus/testing.annotations.txt"),
    chemdner.parse_borders("chemdner_corpus/testing.borders.tsv")))

valsplit = 0.1
ntrain = int(len(abstracts1) * (1 - valsplit))


abstracts_train = util.oldmap(F(util.oldmap, tuple), abstracts1[:ntrain])
abstracts_val = util.oldmap(F(util.oldmap, tuple), abstracts1[ntrain:])
abstracts_test = util.oldmap(F(util.oldmap, tuple), abstracts2)

# joblib.dump(abstracts_train, "abstracts_train.joblib", 1)
# joblib.dump(abstracts_val, "abstracts_val.joblib", 1)
# joblib.dump(abstracts_test, "abstracts_test.joblib", 1)

Here we define the tokeniser and the transform. Since we've replaced all numeric sequences with a special token `<NUMERIC>` to better train Glove embeddings, we now need to pass the same transform to the `WordEncoder`. `<unk>` is the standard Glove OOV vector's identifier. We also "train" the character encoder (basically, building the set of all characters in the corpus).

In [None]:
tokeniser = F(parsing.tokenise, [re.compile("\w+|[^\s\w]")])
transform = F(parsing.transform, [(parsing.numeric, "<NUMERIC>")])

texts = chain.from_iterable(oldmap(lambda x: x[0][1:], abstracts_train) + 
                            oldmap(lambda x: x[0][1:], abstracts_val) + 
                            oldmap(lambda x: x[0][1:], abstracts_test))

word_encoder = encoding.WordEncoder("embeddings-numeric/vectors-300.txt", "<unk>", transform)
char_encoder = encoding.CharEncoder("\n".join(texts))

The `process_abstracts` utility we've defined above returns aligned tuples of text identifiers (PMIDs), sources (abstract's title or body), samples (intervals corresponding to sentence boundaries), token strings (`ws`), token entity-part annotations and token entity-beginning annotations.

In [None]:
ids, srcs, samples, ws, w_anno, b_anno = process_abstracts(tokeniser, abstracts_train, mapping)
ids_val, srcs_val, samples_val, ws_val, w_anno_val, b_anno_val = process_abstracts(tokeniser, abstracts_val, mapping)
ids_test, srcs_test, samples_test, ws_test, w_anno_test, b_anno_test = process_abstracts(tokeniser, abstracts_test, mapping)

We then enocode and join the samples

In [None]:
encode_words = (F(map, F(word_encoder.encode, vectors=True)) >> list 
                >> F(util.join, length=nsteps, trim=True))
encode_chars = (F(map, char_encoder.encode) >> list 
                >> F(join_nested, nsteps=nsteps, nfeatures=charlen))

encoded_words, word_mask = encode_words(ws)
encoded_characters, char_mask = encode_chars(ws)
word_annotations, anno_mask = util.join(w_anno, nsteps, trim=True)
border_annotations, border_mask = util.join(b_anno, nsteps, trim=True)
prob_masks = word_mask.astype(np.float32)

encoded_words_val, word_mask_val = encode_words(ws_val)
encoded_characters_val, char_mask_val = encode_chars(ws_val)
word_annotations_val, anno_mask_val = util.join(w_anno_val, nsteps, trim=True)
border_annotations_val, border_mask_val = util.join(b_anno_val, nsteps, trim=True)
prob_masks_val = word_mask_val.astype(np.float32)

You can pick into a sample

In [None]:
list(zip(ws[12], word_annotations[12], border_annotations[12]))

Here we define the model. If you have several GPUs and only want to activate one, you can uncomment the first two lines and specify a GPU id. 

In [None]:
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

from keras import layers, models, optimizers
from sklearn import metrics

from sciner.models import build
from sciner.models.metrics import Validator

wordemb_dim = 300
charemb_dim = 50
units = 30
layer = layers.GRU


masks = layers.Input((nsteps, 1), name="masks", dtype="float32")

wordemb = layers.Input((nsteps, wordemb_dim), name="wordemb")
wordcnn = build.cnn([200, 250], 2, [0.3, None], name_template="wordcnn{}")(wordemb)
wordcnn = layers.multiply([wordcnn, masks])
wordcnn = layers.Masking(0.0)(wordcnn)

characters = layers.Input((nsteps, charlen), dtype="int32", name="characters")
charemb = build.char_embeddings(len(char_encoder), nsteps, charemb_dim, units, 0.3, 0.3, mask=True, layer=layer)(characters) 
charcnn = build.cnn([200, 250], 2, [0.3, None], name_template="charcnn{}")(charemb)
charcnn = layers.multiply([charcnn, masks])
charcnn = layers.Masking(0.0)(charcnn)

merged = layers.concatenate([wordcnn, charcnn], axis=-1)
rnn = build.rnn([150, 150], 0.1, 0.1, bidirectional="concat", layer=layer)(merged)

rnn_runs = build.rnn([150], 0.1, 0.1, bidirectional="concat", layer=layer)(rnn)
output_runs = layers.Dense(1, activation="sigmoid")(rnn_runs)

rnn_borders = layers.multiply([output_runs, rnn])
rnn_borders = build.rnn([150], 0.1, 0.1, bidirectional="concat", layer=layer)(rnn_borders)
output_borders = layers.Dense(1, activation="sigmoid")(rnn_borders)


model = models.Model([wordemb, characters, masks], [output_runs, output_borders])
model.compile(optimizer=optimizers.Adam(clipvalue=1.0), loss="binary_crossentropy",
              sample_weight_mode="temporal")

Finally, we define validation callbacks to checkpoint the model and save weights upon improvements in the F1-score. Since the model has two outputs solving different, albeit related, objectives, it is better to monitor their improvements separately.

In [None]:
inputs = [encoded_words, encoded_characters, prob_masks[:,:,None]]
output_runs = np.clip(word_annotations, 0, 1)[:,:,None]
output_borders = np.clip(border_annotations, 0, 1)[:,:,None]
inputs_val = [encoded_words_val, encoded_characters_val, prob_masks_val[:,:,None]]
output_runs_val = np.clip(word_annotations_val, 0, 1).flatten()
output_borders_val = np.clip(border_annotations_val, 0, 1).flatten()

scores = {"precision": F(metrics.precision_score, average="binary", labels=[1]),
          "recall": F(metrics.recall_score, average="binary", labels=[1]),
          "f1": F(metrics.f1_score, average="binary", labels=[1])}

! mkdir -p trainlogs
logfile = open("trainlogs/log.txt", "w")
f1_runs = Validator(inputs_val, output_runs_val, 100, scores, lambda x: (x[0] > 0.5).astype(int).flatten(), "f1",
                    prefix="trainlogs/runs",
                    stream=logfile)
f1_borders = Validator(inputs_val, output_borders_val, 100, scores, lambda x: (x[1] > 0.5).astype(int).flatten(), "f1",
                       prefix="trainlogs/borders",
                       stream=logfile)

In [None]:
model.fit(inputs, [output_runs, output_borders],
          verbose=1, epochs=50, batch_size=32,
          initial_epoch=0,
          callbacks=[f1_runs, f1_borders])