Here we will guide you through loading and using the CHEMDNER model pack. First, some imports. Take note, that the first two lines only make any difference with a CUDA-enabled tensorflow installation, by forcing it to run in CPU-only mode. If instead you are willing to use a GPU, you should specify a GPU device number, e.g. `os.environ['CUDA_VISIBLE_DEVICES'] = '0'`. If you don't have a GPU or are using a CPU-only tensorflow version, you can remove these lines altogether.

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''

from scilk.collections.chemdner import loaders
from scilk.corpora import corpus
from scilk.corpora.chemdner import parse_abstracts
from scilk.util import intervals

Using TensorFlow backend.


We then need to specify the paths to serealised data and load the tokeniser and the NER model.

In [2]:
root = 'chemdner-collection'  # this is the unarchived chemdner-collection.tgz file

tokeniser_data = {
    'tokeniser_weights': f'{root}/tokeniser-weights.hdf5',
    'charmap': f'{root}/charmap.joblib'
}
detector_data = {
    'embeddings': f'{root}/vectors.txt.gz',
    'charmap': f'{root}/charmap.joblib',
    'ner_weights': f'{root}/ner-weights.hdf5'
}

tokeniser = loaders.load_tokeniser(tokeniser_data)
detector = loaders.load_detector(detector_data)

Here we use the CHEMDNER corpus, but you can work with any other texts   

In [3]:
texts = [ab.body for ab in parse_abstracts('data/chemdner_corpus/testing.abstracts.txt')]
tokenised = tokeniser(list(texts)[:100])  # only tokenise the first 100 texts to spare some time
annotations = detector(tokenised)  # note! you must pass texts preprocessed by the bundled tokeniser.

In [4]:
annotations[:5]

[[Interval(start=94, stop=113, data=['poly(vinyl', 'alcohol)']),
  Interval(start=115, stop=118, data=PVA),
  Interval(start=243, stop=246, data=PVA),
  Interval(start=485, stop=488, data=PVA),
  Interval(start=670, stop=673, data=PVA),
  Interval(start=703, stop=708, data=thiol),
  Interval(start=761, stop=764, data=PVA),
  Interval(start=820, stop=823, data=PVA),
  Interval(start=1154, stop=1157, data=PVA),
  Interval(start=1301, stop=1304, data=PVA),
  Interval(start=1369, stop=1372, data=PVA)],
 [Interval(start=667, stop=684, data=estrone-3-sulfate),
  Interval(start=686, stop=691, data=E-3-S),
  Interval(start=697, stop=713, data=['taurocholic', 'acid'])],
 [Interval(start=294, stop=295, data=C),
  Interval(start=444, stop=452, data=tyrosine)],
 [Interval(start=42, stop=46, data=DOPC),
  Interval(start=92, stop=103, data=octadecanol),
  Interval(start=113, stop=120, data=Au(111)),
  Interval(start=214, stop=225, data=octadecanol),
  Interval(start=278, stop=289, data=octadecanol),

Take note, the NER model returns annotations as `Interval` objects (defined in `scilk.util.intervals`), literally representing a range within the source text corresponding to a detected named entity, since `Intervals`. Finally, we can extract these intervals.

In [5]:
entities = [intervals.extract(text, ivs) for text, ivs in zip(texts, annotations)]

In [6]:
entities[:5]

[['poly(vinyl alcohol)',
  'PVA',
  'PVA',
  'PVA',
  'PVA',
  'thiol',
  'PVA',
  'PVA',
  'PVA',
  'PVA',
  'PVA'],
 ['estrone-3-sulfate', 'E-3-S', 'taurocholic acid'],
 ['C', 'tyrosine'],
 ['DOPC',
  'octadecanol',
  'Au(111)',
  'octadecanol',
  'octadecanol',
  'BODIPY',
  'octadecanol'],
 ['GSE', 'GSE', 'phenolic', 'GSE phenolic', 'H']]