# II. NLP Preprocessing


See `preprocessing/README.md` for details on our clinical text preprocessing used our manuscript. We also recommend [scispaCy](https://allenai.github.io/scispacy/) as a high quality NLP preprocessing framework for biomedical text. Trove assumes documents are encoded in a JSON format that is internally transformed into sentences and documents for labeling function use. 

This tutorial uses the _BioCreative V Chemical-Disease Relation (CDR) Task Corpus_, which is freely available for download from http://www.biocreative.org/media/store/files/2016/CDR_Data.zip
    

In [29]:
import re
from collections import defaultdict

def parse_doc(doc, disable, keep_whitespace = False):
    """
    Given a parsed spaCy document, convert to a dictionary of lists for each field.
    """
    disable = {"ner", "parser", "tagger", "lemmatizer"} if not disable else disable
    for position, sent in enumerate(doc.sents):
        parts = defaultdict(list)

        for i, token in enumerate(sent):

            text = str(sent.text)
            parts['newlines'] = [m.span()[0] for m in re.finditer(r'''(\n)''', text)]

            if not keep_whitespace and not token.text.strip():
                continue

            parts['words'].append(token.text)
            parts['abs_char_offsets'].append(token.idx)

            # optional NLP tags
            if "lemmatizer" not in disable:
                parts['lemmas'].append(token.lemma_)
            if "tagger" not in disable:
                parts['pos_tags'].append(token.tag_)
            if "ner" not in disable:
                parts['ner_tags'].append(
                    token.ent_type_ if token.ent_type_ else 'O'
                )
            if "parser" not in disable:
                head_idx = 0 if token.head is token else \
                    token.head.i - sent[0].i + 1
                parts['dep_parents'].append(head_idx)
                parts['dep_labels'].append(token.dep_)

        # sentence is all whitespace
        if not parts['words']:
            continue

        parts['i'] = position
        yield parts

In [30]:
import scispacy
import spacy

# PubMed abstract 227508
nlp = spacy.load("en_core_sci_sm")
text = """Naloxone reverses the antihypertensive effect of clonidine. In unanesthetized, spontaneously hypertensive rats the decrease in blood pressure and heart rate produced by intravenous clonidine, 5 to 20 micrograms/kg, was inhibited or reversed by nalozone, 0.2 to 2 mg/kg. The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence clonidine-suppressible binding of [3H]-dihydroergocryptine (1 nM). These findings indicate that in spontaneously hypertensive rats the effects of central alpha-adrenoceptor stimulation involve activation of opiate receptors. As naloxone and clonidine do not appear to interact with the same receptor site, the observed functional antagonism suggests the release of an endogenous opiate by clonidine or alpha-methyldopa and the possible role of the opiate in the central control of sympathetic tone."""
doc = nlp(text)


In [31]:
import json

sents = list(parse_doc(doc, disable=['lemmatizer', 'ner']))
data = json.dumps({
    'name': str(227508),
    'metadata': {},
    'sentences': sents
})