In [None]:
!pip install medspacy > /dev/null
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz > /dev/null


RegEx is really useful and very powerful when you are looking for something very specific in a document.

What about if you want to extract more general concepts from text? You *could* define a set of really complex RegEx patterns to find all of the possible variations of all of the thigns you are interested in finding, but we saw how complicated it can get even for just one piece of data like the Gleason score.

For more general use cases, we need more generalizable tools, and that is where Natural Language Processing (NLP) comes in. NLP is a branch of computer science or subset of artificial intelligence that tries to understand plain spoken or written language, as opposed to non-natural languages (like Python and other programming languages) that have very clearly defined rules and syntax.

---

You can think of NLP as a series of steps performed in sequence, where each step might depend on the output of the one before, that transforms natural language into structured data. The entire process is referred to as a *pipeline*, and each individual step is referred to as a "pipe" component (at least it is in the NLP library we'll be using).

---

For example, common initial pipes in a radiology-focused NLP pipeline might be to:
* ***Sectionize***: split a document up in to *sections* (Technique, Indication, Findings, Impression)
* ***Sentencize***: split sections up into *sentences*
* ***Tokenize***: split sentences up into *tokens* (words, numbers, punctuation)
* ***POS Tagging***: identify parts-of-speech of *tokens*
* ***Lemmatize***: transform variants into a common root word ("be", "is", "was", "were" == "be" after lemmatization)
* ***Chunk***: group tokens into phrases based on POS and other token information
* ***Dependency Parse***: identify relationships between tokens and phrases - resolve pronouns to their reference token, determine the scope of a negation phrase
* ***Named Entity Recognition (NER)***: Detect and label named entities, or categories of tokens that are of interest. In general NLP, these might be "Person", "CompanyName", "Country", "PhoneNumber", etc. In radiology NLP, these might be "Anatomy", "Disease", "Severity". Which NEs are recognizable depends on the *Language Model* your pipeline is trained for.

---

I know it seems really complicated, and it really is. I have a pretty superficial understanding of it, but even just with that you can do some cool stuff.

And fortunately, like most things in Python, you don't have to build it from scratch - there are powerful libraries that abstract away all of the complexity for you and lower the barrier to getting started.

For Python-based NLP, the dominant framework is called spaCy: https://spacy.io/, and there is a version of it specifically for biomedical NLP called medspaCy: https://spacy.io/universe/project/medspacy

We can simply import spacy and medspacy into our programming session and immediately take advantage of their functions.

In [None]:
import pprint
import spacy
import medspacy
from medspacy.context import ConTextComponent
from medspacy.visualization import visualize_dep, visualize_ent
from spacy import displacy

incl_scispacy_umls_linker = False

# //For linking to UMLS - not required just for NER and context annotations, only if the CUIs are required
# //Adds significantly to load time and doc processing time
if incl_scispacy_umls_linker:
    !pip install scispacy
    from scispacy.linking import EntityLinker

import warnings
warnings.filterwarnings('ignore')

pp = pprint.PrettyPrinter(indent=2)

In [None]:
# Initialize an nlp pipeline based on the small version of the 
# English language core scientific model "en_core_sci_sm"
nlp = spacy.load("en_core_sci_sm")
if incl_scispacy_umls_linker:
    # //Add this pipeline component to get UMLS CUIs annotated
    nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

# Add the ConText pipeline component to our model
# This will determine whether NEs are negated, hypothetical, uncertain
context = ConTextComponent(nlp)
nlp.add_pipe("medspacy_context")
print(nlp.pipe_names)

In [None]:
text = """IMPRESSION: 
1. Cholelithiasis with positive sonographic Murphy sign. However gallbladder is 
nondistended without wall thickening. Equivocal for acute cholecystitis.
2. Mild intra and extrahepatic ductal dilatation. Correlate with LFTs.
If obstructive pattern, consider MRCP."""
doc = nlp(text)
sentences = [{"idx":sent_id, "start":sent.start_char, "end":sent.end_char} for sent_id, sent in enumerate(doc.sents)]
results = []
for ent in doc.ents:
    if incl_scispacy_umls_linker:
        try:
            cuis = ent._.kb_ents[0][0]
        except IndexError:
            cuis = None
    else:
        cuis = None
    result = {
        "concept": ent,
        "start": ent.start_char,
        "end": ent.end_char,
        "cui": cuis,
        "is_negated": ent._.is_negated,
        "is_uncertain": ent._.is_uncertain,
        "is_conditional": ent._.is_hypothetical,
        "is_historic": ent._.is_historical,
        "subject": "family" if ent._.is_family else "patient",
        "sentence": [sentence["idx"] for sentence in sentences if sentence["start"]<= ent.start_char and sentence["end"]>=ent.end_char][0]
    }
    results.append(result)
    if incl_scispacy_umls_linker:
        linker = nlp.get_pipe("scispacy_linker")
        for umls_ent in ent._.kb_ents:
            print(linker.kb.cui_to_entity[umls_ent[0]])

visualize_ent(doc)

In [None]:
for r in results:
    if r['concept'].text == "acute cholecystitis":
        pp.pprint(r)

---

Try changing "Equivocal for acute cholecystitis." to:
* "Negative for acute cholecystitis." => check is_negated
* "Possible acute cholecystitis." => check is_uncertain

and see if spaCy can recognize that the first variant is negated and the second one is uncertain.

---

It is not too hard to find phrases that are not included in the ConText algorithm's rule set. Here are the phrases it is set up to match by default:

In [None]:
context_rules = context.rules
context_rules.sort(key=lambda x: x.category)
for rule in context_rules:
    print("%s: %s" % (rule.category.rjust(20), rule.literal))

In [None]:
results

---

As I mentioned, NLP is much more complex than what we are taking advantage of here. A linguist, or language scientist, could extract much more nuanced information from the NLP annotations. Here's a visualization of the dependency relationships and parts-of-speech of the various tokens in our test document, to give you some idea of what sorts of things you'd have to account for to build an NLP system that truly understands what the text is trying to convey. 

In [None]:
options = {"compact": True, "bg": "#19A974",
           "color": "white", "font": "Avenir"}
displacy.render(doc, style="dep", options=options)