#1) get abstracts
#2) write pattern matcher
#3) see what comes out
#4) Write pattern -> triple 
#4.5) See test performance on causaly-small Dataset -- binary classifier
#5) see what comes out
#6) try biochemical corpus

In [22]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [46]:
import xmltodict  #xmltodict.parse('data/sample_pubmed20n0004.xml')
from gzip import GzipFile
import pubmed_parser

import scispacy
import spacy
from scispacy.abbreviation import AbbreviationDetector
#from spacy.pipeline import merge_entities
from scispacy.linking import EntityLinker
#from .. import utils
from scify.nlp import show_tabs, visualise_subtrees, visualise_doc, show_tabs, check_for_non_trees, get_lemma,add_matches, match_texts

In [47]:
import en_ner_bc5cdr_md, en_core_sci_md, en_ner_craft_md

In [48]:
# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached. --- But it still takes forever!


# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.

linker = EntityLinker(resolve_abbreviations=True, name="umls")



In [52]:
from spacy.pipeline import merge_entities

nlp = spacy.load("en_core_sci_md")
text = """Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity."""
abbreviation_pipe = AbbreviationDetector(nlp)

nlp.add_pipe(abbreviation_pipe)
nlp.add_pipe(linker)
nlp.add_pipe(merge_entities)

#if you get a warning here, it means you should upgrade the scispacy models (0.24 -> 0.25)

In [62]:
pubmed_abstracts = pubmed_parser.parse_medline_xml("../data/pubmed/pubmed20n1015.xml")
abstr = [article["abstract"] for article in pubmed_abstracts]

In [63]:
len(abstr)

756

In [64]:
import json
patterns = [
    "prevented|nsubj|START_ENTITY prevented|dobj|END_ENTITY",
    "causes|nsubj|START_ENTITY causes|dobj|END_ENTITY"
]

matcher = add_matches(nlp.vocab, patterns)
matched_abstracts = match_texts(matcher, abstr[:200], nlp)

In [65]:
nlp_NER = en_ner_bc5cdr_md.load()
matcher = add_matches(nlp_NER.vocab, patterns)
matched_abstracts2 = match_texts(matcher, abstr[:200], nlp_NER)

In [66]:
#uncomment token_pattern["ENT_TYPE"] = {"NOT_IN": [""]} in construct_pattern() to make nlp_NER not match bc not entities
matched_abstracts, "WITH NER -->", matched_abstracts2

({'causes|nsubj|START_ENTITY causes|dobj|END_ENTITY': [{'doc_idx': 14,
    'span': 'Spinal cord injury (SCI) can cause loss of',
    'sents': [Spinal cord injury (SCI) can cause loss of mobility in the limbs, and no drugs, surgical procedures, or rehabilitation strategies provide a complete cure.],
    'matches': [[5, 0, 6]],
    'sent_ents': [[Spinal cord injury,
      SCI,
      loss of,
      limbs,
      drugs,
      surgical procedures,
      rehabilitation,
      cure]]},
   {'doc_idx': 165,
    'span': 'misdiagnosis can cause undue stress',
    'sents': [Such misdiagnosis can cause undue stress on the patient and their families.],
    'matches': [[264, 262, 265]],
    'sent_ents': [[misdiagnosis, undue stress, patient, families]]}],
  'prevented|nsubj|START_ENTITY prevented|dobj|END_ENTITY': [{'doc_idx': 82,
    'span': 'Pretreatment of cells with antioxidants ascorbic acid and beta-mercaptoethanol prevented these NEO212-induced effects',
    'sents': [Pretreatment of cells with

THE NER model doesn't match the patterns!! ??!

In [71]:
# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).

#for umls_ent in entity._.kb_ents:
 #   print(linker.kb.cui_to_entity[umls_ent[0]])


## NER
The NER model correctly identifies Ca2 and histamine as chemicals, but there's way less Entities that the pattern matcher can identify with.

In [69]:
ex1 = "The subsequent exposure of the pretreated cells to Ca2 causes increased release of histamine and degradation of methylated phospholipids."
ex2 = "Ca2 causes histamine"

#only ex2 is pattern-matched!
match_texts(matcher, [ex1, ex2], nlp_NER)

{'causes|nsubj|START_ENTITY causes|dobj|END_ENTITY': [{'doc_idx': 0,
   'span': 'exposure of the pretreated cells to Ca2 causes increased release',
   'sents': [The subsequent exposure of the pretreated cells to Ca2 causes increased release of histamine and degradation of methylated phospholipids.],
   'matches': [[9, 2, 11]],
   'sent_ents': [[Ca2, histamine]]},
  {'doc_idx': 1,
   'span': 'Ca2 causes histamine',
   'sents': [Ca2 causes histamine],
   'matches': [[1, 0, 2]],
   'sent_ents': [[Ca2, histamine]]}]}

In [70]:
[(ent, ent.label_) for ent in nlp_NER(ex1).ents], ' --Versus (UNNAMED) ENTITIY RECOGNITION -->',[(ent, ent.label_) for ent in nlp(ex1).ents]

([(Ca2, 'CHEMICAL'), (histamine, 'CHEMICAL')],
 ' --Versus (UNNAMED) ENTITIY RECOGNITION -->',
 [(exposure of, 'ENTITY'),
  (pretreated cells, 'ENTITY'),
  (Ca2, 'ENTITY'),
  (increased, 'ENTITY'),
  (release, 'ENTITY'),
  (histamine, 'ENTITY'),
  (degradation, 'ENTITY'),
  (methylated phospholipids, 'ENTITY')])

When you add matcher rules, you can also define an on_match callback function as the second argument of Matcher.add. This is often useful if you want to trigger specific actions – for example, do one thing if a COLOR match is found, and something else for a PRODUCT match.

If you want to solve this even more elegantly, you might also want to look into combining your matcher with a custom pipeline component or custom attributes. For example, you could write a simple component that's run automatically when you call nlp() on your text, finds the matches, and sets a Doc._.relations or Token._.is_color attribute. The docs have a few examples of this that should help you get started.

In [25]:
#dependency distributions

from collections import Counter
dep_counts = Counter()
for abst in abstr[:2000]:
    for token in nlp(abst):
        dep_counts[token.dep_] += 1
dep_counts.most_common(30)