# Structured Information Extraction with `DDLite`

### Gene-Phenotype Relation Extraction Demonstration

In this demo, we'll be building a scalable machine learning & statistical inference-based information extraction system using an easy Jupyter Notebook interface.

We will attempt to extract _gene-phenotype relation mentions_ from the scientific literature; in other words, any time there is a phrase expressing a causal relationship between a mutation in some gene and a phenotype (symptom), such as:

> "A mutation in gene A is known to cause phenotype B"

We want to extract a tuple `(A,B)` for our knowledgebase.  We'll start by loading the `ddlite` library:

In [4]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (18,6)

import cPickle, os, sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from ddlite import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Candidate Extraction

The first stage in our pipeline is to preprocess the set of input documents we will use, and extract a set of _candidates_ from 

In [12]:
dp = DocParser('g-p-demo/deaf1_abstract.txt', TextParser())
sents = dp.parseDocSentences()
for doc in dp.parseDocs(): print doc

Recently, we identified in two individuals with intellectual disability (ID) different de novo mutations in DEAF1, which encodes a transcription factor with an important role in embryonic development. To ascertain whether these mutations in DEAF1 are causative for the ID phenotype, we performed targeted resequencing of DEAF1 in an additional cohort of over 2,300 individuals with unexplained ID and identified two additional individuals with de novo mutations in this gene. All four individuals had severe ID with severely affected speech development, and three showed severe behavioral problems. DEAF1 is highly expressed in the CNS, especially during early embryonic development. All four mutations were missense mutations affecting the SAND domain of DEAF1. Altered DEAF1 harboring any of the four amino acid changes showed impaired transcriptional regulation of the DEAF1 promoter. Moreover, behavioral studies in mice with a conditional knockout of Deaf1 in the brain showed memory deficits an

In [36]:
# Schema is: ENSEMBL_ID | NAME | TYPE (refseq, canonical, non-canonical)
genes = [line.rstrip().split('\t')[1] for line in open('g-p-demo/dicts/ensembl_genes.tsv')]
genes = filter(lambda g : len(g) > 3, genes)

# Schema is: HPO_ID | NAME | TYPE (exact, lemma)
phenos = [line.rstrip().split('\t')[1] for line in open('g-p-demo/dicts/pheno_terms.tsv')]

GM = DictionaryMatch('G', genes)
PM = DictionaryMatch('P', phenos)

In [37]:
R = Relations(sents, GM, PM)

In [50]:
R[4].render()

In [41]:
CM = CandidateModel(R)

In [42]:
CM

<ddlite.CandidateModel instance at 0x1112686c8>

In [45]:
R._get_features().keys()

['INV_LEMMA:BETWEEN-MENTION-and-MENTION[mutation]',
 'DEP_LABEL|LEMMA:BETWEEN-MENTION-and-MENTION[nsubj|study]',
 'LEMMA:BETWEEN-MENTION-and-MENTION[study knockout]',
 'LEMMA:PARENTS-OF-BETWEEN-MENTION-and-MENTION[None]',
 'LEMMA:SEQ-BETWEEN[deficit and]',
 'LEMMA:SEQ-BETWEEN[and increase]',
 'INV_DEP_LABEL|LEMMA:BETWEEN-MENTION-and-MENTION[ROOT|mutation nsubj|mutation acl|affect]',
 'LEMMA:SEQ-BETWEEN[show memory deficit]',
 'INV_LEMMA:FILTER-BY(pos=NN):BETWEEN-MENTION-and-MENTION[mutation mutation]',
 'INV_LEMMA:SEQ-BETWEEN[result of impaired]',
 'LEMMA:SEQ-BETWEEN[a conditional]',
 'INV_LEMMA:SEQ-BETWEEN[mutation]',
 'DEP_LABEL|LEMMA:BETWEEN-MENTION-and-MENTION[nsubj|study nmod|knockout]',
 'INV_LEMMA:SEQ-BETWEEN[-rrb- different de]',
 'LEMMA:SEQ-BETWEEN[with a conditional]',
 'INV_LEMMA:BETWEEN-MENTION-and-MENTION[id likely result]',
 'INV_LEMMA:SEQ-BETWEEN[domain of]',
 'INV_DEP_LABEL:BETWEEN-MENTION-and-MENTION[acl dobj]',
 'INV_LEMMA:SEQ-BETWEEN[impaired transcriptional regulati