# DeepDiveLite (DDL) Demo

In [1]:
%load_ext autoreload
%autoreload 2

import os, sys, re, cPickle
from ddlite import *

## Raw input -> Sentences

As a first stage we load a set of documents as raw strings:

In [2]:
docs = list(DocParser('raw/').parseDocs())

Here's an example document:

In [3]:
np.random.choice(docs[:20])

'Congenital myasthenic syndrome with tubular aggregates caused by GFPT1 mutations'

Next, we transform these document strings to a list of DDL `Sentence` objects.  We use the `SentenceParser.parse` method to parse the documents, which by default does a variety of NLP pre-processing as well.  `SentenceParser` is also wrapped in the DocParser, and if we wanted to parse everything, we could use `DocParser.parseDocSentences`. Since parsing / preprocessing (above) is probably the slowest part of the process, we'll just load and parse one example sentence.

In [4]:
sp = SentenceParser()
for doc in docs:
  if 'Although the BMPR-II tail is not involved' in doc:
    sents = list(sp.parse(doc))
    break

sent = sents[4]
print sent

Sentence(words=['Although', 'the', 'BMPR-II', 'tail', 'is', 'not', 'involved', 'in', 'BMP', 'signaling', 'via', 'Smad', 'proteins', ',', 'mutations', 'truncating', 'this', 'domain', 'are', 'present', 'in', 'patients', 'with', 'primary', 'pulmonary', 'hypertension', '-LRB-', 'PPH', '-RRB-', '.'], lemmas=['although', 'the', 'bmpr-ii', 'tail', 'be', 'not', 'involve', 'in', 'bmp', 'signaling', 'via', 'smad', 'protein', ',', 'mutation', 'truncating', 'this', 'domain', 'be', 'present', 'in', 'patient', 'with', 'primary', 'pulmonary', 'hypertension', '-lrb-', 'pph', '-rrb-', '.'], poses=['IN', 'DT', 'NN', 'NN', 'VBZ', 'RB', 'VBN', 'IN', 'NN', 'NN', 'IN', 'NN', 'NNS', ',', 'NNS', 'JJ', 'DT', 'NN', 'VBP', 'JJ', 'IN', 'NNS', 'IN', 'JJ', 'JJ', 'NN', '-LRB-', 'NN', '-RRB-', '.'], dep_parents=[7, 4, 4, 7, 7, 7, 20, 10, 10, 7, 13, 13, 7, 20, 18, 18, 18, 20, 20, 0, 22, 20, 26, 26, 26, 22, 28, 26, 28, 20], dep_labels=['mark', 'det', 'compound', 'nsubjpass', 'auxpass', 'neg', 'advcl', 'case', 'compound

For now, we'll pick a _random_ sentence to work with:

## Candidate Extraction

First, we load a dictionary of gene and phenotype names- these are the entities that we want to extract relations over:

In [5]:
# Schema is: ENSEMBL_ID | NAME | TYPE (refseq, canonical, non-canonical)
genes = [line.rstrip().split('\t')[1] for line in open('dicts/ensembl_genes.tsv')]
genes = filter(lambda g : len(g) > 3, genes)

# Schema is: HPO_ID | NAME | TYPE (exact, lemma)
phenos = [line.rstrip().split('\t')[1] for line in open('dicts/pheno_terms.tsv')]

Next, we define the type of relation we want to look for.  To do this, we'll define a DDL `Relations` operator, which is built from two `Entity`-type operators:

In [6]:
rels = Relations(
    DictionaryMatch('G', genes, ignore_case=False),
    DictionaryMatch('P', phenos), 
    [sent])


We can also render a visualization of the relations / their contexts:

In [7]:
rels[1].render()

## Distant Supervision

We can create **_rule functions_** using a variety of helper attributes and tools both from `ddlite` and `treedlib`.  **These functions must return values $\in\{-1,0,1\}$**

In [8]:
def rule_1(r):
    return 1 if 'mutation' in r.lemmas else 0

def rule_2(r):
    return 1 if re.search(r'{{G}}.*in patients with.*{{P}}', r.tagged_sent) else 0

def rule_3(r):
    return 1 if len(r.e2_idxs) > 1 else -1

rules = [rule_1, rule_2, rule_3]

In [9]:
rels.apply_rules(rules, clear=True)
rels.rules

<2x3 sparse matrix of type '<type 'numpy.float64'>'
	with 6 stored elements in LInked List format>

In [10]:
rels.get_rule_priority_vote_accuracy([1,1])

1.0

## Feature Extraction

Feature extraction is push-button, although custom treedlib feature sets can be passed in as well:

In [11]:
rels.extract_features()
rels.feats

<2x102 sparse matrix of type '<type 'numpy.float64'>'
	with 191 stored elements in LInked List format>

## Learning

Here we use a very simple method & implementation:

In [12]:
rels.learn_feats_and_weights(sample=False, verbose=True, holdout=0)


Learning epoch =  0	100	200	300	400	
Learning epoch =  500	600	700	800	900	

In [13]:
rels.get_predicted()

array([ 1.,  1.])

In [14]:
rels.get_classification_accuracy([1,1])

1.0

## Error Analysis

Now, let's look at a sample of extractions using [Mindtagger](http://deepdive.stanford.edu/labeling).  We can use a shorthand to create a Mindtagger task and launch it right from the notebook:

In [15]:
rels.open_mindtagger(num_sample=20, width='100%', height=1000)

Making sure MindTagger is installed. Hang on!


We can also get the tags collected using Mindtagger using the following shorthand:

In [16]:
tags = rels.get_mindtagger_tags()
print tags

[{u'is_correct': False, u'ext_id': 0}, {u'is_correct': True, u'ext_id': 1}]


From the tags, we can get a precision estimate as follows:

In [17]:
tagged_exts = [tag for tag in tags if u'is_correct' in tag]
num_correct = sum(1 for tag in tagged_exts if tag[u'is_correct'])
"precision = %3.f%%" % (100 * num_correct * 1.0 / len(tagged_exts))

'precision =  50%'