# DeepDiveLite (DDL) Demo

In [1]:
%load_ext autoreload
%autoreload 2

import os, sys, re, cPickle
from ddlite import *

## Raw input -> Sentences

As a first stage we load a set of documents as raw strings:

In [2]:
docs = list(DocParser('raw/').parseDocs())

Here's an example document:

In [3]:
random.choice(docs[:20])

'Retinal degeneration 6 (rd6): a new mouse model for human retinitis punctata albescens.'

Next, we transform these document strings to a _list of lists_ of DDL `Sentence` objects.  We use the `SentenceParser.parse` method to parse the documents, which by default does a variety of NLP pre-processing as well.  Since parsing / preprocessing (above) is probably the slowest part of the process, we'll save the processed `Sentence` objects to disk as follows.

In [4]:
%time
# load previously processed Sentence objects if exists
sents_pickle_file = 'saved_sents.pkl'
import os.path
if os.path.exists(sents_pickle_file):
    import cPickle
    sents = cPickle.load(open(sents_pickle_file, 'rb'))
else:
    # or process
    parser = SentenceParser()
    sents = [list(parser.parse(doc)) for doc in docs[:20]]
    cPickle.dump(sents, open(sents_pickle_file, 'wb'))

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.01 µs


In [5]:
random.choice(sents)

[Sentence(words=['HFE', 'gene', 'mutation', ',', 'C282Y', 'causing', 'hereditary', 'hemochromatosis', 'in', 'Caucasian', 'is', 'extremely', 'rare', 'in', 'Korean', 'population', '.'], lemmas=['hfe', 'gene', 'mutation', ',', 'c282y', 'cause', 'hereditary', 'hemochromatosis', 'in', 'Caucasian', 'be', 'extremely', 'rare', 'in', 'korean', 'population', '.'], poses=['NN', 'NN', 'NN', ',', 'NN', 'VBG', 'JJ', 'NN', 'IN', 'NNP', 'VBZ', 'RB', 'JJ', 'IN', 'JJ', 'NN', '.'], dep_parents=[3, 3, 13, 3, 3, 5, 8, 6, 10, 6, 13, 13, 0, 16, 16, 13, 13], dep_labels=['compound', 'compound', 'nsubj', 'punct', 'appos', 'acl', 'amod', 'dobj', 'case', 'nmod', 'cop', 'advmod', 'ROOT', 'case', 'amod', 'nmod', 'punct']),
 Sentence(words=['Hereditary', 'hemochromatosis', '-LRB-', 'HFE', '-RRB-', ',', 'which', 'affects', '1', 'in', '400', 'and', 'has', 'an', 'estimated', 'carrier', 'frequency', 'of', '1', 'in', '10', 'individuals', 'in', 'Western', 'population', ',', 'results', 'in', 'multiple', 'organ', 'damage', 

For now, we'll pick a _random_ sentence to work with:

In [6]:
sent = sents[15][4]; sent

Sentence(words=['Although', 'the', 'BMPR-II', 'tail', 'is', 'not', 'involved', 'in', 'BMP', 'signaling', 'via', 'Smad', 'proteins', ',', 'mutations', 'truncating', 'this', 'domain', 'are', 'present', 'in', 'patients', 'with', 'primary', 'pulmonary', 'hypertension', '-LRB-', 'PPH', '-RRB-', '.'], lemmas=['although', 'the', 'bmpr-ii', 'tail', 'be', 'not', 'involve', 'in', 'bmp', 'signaling', 'via', 'smad', 'protein', ',', 'mutation', 'truncating', 'this', 'domain', 'be', 'present', 'in', 'patient', 'with', 'primary', 'pulmonary', 'hypertension', '-lrb-', 'pph', '-rrb-', '.'], poses=['IN', 'DT', 'NN', 'NN', 'VBZ', 'RB', 'VBN', 'IN', 'NN', 'NN', 'IN', 'NN', 'NNS', ',', 'NNS', 'JJ', 'DT', 'NN', 'VBP', 'JJ', 'IN', 'NNS', 'IN', 'JJ', 'JJ', 'NN', '-LRB-', 'NN', '-RRB-', '.'], dep_parents=[7, 4, 4, 7, 7, 7, 20, 10, 10, 7, 13, 13, 7, 20, 18, 18, 18, 20, 20, 0, 22, 20, 26, 26, 26, 22, 28, 26, 28, 20], dep_labels=['mark', 'det', 'compound', 'nsubjpass', 'auxpass', 'neg', 'advcl', 'case', 'compound

## Candidate Extraction

First, we load a dictionary of gene and phenotype names- these are the entities that we want to extract relations over:

In [7]:
# Schema is: ENSEMBL_ID | NAME | TYPE (refseq, canonical, non-canonical)
genes = [line.rstrip().split('\t')[1] for line in open('dicts/ensembl_genes.tsv')]
genes = filter(lambda g : len(g) > 3, genes)

# Schema is: HPO_ID | NAME | TYPE (exact, lemma)
phenos = [line.rstrip().split('\t')[1] for line in open('dicts/pheno_terms.tsv')]

Next, we define the type of relation we want to look for.  To do this, we'll define a DDL `Relations` operator, which is built from two `Entity`-type operators:

In [8]:
rels = Relations(
    DictionaryMatch('G', genes, ignore_case=False),
    DictionaryMatch('P', phenos), 
    [sent])


We can also render a visualization of the relations / their contexts:

In [9]:
rels.relations[1].render()

## Distant Supervision

We can create **_rule functions_** using a variety of helper attributes and tools both from `ddlite` and `treedlib`.  **These functions must return values $\in\{-1,0,1\}$**

In [10]:
def rule_1(r):
    return 1 if 'mutation' in r.lemmas else 0

def rule_2(r):
    return 1 if re.search(r'{{G}}.*in patients with.*{{P}}', r.tagged_sent) else 0

def rule_3(r):
    return 1 if len(r.e2_idxs) > 1 else -1

rules = [rule_1, rule_2, rule_3]

In [11]:
rels.apply_rules(rules)
rels.rules

array([[ 1.,  1.],
       [ 0.,  1.],
       [-1.,  1.]])

In [12]:
rels.get_rule_priority_vote_accuracy([1,1])

1.0

## Feature Extraction

Feature extraction is push-button, although custom treedlib feature sets can be passed in as well:

In [13]:
rels.extract_features()
rels.feats

<102x2 sparse matrix of type '<type 'numpy.float64'>'
	with 191 stored elements in Compressed Sparse Row format>

## Learning

Here we use a very simple method & implementation:

In [14]:
rels.learn_feats_and_weights(sample=True, verbose=True, holdout=0)

Learning epoch = 0
Learning epoch = 100


  return np.log(float(p) / (1-p))


Learning epoch = 200
Learning epoch = 300
Learning epoch = 400
Learning epoch = 500
Learning epoch = 600
Learning epoch = 700
Learning epoch = 800
Learning epoch = 900


In [15]:
rels.get_predicted()

array([ 1.,  1.])

In [16]:
rels.get_classification_accuracy([1,1])

1.0

## Error Analysis

Now, let's look at a sample of extractions using [Mindtagger](http://deepdive.stanford.edu/labeling).  We can use a shorthand to create a Mindtagger task and launch it right from the notebook:

In [17]:
rels.open_mindtagger(num_sample=20, width='100%', height=600)

We can also get the tags collected using Mindtagger using the following shorthand:

In [18]:
tags = rels.get_mindtagger_tags(); tags

[{u'foo': True, u'is_correct': False, u'sent_id': 0},
 {u'is_correct': True, u'sent_id': 1}]

From the tags, we can get a precision estimate as follows:

In [19]:
num_correct = sum(1 for tag in tags if tag[u'is_correct'])
"precision = %3.f%%" % (100 * num_correct * 1.0 / len(tags))

'precision =  50%'