# Tagging genes with ddlite

### Introduction
In this example **ddlite** app, we'll build a gene tagger from scratch. Here's why we developed ddlite:

* To provide a lighter-weight interface to structured information extraction for new DeepDive users
* To help advanced DeepDive rapidly develop and prototype applications and distant supervision rules
* To investigate DeepDive's data programming approach to building inference systems

This example is centered around the second item. Domain-specific tagging systems take months or years to develop. They use hand-crafted model circuitry and accurate, hand-labeled training data. We're going to try to build a pretty good one in a few minutes with none of those things. The generalized extraction and learning utilities provided by ddlite will allow us to turn a sampling of article abstracts and some basic domain knowledge into an automated tagging system. Here's the pipeline we'll follow:

1. Obtain and parse input data (relevant article abstracts from PubMed)
2. Extract candidates for tagging
3. Generate features
4. Write distant supervision rules
5. Learn the tagging model

Let's get to it.

In [1]:
%load_ext autoreload
%autoreload 2

from ddlite import *

### Processing the input data
We already downloaded the raw HTML for ??? gene-related article pages from PubMed using the **pubmed_gene_html.py** script. These can be found in the **data** folder. We can use ddlite's **DocParser** to read in the article text. There's a general HTML parser which finds visible text, but we can do better by writing a more specific version to just grab the abstract text.

In [2]:
class PubMedAbstractParser(HTMLParser):
    def _cleaner(self, s):
        return (s.parent.name == 'abstracttext')

dp = DocParser('gene_tag_example/data/', PubMedAbstractParser())
docs = list(dp.parseDocs())
print docs[0]

Mutations in BCS1L, a respiratory chain complex III assembly chaperone, constitute a major cause of mitochondrial complex III deficiency and are associated with GRACILE and Bjrnstad syndromes. Here we describe a 4-year-old infant with hyperlactacidemia, mild liver dysfunction, hypotonia, growth and psychomotor retardation, dysmorphic features and mitochondrial complex III deficiency. Respiratory chain enzyme activities showed an isolated complex III defect in muscle and fibroblasts. Sequencing and polymerase chain reaction-restriction fragment length polymorphism (PCR-RFLP) analysis revealed a novel homozygous BCS1L mutation, c.148A>G, which caused a p.T50A substitution at an evolutionarily conserved BCS1L region. The severity of the complex III enzyme defect correlated with decreased amounts of BCS1L and respiratory chain complex III in the affected tissues. Our findings support a pathogenic role for the novel BCS1L mutation in a patient with a singular clinical phenotype.


Now we'll use CoreNLP via ddlite's SentenceParser to parse each sentence. **DocParser** can handle this too; we didn't really need that call above. We'll materialize them now for easy access.

In [3]:
docs = None
sents = [list(d_sents) for d_sents in dp.parseDocSentences()]
print sents[0][0]

Sentence(words=['Mutations', 'in', 'BCS1L', ',', 'a', 'respiratory', 'chain', 'complex', 'III', 'assembly', 'chaperone', ',', 'constitute', 'a', 'major', 'cause', 'of', 'mitochondrial', 'complex', 'III', 'deficiency', 'and', 'are', 'associated', 'with', 'GRACILE', 'and', 'Bjrnstad', 'syndromes', '.'], lemmas=['mutation', 'in', 'bcs1l', ',', 'a', 'respiratory', 'chain', 'complex', 'iii', 'assembly', 'chaperone', ',', 'constitute', 'a', 'major', 'cause', 'of', 'mitochondrial', 'complex', 'iii', 'deficiency', 'and', 'be', 'associate', 'with', 'gracile', 'and', 'bjrnstad', 'syndrome', '.'], poses=['NNS', 'IN', 'NN', ',', 'DT', 'JJ', 'NN', 'NN', 'CD', 'NN', 'NN', ',', 'VBP', 'DT', 'JJ', 'NN', 'IN', 'JJ', 'NN', 'CD', 'NN', 'CC', 'VBP', 'VBN', 'IN', 'NN', 'CC', 'NN', 'NNS', '.'], dep_parents=[13, 3, 1, 3, 11, 11, 11, 11, 11, 11, 3, 3, 0, 16, 16, 13, 21, 21, 21, 21, 16, 13, 24, 13, 26, 24, 26, 29, 26, 13], dep_labels=['nsubj', 'case', 'nmod', 'punct', 'det', 'amod', 'compound', 'compound', 'nu