# Tagging genes with ddlite: candidate extraction

## Introduction
In this example **ddlite** app, we'll build a gene tagger from scratch. Domain-specific tagging systems take months or years to develop. They use hand-crafted model circuitry and accurate, hand-labeled training data. We'll start to build a pretty good one in a few minutes with none of those things. The generalized extraction and learning utilities provided by ddlite will allow us to turn a sampling of article abstracts and some basic domain knowledge into an automated tagging system. Specifically, we want an accurate tagger for genes in academic articles. We have comprehensive dictionaries of genes, but applying a simple matching rule might yield a lot of false positives. For example, "p53" might get tagged as a gene if it refers to a page number. Our goal is to use distant supervision to improve precision.

Here's the pipeline we'll follow:

1. Obtain and parse input data (relevant article abstracts from PubMed)
2. Extract candidates for tagging
3. Generate features
4. Create a test set
5. Write labeling functions
6. Learn the tagging model
7. Iterate on labeling functions

Parts 1 and 2 are covered in this notebook, and parts 3 through 7 are covered in `GeneTaggerExample_Learning.ipynb`. Let's get to it.

In [1]:
%load_ext autoreload
%autoreload 2

import cPickle, os, sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from ddlite import *

## Processing the input data
We already downloaded the raw HTML for 150 gene-related article pages from PubMed using the `pubmed_gene_html.py` script. These can be found in the `data` folder. We can use ddlite's `DocParser` to read in the article text. There's a general HTML parser which finds visible text, but we can do better by writing a more specific version to just grab the abstract text.

In [2]:
class PubMedAbstractReader(HTMLReader):
    def _cleaner(self, s):
        return (s.parent.name == 'abstracttext')

dp = DocParser('gene_tag_example/data/', PubMedAbstractReader())
docs = list(dp.readDocs())
print docs[0]

('7802348.html', 'Thrombin is a serine protease able to evoke biological responses from a variety of cells, including platelets, endothelial cells, fibroblasts and smooth muscle cells. The structure of the thrombin receptor present in the human megakaryoblastic cell line and in hamster fibroblasts has recently been deduced by expression in the Xenopus laevis oocyte. The cloned receptor is a new member of the seven transmembrane domain receptor family that interacts with G proteins. A large amino-terminal extracellular extension has a cleavage site for thrombin (Leu Asp Pro Arg/Ser Phe Leu Leu,/representing the cleavage site). Thrombin cleaves at this site, unmasking a new amino terminus, that functions like a ligand, binding to an as yet undefined site and eliciting receptor activation. Peptides similar to a new amino terminus created after cleavage are able to mimic thrombin cellular effects. These agonist peptides are used to analyse the role of the cloned receptor in the thrombin-sp

Now we'll use CoreNLP via ddlite's `SentenceParser` to parse each sentence. `DocParser` can handle this too; we didn't really need that call above. This can take a little while, so if the example has already been run, we'll reload it.

In [3]:
docs = None

pkl_f = 'gene_tag_example/gene_tag_saved_sents_v5.pkl'
try:
    with open(pkl_f, 'rb') as f:
        sents = cPickle.load(f)
except:
    %time sents = dp.parseDocSentences()
    with open(pkl_f, 'w+') as f:
        cPickle.dump(sents, f)

print sents[0]

CPU times: user 8.35 s, sys: 56 ms, total: 8.4 s
Wall time: 53.5 s
Sentence(words=[u'Thrombin', u'is', u'a', u'serine', u'protease', u'able', u'to', u'evoke', u'biological', u'responses', u'from', u'a', u'variety', u'of', u'cells', u',', u'including', u'platelets', u',', u'endothelial', u'cells', u',', u'fibroblasts', u'and', u'smooth', u'muscle', u'cells', u'.'], lemmas=[u'thrombin', u'be', u'a', u'serine', u'protease', u'able', u'to', u'evoke', u'biological', u'response', u'from', u'a', u'variety', u'of', u'cell', u',', u'include', u'platelet', u',', u'endothelial', u'cell', u',', u'fibroblast', u'and', u'smooth', u'muscle', u'cell', u'.'], poses=[u'NN', u'VBZ', u'DT', u'NN', u'NN', u'JJ', u'TO', u'VB', u'JJ', u'NNS', u'IN', u'DT', u'NN', u'IN', u'NNS', u',', u'VBG', u'NNS', u',', u'JJ', u'NNS', u',', u'NNS', u'CC', u'VB', u'NN', u'NNS', u'.'], dep_parents=[5, 5, 5, 5, 0, 5, 8, 6, 10, 8, 13, 13, 10, 15, 13, 15, 18, 15, 18, 21, 18, 18, 18, 18, 27, 27, 18, 5], dep_labels=[u'nsubj', u'c

## Extracting candidates with matchers
Extracting candidates for mentions (or relations) in ddlite is done with `Matcher` objects. First, we'll use a `DictionaryMatcher`. We have access to a pretty comprehensive gene dictionary. Let's load it in and create the `DictionaryMatcher`.

In [4]:
# Schema is: ENSEMBL_ID | NAME | TYPE (refseq, canonical, non-canonical)
genes = [line.rstrip().split('\t')[1] for line in open('gene_tag_example/dicts/ensembl_genes.tsv')]
genes = filter(lambda g : len(g) > 2, genes)

gene_dm = DictionaryMatch(label='GeneName', dictionary=genes, ignore_case=False)

The dictionary match should provide fairly high recall, but we may still miss some candidates. We know that gene names are named nouns and are often all uppercase. Let's use DDLite's *compositional* matcher operations to handle this. First, we'll write a matcher to find all nouns using the parts-of-speech tags. Then, we'll use a filter to find uppercase sequences. Finally, we'll use a filter to make sure each match has at least 3 characters. We pass `noun_rm` to `up_rm`, and `up_rm` to the final filter to compose them with each other.

In [5]:
noun_regex = RegexNgramMatch(label='Nouns', regex_pattern=r'[A-Z]?NN[A-Z]?', ignore_case=True, match_attrib='poses')
up_regex = RegexFilterAll(noun_regex, label='Upper', regex_pattern=r'[A-Z]+([0-9]+)?([A-Z]+)?([0-9]+)?$', ignore_case=False, match_attrib='words')
multi_regex = RegexFilterAll(up_regex, label='Multi', regex_pattern=r'[a-z0-9]{3,}', ignore_case=True)

Since we want matches both from the dictionary and the uppercase-noun-phrase-matcher we just built, we'll use the union object to create a matcher for both.

In [6]:
CE = Union(gene_dm, multi_regex)

## Creating the candidates
We'll use our unioned candidate extractor to extract our candidate entities from the sentences into an `Entities` object. Using both matchers together will provide very high recall, but may have poor precision. In the next demo notebook (`GeneTaggerExample_Learning.ipynb`), we'll write distant supervision rules and learn a model to improve precision.

In [7]:
E = Entities(sents, CE)

We can visualize contexts for our extractions too. This may help in writing labeling functions in `GeneTaggerExample_Learning.ipynb`.

In [8]:
E[0].render()

Finally, we'll pickle the extracted candidates from our `Entities` object for use in `GeneTaggerExample_Learning.ipynb`.

In [9]:
E.dump_candidates('gene_tag_example/gene_tag_saved_entities_v6.pkl')