## Test Run 2: Extract Meaningful Span

This notebook continues from Test Run 1, but defines *Span* more meaningfully. We want to achieve that, 1 *Sentence* = 1+ *Spans* = 1+ *Clauses*. After the default `Ngram()` class, we customize Matcher to exclude any span with comma.

Note that in this test run, we will initiate `SnorkelSession()` without re-create `Document` or `Sentence`, which has been created in test run 1.


In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:85% !important; }</style>"))

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence

session = SnorkelSession()
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

sents = session.query(Sentence).all()
n_max_corpus=0
for sent in sents:
    n_max_corpus=max(n_max_corpus,len(sent.words))

print("The longest sentence has "+str(n_max_corpus)+" tokens.")

from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)


Documents: 1000
Sentences: 4487
The longest sentence has 101 tokens.


In test run 1, we have tried using **single** `DictionaryMatcher`. Now it's time to explore more, e.g. combining **multiple** `DictionaryMatchers`. 

Fortunately, we could combine **multiple** `DictionaryMatchers` into what-we-named-as **compound** `Matchers` through:
1. `Union(Matcher)` (takes the union of candidate sets returned by child matcher), 
2. `Intersection(Matcher)`, or 
3. `Concat(Matcher)` (candidates which are the concatenation of adjacent matches from child matchers).

In [3]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import *

Background = candidate_subclass('Background', ['background_cue'])
ngrams = Ngrams(n_max=n_max_corpus) # we define the maximum n value as n_max_corpus

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
def extract_and_display(matcher):
    for i, sents in enumerate([train_sents, dev_sents, test_sents]):
        %time matcher.apply(sents, split=i)
        printmd("**Split "+str(i)+" - number of candidates extracted: "+str(session.query(Background).filter(Background.split == i).count())+"**\n\n")
    cands = session.query(Background).filter(Background.split == 0).all()
    document_list=list()
    for i in range(7): # to print all cands, range(len(cands))
        printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i].background_cue)+"`")
        printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
        printmd("**Its parent Document's text:**\t"+str(cands[i].get_parent().get_parent().__dict__))
        print() 
        
    return cands
    

In [4]:
# Compound Matcher Example 1 (Dict-based): 
non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
dict_background_matcher=DictionaryMatch(d=['previous','motivated','recent','widely'],longest_match_only=True) 
non_comma_dict_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,dict_background_matcher)])
extract_and_display(non_comma_dict_background_matcher)

Clearing existing...
Running UDF...

CPU times: user 9.9 s, sys: 226 ms, total: 10.1 s
Wall time: 10.5 s


**Split 0 - number of candidates extracted: 106**



Clearing existing...
Running UDF...

CPU times: user 1.64 s, sys: 122 ms, total: 1.76 s
Wall time: 1.98 s


**Split 1 - number of candidates extracted: 11**



Clearing existing...
Running UDF...

CPU times: user 1.6 s, sys: 90.1 ms, total: 1.69 s
Wall time: 1.72 s


**Split 2 - number of candidates extracted: 9**



**0/106 Candidate/Span:**	`Span("b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition'", sentence=4273, chars=[0,94], words=[0,14])`

**Its parent Sentence's text:**	While previous studies demonstrate the feasibility of using ACT-R to model collective cognition, as well as sensemaking processes at the individual level, the 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c959b0>, 'name': '5834868425ff05a97b00df37', 'stable_id': '5834868425ff05a97b00df37::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 713, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00df37,0,b'Cognitive social simulations, enabled by cognitive architectures (such as ACT-R), are particularly well-suited for advancing our understanding of socially-distributed and socially-situated cognition.'), Sentence(Document 5834868425ff05a97b00df37,1,b'As a result, multi-agent simulations featuring the use of ACT-R agents may be important in improving our understanding of the factors that influence collective sensemaking.'), Sentence(Document 5834868425ff05a97b00df37,2,b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition, as well as sensemaking processes at the individual level, the \n')]}




**1/106 Candidate/Span:**	`Span("b'Our approach relies on recent learning theory results for distribution regression'", sentence=2307, chars=[0,80], words=[0,10])`

**Its parent Sentence's text:**	Our approach relies on recent learning theory results for distribution regression, using kernel 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c53e48>, 'name': '5834868425ff05a97b00d9ba', 'stable_id': '5834868425ff05a97b00d9ba::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 270, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d9ba,0,b"Abstract We present a new solution to the``ecological inference''problem, of learning individual-level associations from aggregate data."), Sentence(Document 5834868425ff05a97b00d9ba,1,b'This problem has a long history and has attracted much attention, debate, claims that it is unsolvable, and purported solutions.'), Sentence(Document 5834868425ff05a97b00d9ba,2,b'Unlike other ecological inference techniques, our method makes use of unlabeled individual-level data by embedding the distribution over these predictors into a vector in Hilbert space.'), Sentence(Document 5834868425ff05a97b00d9ba,3,b'Our approach relies on recent learning theory results for distribution regression, using kernel \n')]}




**2/106 Candidate/Span:**	`Span("b'Abstract A recent trend (especially in electronic commerce) is higher levels of expressiveness in the mechanisms that mediate interactions such as auctions'", sentence=1776, chars=[0,154], words=[0,23])`

**Its parent Sentence's text:**	Abstract A recent trend (especially in electronic commerce) is higher levels of expressiveness in the mechanisms that mediate interactions such as auctions, exchanges, catalog offers, voting systems, matching of peers, and so on.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116bcecc0>, 'name': '5834868425ff05a97b00caf4', 'stable_id': '5834868425ff05a97b00caf4::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 153, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00caf4,0,b'Abstract A recent trend (especially in electronic commerce) is higher levels of expressiveness in the mechanisms that mediate interactions such as auctions, exchanges, catalog offers, voting systems, matching of peers, and so on.'), Sentence(Document 5834868425ff05a97b00caf4,1,b'Participants can express their preferences in drastically greater detail than ever before.'), Sentence(Document 5834868425ff05a97b00caf4,2,b'In many cases this trend is fueled by modern algorithms for winner determination that can handle the richer inputs.\n')]}




**3/106 Candidate/Span:**	`Span("b'Previous work with the programming language Lollimon demonstrates the expressive power of logic programming with linear logic in describing algorithms that have imperative elements or that must repeatedly make mutually exclusive \n'", sentence=5441, chars=[0,229], words=[0,31])`

**Its parent Sentence's text:**	Previous work with the programming language Lollimon demonstrates the expressive power of logic programming with linear logic in describing algorithms that have imperative elements or that must repeatedly make mutually exclusive 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11350c2e8>, 'name': '5834868425ff05a97b00c4e3', 'stable_id': '5834868425ff05a97b00c4e3::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 990, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c4e3,0,b'Abstract Bottom-up logic programming can be used to declaratively specify many algorithms in a succinct and natural way, and McAllester and Ganzinger have shown that it is possible to define a cost semantics that enables reasoning about the running time of algorithms written as inference rules.'), Sentence(Document 5834868425ff05a97b00c4e3,1,b'Previous work with the programming language Lollimon demonstrates the expressive power of logic programming with linear logic in describing algorithms that have imperative elements or that must repeatedly make mutually exclusive \n')]}




**4/106 Candidate/Span:**	`Span("b'Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.'", sentence=1728, chars=[0,161], words=[0,29])`

**Its parent Sentence's text:**	Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c7e3c8>, 'name': '5834868425ff05a97b00deb2', 'stable_id': '5834868425ff05a97b00deb2::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 142, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00deb2,0,b'Abstract This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations.'), Sentence(Document 5834868425ff05a97b00deb2,1,b'Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.'), Sentence(Document 5834868425ff05a97b00deb2,2,b'The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers.'), Sentence(Document 5834868425ff05a97b00deb2,3,b'Our architecture entirely obviates the need for separate \n')]}




**5/106 Candidate/Span:**	`Span("b'Abstract Motivated by a radically new peer review system that the National Science Foundation recently experimented with'", sentence=3658, chars=[0,119], words=[0,16])`

**Its parent Sentence's text:**	Abstract Motivated by a radically new peer review system that the National Science Foundation recently experimented with, we study peer review systems in which proposals are reviewed by PIs who have submitted proposals themselves.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116ba5278>, 'name': '5834868425ff05a97b00c420', 'stable_id': '5834868425ff05a97b00c420::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 573, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c420,0,b'Abstract Motivated by a radically new peer review system that the National Science Foundation recently experimented with, we study peer review systems in which proposals are reviewed by PIs who have submitted proposals themselves.'), Sentence(Document 5834868425ff05a97b00c420,1,b'An (m, k)-selection mechanism asks each PI to review m proposals, and uses these reviews to select (at most) k proposals.'), Sentence(Document 5834868425ff05a97b00c420,2,b"We are interested in impartial mechanisms, which guarantee that the ratings given by a PI to others' proposals do not affect the likelihood of the PI's own proposal \n")]}




**6/106 Candidate/Span:**	`Span("b'Abstract: Sparse representation technique has been widely used in various areas of computer vision over the last decades.'", sentence=5179, chars=[0,120], words=[0,19])`

**Its parent Sentence's text:**	Abstract: Sparse representation technique has been widely used in various areas of computer vision over the last decades.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116cc19b0>, 'name': '5834868525ff05a97b00f4b4', 'stable_id': '5834868525ff05a97b00f4b4::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 928, 'type': 'document', 'sentences': [Sentence(Document 5834868525ff05a97b00f4b4,0,b'Abstract: Sparse representation technique has been widely used in various areas of computer vision over the last decades.'), Sentence(Document 5834868525ff05a97b00f4b4,1,b'Unfortunately, in the current formulations, there are no explicit relationship between the learned dictionary and the original data.'), Sentence(Document 5834868525ff05a97b00f4b4,2,b'By tracing back and connecting sparse representation with the K-means algorithm, a novel variation scheme termed as self-explanatory convex sparse representation (SCSR) has been proposed in this paper.'), Sentence(Document 5834868525ff05a97b00f4b4,3,b'To be specific, the basis vectors of the dictionary are refined as convex \n')]}




[Background(Span("b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition'", sentence=4273, chars=[0,94], words=[0,14])),
 Background(Span("b'Our approach relies on recent learning theory results for distribution regression'", sentence=2307, chars=[0,80], words=[0,10])),
 Background(Span("b'Abstract A recent trend (especially in electronic commerce) is higher levels of expressiveness in the mechanisms that mediate interactions such as auctions'", sentence=1776, chars=[0,154], words=[0,23])),
 Background(Span("b'Previous work with the programming language Lollimon demonstrates the expressive power of logic programming with linear logic in describing algorithms that have imperative elements or that must repeatedly make mutually exclusive \n'", sentence=5441, chars=[0,229], words=[0,31])),
 Background(Span("b'Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents an

Next we move to Compound Matcher 2, esp. `RegexMatch`, to extract *Spans* that starts with "to"/"for"/"by". 

In [5]:
# Compound Matcher Example 2 (Regex-based):
# E.g. to match either begining To/By/For or middle to/by/or, 
# we could use "(^(?:(To)|(By)|(For)) .*$)|(^.+ (?:(to)|(by)|(for)) .*$)"

non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
regex_background_matcher=RegexMatchSpan(rgx="(^(?:(To)|(By)|(For)) .*$)|(^.+ (?:(to)|(by)|(for)) .+$)",longest_match_only=True)  
non_comma_regex_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,regex_background_matcher)]) # why isn't it children = [non_comma_matcher,regex_background_matcher]
extract_and_display(non_comma_regex_background_matcher)

Clearing existing...
Running UDF...

CPU times: user 11.2 s, sys: 227 ms, total: 11.4 s
Wall time: 11.5 s


**Split 0 - number of candidates extracted: 2042**



Clearing existing...
Running UDF...

CPU times: user 1.84 s, sys: 125 ms, total: 1.96 s
Wall time: 1.95 s


**Split 1 - number of candidates extracted: 239**



Clearing existing...
Running UDF...

CPU times: user 1.7 s, sys: 108 ms, total: 1.81 s
Wall time: 1.75 s


**Split 2 - number of candidates extracted: 236**



**0/2042 Candidate/Span:**	`Span("b'The transformations can improve cache locality of the loop traversal or enable other optimizations that have been impossible before due to bad data dependencies.'", sentence=1206, chars=[0,160], words=[0,24])`

**Its parent Sentence's text:**	The transformations can improve cache locality of the loop traversal or enable other optimizations that have been impossible before due to bad data dependencies.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116b68e48>, 'name': '5834868425ff05a97b00c2b3', 'stable_id': '5834868425ff05a97b00c2b3::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 43, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c2b3,0,b'In this lecture we consider loop transformations that can be used for cache optimization.'), Sentence(Document 5834868425ff05a97b00c2b3,1,b'The transformations can improve cache locality of the loop traversal or enable other optimizations that have been impossible before due to bad data dependencies.'), Sentence(Document 5834868425ff05a97b00c2b3,2,b'The same loop transformations are needed for loop parallelization and vectorization.\n')]}




**1/2042 Candidate/Span:**	`Span("b'For our logic'", sentence=3099, chars=[0,12], words=[0,2])`

**Its parent Sentence's text:**	For our logic, we introduce a calculus over real arithmetic with discrete induction and a new 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116b68278>, 'name': '5834868425ff05a97b00c29a', 'stable_id': '5834868425ff05a97b00c29a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 440, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c29a,0,b'Synopsis We generalise dynamic logic to a logic for differential-algebraic programs, ie, discrete programs augmented with first-order differential-algebraic formulas as continuous evolution constraints in addition to first-order discrete jump formulas.'), Sentence(Document 5834868425ff05a97b00c29a,1,b'These programs characterise interacting discrete and continuous dynamics of hybrid systems elegantly and uniformly, including systems with disturbance and differential-algebraic equations.'), Sentence(Document 5834868425ff05a97b00c29a,2,b'For our logic, we introduce a calculus over real arithmetic with discrete induction and a new \n')]}




**2/2042 Candidate/Span:**	`Span("b'For example'", sentence=5019, chars=[0,10], words=[0,1])`

**Its parent Sentence's text:**	For example, the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116bd9e10>, 'name': '5834868425ff05a97b00cb43', 'stable_id': '5834868425ff05a97b00cb43::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 891, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cb43,0,b'Abstract A significant contributor to the total energy consumed on many battery-powered mobile devices is the wireless network interface card (NIC).'), Sentence(Document 5834868425ff05a97b00cb43,1,b'Therefore, minimizing the energy consumption of the NIC can result in significant battery life improvement for a mobile device.'), Sentence(Document 5834868425ff05a97b00cb43,2,b'One way of achieving this is to transition the NIC to a lower-power sleep mode when possible'), Sentence(Document 5834868425ff05a97b00cb43,3,b'For example, the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform \n')]}




**3/2042 Candidate/Span:**	`Span("b'the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform \n'", sentence=5019, chars=[13,166], words=[3,32])`

**Its parent Sentence's text:**	For example, the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116bd9e10>, 'name': '5834868425ff05a97b00cb43', 'stable_id': '5834868425ff05a97b00cb43::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 891, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cb43,0,b'Abstract A significant contributor to the total energy consumed on many battery-powered mobile devices is the wireless network interface card (NIC).'), Sentence(Document 5834868425ff05a97b00cb43,1,b'Therefore, minimizing the energy consumption of the NIC can result in significant battery life improvement for a mobile device.'), Sentence(Document 5834868425ff05a97b00cb43,2,b'One way of achieving this is to transition the NIC to a lower-power sleep mode when possible'), Sentence(Document 5834868425ff05a97b00cb43,3,b'For example, the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform \n')]}




**4/2042 Candidate/Span:**	`Span("b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition'", sentence=4273, chars=[0,94], words=[0,14])`

**Its parent Sentence's text:**	While previous studies demonstrate the feasibility of using ACT-R to model collective cognition, as well as sensemaking processes at the individual level, the 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c959b0>, 'name': '5834868425ff05a97b00df37', 'stable_id': '5834868425ff05a97b00df37::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 713, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00df37,0,b'Cognitive social simulations, enabled by cognitive architectures (such as ACT-R), are particularly well-suited for advancing our understanding of socially-distributed and socially-situated cognition.'), Sentence(Document 5834868425ff05a97b00df37,1,b'As a result, multi-agent simulations featuring the use of ACT-R agents may be important in improving our understanding of the factors that influence collective sensemaking.'), Sentence(Document 5834868425ff05a97b00df37,2,b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition, as well as sensemaking processes at the individual level, the \n')]}




**5/2042 Candidate/Span:**	`Span("b'Our approach relies on recent learning theory results for distribution regression'", sentence=2307, chars=[0,80], words=[0,10])`

**Its parent Sentence's text:**	Our approach relies on recent learning theory results for distribution regression, using kernel 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c53e48>, 'name': '5834868425ff05a97b00d9ba', 'stable_id': '5834868425ff05a97b00d9ba::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 270, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d9ba,0,b"Abstract We present a new solution to the``ecological inference''problem, of learning individual-level associations from aggregate data."), Sentence(Document 5834868425ff05a97b00d9ba,1,b'This problem has a long history and has attracted much attention, debate, claims that it is unsolvable, and purported solutions.'), Sentence(Document 5834868425ff05a97b00d9ba,2,b'Unlike other ecological inference techniques, our method makes use of unlabeled individual-level data by embedding the distribution over these predictors into a vector in Hilbert space.'), Sentence(Document 5834868425ff05a97b00d9ba,3,b'Our approach relies on recent learning theory results for distribution regression, using kernel \n')]}




**6/2042 Candidate/Span:**	`Span("b'most work is based on assumptions which are not suited to large scale \n'", sentence=5297, chars=[79,149], words=[13,26])`

**Its parent Sentence's text:**	Although much related work has been done on efficient delivery of information, most work is based on assumptions which are not suited to large scale 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116ca0710>, 'name': '5834868425ff05a97b00e0b4', 'stable_id': '5834868425ff05a97b00e0b4::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 957, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00e0b4,0,b'Abstract: Effective communication among agents in large teams is crucial because the members share a common goal but only have a partial views of the environment.'), Sentence(Document 5834868425ff05a97b00e0b4,1,b'Information sharing is difficult in a large team because, a team member may have a piece of valuable information but not know who needs the information, since it is infeasible to know what each other agent is doing.'), Sentence(Document 5834868425ff05a97b00e0b4,2,b'Although much related work has been done on efficient delivery of information, most work is based on assumptions which are not suited to large scale \n')]}




[Background(Span("b'The transformations can improve cache locality of the loop traversal or enable other optimizations that have been impossible before due to bad data dependencies.'", sentence=1206, chars=[0,160], words=[0,24])),
 Background(Span("b'For our logic'", sentence=3099, chars=[0,12], words=[0,2])),
 Background(Span("b'For example'", sentence=5019, chars=[0,10], words=[0,1])),
 Background(Span("b'the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform \n'", sentence=5019, chars=[13,166], words=[3,32])),
 Background(Span("b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition'", sentence=4273, chars=[0,94], words=[0,14])),
 Background(Span("b'Our approach relies on recent learning theory results for distribution regression'", sentence=2307, chars=[0,80], words=[0,10])),
 Background(Span("b'most work is based on assumptions which are not s

In [6]:
# Compound Matcher Example 3 (Regex-based):
# E.g. to match either begining To/By/For or middle to/by/or, BUT ALSO excludes "according to", "to have", etc. 

non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
regex_background_matcher=RegexMatchSpan(rgx="(^(?:(To)|(By)|(For)) .*$)|(^(?!(according)|(According))+ (?:(to)|(by)|(for)) (?!(have))+$)",longest_match_only=True,ignore_case=False)  
non_comma_regex_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,regex_background_matcher)])
extract_and_display(non_comma_regex_background_matcher)

Clearing existing...
Running UDF...

CPU times: user 9.84 s, sys: 289 ms, total: 10.1 s
Wall time: 10.3 s


**Split 0 - number of candidates extracted: 122**



Clearing existing...
Running UDF...

CPU times: user 1.43 s, sys: 74.7 ms, total: 1.5 s
Wall time: 1.5 s


**Split 1 - number of candidates extracted: 13**



Clearing existing...
Running UDF...

CPU times: user 1.46 s, sys: 78.8 ms, total: 1.54 s
Wall time: 1.51 s


**Split 2 - number of candidates extracted: 15**



**0/122 Candidate/Span:**	`Span("b'For our logic'", sentence=3099, chars=[0,12], words=[0,2])`

**Its parent Sentence's text:**	For our logic, we introduce a calculus over real arithmetic with discrete induction and a new 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116b68278>, 'name': '5834868425ff05a97b00c29a', 'stable_id': '5834868425ff05a97b00c29a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 440, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c29a,0,b'Synopsis We generalise dynamic logic to a logic for differential-algebraic programs, ie, discrete programs augmented with first-order differential-algebraic formulas as continuous evolution constraints in addition to first-order discrete jump formulas.'), Sentence(Document 5834868425ff05a97b00c29a,1,b'These programs characterise interacting discrete and continuous dynamics of hybrid systems elegantly and uniformly, including systems with disturbance and differential-algebraic equations.'), Sentence(Document 5834868425ff05a97b00c29a,2,b'For our logic, we introduce a calculus over real arithmetic with discrete induction and a new \n')]}




**1/122 Candidate/Span:**	`Span("b'For example'", sentence=5019, chars=[0,10], words=[0,1])`

**Its parent Sentence's text:**	For example, the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116bd9e10>, 'name': '5834868425ff05a97b00cb43', 'stable_id': '5834868425ff05a97b00cb43::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 891, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cb43,0,b'Abstract A significant contributor to the total energy consumed on many battery-powered mobile devices is the wireless network interface card (NIC).'), Sentence(Document 5834868425ff05a97b00cb43,1,b'Therefore, minimizing the energy consumption of the NIC can result in significant battery life improvement for a mobile device.'), Sentence(Document 5834868425ff05a97b00cb43,2,b'One way of achieving this is to transition the NIC to a lower-power sleep mode when possible'), Sentence(Document 5834868425ff05a97b00cb43,3,b'For example, the IEEE 802.11 wireless LAN standard power-saving mode (PSM) makes access points buffer data destined for a client and uses a periodic beacon to inform \n')]}




**2/122 Candidate/Span:**	`Span("b'To explore this hypothesis'", sentence=4149, chars=[0,25], words=[0,3])`

**Its parent Sentence's text:**	To explore this hypothesis, here we profiled the expression of 16 genes at eight developmental stages.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1135b36d8>, 'name': '5834868425ff05a97b00c54b', 'stable_id': '5834868425ff05a97b00c54b::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 685, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c54b,0,b'ABSTRACT In our companion study (Jarvis et al.[2013] J Comp Neurol.'), Sentence(Document 5834868425ff05a97b00c54b,1,b'doi:'), Sentence(Document 5834868425ff05a97b00c54b,2,b'10.1002/cne.'), Sentence(Document 5834868425ff05a97b00c54b,3,b'23404)'), Sentence(Document 5834868425ff05a97b00c54b,4,b'we used quantitative brain molecular profiling to discover that distinct subdivisions in the avian pallium above and below the ventricle and the associated mesopallium lamina have similar molecular profiles, leading to a hypothesis that they may form as continuous subdivisions around the lateral ventricle.'), Sentence(Document 5834868425ff05a97b00c54b,5,b'To explore this hypothesis, here we profiled the expression of 16 genes at eight developmental stages.'), Sentence(Document 5834868425ff05a97b00c54b,6,b'The genes included those that \n')]}




**3/122 Candidate/Span:**	`Span("b'For example'", sentence=3893, chars=[0,10], words=[0,1])`

**Its parent Sentence's text:**	For example, for visualization tools we evaluated the suitability of the level of abstraction and 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c8a1d0>, 'name': '5834868425ff05a97b00def4', 'stable_id': '5834868425ff05a97b00def4::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 624, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00def4,0,b'Abstract This chapter discusses a set of co-ordination tools (the Continuous Co-ordination (CC) tool suite that includes Ariadne, Workspace Activity Viewer (WAV), Lighthouse, Palant\xc3\xadr, and YANCEES) and details of our evaluation framework for these tools.'), Sentence(Document 5834868425ff05a97b00def4,1,b'Specifically, we discuss how we assessed the usefulness and the usability of these tools within the context of a predefined evaluation framework called DESMET DESMET.'), Sentence(Document 5834868425ff05a97b00def4,2,b'For example, for visualization tools we evaluated the suitability of the level of abstraction and \n')]}




**4/122 Candidate/Span:**	`Span("b'For this task'", sentence=1616, chars=[0,12], words=[0,2])`

**Its parent Sentence's text:**	For this task, we show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116b8fc50>, 'name': '5834868425ff05a97b00c3b1', 'stable_id': '5834868425ff05a97b00c3b1::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 118, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c3b1,0,b'Abstract: We focus on the task of everyday hand pose estimation from egocentric viewpoints.'), Sentence(Document 5834868425ff05a97b00c3b1,1,b'For this task, we show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment.'), Sentence(Document 5834868425ff05a97b00c3b1,2,b'Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand pose estimation in RGB-D images is still an unsolved problem.'), Sentence(Document 5834868425ff05a97b00c3b1,3,b'The problem is considerably exacerbated when analyzing hands performing daily activities from a first-person \n')]}




**5/122 Candidate/Span:**	`Span("b'To many people'", sentence=4580, chars=[0,13], words=[0,2])`

**Its parent Sentence's text:**	To many people, the Web and the Internet are synonymous.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116bfb828>, 'name': '5834868425ff05a97b00ccaf', 'stable_id': '5834868425ff05a97b00ccaf::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 785, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00ccaf,0,b'Abstract Over the last few years, the World-Wide Web [2] has risen to dominance as a mechanism for wide-area information access.'), Sentence(Document 5834868425ff05a97b00ccaf,1,b'Each day brings new reports of the growth of the Web, and this trend shows no signs of abating any time soon.'), Sentence(Document 5834868425ff05a97b00ccaf,2,b'To many people, the Web and the Internet are synonymous.'), Sentence(Document 5834868425ff05a97b00ccaf,3,b'Unfortunately, success has exposed many limitations of the Web such as its tendency to overload the network and servers, its limited ability to control access to sensitive data, its lack of mechanisms for data consistency, and its susceptibility \n')]}




**6/122 Candidate/Span:**	`Span("b'For traffic with bandwidth guarantees'", sentence=1188, chars=[0,36], words=[0,4])`

**Its parent Sentence's text:**	For traffic with bandwidth guarantees, we found that several routing algorithms that favor paths with fewer hops perform well.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x116c535c0>, 'name': '5834868425ff05a97b00d955', 'stable_id': '5834868425ff05a97b00d955::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 39, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d955,0,b'Abstract Quality-of-Service (QoS) routing tries to select a path that satisfies a set of QoS constraints, while also achieving overall network resource efficiency.'), Sentence(Document 5834868425ff05a97b00d955,1,b'We present initial results on QoS path selection for traffic requiring bandwidth and delay guarantees.'), Sentence(Document 5834868425ff05a97b00d955,2,b'For traffic with bandwidth guarantees, we found that several routing algorithms that favor paths with fewer hops perform well.'), Sentence(Document 5834868425ff05a97b00d955,3,b'For traffic with delay guarantees, we show that for a broad class of WFQ-like scheduling algorithms, the problem of finding a path satisfying bandwidth, delay, \n')]}




[Background(Span("b'For our logic'", sentence=3099, chars=[0,12], words=[0,2])),
 Background(Span("b'For example'", sentence=5019, chars=[0,10], words=[0,1])),
 Background(Span("b'To explore this hypothesis'", sentence=4149, chars=[0,25], words=[0,3])),
 Background(Span("b'For example'", sentence=3893, chars=[0,10], words=[0,1])),
 Background(Span("b'For this task'", sentence=1616, chars=[0,12], words=[0,2])),
 Background(Span("b'To many people'", sentence=4580, chars=[0,13], words=[0,2])),
 Background(Span("b'For traffic with bandwidth guarantees'", sentence=1188, chars=[0,36], words=[0,4])),
 Background(Span("b'To communicate with a service provider'", sentence=2518, chars=[0,37], words=[0,5])),
 Background(Span("b'To accomplish biologically meaningful gene selection from microarray data'", sentence=5348, chars=[0,72], words=[0,8])),
 Background(Span("b'For traffic with delay guarantees'", sentence=1189, chars=[0,32], words=[0,4])),
 Background(Span("b'To computationally model discou

The above three simple compound matchers help us familiarize with Snorkel syntax, next we will carve the matchers more carefully. We look back to the 70K paper data, and identified the following signal phrases. These phrases then will become our labelling function. 

1. Background:    `"basic problem", "a key issue", "how do", "Existing", "how [adj.] are XXX, Researchers", "Increasingly", "over the last few years", "Recent trends", "have been", "has been", "Little is known", "One major goal", "advances in"`

2. Mechanism:    `"grounded in", "draw inspiration from"`

3. Method:    `"propose", "This paper/we develops/adopts/describes/presents/examines/extends/studies/validated/conducted", "empirical research", "our focus is"`

4. Finding:    `"We find/found that", "We report", "results/experiment show that", "results suggest", "demonstrate(s)/reveal(s)/outperform(s/ed)", `

Also perhaps we could have a reverse matcher (`Spam Filter`) that filters out lecture/book abstract: `"This book", "this lecture"`

In [7]:
# Compound Matcher Example 4 (Regex-based):
# comma-separated + To/By/For or middle to/by/or + background-related dict words

regex_background_matcher=RegexMatchSpan(rgx="(^(?:(To)|(By)|(For)) .*$)|(^(?!(according)|(According))+ (?:(in)|(to)|(by)|(for)) (?!(have))+$)",longest_match_only=True,ignore_case=False)  
dict_background_matcher2=DictionaryMatch(d=['previous','motivated','recent','widely','increasingly','existing','researchers','advances','trends'],ignore_case=True,longest_match_only=True) 
intersected_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,regex_background_matcher,dict_background_matcher2)])
cands=extract_and_display(intersected_background_matcher)



Clearing existing...
Running UDF...

CPU times: user 10.1 s, sys: 287 ms, total: 10.4 s
Wall time: 10.8 s


**Split 0 - number of candidates extracted: 0**



Clearing existing...
Running UDF...

CPU times: user 1.6 s, sys: 82.6 ms, total: 1.68 s
Wall time: 1.67 s


**Split 1 - number of candidates extracted: 0**



Clearing existing...
Running UDF...

CPU times: user 1.45 s, sys: 96.7 ms, total: 1.54 s
Wall time: 1.54 s


**Split 2 - number of candidates extracted: 0**



IndexError: list index out of range


### Stopped here as of Mon 06/25 1:36 am. More to come ... ###

## TODO List