## Test Run 2: Extract Meaningful Span

This notebook continues from Test Run 1, but defines *Span* more meaningfully. We want to achieve that, 1 *Sentence* = 1+ *Spans* = 1+ *Clauses*. After the default `Ngram()` class, we customize Matcher to exclude any span with comma.

Note that in this test run, we will initiate `SnorkelSession()` without re-create `Document` or `Sentence`, which has been created in test run 1.


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence

session = SnorkelSession()
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

sents = session.query(Sentence).all()
n_max_corpus=0
for sent in sents:
    n_max_corpus=max(n_max_corpus,len(sent.words))

print("The longest sentence has "+str(n_max_corpus)+" tokens.")

from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)


Documents: 1000
Sentences: 4487
The longest sentence has 101 tokens.


In test run 1, we have tried using **single** `DictionaryMatcher`. Now it's time to explore more, e.g. combining **multiple** `DictionaryMatchers`. 

Fortunately, we could combine them through `Union(Matcher)` (takes the union of candidate sets returned by child matcher), `Intersection(Matcher)`, or `Concat(Matcher)` (candidates which are the concatenation of adjacent matches from child matchers).

In [4]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import *

Background = candidate_subclass('Background', ['background_cue'])

ngrams = Ngrams(n_max=n_max_corpus) # we define the maximum n value as n_max_corpus

non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
background_matcher=DictionaryMatch(d=['previous','motivated','recent','widely'],longest_match_only=True) 

non_comma_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,background_matcher)])

In [5]:
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    %time non_comma_background_matcher.apply(sents, split=i)
    print("Split "+str(i)+" - number of candidates extracted:", session.query(Background).filter(Background.split == i).count(),"\n\n")
            

Clearing existing...
Running UDF...

CPU times: user 8.79 s, sys: 284 ms, total: 9.08 s
Wall time: 9.18 s
Split 0 - number of candidates extracted: 106 


Clearing existing...
Running UDF...

CPU times: user 1.42 s, sys: 144 ms, total: 1.57 s
Wall time: 1.57 s
Split 1 - number of candidates extracted: 11 


Clearing existing...
Running UDF...

CPU times: user 1.47 s, sys: 148 ms, total: 1.62 s
Wall time: 1.63 s
Split 2 - number of candidates extracted: 9 




In [7]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

cands = session.query(Background).filter(Background.split == 0).all()

document_list=list()
for i in range(len(cands)):
    printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i].background_cue)+"`")
    printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
    print()
    

**0/106 Candidate/Span:**	`Span("b'we investigate how to use machine learning to add reactivity to a previously \n'", sentence=2476, chars=[15,92], words=[4,17])`

**Its parent Sentence's text:**	In this paper, we investigate how to use machine learning to add reactivity to a previously 





**1/106 Candidate/Span:**	`Span("b'issues of material embodiment and environmental embedding have emerged as important areas of research attention over recent decades.'", sentence=4295, chars=[33,164], words=[7,25])`

**Its parent Sentence's text:**	Within the sciences of the mind, issues of material embodiment and environmental embedding have emerged as important areas of research attention over recent decades.




**2/106 Candidate/Span:**	`Span("b'and three approaches are presented for distinguishing amongst objects from a previously-encountered set: The first approach is entirely geometric in nature and borrows ideas from "\n'", sentence=4197, chars=[32,212], words=[6,35])`

**Its parent Sentence's text:**	The goal is object recognition, and three approaches are presented for distinguishing amongst objects from a previously-encountered set: The first approach is entirely geometric in nature and borrows ideas from "





**3/106 Candidate/Span:**	`Span("b'at a recent social gathering'", sentence=4350, chars=[50,77], words=[11,15])`

**Its parent Sentence's text:**	Marian ran a straw poll on that question at ICSE, at a recent social gathering, 





**4/106 Candidate/Span:**	`Span("b'Previous Work: Highway Control'", sentence=1229, chars=[0,29], words=[0,4])`

**Its parent Sentence's text:**	Previous Work: Highway Control




**5/106 Candidate/Span:**	`Span("b'recent research has investigated \n'", sentence=2575, chars=[41,74], words=[6,10])`

**Its parent Sentence's text:**	To circumvent this impossibility result, recent research has investigated 





**6/106 Candidate/Span:**	`Span("b'Abstract: Switched LANs are become more widely used because they can provide a higher bandwidth than LANs based on shared media.'", sentence=1211, chars=[0,127], words=[0,22])`

**Its parent Sentence's text:**	Abstract: Switched LANs are become more widely used because they can provide a higher bandwidth than LANs based on shared media.




**7/106 Candidate/Span:**	`Span("b'as the authors have learned in recent years'", sentence=2126, chars=[21,63], words=[3,10])`

**Its parent Sentence's text:**	Continuous lattices, as the authors have learned in recent years, exhibit a variety of different aspects, some are lattice theoretical, some are topological, some belong to topological algebra and some to category theory




**8/106 Candidate/Span:**	`Span("b'Network-awareness is a recent attempt to bridge the gap between the realities of networks and the demands of applications.'", sentence=4814, chars=[0,121], words=[0,21])`

**Its parent Sentence's text:**	Network-awareness is a recent attempt to bridge the gap between the realities of networks and the demands of applications.




**9/106 Candidate/Span:**	`Span("b'ABSTRACT Motivated by many practical applications that have to compute in the presence of uncertainty'", sentence=3160, chars=[0,100], words=[0,14])`

**Its parent Sentence's text:**	ABSTRACT Motivated by many practical applications that have to compute in the presence of uncertainty, we propose a monadic probabilistic language based upon the mathematical notion of sampling function.




**10/106 Candidate/Span:**	`Span("b'The Urban Challenge represents a technological leap beyond the previous Grand Challenges.'", sentence=3143, chars=[0,88], words=[0,12])`

**Its parent Sentence's text:**	The Urban Challenge represents a technological leap beyond the previous Grand Challenges.




**11/106 Candidate/Span:**	`Span("b'It is based on a recently discovered connection between homotopy the-ory and type theory.'", sentence=4352, chars=[0,88], words=[0,16])`

**Its parent Sentence's text:**	It is based on a recently discovered connection between homotopy the-ory and type theory.




**12/106 Candidate/Span:**	`Span("b'Mechanism design is the art of designing the rules of the game so that the agents are motivated to report their preferences truthfully'", sentence=3177, chars=[0,133], words=[0,22])`

**Its parent Sentence's text:**	Mechanism design is the art of designing the rules of the game so that the agents are motivated to report their preferences truthfully, and a desirable outcome is chosen.




**13/106 Candidate/Span:**	`Span("b'simple threshold-based methods remain the most widely deployed and most popular approach among practitioners.'", sentence=2300, chars=[98,206], words=[14,30])`

**Its parent Sentence's text:**	Despite the proliferation of detection and containment techniques in the worm defense literature, simple threshold-based methods remain the most widely deployed and most popular approach among practitioners.




**14/106 Candidate/Span:**	`Span("b'Abstract Previous research on automated mechanism design (proposed in UAI-02) assumed that the outcome space was flatly represented'", sentence=1767, chars=[0,130], words=[0,19])`

**Its parent Sentence's text:**	Abstract Previous research on automated mechanism design (proposed in UAI-02) assumed that the outcome space was flatly represented, which makes that work inapplicable if the outcome space is exponential, as it is, for example, in multi-item auctions.




**15/106 Candidate/Span:**	`Span("b'there is a box-modality [\xce\xb1] and a diamond modality\xe2\x8c\xa9 \xce\xb1\xe2\x8c\xaa. PDL was developed from first-order dynamic logic by Fischer-Ladner [FL79] and has become popular recently [GW09].'", sentence=1163, chars=[20,188], words=[5,43])`

**Its parent Sentence's text:**	For each program α, there is a box-modality [α] and a diamond modality〈 α〉. PDL was developed from first-order dynamic logic by Fischer-Ladner [FL79] and has become popular recently [GW09].




**16/106 Candidate/Span:**	`Span("b'Previous approaches such as ridge regression'", sentence=3807, chars=[0,43], words=[0,5])`

**Its parent Sentence's text:**	Previous approaches such as ridge regression, support vector methods, and regularization networks are included as special cases.




**17/106 Candidate/Span:**	`Span("b'This paper is motivated by a simple question regarding the diagnosis of such attacks-is it possible to establish attack-causality through network-level monitoring'", sentence=2270, chars=[0,161], words=[0,27])`

**Its parent Sentence's text:**	This paper is motivated by a simple question regarding the diagnosis of such attacks-is it possible to establish attack-causality through network-level monitoring, without relying on signatures and attack-specific properties?




**18/106 Candidate/Span:**	`Span("b'has motivated recent efforts in improving the quality of Internet video.'", sentence=5341, chars=[115,186], words=[19,30])`

**Its parent Sentence's text:**	Abstract The key role that video quality plays in impacting user engagement, and consequently providers' revenues, has motivated recent efforts in improving the quality of Internet video.




**19/106 Candidate/Span:**	`Span("b'Previous work with the previous generation of mobile phones has shown that such an assumption is false.'", sentence=1814, chars=[0,102], words=[0,17])`

**Its parent Sentence's text:**	Previous work with the previous generation of mobile phones has shown that such an assumption is false.




**20/106 Candidate/Span:**	`Span("b'Abstract Despite the recent trend of increasingly large datasets for object detection'", sentence=4830, chars=[0,84], words=[0,11])`

**Its parent Sentence's text:**	Abstract Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples.




**21/106 Candidate/Span:**	`Span("b'We recently proposed an approach---called automated mechanism design---where a mechanism is computed for the preference \n'", sentence=3178, chars=[0,120], words=[0,19])`

**Its parent Sentence's text:**	We recently proposed an approach---called automated mechanism design---where a mechanism is computed for the preference 





**22/106 Candidate/Span:**	`Span("b'and experiment with a setup motivated by the behavior of real-world distributed computation networks'", sentence=1266, chars=[21,120], words=[5,20])`

**Its parent Sentence's text:**	We discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed computation networks, where the machines are differently slow at different time.




**23/106 Candidate/Span:**	`Span("b'The recent spate of security issues and allegations of \xe2\x80\x9clost votes\xe2\x80\x9d in the US demonstrates the inadequacy of the standards used to evaluate our election systems.'", sentence=2481, chars=[0,160], words=[0,28])`

**Its parent Sentence's text:**	The recent spate of security issues and allegations of “lost votes” in the US demonstrates the inadequacy of the standards used to evaluate our election systems.




**24/106 Candidate/Span:**	`Span("b'we develop a logically motivated theory of parametric polymorphism'", sentence=3939, chars=[13,78], words=[4,12])`

**Its parent Sentence's text:**	To this end, we develop a logically motivated theory of parametric polymorphism, reminiscent of the Girard-Reynolds polymorphic λ-calculus, but casted in the setting of concurrent processes.




**25/106 Candidate/Span:**	`Span("b'Recently'", sentence=3237, chars=[0,7], words=[0,0])`

**Its parent Sentence's text:**	Recently, type systems based on constructive modal logic have been proposed as an expressive \nbasis for run-time code generation [DP96,WLPD98], partial evaluation [Dav96], and general \nmeta-programming [MTBS99,DPS97].




**26/106 Candidate/Span:**	`Span("b'In previous work'", sentence=5091, chars=[0,15], words=[0,2])`

**Its parent Sentence's text:**	In previous work, we developed a reliable navigation technique that uses partially observable Markov models to represent metric, actuator and sensor uncertainties.




**27/106 Candidate/Span:**	`Span("b'While recent work in network verification has made giant strides to reduce this effort'", sentence=4155, chars=[0,85], words=[0,13])`

**Its parent Sentence's text:**	While recent work in network verification has made giant strides to reduce this effort, they focus on simple reachability properties and cannot handle context-dependent policies (eg, how many connections has a host spawned) that operators realize using stateful network functions (NFs).




**28/106 Candidate/Span:**	`Span("b'The robot design extends upon previous prototypes of HeartLander'", sentence=4718, chars=[0,63], words=[0,8])`

**Its parent Sentence's text:**	The robot design extends upon previous prototypes of HeartLander, a miniature mobile robot that moves in an inchworm-like fashion.




**29/106 Candidate/Span:**	`Span("b'Machine learning develops intelligent computer systems that are able to generalizefrom previously seen examples.'", sentence=3747, chars=[0,111], words=[0,14])`

**Its parent Sentence's text:**	Machine learning develops intelligent computer systems that are able to generalizefrom previously seen examples.




**30/106 Candidate/Span:**	`Span("b'Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset.'", sentence=2713, chars=[0,165], words=[0,24])`

**Its parent Sentence's text:**	Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset.




**31/106 Candidate/Span:**	`Span("b'We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning.'", sentence=2714, chars=[0,148], words=[0,23])`

**Its parent Sentence's text:**	We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning.




**32/106 Candidate/Span:**	`Span("b'As described in the previous chapters'", sentence=4705, chars=[0,36], words=[0,5])`

**Its parent Sentence's text:**	As described in the previous chapters, some of the most successful work has included road following using CCD cameras [8.10] and cross country navigation using a scanning laser range finder [8.7].




**33/106 Candidate/Span:**	`Span("b'Abstract Recently'", sentence=2006, chars=[0,16], words=[0,1])`

**Its parent Sentence's text:**	Abstract Recently, there has been increasing deployment of distributed infrastructure and applications such as Akamai, PlanetLab and DHTs.




**34/106 Candidate/Span:**	`Span("b'Recent proposals have advocated the use of consolidation of \n'", sentence=2205, chars=[0,60], words=[0,9])`

**Its parent Sentence's text:**	Recent proposals have advocated the use of consolidation of 





**35/106 Candidate/Span:**	`Span("b'<\\nPrevious; Next >; Home > Conferences_Events > SCFORUM > 6.'", sentence=2902, chars=[0,60], words=[0,13])`

**Its parent Sentence's text:**	<\nPrevious; Next >; Home > Conferences_Events > SCFORUM > 6.




**36/106 Candidate/Span:**	`Span("b'Abstract Recent studies have shown a range of co-residency side channels that can be used to extract private information from cloud clients.'", sentence=2272, chars=[0,139], words=[0,24])`

**Its parent Sentence's text:**	Abstract Recent studies have shown a range of co-residency side channels that can be used to extract private information from cloud clients.




**37/106 Candidate/Span:**	`Span("b"Abstract Recent trends have exposed three key problems in today's operating systems."", sentence=1722, chars=[0,83], words=[0,13])`

**Its parent Sentence's text:**	Abstract Recent trends have exposed three key problems in today's operating systems.




**38/106 Candidate/Span:**	`Span("b'There has been recently a reawaking of interest in many aspects of realizability interpretations---especially as regards semantics of type theories for constructive reasoning and semantics of programming languages.'", sentence=4300, chars=[0,213], words=[0,30])`

**Its parent Sentence's text:**	There has been recently a reawaking of interest in many aspects of realizability interpretations---especially as regards semantics of type theories for constructive reasoning and semantics of programming languages.




**39/106 Candidate/Span:**	`Span("b'Abstract Recent biological advances with protein microarrays allow high-throughput antibody response screens against the entire proteome of a disease pathogen.'", sentence=3381, chars=[0,158], words=[0,22])`

**Its parent Sentence's text:**	Abstract Recent biological advances with protein microarrays allow high-throughput antibody response screens against the entire proteome of a disease pathogen.




**40/106 Candidate/Span:**	`Span("b'Previous RTA conferences took place in Dijon (1985)'", sentence=3558, chars=[0,50], words=[0,9])`

**Its parent Sentence's text:**	Previous RTA conferences took place in Dijon (1985), Bordeaux (1987), Chapel Hill (1989), Como (1991), Montreal (1993), Kaiserslautern (1995), New Brunswick (1996), Sitges (1997), Tsukuba (1998), 





**41/106 Candidate/Span:**	`Span("b'Abstract Information-Centric Networking (ICN) has seen a significant resurgence in recent years.'", sentence=4921, chars=[0,95], words=[0,16])`

**Its parent Sentence's text:**	Abstract Information-Centric Networking (ICN) has seen a significant resurgence in recent years.




**42/106 Candidate/Span:**	`Span("b'not addressed by previous monitoring services.'", sentence=2008, chars=[94,139], words=[17,23])`

**Its parent Sentence's text:**	In this setting, a key challenge to high availability is the presence of correlated failures, not addressed by previous monitoring services.




**43/106 Candidate/Span:**	`Span("b'Recent results make a promising case for augmenting an oversubscribed network with reconfigurable inter-rack wireless or optical links.'", sentence=2288, chars=[0,134], words=[0,20])`

**Its parent Sentence's text:**	Recent results make a promising case for augmenting an oversubscribed network with reconfigurable inter-rack wireless or optical links.




**44/106 Candidate/Span:**	`Span("b'we present recent results with using range from radio for mobile robot localization.'", sentence=4458, chars=[15,98], words=[4,17])`

**Its parent Sentence's text:**	In this paper, we present recent results with using range from radio for mobile robot localization.




**45/106 Candidate/Span:**	`Span("b'The presentation from the previous lecture would have been slightly simpler if we had presupposed an explicit conjunction form for programs where each inference rule has exactly one premiss.'", sentence=4850, chars=[0,189], words=[0,29])`

**Its parent Sentence's text:**	The presentation from the previous lecture would have been slightly simpler if we had presupposed an explicit conjunction form for programs where each inference rule has exactly one premiss.




**46/106 Candidate/Span:**	`Span("b'the data mining method of association rule has been widely used in various aspects in the field of traditional Chinese medicine'", sentence=4936, chars=[72,198], words=[15,35])`

**Its parent Sentence's text:**	As one of the most active research methods in the field of data mining, the data mining method of association rule has been widely used in various aspects in the field of traditional Chinese medicine, which makes the large database of traditional Chinese medicine information can be utilized effectively, and promotes the development of traditional Chinese medicine modernization.




**47/106 Candidate/Span:**	`Span("b'Abstract We present recent advances from our efforts in increasing coverage'", sentence=2885, chars=[0,74], words=[0,10])`

**Its parent Sentence's text:**	Abstract We present recent advances from our efforts in increasing coverage, robustness, generality and speed of JANUS, CMU's speech-to-speech translation system.




**48/106 Candidate/Span:**	`Span("b'has been presented previously'", sentence=4722, chars=[202,230], words=[33,36])`

**Its parent Sentence's text:**	Abstract: A technique for steering of flexible bevel-tipped needles through tissue, providing proportional control of trajectory curvature by means of duty-cycled rotation or spinning during insertion, has been presented previously, and tested in vitro in gelatin samples.




**49/106 Candidate/Span:**	`Span("b'Recently'", sentence=1771, chars=[0,7], words=[0,0])`

**Its parent Sentence's text:**	Recently, there has been considerable interest in the use of Model Checking for Systems Biology.




**50/106 Candidate/Span:**	`Span("b'As described in our previous work'", sentence=3885, chars=[0,32], words=[0,5])`

**Its parent Sentence's text:**	As described in our previous work, we developed Ariadne, a dependency plug-in infrastructure for the Eclipse IDE.




**51/106 Candidate/Span:**	`Span("b'Abstract: Recent work in machine learning has significantly benefited semantic extraction tasks in computer vision'", sentence=5218, chars=[0,113], words=[0,15])`

**Its parent Sentence's text:**	Abstract: Recent work in machine learning has significantly benefited semantic extraction tasks in computer vision, particularly for object recognition and image retrieval.




**52/106 Candidate/Span:**	`Span("b'Abstract: Active physiological tremor compensation instruments have been under research and development recently.'", sentence=4785, chars=[0,112], words=[0,14])`

**Its parent Sentence's text:**	Abstract: Active physiological tremor compensation instruments have been under research and development recently.




**53/106 Candidate/Span:**	`Span("b'previous negotiations are a source of heuristic \n'", sentence=4770, chars=[13,61], words=[3,10])`

**Its parent Sentence's text:**	In addition, previous negotiations are a source of heuristic 





**54/106 Candidate/Span:**	`Span("b'In previous work we have shown how range readings from radio tags placed in the environment can be used to localize a robot.'", sentence=4459, chars=[0,123], words=[0,23])`

**Its parent Sentence's text:**	In previous work we have shown how range readings from radio tags placed in the environment can be used to localize a robot.




**55/106 Candidate/Span:**	`Span("b'ACT-R is one of the most widely used cognitive architectures'", sentence=4315, chars=[0,59], words=[0,11])`

**Its parent Sentence's text:**	ACT-R is one of the most widely used cognitive architectures, and it has been used to model hundreds of phenomena described in the cognitive psychology literature.




**56/106 Candidate/Span:**	`Span("b'In previous work we have developed an optimal method D* to plan paths when the environment is not known ahead of time'", sentence=4492, chars=[0,116], words=[0,22])`

**Its parent Sentence's text:**	In previous work we have developed an optimal method D* to plan paths when the environment is not known ahead of time, but, rather is discovered as the robot moves around.




**57/106 Candidate/Span:**	`Span("b'Coordinated Sampling (cSamp) is a recent proposal for improving the flow monitoring capabilities of ISPs to address these demands.'", sentence=4142, chars=[0,129], words=[0,21])`

**Its parent Sentence's text:**	Coordinated Sampling (cSamp) is a recent proposal for improving the flow monitoring capabilities of ISPs to address these demands.




**58/106 Candidate/Span:**	`Span("b'Session types are widely accepted as an expressive discipline for structuring communications in concurrent and distributed systems.'", sentence=1404, chars=[0,130], words=[0,17])`

**Its parent Sentence's text:**	Session types are widely accepted as an expressive discipline for structuring communications in concurrent and distributed systems.




**59/106 Candidate/Span:**	`Span("b'We have extended previous work to consider robustness.'", sentence=4460, chars=[0,53], words=[0,8])`

**Its parent Sentence's text:**	We have extended previous work to consider robustness.




**60/106 Candidate/Span:**	`Span("b'discriminative model for such context-based action recognition building on recent techniques for learning large-scale discriminative models.'", sentence=4127, chars=[22,161], words=[5,25])`

**Its parent Sentence's text:**	We develop a unified, discriminative model for such context-based action recognition building on recent techniques for learning large-scale discriminative models.




**61/106 Candidate/Span:**	`Span("b'Previously'", sentence=3383, chars=[0,9], words=[0,0])`

**Its parent Sentence's text:**	Previously, some basic machine learning classification algorithms have been used to evaluate the diagnostic power of noted antigens, but the issue of which algorithms perform best on this type of data is 





**62/106 Candidate/Span:**	`Span("b'In our previous work'", sentence=3197, chars=[0,19], words=[0,3])`

**Its parent Sentence's text:**	In our previous work, we have automatically generated hints for logic tutoring by constructing a Markov Decision Process (MDP) that holds and rates historical student work for automatic selection of the best prior cases for hint generation.




**63/106 Candidate/Span:**	`Span("b'Our recently proposed framework of deploying packet level redundancy elimination (RE) on network elements (eg'", sentence=1694, chars=[0,108], words=[0,17])`

**Its parent Sentence's text:**	Our recently proposed framework of deploying packet level redundancy elimination (RE) on network elements (eg, routers)([2],[3]) can serve as an effective way to improve network efficiency without expensive upgrades.




**64/106 Candidate/Span:**	`Span("b'the previous components must become interoperable with the \n'", sentence=1969, chars=[30,89], words=[7,15])`

**Its parent Sentence's text:**	As a result of these changes, the previous components must become interoperable with the 





**65/106 Candidate/Span:**	`Span("b'Abstract Much recent attention'", sentence=3778, chars=[0,29], words=[0,3])`

**Its parent Sentence's text:**	Abstract Much recent attention, both experimental and theoretical, has been focussed on classification algorithms which produce voted combinations of classifiers.




**66/106 Candidate/Span:**	`Span("b'A recent trend towards greater automation of earthmoving machines'", sentence=2180, chars=[0,64], words=[0,8])`

**Its parent Sentence's text:**	A recent trend towards greater automation of earthmoving machines, such as backhoes, loaders, and dozers, reflects a larger movement in the construction industry to improve productivity, efficiency, and safety.




**67/106 Candidate/Span:**	`Span("b'we present a new method for doing object recognition using tactile force sensors that makes use of recent work on \xe2\x80\x9ctactile appearance\xe2\x80\x9d to describe objects by the spatially-varying appearance characteristics of their surface texture.'", sentence=4192, chars=[23,254], words=[5,43])`

**Its parent Sentence's text:**	Abstract In this work, we present a new method for doing object recognition using tactile force sensors that makes use of recent work on “tactile appearance” to describe objects by the spatially-varying appearance characteristics of their surface texture.




**68/106 Candidate/Span:**	`Span("b'In recent years'", sentence=4972, chars=[0,14], words=[0,2])`

**Its parent Sentence's text:**	In recent years, industry has developed a variety of standards, eg SOAP, WSDL, UDDI, BPEL for web services discovery, description, and distributed execution over the Web.




**69/106 Candidate/Span:**	`Span("b'Our approach relies on recent learning theory results for distribution regression'", sentence=2307, chars=[0,80], words=[0,10])`

**Its parent Sentence's text:**	Our approach relies on recent learning theory results for distribution regression, using kernel 





**70/106 Candidate/Span:**	`Span("b'Recent theoretical work has shown that the impressive generalization performance of algorithms like AdaBoost can be attributed to the classifier having large margins on the training data.'", sentence=3779, chars=[0,186], words=[0,27])`

**Its parent Sentence's text:**	Recent theoretical work has shown that the impressive generalization performance of algorithms like AdaBoost can be attributed to the classifier having large margins on the training data.




**71/106 Candidate/Span:**	`Span("b'in recent years an increasing amount of reserch has been devoted to machine learning on relational data with more complex structure.'", sentence=2291, chars=[50,181], words=[11,32])`

**Its parent Sentence's text:**	Due to the needs of many real-world applications, in recent years an increasing amount of reserch has been devoted to machine learning on relational data with more complex structure.




**72/106 Candidate/Span:**	`Span("b'This paper builds off of recent work on rapidly exponentially stabilizing control Lyapunov functions (RES-CLF) and control Lyapunov function based quadratic programs (CLFQP) for underactuated hybrid systems.'", sentence=3535, chars=[0,206], words=[0,33])`

**Its parent Sentence's text:**	This paper builds off of recent work on rapidly exponentially stabilizing control Lyapunov functions (RES-CLF) and control Lyapunov function based quadratic programs (CLFQP) for underactuated hybrid systems.




**73/106 Candidate/Span:**	`Span("b'This claim is supported by recent results in music vs. speech classification'", sentence=5220, chars=[0,75], words=[0,11])`

**Its parent Sentence's text:**	This claim is supported by recent results in music vs. speech classification, structure from sound, robust music identification and sound object recognition.




**74/106 Candidate/Span:**	`Span("b'Abstract Recent studies have investigated how a team of mobile sensors can cope with real world constraints'", sentence=4064, chars=[0,106], words=[0,16])`

**Its parent Sentence's text:**	Abstract Recent studies have investigated how a team of mobile sensors can cope with real world constraints, such as uncertainty in the reward functions, dynamically appearing and disappearing targets, technology failures end changes in the environment conditions.




**75/106 Candidate/Span:**	`Span("b'While previous studies demonstrate the feasibility of using ACT-R to model collective cognition'", sentence=4273, chars=[0,94], words=[0,14])`

**Its parent Sentence's text:**	While previous studies demonstrate the feasibility of using ACT-R to model collective cognition, as well as sensemaking processes at the individual level, the 





**76/106 Candidate/Span:**	`Span("b'Previous work with the programming language Lollimon demonstrates the expressive power of logic programming with linear logic in describing algorithms that have imperative elements or that must repeatedly make mutually exclusive \n'", sentence=5441, chars=[0,229], words=[0,31])`

**Its parent Sentence's text:**	Previous work with the programming language Lollimon demonstrates the expressive power of logic programming with linear logic in describing algorithms that have imperative elements or that must repeatedly make mutually exclusive 





**77/106 Candidate/Span:**	`Span("b'From the graph we compute strategies using previous work \n'", sentence=2366, chars=[0,57], words=[0,9])`

**Its parent Sentence's text:**	From the graph we compute strategies using previous work 





**78/106 Candidate/Span:**	`Span("b'Abstract Convolutional neural nets (CNNs) have demonstrated remarkable performance in recent history.'", sentence=4022, chars=[0,100], words=[0,14])`

**Its parent Sentence's text:**	Abstract Convolutional neural nets (CNNs) have demonstrated remarkable performance in recent history.




**79/106 Candidate/Span:**	`Span("b'each Mars Exploration Rover (MER) affords better obstacle avoidance than does the previous version.'", sentence=4895, chars=[0,98], words=[0,16])`

**Its parent Sentence's text:**	each Mars Exploration Rover (MER) affords better obstacle avoidance than does the previous version.




**80/106 Candidate/Span:**	`Span("b'Abstract We review progress in a recent line of research that provides a concurrent computational interpretation of (intuitionistic) linear logic.'", sentence=3732, chars=[0,145], words=[0,22])`

**Its parent Sentence's text:**	Abstract We review progress in a recent line of research that provides a concurrent computational interpretation of (intuitionistic) linear logic.




**81/106 Candidate/Span:**	`Span("b'several participants presented their recent research'", sentence=3324, chars=[20,71], words=[4,9])`

**Its parent Sentence's text:**	During the seminar, several participants presented their recent research, and ongoing work and open problems were discussed.




**82/106 Candidate/Span:**	`Span("b'Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.'", sentence=1728, chars=[0,161], words=[0,29])`

**Its parent Sentence's text:**	Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.




**83/106 Candidate/Span:**	`Span("b'we make use of a recently developed fine-grained taxonomy of human-object grasps.'", sentence=1887, chars=[10,90], words=[4,20])`

**Its parent Sentence's text:**	To do so, we make use of a recently developed fine-grained taxonomy of human-object grasps.




**84/106 Candidate/Span:**	`Span("b'Abstract: Sparse representation technique has been widely used in various areas of computer vision over the last decades.'", sentence=5179, chars=[0,120], words=[0,19])`

**Its parent Sentence's text:**	Abstract: Sparse representation technique has been widely used in various areas of computer vision over the last decades.




**85/106 Candidate/Span:**	`Span("b'Previous versions of the robot have been remotely actuated through push-pull wires'", sentence=3012, chars=[0,81], words=[0,13])`

**Its parent Sentence's text:**	Previous versions of the robot have been remotely actuated through push-pull wires, while visual feedback was provided by fiber optic transmission.




**86/106 Candidate/Span:**	`Span("b'Despite the recent advances in full-body pose estimation using Kinect-like sensors'", sentence=1617, chars=[0,81], words=[0,14])`

**Its parent Sentence's text:**	Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand pose estimation in RGB-D images is still an unsolved problem.




**87/106 Candidate/Span:**	`Span("b'Abstract A recent trend (especially in electronic commerce) is higher levels of expressiveness in the mechanisms that mediate interactions such as auctions'", sentence=1776, chars=[0,154], words=[0,23])`

**Its parent Sentence's text:**	Abstract A recent trend (especially in electronic commerce) is higher levels of expressiveness in the mechanisms that mediate interactions such as auctions, exchanges, catalog offers, voting systems, matching of peers, and so on.




**88/106 Candidate/Span:**	`Span("b'Abstract Motivated by a radically new peer review system that the National Science Foundation recently experimented with'", sentence=3658, chars=[0,119], words=[0,16])`

**Its parent Sentence's text:**	Abstract Motivated by a radically new peer review system that the National Science Foundation recently experimented with, we study peer review systems in which proposals are reviewed by PIs who have submitted proposals themselves.




**89/106 Candidate/Span:**	`Span("b'and the authors have previously presented a fragment of ordered linear logic suitable to a forward-chaining operational semantics.'", sentence=2969, chars=[157,286], words=[26,46])`

**Its parent Sentence's text:**	In previous work, Polakow presented ordered linear logic and gave a backward-chaining operational semantics semantics for the uniform fragment of the logic, and the authors have previously presented a fragment of ordered linear logic suitable to a forward-chaining operational semantics.




**90/106 Candidate/Span:**	`Span("b'In previous work'", sentence=2969, chars=[0,15], words=[0,2])`

**Its parent Sentence's text:**	In previous work, Polakow presented ordered linear logic and gave a backward-chaining operational semantics semantics for the uniform fragment of the logic, and the authors have previously presented a fragment of ordered linear logic suitable to a forward-chaining operational semantics.




**91/106 Candidate/Span:**	`Span("b'Kidney exchanges represent a truly fielded example of technology at the intersection of artificial intelligence and economics\xe2\x80\x94an example that is both recent and an active research area.'", sentence=3478, chars=[0,184], words=[0,29])`

**Its parent Sentence's text:**	Kidney exchanges represent a truly fielded example of technology at the intersection of artificial intelligence and economics—an example that is both recent and an active research area.




**92/106 Candidate/Span:**	`Span("b'Recent research efforts have created tools that automatically localize the problem to a small number of potential culprits'", sentence=4670, chars=[0,121], words=[0,17])`

**Its parent Sentence's text:**	Recent research efforts have created tools that automatically localize the problem to a small number of potential culprits, but research is needed to understand what visualization techniques work best for helping distributed systems developers understand 





**93/106 Candidate/Span:**	`Span("b'Our results are for a class of games that generalizes the only previously known class of imperfect-recall abstractions where any results had been obtained.'", sentence=1585, chars=[0,154], words=[0,26])`

**Its parent Sentence's text:**	Our results are for a class of games that generalizes the only previously known class of imperfect-recall abstractions where any results had been obtained.




**94/106 Candidate/Span:**	`Span("b'BackgrOuNd Considerable attention has been given recently to the online offerings such as edX and Coursera by prestigious universities such as MIT'", sentence=4347, chars=[0,145], words=[0,21])`

**Its parent Sentence's text:**	BackgrOuNd Considerable attention has been given recently to the online offerings such as edX and Coursera by prestigious universities such as MIT, Stanford, and Harvard.




**95/106 Candidate/Span:**	`Span("b'We show that none of the previous coalition structure generation algorithms can establish any \n'", sentence=1806, chars=[0,94], words=[0,14])`

**Its parent Sentence's text:**	We show that none of the previous coalition structure generation algorithms can establish any 





**96/106 Candidate/Span:**	`Span("b'We show that none of the previous coalition structure generation algorithms can establish any bound because \n'", sentence=2621, chars=[0,108], words=[0,16])`

**Its parent Sentence's text:**	We show that none of the previous coalition structure generation algorithms can establish any bound because 





**97/106 Candidate/Span:**	`Span("b'There have been important recent game-theoretic analyses of spiteful bidding assuming all agents are equally spiteful.'", sentence=3993, chars=[0,117], words=[0,18])`

**Its parent Sentence's text:**	There have been important recent game-theoretic analyses of spiteful bidding assuming all agents are equally spiteful.




**98/106 Candidate/Span:**	`Span("b'Abstract Automated abstraction algorithms for sequential imperfect information games have recently emerged as a key component in developing competitive game theory-based agents.'", sentence=4594, chars=[0,176], words=[0,24])`

**Its parent Sentence's text:**	Abstract Automated abstraction algorithms for sequential imperfect information games have recently emerged as a key component in developing competitive game theory-based agents.




**99/106 Candidate/Span:**	`Span("b'until quite recently'", sentence=3419, chars=[36,55], words=[8,10])`

**Its parent Sentence's text:**	In our opinion, the literature has, until quite recently, placed too much emphasis on probabilistic inference machinery, while paying 





**100/106 Candidate/Span:**	`Span("b'In recent years'", sentence=4992, chars=[0,14], words=[0,2])`

**Its parent Sentence's text:**	In recent years, research attention in the software engineering community has shifted from process management and workflow tools that aim to plan for all coordination activity and eventualities before development begins to a new generation of more flexible tools that saturate the developer's workspace with information at varying degrees of granularity and in different visual, and often interactive, representations.




**101/106 Candidate/Span:**	`Span("b'motivated by modeling of conflict scenarios in societies with multiple ethno-\n'", sentence=4103, chars=[15,92], words=[4,15])`

**Its parent Sentence's text:**	In this paper, motivated by modeling of conflict scenarios in societies with multiple ethno-





**102/106 Candidate/Span:**	`Span("b'The variable circular plot (VCP) method for censusing animals is used widely to estimate the abundance of forest birds'", sentence=5124, chars=[0,117], words=[0,20])`

**Its parent Sentence's text:**	The variable circular plot (VCP) method for censusing animals is used widely to estimate the abundance of forest birds, despite a lack of effective methods for analyzing such data.




**103/106 Candidate/Span:**	`Span("b'In this lecture we combine ideas from the previous two lectures'", sentence=2622, chars=[0,62], words=[0,10])`

**Its parent Sentence's text:**	In this lecture we combine ideas from the previous two lectures, linear monadic logic programming and higher-order abstract syntax, to present a specification technique for programming languages we call substructural operational semantics.




**104/106 Candidate/Span:**	`Span("b'As sketched in the previous lecture'", sentence=1077, chars=[0,34], words=[0,5])`

**Its parent Sentence's text:**	As sketched in the previous lecture, this an important piece in the general technique of verifying properties of logic programs by reasoning about proof terms.





**105/106 Candidate/Span:**	`Span("b'Abstract\xe2\x80\x94Session types are widely accepted as a useful expressive discipline for structuring communications in concurrent and distributed systems.'", sentence=3217, chars=[0,145], words=[0,20])`

**Its parent Sentence's text:**	Abstract—Session types are widely accepted as a useful expressive discipline for structuring communications in concurrent and distributed systems.




**(Solved) Question 1:** the `CandidateExtractor` extracts only spans with length 1. But Each sentence is only getting matched once with one span (try search "sentence=3993", for example). The sentence is good, but we still would want longer span, e.g. half part or the whole of a sentence. 

`Span("b'previous'", sentence=4705, chars=[20,27], words=[4,4])`

**Answer 1:** overwrite `DictionaryMatch._f()`.

=====================================================================================

**(Solved) Question 2:** Several `Span` corresponds to the same sentence, e,g, three `Span` corresponds to `sentence=490`. They wee all included even after we have set `longest_match_only=True`, as these spans all have the longest length.  

**Answer 2:** Perhaps try either (1) filtering by first appearance of document ID; (2) set max_length as the length of the longest sentence length.


### Stopped here as of Wed 06/19 9:39 pm. More to come ... ###

## TODO List

1. Re-format 2K gold label papers. Load them as gold standard. Question: are there any recommended approaches to evaluate segmentation quality, e.g. word overlap? 
2. Our current segmentation strictly segments by sentence boundary, is that errorenous? 



Thanks for reading! 

Some debugging note to memorize (could ignore). 

```
python -m spacy download en
```

Current issue: parser does not parse *by periods*. Sentence count is significantly fewer than expected! 
Potential fix: https://github.com/explosion/spaCy/issues/93

======= Some more debugging log here (not necessary, could skip reading) ======
~~~~
Xins-MacBook-Pro:~ xin$ source activate snorkel
(snorkel) Xins-MacBook-Pro:~ xin$ python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacey
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spacey'
>>> import spacy
>>> spacy.load('en')
<spacy.en.English object at 0x1080e1da0>
>>> model=spacy.load('en')
>>> docs=model.tokenizer('Hello, world. Here are two sentences.')
>>> for sent in docs.sents:
...     pritn(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> for sent in docs.sents:
...     print(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(raw_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'raw_text' is not defined
>>> raw_text='Hello, world. Here are two sentences.'
>>> doc = nlp(raw_text)
>>> sentences = [sent.string.strip() for sent in doc.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> model(raw_text)
Hello, world. Here are two sentences.
>>> docs=model(raw_text)
>>> docs.sents
<generator object at 0x14ad31948>
>>> docs=model(raw_text)
>>> sentences = [sent.string.strip() for sent in docs.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> 
~~~~
