## Test Run 2: Extract Meaningful Span

This notebook continues from Test Run 1, but defines *Span* more meaningfully. We want to achieve that, 1 *Sentence* = 1+ *Spans* = 1+ *Clauses*. After the default `Ngram()` class, we customize Matcher to exclude any span with comma.

Note that in this test run, we will initiate `SnorkelSession()` without re-create `Document` or `Sentence`, which has been created in test run 1.


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence

session = SnorkelSession()
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

sents = session.query(Sentence).all()
n_max_corpus=0
for sent in sents:
    n_max_corpus=max(n_max_corpus,len(sent.words))

print("The longest sentence has "+str(n_max_corpus)+" tokens.")

from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)


Documents: 1000
Sentences: 4487
The longest sentence has 101 tokens.


In test run 1, we have tried using **single** `DictionaryMatcher`. Now it's time to explore more, e.g. combining **multiple** `DictionaryMatchers`. 

Fortunately, we could combine **multiple** `DictionaryMatchers` into what-we-named-as **compound** `Matchers` through:
1. `Union(Matcher)` (takes the union of candidate sets returned by child matcher), 
2. `Intersection(Matcher)`, or 
3. `Concat(Matcher)` (candidates which are the concatenation of adjacent matches from child matchers).

In [2]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import *

Background = candidate_subclass('Background', ['background_cue'])
ngrams = Ngrams(n_max=n_max_corpus) # we define the maximum n value as n_max_corpus

# Compound Matcher Example 1 (Dict-based): 
non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
dict_background_matcher=DictionaryMatch(d=['previous','motivated','recent','widely'],longest_match_only=True) 
non_comma_dict_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,dict_background_matcher)])

In [5]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    %time non_comma_dict_background_matcher.apply(sents, split=i)
    printmd("**Split "+str(i)+" - number of candidates extracted: "+str(session.query(Background).filter(Background.split == i).count())+"**\n\n")
            

Clearing existing...
Running UDF...

CPU times: user 7.92 s, sys: 208 ms, total: 8.13 s
Wall time: 8.02 s


**Split 0 - number of candidates extracted: 106**



Clearing existing...
Running UDF...

CPU times: user 1.13 s, sys: 86.5 ms, total: 1.21 s
Wall time: 1.17 s


**Split 1 - number of candidates extracted: 11**



Clearing existing...
Running UDF...

CPU times: user 1.14 s, sys: 89.6 ms, total: 1.23 s
Wall time: 1.18 s


**Split 2 - number of candidates extracted: 9**



In [6]:
cands = session.query(Background).filter(Background.split == 0).all()
document_list=list()
for i in range(7): # to print all cands, range(len(cands))
    printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i].background_cue)+"`")
    printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
    printmd("**Its parent Document's text:**\t"+str(cands[i].get_parent().get_parent().__dict__))
    print()
    

**0/106 Candidate/Span:**	`Span("b'simple threshold-based methods remain the most widely deployed and most popular approach among practitioners.'", sentence=2300, chars=[98,206], words=[14,30])`

**Its parent Sentence's text:**	Despite the proliferation of detection and containment techniques in the worm defense literature, simple threshold-based methods remain the most widely deployed and most popular approach among practitioners.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115abf4a8>, 'name': '5834868425ff05a97b00ce09', 'stable_id': '5834868425ff05a97b00ce09::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 269, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00ce09,0,b'Abstract:'), Sentence(Document 5834868425ff05a97b00ce09,1,b'Despite the proliferation of detection and containment techniques in the worm defense literature, simple threshold-based methods remain the most widely deployed and most popular approach among practitioners.'), Sentence(Document 5834868425ff05a97b00ce09,2,b'This popularity arises out of the simplistic appeal, ease of use, and independence from attack-specific properties such as scanning strategies and signatures.'), Sentence(Document 5834868425ff05a97b00ce09,3,b'However, such approaches have known limitations: they either fail to detect low-rate attacks or incur very high false positive rates.'), Sentence(Document 5834868425ff05a97b00ce09,4,b'We propose a multi-\n')]}




**1/106 Candidate/Span:**	`Span("b'Abstract Despite the recent trend of increasingly large datasets for object detection'", sentence=4830, chars=[0,84], words=[0,11])`

**Its parent Sentence's text:**	Abstract Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115acbe80>, 'name': '5834868425ff05a97b00d0e7', 'stable_id': '5834868425ff05a97b00d0e7::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 845, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d0e7,0,b'Abstract Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples.'), Sentence(Document 5834868425ff05a97b00d0e7,1,b'To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes.'), Sentence(Document 5834868425ff05a97b00d0e7,2,b'Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class.'), Sentence(Document 5834868425ff05a97b00d0e7,3,b'Our experimental results \n')]}




**2/106 Candidate/Span:**	`Span("b'In previous work'", sentence=5091, chars=[0,15], words=[0,2])`

**Its parent Sentence's text:**	In previous work, we developed a reliable navigation technique that uses partially observable Markov models to represent metric, actuator and sensor uncertainties.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115aec2b0>, 'name': '5834868425ff05a97b00d663', 'stable_id': '5834868425ff05a97b00d663::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 908, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d663,0,b'Abstract: Navigation methods for office delivery robots need to take various sources of uncertainty into account in order to get robust performance.'), Sentence(Document 5834868425ff05a97b00d663,1,b'In previous work, we developed a reliable navigation technique that uses partially observable Markov models to represent metric, actuator and sensor uncertainties.'), Sentence(Document 5834868425ff05a97b00d663,2,b"This paper describes an algorithm that adjusts the probabilities of the initial Markov model by passively observing the robot's interactions with its environment."), Sentence(Document 5834868425ff05a97b00d663,3,b'The learned probabilities more accurately reflect the actual uncertainties \n')]}




**3/106 Candidate/Span:**	`Span("b'The robot design extends upon previous prototypes of HeartLander'", sentence=4718, chars=[0,63], words=[0,8])`

**Its parent Sentence's text:**	The robot design extends upon previous prototypes of HeartLander, a miniature mobile robot that moves in an inchworm-like fashion.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1122240b8>, 'name': '5834868425ff05a97b00c8cf', 'stable_id': '5834868425ff05a97b00c8cf::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 820, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c8cf,0,b'Abstract\xe2\x80\x94'), Sentence(Document 5834868425ff05a97b00c8cf,1,b'This paper describes the development and construction of a mobile robot driven by miniature ultrasonic piezoelectric motors for minimally invasive cardiac therapy.'), Sentence(Document 5834868425ff05a97b00c8cf,2,b'The robot design extends upon previous prototypes of HeartLander, a miniature mobile robot that moves in an inchworm-like fashion.'), Sentence(Document 5834868425ff05a97b00c8cf,3,b'Construction of the system included motor selection, body design, and development of the control system.'), Sentence(Document 5834868425ff05a97b00c8cf,4,b'The robotic design was developed as a proof of concept to demonstrate mobility on the cardiac surface.'), Sentence(Document 5834868425ff05a97b00c8cf,5,b'This paper presents the \n')]}




**4/106 Candidate/Span:**	`Span("b'has motivated recent efforts in improving the quality of Internet video.'", sentence=5341, chars=[115,186], words=[19,30])`

**Its parent Sentence's text:**	Abstract The key role that video quality plays in impacting user engagement, and consequently providers' revenues, has motivated recent efforts in improving the quality of Internet video.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115ab5fd0>, 'name': '5834868425ff05a97b00cdf8', 'stable_id': '5834868425ff05a97b00cdf8::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 966, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cdf8,0,b"Abstract The key role that video quality plays in impacting user engagement, and consequently providers' revenues, has motivated recent efforts in improving the quality of Internet video."), Sentence(Document 5834868425ff05a97b00cdf8,1,b'This includes work on adaptive bitrate selection, multi-CDN optimization, and global control plane architectures.'), Sentence(Document 5834868425ff05a97b00cdf8,2,b'Before we embark on deploying these designs, we need to first understand the nature of video of quality problems to see if this complexity is necessary, and if simpler approaches can yield comparable benefits.'), Sentence(Document 5834868425ff05a97b00cdf8,3,b'To this end, this \n')]}




**5/106 Candidate/Span:**	`Span("b'It is based on a recently discovered connection between homotopy the-ory and type theory.'", sentence=4352, chars=[0,88], words=[0,16])`

**Its parent Sentence's text:**	It is based on a recently discovered connection between homotopy the-ory and type theory.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a9e400>, 'name': '5834868425ff05a97b00cbc8', 'stable_id': '5834868425ff05a97b00cbc8::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 733, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cbc8,0,b'Homotopy type theory is a new branch of mathematics that combines aspects of several different fields in a surprising way.'), Sentence(Document 5834868425ff05a97b00cbc8,1,b'It is based on a recently discovered connection between homotopy the-ory and type theory.'), Sentence(Document 5834868425ff05a97b00cbc8,2,b'Homotopy theory is an outgrowth of algebraic topology and homological algebra, with relationships to higher category theory; while type theory is a branch of mathematical logic and theoretical computer science.'), Sentence(Document 5834868425ff05a97b00cbc8,3,b'Although the connections between the two are currently the focus of intense investigation, it is increasingly clear that \n')]}




**6/106 Candidate/Span:**	`Span("b'Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset.'", sentence=2713, chars=[0,165], words=[0,24])`

**Its parent Sentence's text:**	Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115acbd30>, 'name': '5834868425ff05a97b00d0d6', 'stable_id': '5834868425ff05a97b00d0d6::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 362, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d0d6,0,b'Abstract: The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset.'), Sentence(Document 5834868425ff05a97b00d0d6,1,b'We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning.'), Sentence(Document 5834868425ff05a97b00d0d6,2,b'To enable model selection, we organize these models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules.'), Sentence(Document 5834868425ff05a97b00d0d6,3,b'We use our grammar to generically and \n')]}




Next we move to `RegexMatch`, to extract *Spans* that starts with "to"/"for"/"by". 

In [7]:
# Compound Matcher Example 2 (Regex-based):
# E.g. to match either begining To/By/For or middle to/by/or, 
# we could use "(^(?:(To)|(By)|(For)) .*$)|(^.+ (?:(to)|(by)|(for)) .*$)"

non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
regex_background_matcher=RegexMatchSpan(rgx="(^(?:(To)|(By)|(For)) .*$)|(^.+ (?:(to)|(by)|(for)) .+$)",longest_match_only=True)  
non_comma_regex_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,regex_background_matcher)])

for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    %time non_comma_regex_background_matcher.apply(sents, split=i)
    printmd("**Split "+str(i)+" - number of candidates extracted: "+str(session.query(Background).filter(Background.split == i).count())+"**\n\n")
 


Clearing existing...
Running UDF...

CPU times: user 9.37 s, sys: 254 ms, total: 9.63 s
Wall time: 9.5 s


**Split 0 - number of candidates extracted: 2042**



Clearing existing...
Running UDF...

CPU times: user 1.36 s, sys: 153 ms, total: 1.52 s
Wall time: 1.41 s


**Split 1 - number of candidates extracted: 239**



Clearing existing...
Running UDF...

CPU times: user 1.4 s, sys: 150 ms, total: 1.55 s
Wall time: 1.44 s


**Split 2 - number of candidates extracted: 236**



In [8]:
cands = session.query(Background).filter(Background.split == 0).all()
document_list=list()
for i in range(7): # to print all cands, range(len(cands))
    printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i].background_cue)+"`")
    printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
    printmd("**Its parent Document's text:**\t"+str(cands[i].get_parent().get_parent().__dict__))
    print()

**0/2042 Candidate/Span:**	`Span("b'This is particularly challenging in time critical domains where goal rewards decrease over time and for tightly coupled coordination where multiple robots must work together on each goal.'", sentence=2782, chars=[0,186], words=[0,28])`

**Its parent Sentence's text:**	This is particularly challenging in time critical domains where goal rewards decrease over time and for tightly coupled coordination where multiple robots must work together on each goal.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115af76a0>, 'name': '5834868425ff05a97b00d8ca', 'stable_id': '5834868425ff05a97b00d8ca::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 374, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d8ca,0,b'Abstract Enabling multiple robots to work together as a team is a difficult problem.'), Sentence(Document 5834868425ff05a97b00d8ca,1,b'Robots must decide amongst themselves who should work on which goals and at what time each goal should be achieved.'), Sentence(Document 5834868425ff05a97b00d8ca,2,b'Since the team is situated in some physical environment, the robots must consider travel time in these decisions.'), Sentence(Document 5834868425ff05a97b00d8ca,3,b'This is particularly challenging in time critical domains where goal rewards decrease over time and for tightly coupled coordination where multiple robots must work together on each goal.'), Sentence(Document 5834868425ff05a97b00d8ca,4,b'Further complications arise when \n')]}




**1/2042 Candidate/Span:**	`Span("b'repeatable testbeds for mobile software and systems.'", sentence=3740, chars=[146,197], words=[25,32])`

**Its parent Sentence's text:**	This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels, as well as to build realistic, repeatable testbeds for mobile software and systems.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a7d400>, 'name': '5834868425ff05a97b00ca70', 'stable_id': '5834868425ff05a97b00ca70::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 588, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00ca70,0,b'Abstract Mobile networks are both poorly understood and difficult to experiment with.'), Sentence(Document 5834868425ff05a97b00ca70,1,b'This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels, as well as to build realistic, repeatable testbeds for mobile software and systems.'), Sentence(Document 5834868425ff05a97b00ca70,2,b'The RFC is a status report on our work tracing mobile networks.'), Sentence(Document 5834868425ff05a97b00ca70,3,b'Our goal is to begin discussion on a standard format for mobile network tracing as well as a testbed for mobile systems research.'), Sentence(Document 5834868425ff05a97b00ca70,4,b'We present our format for collecting mobile network traces, and \n')]}




**2/2042 Candidate/Span:**	`Span("b'This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels'", sentence=3740, chars=[0,112], words=[0,16])`

**Its parent Sentence's text:**	This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels, as well as to build realistic, repeatable testbeds for mobile software and systems.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a7d400>, 'name': '5834868425ff05a97b00ca70', 'stable_id': '5834868425ff05a97b00ca70::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 588, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00ca70,0,b'Abstract Mobile networks are both poorly understood and difficult to experiment with.'), Sentence(Document 5834868425ff05a97b00ca70,1,b'This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels, as well as to build realistic, repeatable testbeds for mobile software and systems.'), Sentence(Document 5834868425ff05a97b00ca70,2,b'The RFC is a status report on our work tracing mobile networks.'), Sentence(Document 5834868425ff05a97b00ca70,3,b'Our goal is to begin discussion on a standard format for mobile network tracing as well as a testbed for mobile systems research.'), Sentence(Document 5834868425ff05a97b00ca70,4,b'We present our format for collecting mobile network traces, and \n')]}




**3/2042 Candidate/Span:**	`Span("b'as well as to build realistic'", sentence=3740, chars=[115,143], words=[18,23])`

**Its parent Sentence's text:**	This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels, as well as to build realistic, repeatable testbeds for mobile software and systems.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a7d400>, 'name': '5834868425ff05a97b00ca70', 'stable_id': '5834868425ff05a97b00ca70::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 588, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00ca70,0,b'Abstract Mobile networks are both poorly understood and difficult to experiment with.'), Sentence(Document 5834868425ff05a97b00ca70,1,b'This RFC argues that mobile network tracing provides both tools to improve our understanding of wireless channels, as well as to build realistic, repeatable testbeds for mobile software and systems.'), Sentence(Document 5834868425ff05a97b00ca70,2,b'The RFC is a status report on our work tracing mobile networks.'), Sentence(Document 5834868425ff05a97b00ca70,3,b'Our goal is to begin discussion on a standard format for mobile network tracing as well as a testbed for mobile systems research.'), Sentence(Document 5834868425ff05a97b00ca70,4,b'We present our format for collecting mobile network traces, and \n')]}




**4/2042 Candidate/Span:**	`Span("b'Lecture 5 on Dynamical Systems & Dynamic Axioms investigated dynamic axioms for dynamical systems'", sentence=3453, chars=[0,96], words=[0,13])`

**Its parent Sentence's text:**	Lecture 5 on Dynamical Systems & Dynamic Axioms investigated dynamic axioms for dynamical systems, ie axioms in differential dynamic logic (dL) that characterize operators of the dynamical systems that dL describes by hybrid programs in terms of structurally simpler dL formulas.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a0da58>, 'name': '5834868425ff05a97b00c280', 'stable_id': '5834868425ff05a97b00c280::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 523, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c280,0,b'Lecture 5 on Dynamical Systems & Dynamic Axioms investigated dynamic axioms for dynamical systems, ie axioms in differential dynamic logic (dL) that characterize operators of the dynamical systems that dL describes by hybrid programs in terms of structurally simpler dL formulas.'), Sentence(Document 5834868425ff05a97b00c280,1,b'All it takes to understand the bigger system, thus, is to apply the axiom and investigate the smaller remainders.'), Sentence(Document 5834868425ff05a97b00c280,2,b'That lecture did not quite show all important axioms yet, but it still revealed enough to prove a property of a bouncing ball.'), Sentence(Document 5834868425ff05a97b00c280,3,b"Yet, there's more to \n")]}




**5/2042 Candidate/Span:**	`Span("b'ie axioms in differential dynamic logic (dL) that characterize operators of the dynamical systems that dL describes by hybrid programs in terms of structurally simpler dL formulas.'", sentence=3453, chars=[99,278], words=[15,44])`

**Its parent Sentence's text:**	Lecture 5 on Dynamical Systems & Dynamic Axioms investigated dynamic axioms for dynamical systems, ie axioms in differential dynamic logic (dL) that characterize operators of the dynamical systems that dL describes by hybrid programs in terms of structurally simpler dL formulas.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a0da58>, 'name': '5834868425ff05a97b00c280', 'stable_id': '5834868425ff05a97b00c280::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 523, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c280,0,b'Lecture 5 on Dynamical Systems & Dynamic Axioms investigated dynamic axioms for dynamical systems, ie axioms in differential dynamic logic (dL) that characterize operators of the dynamical systems that dL describes by hybrid programs in terms of structurally simpler dL formulas.'), Sentence(Document 5834868425ff05a97b00c280,1,b'All it takes to understand the bigger system, thus, is to apply the axiom and investigate the smaller remainders.'), Sentence(Document 5834868425ff05a97b00c280,2,b'That lecture did not quite show all important axioms yet, but it still revealed enough to prove a property of a bouncing ball.'), Sentence(Document 5834868425ff05a97b00c280,3,b"Yet, there's more to \n")]}




**6/2042 Candidate/Span:**	`Span("b"The work presented here captures progress on an initial empirical evaluation of how well the current VANE system is able to reproduce a real autonomy system's perception performance."", sentence=1494, chars=[0,181], words=[0,29])`

**Its parent Sentence's text:**	The work presented here captures progress on an initial empirical evaluation of how well the current VANE system is able to reproduce a real autonomy system's perception performance.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115a2a908>, 'name': '5834868425ff05a97b00c345', 'stable_id': '5834868425ff05a97b00c345::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 100, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c345,0,b"Abstract: The US Army Corps of Engineers'(USACE)"), Sentence(Document 5834868425ff05a97b00c345,1,b'Virtual Autonomous Navigation Environment (VANE) is a physics-based, multi-scale numerical testbed designed to quantitatively and accurately predict sensor and autonomous system performance in a simulation environment.'), Sentence(Document 5834868425ff05a97b00c345,2,b"The work presented here captures progress on an initial empirical evaluation of how well the current VANE system is able to reproduce a real autonomy system's perception performance."), Sentence(Document 5834868425ff05a97b00c345,3,b'Findings will directly guide continuing development of \n')]}





### Stopped here as of Mon 06/25 1:36 am. More to come ... ###

## TODO List

Thanks for reading! 

Some debugging note to memorize (could ignore). 

```
python -m spacy download en
```

Current issue: parser does not parse *by periods*. Sentence count is significantly fewer than expected! 
Potential fix: https://github.com/explosion/spaCy/issues/93

======= Some more debugging log here (not necessary, could skip reading) ======
~~~~
Xins-MacBook-Pro:~ xin$ source activate snorkel
(snorkel) Xins-MacBook-Pro:~ xin$ python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacey
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spacey'
>>> import spacy
>>> spacy.load('en')
<spacy.en.English object at 0x1080e1da0>
>>> model=spacy.load('en')
>>> docs=model.tokenizer('Hello, world. Here are two sentences.')
>>> for sent in docs.sents:
...     pritn(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> for sent in docs.sents:
...     print(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(raw_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'raw_text' is not defined
>>> raw_text='Hello, world. Here are two sentences.'
>>> doc = nlp(raw_text)
>>> sentences = [sent.string.strip() for sent in doc.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> model(raw_text)
Hello, world. Here are two sentences.
>>> docs=model(raw_text)
>>> docs.sents
<generator object at 0x14ad31948>
>>> docs=model(raw_text)
>>> sentences = [sent.string.strip() for sent in docs.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> 
~~~~
