## Test Run 2: Extract Meaningful Span

This notebook continues from Test Run 1, but defines *Span* more meaningfully. We want to achieve that, 1 *Sentence* = 1+ *Spans* = 1+ *Clauses*. After the default `Ngram()` class, we customize Matcher to exclude any span with comma.

Note that in this test run, we will initiate `SnorkelSession()` without re-create `Document` or `Sentence`, which has been created in test run 1.


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence

session = SnorkelSession()
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

sents = session.query(Sentence).all()
n_max_corpus=0
for sent in sents:
    n_max_corpus=max(n_max_corpus,len(sent.words))

print("The longest sentence has "+str(n_max_corpus)+" tokens.")

from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)


Documents: 1000
Sentences: 4487
The longest sentence has 101 tokens.


In test run 1, we have tried using **single** `DictionaryMatcher`. Now it's time to explore more, e.g. combining **multiple** `DictionaryMatchers`. 

Fortunately, we could combine **multiple** `DictionaryMatchers` into what-we-named-as **compound** `Matchers` through:
1. `Union(Matcher)` (takes the union of candidate sets returned by child matcher), 
2. `Intersection(Matcher)`, or 
3. `Concat(Matcher)` (candidates which are the concatenation of adjacent matches from child matchers).

In [2]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import *

Background = candidate_subclass('Background', ['background_cue'])
ngrams = Ngrams(n_max=n_max_corpus) # we define the maximum n value as n_max_corpus

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
def extract_and_display(matcher):
    for i, sents in enumerate([train_sents, dev_sents, test_sents]):
        %time matcher.apply(sents, split=i)
        printmd("**Split "+str(i)+" - number of candidates extracted: "+str(session.query(Background).filter(Background.split == i).count())+"**\n\n")
    cands = session.query(Background).filter(Background.split == 0).all()
    document_list=list()
    for i in range(7): # to print all cands, range(len(cands))
        printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i].background_cue)+"`")
        printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
        printmd("**Its parent Document's text:**\t"+str(cands[i].get_parent().get_parent().__dict__))
        print()        
    

In [3]:
# Compound Matcher Example 1 (Dict-based): 
non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
dict_background_matcher=DictionaryMatch(d=['previous','motivated','recent','widely'],longest_match_only=True) 
non_comma_dict_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,dict_background_matcher)])
extract_and_display(non_comma_dict_background_matcher)

Clearing existing...
Running UDF...

CPU times: user 9 s, sys: 320 ms, total: 9.32 s
Wall time: 9.29 s


**Split 0 - number of candidates extracted: 106**



Clearing existing...
Running UDF...

CPU times: user 1.27 s, sys: 110 ms, total: 1.38 s
Wall time: 1.36 s


**Split 1 - number of candidates extracted: 11**



Clearing existing...
Running UDF...

CPU times: user 1.3 s, sys: 115 ms, total: 1.42 s
Wall time: 1.41 s


**Split 2 - number of candidates extracted: 9**



**0/106 Candidate/Span:**	`Span("b'While recent work in network verification has made giant strides to reduce this effort'", sentence=4155, chars=[0,85], words=[0,13])`

**Its parent Sentence's text:**	While recent work in network verification has made giant strides to reduce this effort, they focus on simple reachability properties and cannot handle context-dependent policies (eg, how many connections has a host spawned) that operators realize using stateful network functions (NFs).

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115732240>, 'name': '5834868425ff05a97b00cb7f', 'stable_id': '5834868425ff05a97b00cb7f::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 687, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cb7f,0,b'Abstract: Network operators today spend significant manual effort in ensuring and checking that the network meets their intended policies.'), Sentence(Document 5834868425ff05a97b00cb7f,1,b'While recent work in network verification has made giant strides to reduce this effort, they focus on simple reachability properties and cannot handle context-dependent policies (eg, how many connections has a host spawned) that operators realize using stateful network functions (NFs).'), Sentence(Document 5834868425ff05a97b00cb7f,2,b'Together, these introduce new expressiveness and scalability challenges that fall outside the scope of existing network \n')]}




**1/106 Candidate/Span:**	`Span("b'Abstract Recent studies have shown a range of co-residency side channels that can be used to extract private information from cloud clients.'", sentence=2272, chars=[0,139], words=[0,24])`

**Its parent Sentence's text:**	Abstract Recent studies have shown a range of co-residency side channels that can be used to extract private information from cloud clients.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115754e10>, 'name': '5834868425ff05a97b00cdf0', 'stable_id': '5834868425ff05a97b00cdf0::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 262, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cdf0,0,b'Abstract Recent studies have shown a range of co-residency side channels that can be used to extract private information from cloud clients.'), Sentence(Document 5834868425ff05a97b00cdf0,1,b'Unfortunately, addressing these side channels often requires detailed attack-specific fixes that require significant modifications to hardware, client virtual machines (VM), or hypervisors.'), Sentence(Document 5834868425ff05a97b00cdf0,2,b'Furthermore, these solutions cannot be generalized to future side channels.'), Sentence(Document 5834868425ff05a97b00cdf0,3,b'Barring extreme solutions such as single tenancy which sacrifices the multiplexing benefits of cloud computing, such side channels will \n')]}




**2/106 Candidate/Span:**	`Span("b'As described in the previous chapters'", sentence=4705, chars=[0,36], words=[0,5])`

**Its parent Sentence's text:**	As described in the previous chapters, some of the most successful work has included road following using CCD cameras [8.10] and cross country navigation using a scanning laser range finder [8.7].

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115797a90>, 'name': '5834868425ff05a97b00d8e4', 'stable_id': '5834868425ff05a97b00d8e4::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 817, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d8e4,0,b'As described in the previous chapters, some of the most successful work has included road following using CCD cameras [8.10] and cross country navigation using a scanning laser range finder [8.7].'), Sentence(Document 5834868425ff05a97b00d8e4,1,b'The long term goal of each of these projects has been to engineer a vehicle capable of performing fully autonomous tasks in a given environment.'), Sentence(Document 5834868425ff05a97b00d8e4,2,b'But current systems have significant limitations.'), Sentence(Document 5834868425ff05a97b00d8e4,3,b'A single CCD camera is sufficient for road following, but not for extended road driving tasks; a laser range finder is sufficient for obstacle avoidance\n')]}




**3/106 Candidate/Span:**	`Span("b'Recent proposals have advocated the use of consolidation of \n'", sentence=2205, chars=[0,60], words=[0,9])`

**Its parent Sentence's text:**	Recent proposals have advocated the use of consolidation of 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115749198>, 'name': '5834868425ff05a97b00cc77', 'stable_id': '5834868425ff05a97b00cc77::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 246, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cc77,0,b'Abstract Modern offices are crowded with personal computers.'), Sentence(Document 5834868425ff05a97b00cc77,1,b'While studies have shown these to be idle most of the time, they remain powered, consuming up to 60&percnt; of their peak power.'), Sentence(Document 5834868425ff05a97b00cc77,2,b'Hardware-based solutions engendered by PC vendors (eg, low-power states, Wake-on-LAN) have proved unsuccessful because, in spite of user inactivity, these machines often need to remain network active in support of background applications that maintain network presence.'), Sentence(Document 5834868425ff05a97b00cc77,3,b'Recent proposals have advocated the use of consolidation of \n')]}




**4/106 Candidate/Span:**	`Span("b'<\\nPrevious; Next >; Home > Conferences_Events > SCFORUM > 6.'", sentence=2902, chars=[0,60], words=[0,13])`

**Its parent Sentence's text:**	<\nPrevious; Next >; Home > Conferences_Events > SCFORUM > 6.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11578d6a0>, 'name': '5834868425ff05a97b00d782', 'stable_id': '5834868425ff05a97b00d782::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 399, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d782,0,b'Home; Search; Browse Collections; My Account; About; DC Network Digital Commons Network\xe2\x84\xa2.\\nSkip to main content logo.'), Sentence(Document 5834868425ff05a97b00d782,1,b'Research Showcase @'), Sentence(Document 5834868425ff05a97b00d782,2,b'CMU.'), Sentence(Document 5834868425ff05a97b00d782,3,b'Home; About; FAQ; Policies; My Account.'), Sentence(Document 5834868425ff05a97b00d782,4,b'<\\nPrevious; Next >; Home > Conferences_Events > SCFORUM > 6.'), Sentence(Document 5834868425ff05a97b00d782,5,b'Scholarly Communications Forum.\\nAuthors.'), Sentence(Document 5834868425ff05a97b00d782,6,b'John Ockerbloom N. Balakrishnan, Indian Institute of Science, Bangalore Ismail Serageldin,\\nBibliotheca Alexandrina Michael I. Shamos, Carnegie Mellon University.'), Sentence(Document 5834868425ff05a97b00d782,7,b'Title.'), Sentence(Document 5834868425ff05a97b00d782,8,b'Forum on International\\nInitiatives on Copyright.'), Sentence(Document 5834868425ff05a97b00d782,9,b'Date of Original Version.'), Sentence(Document 5834868425ff05a97b00d782,10,b'11-2  \\n\n')]}




**5/106 Candidate/Span:**	`Span("b'Abstract Recently'", sentence=2006, chars=[0,16], words=[0,1])`

**Its parent Sentence's text:**	Abstract Recently, there has been increasing deployment of distributed infrastructure and applications such as Akamai, PlanetLab and DHTs.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x115728eb8>, 'name': '5834868425ff05a97b00cb6b', 'stable_id': '5834868425ff05a97b00cb6b::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 203, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cb6b,0,b'Abstract Recently, there has been increasing deployment of distributed infrastructure and applications such as Akamai, PlanetLab and DHTs.'), Sentence(Document 5834868425ff05a97b00cb6b,1,b'A highly-available monitoring service for these systems is vital to ensuring their smooth operation.'), Sentence(Document 5834868425ff05a97b00cb6b,2,b'In this setting, a key challenge to high availability is the presence of correlated failures, not addressed by previous monitoring services.'), Sentence(Document 5834868425ff05a97b00cb6b,3,b'This paper presents our approach to achieving high availability in IRISLOG, a customizable wide-area monitoring service.'), Sentence(Document 5834868425ff05a97b00cb6b,4,b'IRISLOG incorporates a replication design \n')]}




**6/106 Candidate/Span:**	`Span("b'Abstract Recent biological advances with protein microarrays allow high-throughput antibody response screens against the entire proteome of a disease pathogen.'", sentence=3381, chars=[0,158], words=[0,22])`

**Its parent Sentence's text:**	Abstract Recent biological advances with protein microarrays allow high-throughput antibody response screens against the entire proteome of a disease pathogen.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1156d4860>, 'name': '5834868425ff05a97b00c37d', 'stable_id': '5834868425ff05a97b00c37d::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 505, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c37d,0,b'Abstract Recent biological advances with protein microarrays allow high-throughput antibody response screens against the entire proteome of a disease pathogen.'), Sentence(Document 5834868425ff05a97b00c37d,1,b'cDNA microarray analysis techniques have been applied in normalization and identifying highly significant antigens in antibody response across conditions.'), Sentence(Document 5834868425ff05a97b00c37d,2,b'Previously, some basic machine learning classification algorithms have been used to evaluate the diagnostic power of noted antigens, but the issue of which algorithms perform best on this type of data is \n')]}




Next we move to Compound Matcher 2, esp. `RegexMatch`, to extract *Spans* that starts with "to"/"for"/"by". 

In [4]:
# Compound Matcher Example 2 (Regex-based):
# E.g. to match either begining To/By/For or middle to/by/or, 
# we could use "(^(?:(To)|(By)|(For)) .*$)|(^.+ (?:(to)|(by)|(for)) .*$)"

non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
regex_background_matcher=RegexMatchSpan(rgx="(^(?:(To)|(By)|(For)) .*$)|(^.+ (?:(to)|(by)|(for)) .+$)",longest_match_only=True)  
non_comma_regex_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,regex_background_matcher)])
extract_and_display(non_comma_regex_background_matcher)

Clearing existing...
Running UDF...

CPU times: user 12.5 s, sys: 435 ms, total: 13 s
Wall time: 13.1 s


**Split 0 - number of candidates extracted: 2042**



Clearing existing...
Running UDF...

CPU times: user 1.74 s, sys: 205 ms, total: 1.95 s
Wall time: 1.85 s


**Split 1 - number of candidates extracted: 239**



Clearing existing...
Running UDF...

CPU times: user 1.73 s, sys: 194 ms, total: 1.93 s
Wall time: 1.83 s


**Split 2 - number of candidates extracted: 236**



**0/2042 Candidate/Span:**	`Span("b'whose distribution can be estimated by collecting statistics.'", sentence=4031, chars=[238,298], words=[42,50])`

**Its parent Sentence's text:**	Abstract With the explosive growth of the intemet, autonomous agents will increasingly need strategies for efficiently retrieving information, The time an agent (or server) takes to answer a query issued to it is often a random variable, whose distribution can be estimated by collecting statistics.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1157d9898>, 'name': '5834868425ff05a97b00df0a', 'stable_id': '5834868425ff05a97b00df0a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 658, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00df0a,0,b'Abstract With the explosive growth of the intemet, autonomous agents will increasingly need strategies for efficiently retrieving information, The time an agent (or server) takes to answer a query issued to it is often a random variable, whose distribution can be estimated by collecting statistics.'), Sentence(Document 5834868425ff05a97b00df0a,1,b'When an agent A sends a query to another agent B and the query has not completed in some period of time, agent A faces the dilemma of whether to continue waiting or reissue the query (to agent B or to a different one).'), Sentence(Document 5834868425ff05a97b00df0a,2,b'When some information is \n')]}




**1/2042 Candidate/Span:**	`Span("b'The time an agent (or server) takes to answer a query issued to it is often a random variable'", sentence=4031, chars=[143,235], words=[20,40])`

**Its parent Sentence's text:**	Abstract With the explosive growth of the intemet, autonomous agents will increasingly need strategies for efficiently retrieving information, The time an agent (or server) takes to answer a query issued to it is often a random variable, whose distribution can be estimated by collecting statistics.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1157d9898>, 'name': '5834868425ff05a97b00df0a', 'stable_id': '5834868425ff05a97b00df0a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 658, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00df0a,0,b'Abstract With the explosive growth of the intemet, autonomous agents will increasingly need strategies for efficiently retrieving information, The time an agent (or server) takes to answer a query issued to it is often a random variable, whose distribution can be estimated by collecting statistics.'), Sentence(Document 5834868425ff05a97b00df0a,1,b'When an agent A sends a query to another agent B and the query has not completed in some period of time, agent A faces the dilemma of whether to continue waiting or reissue the query (to agent B or to a different one).'), Sentence(Document 5834868425ff05a97b00df0a,2,b'When some information is \n')]}




**2/2042 Candidate/Span:**	`Span("b'autonomous agents will increasingly need strategies for efficiently retrieving information'", sentence=4031, chars=[51,140], words=[9,18])`

**Its parent Sentence's text:**	Abstract With the explosive growth of the intemet, autonomous agents will increasingly need strategies for efficiently retrieving information, The time an agent (or server) takes to answer a query issued to it is often a random variable, whose distribution can be estimated by collecting statistics.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1157d9898>, 'name': '5834868425ff05a97b00df0a', 'stable_id': '5834868425ff05a97b00df0a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 658, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00df0a,0,b'Abstract With the explosive growth of the intemet, autonomous agents will increasingly need strategies for efficiently retrieving information, The time an agent (or server) takes to answer a query issued to it is often a random variable, whose distribution can be estimated by collecting statistics.'), Sentence(Document 5834868425ff05a97b00df0a,1,b'When an agent A sends a query to another agent B and the query has not completed in some period of time, agent A faces the dilemma of whether to continue waiting or reissue the query (to agent B or to a different one).'), Sentence(Document 5834868425ff05a97b00df0a,2,b'When some information is \n')]}




**3/2042 Candidate/Span:**	`Span("b'When compared to other replay'", sentence=4704, chars=[0,28], words=[0,4])`

**Its parent Sentence's text:**	When compared to other replay mechanisms,//TRACE offers 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11576a8d0>, 'name': '5834868425ff05a97b00d0a4', 'stable_id': '5834868425ff05a97b00d0a4::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 816, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00d0a4,0,b'Abstract//TRACE 1 is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behavior.'), Sentence(Document 5834868425ff05a97b00d0a4,1,b'Its tracing engine automatically discovers inter-node data dependencies and inter-'), Sentence(Document 5834868425ff05a97b00d0a4,2,b'I/O compute times for each node (process) in an application.'), Sentence(Document 5834868425ff05a97b00d0a4,3,b'This information is reflected in per-node annotated I/O traces.'), Sentence(Document 5834868425ff05a97b00d0a4,4,b'Such annotation allows a parallel replayer to closely mimic the behavior of a traced application across a variety of storage systems.'), Sentence(Document 5834868425ff05a97b00d0a4,5,b'When compared to other replay mechanisms,//TRACE offers \n')]}




**4/2042 Candidate/Span:**	`Span("b'adapting to local traffic and exceptional situations as necessary. \n'", sentence=3147, chars=[63,130], words=[10,20])`

**Its parent Sentence's text:**	A behavioral layer executes the route through the environment, adapting to local traffic and exceptional situations as necessary. 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1157ae550>, 'name': '5834868425ff05a97b00dc55', 'stable_id': '5834868425ff05a97b00dc55::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 449, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00dc55,0,b'Executive Summary'), Sentence(Document 5834868425ff05a97b00dc55,1,b'The Urban Challenge represents a technological leap beyond the previous Grand Challenges.'), Sentence(Document 5834868425ff05a97b00dc55,2,b'The challenge encompasses three primary behaviors: driving on roads, handling intersections and maneuvering in zones.'), Sentence(Document 5834868425ff05a97b00dc55,3,b'In implementing urban driving we have decomposed the problem into five components.'), Sentence(Document 5834868425ff05a97b00dc55,4,b'Mission Planning determines an efficient route through an urban network of roads.'), Sentence(Document 5834868425ff05a97b00dc55,5,b'A behavioral layer executes the route through the environment, adapting to local traffic and exceptional situations as necessary. \n')]}




**5/2042 Candidate/Span:**	`Span("b'Here we intend to describe in more detail his remarkable career as a publisher of mathematical books.'", sentence=5245, chars=[0,100], words=[0,17])`

**Its parent Sentence's text:**	Here we intend to describe in more detail his remarkable career as a publisher of mathematical books.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11573d588>, 'name': '5834868425ff05a97b00cbcb', 'stable_id': '5834868425ff05a97b00cbcb::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 945, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00cbcb,0,b'Klaus Peters as Mathematical Publisher'), Sentence(Document 5834868425ff05a97b00cbcb,1,b'This piece is a supplement to a biographical note in the December 2014 issue of the Notices dealing with the highly respected publisher of scientific books, Dr. Klaus Peters.'), Sentence(Document 5834868425ff05a97b00cbcb,2,b'Here we intend to describe in more detail his remarkable career as a publisher of mathematical books.'), Sentence(Document 5834868425ff05a97b00cbcb,3,b'After his doctorate in complex analysis in 1962 from the University of Erlangen, Klaus served as assistant professor at Erlangen for two years.'), Sentence(Document 5834868425ff05a97b00cbcb,4,b'Then he was invited by Springer Verlag to be its first in-house mathematics editor. \n')]}




**6/2042 Candidate/Span:**	`Span("b'The GOD immobilized Pt nanowires were used for application in glucose detection by \n'", sentence=3687, chars=[0,83], words=[0,13])`

**Its parent Sentence's text:**	The GOD immobilized Pt nanowires were used for application in glucose detection by 


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1157ce0b8>, 'name': '5834868425ff05a97b00de86', 'stable_id': '5834868425ff05a97b00de86::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned.tsv'}, 'id': 578, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00de86,0,b'Abstract In this work, the surface of platinum (Pt) nanowires was modified by using several chemicals, including a compound of gelatin gel with SiO 2, polyvinyl alcohol (PVA) with Prussian blue (PB) mediator and cysteamine self-assembled monolayers (SAM).'), Sentence(Document 5834868425ff05a97b00de86,1,b'Then, glucose oxidase (GOD) enzyme was immobilized on the modified surfaces of Pt nanowire electrodes by using techniques of electrochemical adsorption and chemical binding.'), Sentence(Document 5834868425ff05a97b00de86,2,b'The GOD immobilized Pt nanowires were used for application in glucose detection by \n')]}





### Stopped here as of Mon 06/25 1:36 am. More to come ... ###

## TODO List

Thanks for reading! 

Some debugging note to memorize (could ignore). 

```
python -m spacy download en
```

Current issue: parser does not parse *by periods*. Sentence count is significantly fewer than expected! 
Potential fix: https://github.com/explosion/spaCy/issues/93

======= Some more debugging log here (not necessary, could skip reading) ======
~~~~
Xins-MacBook-Pro:~ xin$ source activate snorkel
(snorkel) Xins-MacBook-Pro:~ xin$ python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacey
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spacey'
>>> import spacy
>>> spacy.load('en')
<spacy.en.English object at 0x1080e1da0>
>>> model=spacy.load('en')
>>> docs=model.tokenizer('Hello, world. Here are two sentences.')
>>> for sent in docs.sents:
...     pritn(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> for sent in docs.sents:
...     print(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(raw_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'raw_text' is not defined
>>> raw_text='Hello, world. Here are two sentences.'
>>> doc = nlp(raw_text)
>>> sentences = [sent.string.strip() for sent in doc.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> model(raw_text)
Hello, world. Here are two sentences.
>>> docs=model(raw_text)
>>> docs.sents
<generator object at 0x14ad31948>
>>> docs=model(raw_text)
>>> sentences = [sent.string.strip() for sent in docs.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> 
~~~~
