## Test Run 3: Refine Labelling Function

This notebook continues from Test Run 2, but addresses the following issues:

1. Load more papers than toy examples as the toy Test Run 2. 
2. Complete/refine labelling functions.
3. Load two evaluation datasets and calculate overlap: (1) 5 system/model/method papers + 5 empirical papers; (2) 20K labelled paper dataset.



In [13]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:85% !important; }</style>"))
debug_mode=1 # if not, debug_mode=0

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence

session = SnorkelSession()
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

sents = session.query(Sentence).all()
n_max_corpus=0
for sent in sents:
    n_max_corpus=max(n_max_corpus,len(sent.words))

print("The longest sentence has "+str(n_max_corpus)+" tokens.")

from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8:
            dev_sents.add(s)
        elif i % 10 == 9:
            test_sents.add(s)
        else:
            train_sents.add(s)


Documents: 5550
Sentences: 19472
The longest sentence has 157 tokens.


In [2]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import *

ngrams = Ngrams(n_max=n_max_corpus) # we define the maximum n value as n_max_corpus
document_breakdown_map=dict() # maps doc_id into a dict of ["Background", "Purpose", "Mechanism", "Method", "Finding"]
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
def extract_and_display(matcher,candidate_class,candidate_class_name,document_breakdown_map=None,selected_split=0):  # split over train/dev/test but returns only train set
    for (i, sents) in ([(0,train_sents), (1,dev_sents), (2,test_sents)] if selected_split==0 else [(1,dev_sents)]):
        %time matcher.apply(sents, split=i)
        printmd("**Split "+str(i)+" - number of candidates extracted: "+str(session.query(candidate_class).filter(candidate_class.split == i).count())+"**\n\n")
    cands = session.query(candidate_class).filter(candidate_class.split == selected_split).all()
    for i in range(min(4,len(cands))): # to print all cands, range(len(cands))
        printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i])+"`")
        printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
        printmd("**Its parent Document's text:**\t"+str(cands[i].get_parent().get_parent().__dict__))
        print() 
        
    for cand in cands:
        doc_name=cand.get_parent().get_parent().name
        if doc_name not in document_breakdown_map:
            document_breakdown_map[doc_name]=dict()
        if candidate_class_name not in document_breakdown_map[doc_name]:
            document_breakdown_map[doc_name][candidate_class_name]=[]
        document_breakdown_map[doc_name][candidate_class_name]+=[cand]
        
    return cands
    

We want to differentiate ``Background`` with ``Purpose``. Recall our ``Background`` definition from Google Doc:

``Contains words that indicate prior work (e.g., “traditionally”, “researchers have…”), and then following sentence/span starts with some variant of “In this paper, we introduce…” (Exploring adjacency relationship through helper function in Snorkel)``


In [3]:
# remove if necessary 
# session.query(Background).all()
import snorkel.models.candidate as candidate

def del_defined_candidate_class(class_name):
    print("Existing", candidate.candidate_subclasses)
    del(candidate.candidate_subclasses[class_name])
    print("After deletion", candidate.candidate_subclasses)
    
## Usage:    
# del_defined_candidate_class('Purpose')


The next few cells describe our ``Span Matching`` phase: we respectively define the `Span` for the 5 segments are , and centrally stored in a document-id-indexed hashmap. This helps us to group the 5 segments back into documents.
3. Only those documents that have at least 3 segments are considered as valid

Following this ``Span Matching`` phase, we will be having an ``Document Aggregation`` phase, where adjacency relationship and (or) waterfall models are being emphasized through multiple criteria. For example, the following criteria prevents non-abstracts (e.g. proceeding cover letter, tutorial introduction, etc.) or less well-strctured abstracts from being included. 

1. Only those documents that have all 5 segments are considered as valid
2. Only those documents that have at least 4 segments are cosidered as valid

In [4]:
# Compound Matcher for Background: 
Background = candidate_subclass('Background', ['background_cue'])
non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
transition_word=DictionaryMatch(d=['while','unlike','despite'],longest_match_only=True) 
transition_prev_work=DictionaryMatch(d=['previous','earlier','past'],longest_match_only=True) 
dict_background_matcher=DictionaryMatch(d=['previous work','traditionally','researchers'],longest_match_only=True) 
excluded_dict_background_matcher=DictionaryMatch(d=['we','unlike','our'],longest_match_only=True,reverse=True) 
non_comma_dict_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,Union(dict_background_matcher,Intersection(transition_word,transition_prev_work)),excluded_dict_background_matcher)])
background_cands=extract_and_display(non_comma_dict_background_matcher,Background,"Background",document_breakdown_map)


Clearing existing...
Running UDF...

CPU times: user 1min 1s, sys: 1.22 s, total: 1min 2s
Wall time: 1min 7s


**Split 0 - number of candidates extracted: 135**



Clearing existing...
Running UDF...

CPU times: user 8.44 s, sys: 407 ms, total: 8.85 s
Wall time: 9.19 s


**Split 1 - number of candidates extracted: 15**



Clearing existing...
Running UDF...

CPU times: user 8.94 s, sys: 474 ms, total: 9.41 s
Wall time: 9.89 s


**Split 2 - number of candidates extracted: 13**



**0/135 Candidate/Span:**	`Background(Span("b'the use of inflectional morphology and speech-perception abilities in children with SLI traditionally have employed synthetic speech stimuli.'", sentence=9740, chars=[82,222], words=[9,29]))`

**Its parent Sentence's text:**	With Specific Language Impairments Studies investigating the relationship between the use of inflectional morphology and speech-perception abilities in children with SLI traditionally have employed synthetic speech stimuli.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10ce20b70>, 'name': '5834868625ff05a97b012283', 'stable_id': '5834868625ff05a97b012283::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1151, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b012283,0,b'http://jslhr.'), Sentence(Document 5834868625ff05a97b012283,1,b'pubs.'), Sentence(Document 5834868625ff05a97b012283,2,b'asha.'), Sentence(Document 5834868625ff05a97b012283,3,b'org/article.'), Sentence(Document 5834868625ff05a97b012283,4,b'aspx?'), Sentence(Document 5834868625ff05a97b012283,5,b'articleid='), Sentence(Document 5834868625ff05a97b012283,6,b'1780928 Grammatical Morphology and Perception of Synthetic and Natural Speech in Children'), Sentence(Document 5834868625ff05a97b012283,7,b'With Specific Language Impairments Studies investigating the relationship between the use of inflectional morphology and speech-perception abilities in children with SLI traditionally have employed synthetic speech stimuli.'), Sentence(Document 5834868625ff05a97b012283,8,b'The purpose of this study was to replicate the findings reported in Leonard, McGregor, and Allen (1992) with an older group of children with SLI and to...\n')]}




**1/135 Candidate/Span:**	`Background(Span("b'The purpose of this workshop is to bring together researchers interested in the construction and analysis of'", sentence=19011, chars=[0,107], words=[0,16]))`

**Its parent Sentence's text:**	The purpose of this workshop is to bring together researchers interested in the construction and analysis of Web Scale multimedia datasets and resources.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1190bd1d0>, 'name': '5834868825ff05a97b017964', 'stable_id': '5834868825ff05a97b017964::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 3825, 'type': 'document', 'sentences': [Sentence(Document 5834868825ff05a97b017964,0,b'The purpose of this workshop is to bring together researchers interested in the construction and analysis of Web Scale multimedia datasets and resources.'), Sentence(Document 5834868825ff05a97b017964,1,b'The Workshop will provide a forum to consolidate key factors related to research on very large scale multimedia dataset such as the construction of dataset, creation of ground truth, sharing and extension of such resources in terms of ground truth, features, algorithms and tools etc.'), Sentence(Document 5834868825ff05a97b017964,2,b'The Workshop will discuss and formulate action plan towards these goals.\n')]}




**2/135 Candidate/Span:**	`Background(Span("b'and other early-stage researchers.'", sentence=22031, chars=[119,152], words=[18,24]))`

**Its parent Sentence's text:**	Figure 1 shows the trainees, which included engineering graduate students and postdoctoral fellows, medical residents, and other early-stage researchers.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11923a320>, 'name': '59367c4fc50f90cdd3acca1f', 'stable_id': '59367c4fc50f90cdd3acca1f::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4691, 'type': 'document', 'sentences': [Sentence(Document 59367c4fc50f90cdd3acca1f,0,b'Surgical Robotics was held in Pittsburgh, Pennsylvania, 21\xe2\x80\x9325 July 2014.'), Sentence(Document 59367c4fc50f90cdd3acca1f,1,b'The school was held at the Robotics Institute of Carnegie Mellon University and at Allegheny General Hospital (part of Allegheny Health Network) and drew 62 trainees from six countries.'), Sentence(Document 59367c4fc50f90cdd3acca1f,2,b'Figure 1 shows the trainees, which included engineering graduate students and postdoctoral fellows, medical residents, and other early-stage researchers.'), Sentence(Document 59367c4fc50f90cdd3acca1f,3,b'An international group of leading engineering and surgical faculty members was recruited to serve as instructors.\n')]}




**3/135 Candidate/Span:**	`Background(Span("b'Previous work by the authors [1] demonstrated that logistic regression can be a fast and accurate data mining tool for life sciences datasets'", sentence=12297, chars=[0,140], words=[0,24]))`

**Its parent Sentence's text:**	Previous work by the authors [1] demonstrated that logistic regression can be a fast and accurate data mining tool for life sciences datasets, competitive with modern tools like support vector machines and balltree based K-NN.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x118034c18>, 'name': '5834868625ff05a97b01355f', 'stable_id': '5834868625ff05a97b01355f::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1896, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b01355f,0,b'Previous work by the authors [1] demonstrated that logistic regression can be a fast and accurate data mining tool for life sciences datasets, competitive with modern tools like support vector machines and balltree based K-NN.'), Sentence(Document 5834868625ff05a97b01355f,1,b'This paper has two objectives.'), Sentence(Document 5834868625ff05a97b01355f,2,b'The first objective is a serious empirical comparison of logistic regression to several classical and modern learners on a variety of learning tasks.'), Sentence(Document 5834868625ff05a97b01355f,3,b'The second is to describe our use of conjugate gradient inside an iteratively re-weighted least squares fitting procedure.\n')]}




In [5]:
# Compound Matcher for Purpose:
Purpose=candidate_subclass('Purpose',['purpose_cue'])

transition_regex_matcher=RegexMatchSpan(rgx="((^|\s)however.*$)|((^|\s)but(?!(also))*$)",longest_match_only=True)  # Correction: purpose 
excluded_dict_purpose_matcher=SentenceMatch(d=['but also','but without','but sometimes'],longest_match_only=True,reverse=True)  # the parent sentence shall not include "but also"
transition_matcher=Intersection(transition_regex_matcher,excluded_dict_purpose_matcher)

comparative_degree_matcher=Intersection(RegexMatchSpan(rgx="(.*more.*than.*$)|(.*er than.*$)",longest_match_only=True),transition_prev_work)  # Correction: purpose 
other_regex_matcher=RegexMatchSpan(rgx="(.*extend.*$)|(.*offer.*$)",longest_match_only=True)

dict_purpose_matcher=DictionaryMatch(d=['in this paper','in the paper',' that can ','in this study','to examine','we examine','to investigate','implications'],longest_match_only=True) 

# Unit test
# non_comma_dict_purpose_matcher=CandidateExtractor(Purpose, [ngrams], [Intersection(non_comma_matcher,other_regex_matcher)])

non_comma_dict_purpose_matcher=CandidateExtractor(Purpose, [ngrams], [Intersection(non_comma_matcher,Union(comparative_degree_matcher,other_regex_matcher,dict_purpose_matcher,transition_matcher))]) #,intersection(excluded_dict_purpose_matcher,transition_regex_matcher)])
purpose_cands=extract_and_display(non_comma_dict_purpose_matcher,Purpose,"Purpose",document_breakdown_map)

Clearing existing...
Running UDF...

CPU times: user 1min 19s, sys: 836 ms, total: 1min 20s
Wall time: 1min 22s


**Split 0 - number of candidates extracted: 1341**



Clearing existing...
Running UDF...

CPU times: user 10.2 s, sys: 298 ms, total: 10.5 s
Wall time: 10.3 s


**Split 1 - number of candidates extracted: 203**



Clearing existing...
Running UDF...

CPU times: user 9.78 s, sys: 286 ms, total: 10.1 s
Wall time: 9.88 s


**Split 2 - number of candidates extracted: 160**



**0/1341 Candidate/Span:**	`Purpose(Span("b'In this paper'", sentence=11122, chars=[0,12], words=[0,2]))`

**Its parent Sentence's text:**	In this paper, we demonstrate the successful visual estimation of a matrix's rank by treating it as a classification problem.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10dc5d828>, 'name': '5834868625ff05a97b013195', 'stable_id': '5834868625ff05a97b013195::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1554, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b013195,0,b'Abstract\xe2\x80\x94'), Sentence(Document 5834868625ff05a97b013195,1,b'The visual estimation of the rank of a matrix has eluded researchers across a myriad of disciplines many years.'), Sentence(Document 5834868625ff05a97b013195,2,b"In this paper, we demonstrate the successful visual estimation of a matrix's rank by treating it as a classification problem."), Sentence(Document 5834868625ff05a97b013195,3,b'When tested on a dataset of tens-of-thousands of colormapped matrices of varying ranks, we not only achieve state-of-the-art performance, but also distressingly high performance on an absolute basis.\n')]}




**1/1341 Candidate/Span:**	`Purpose(Span("b'but'", sentence=23444, chars=[301,303], words=[53,53]))`

**Its parent Sentence's text:**	As a technologist, working in the field of storage systems, and employed by a company that develops and sells a wide range of computer systems and components, I've learned that the biggest impediments to technology transfer---including better OS support for specific applications---are not technical, but economic.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1192ed128>, 'name': '5938b81ac50f90cdd3add894', 'stable_id': '5938b81ac50f90cdd3add894::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5096, 'type': 'document', 'sentences': [Sentence(Document 5938b81ac50f90cdd3add894,0,b"As a technologist, working in the field of storage systems, and employed by a company that develops and sells a wide range of computer systems and components, I've learned that the biggest impediments to technology transfer---including better OS support for specific applications---are not technical, but economic."), Sentence(Document 5938b81ac50f90cdd3add894,1,b'This paper summarizes some of these issues, and considers how best we might respond to them.\n')]}




**2/1341 Candidate/Span:**	`Purpose(Span("b'The implications for composition'", sentence=19794, chars=[0,31], words=[0,3]))`

**Its parent Sentence's text:**	The implications for composition, performance, and future research are discussed.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11911e908>, 'name': '58cc9dcec50f90cdd3881f53', 'stable_id': '58cc9dcec50f90cdd3881f53::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4052, 'type': 'document', 'sentences': [Sentence(Document 58cc9dcec50f90cdd3881f53,0,b'Traditional notation has led to the continuation of a traditional music approach in which scores are static descriptions to be realized by performers.'), Sentence(Document 58cc9dcec50f90cdd3881f53,1,b'Borrowing programming concepts from computer science leads to score-like descriptions with the capability of expressing dynamic processes and interaction with performers.'), Sentence(Document 58cc9dcec50f90cdd3881f53,2,b'The implications for composition, performance, and future research are discussed.\n')]}




**3/1341 Candidate/Span:**	`Purpose(Span("b'In this paper'", sentence=21553, chars=[0,12], words=[0,2]))`

**Its parent Sentence's text:**	In this paper, we study the group anomaly detection problem.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1191fc278>, 'name': '58f78f35c50f90cdd398bede', 'stable_id': '58f78f35c50f90cdd398bede::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4546, 'type': 'document', 'sentences': [Sentence(Document 58f78f35c50f90cdd398bede,0,b'An important task in exploring and analyzing real-world data sets is to detect unusual and interesting phenomena.'), Sentence(Document 58f78f35c50f90cdd398bede,1,b'In this paper, we study the group anomaly detection problem.'), Sentence(Document 58f78f35c50f90cdd398bede,2,b'Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points.'), Sentence(Document 58f78f35c50f90cdd398bede,3,b'For this purpose, we propose the Flexible Genre Model (FGM).'), Sentence(Document 58f78f35c50f90cdd398bede,4,b'FGM is designed to characterize data groups at both the point level and the group level so as to detect various types of group anomalies.\n')]}




In [6]:
# Compound Matcher for Mechanism: 
Mechanism = candidate_subclass('Mechanism', ['mechanism_cue']) 
dict_mechanism_matcher=DictionaryMatch(d=['introduce','introduces','propose','proposes','we propose','we develop','approach'],longest_match_only=True) 
non_comma_dict_mechanism_matcher=CandidateExtractor(Mechanism, [ngrams], [Intersection(non_comma_matcher,dict_mechanism_matcher)])
mechanism_cands=extract_and_display(non_comma_dict_mechanism_matcher,Mechanism,"Mechanism",document_breakdown_map)

Clearing existing...
Running UDF...

CPU times: user 44.5 s, sys: 514 ms, total: 45 s
Wall time: 45.2 s


**Split 0 - number of candidates extracted: 1247**



Clearing existing...
Running UDF...

CPU times: user 5.77 s, sys: 257 ms, total: 6.03 s
Wall time: 5.86 s


**Split 1 - number of candidates extracted: 169**



Clearing existing...
Running UDF...

CPU times: user 6.25 s, sys: 277 ms, total: 6.53 s
Wall time: 6.36 s


**Split 2 - number of candidates extracted: 157**



**0/1247 Candidate/Span:**	`Mechanism(Span("b'Results show that badges alleviated the negative impact on help seeking introduced by up-and downvoting.\n'", sentence=10396, chars=[0,104], words=[0,18]))`

**Its parent Sentence's text:**	Results show that badges alleviated the negative impact on help seeking introduced by up-and downvoting.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10dc43c18>, 'name': '5834868525ff05a97b00f679', 'stable_id': '5834868525ff05a97b00f679::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1344, 'type': 'document', 'sentences': [Sentence(Document 5834868525ff05a97b00f679,0,b'Through the lens of Expectancy Value Theory, we examine the effect of help giver badges, information about helper expertise, and up-and downvoting on help seeking in a MOOC discussion forum.'), Sentence(Document 5834868525ff05a97b00f679,1,b'Results show that badges alleviated the negative impact on help seeking introduced by up-and downvoting.\n')]}




**1/1247 Candidate/Span:**	`Mechanism(Span("b'we propose functional causes as a refined definition of causality with several desirable properties.\n'", sentence=18222, chars=[79,179], words=[10,25]))`

**Its parent Sentence's text:**	After identifying some undesirable characteristics of the original definition, we propose functional causes as a refined definition of causality with several desirable properties.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11905f128>, 'name': '5834868825ff05a97b01737f', 'stable_id': '5834868825ff05a97b01737f::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 3608, 'type': 'document', 'sentences': [Sentence(Document 5834868825ff05a97b01737f,0,b'In this paper, we propose causality as a unified framework to explain query answers and non-answers, thus generalizing and extending several previously proposed approaches of provenance and missing query result explanations.'), Sentence(Document 5834868825ff05a97b01737f,1,b'We develop our framework starting from the well-studied definition of actual causes by Halpern and Pearl.'), Sentence(Document 5834868825ff05a97b01737f,2,b'After identifying some undesirable characteristics of the original definition, we propose functional causes as a refined definition of causality with several desirable properties.\n')]}




**2/1247 Candidate/Span:**	`Mechanism(Span("b'we developed the \xe2\x80\x9cExploration and Discovery Information Evaluation\xe2\x80\x9d(EDIE) framework to assist with the development of the Rosetta system and its components.'", sentence=9862, chars=[86,241], words=[15,38]))`

**Its parent Sentence's text:**	As part of our participation in the development of evaluation tools for GALE systems, we developed the “Exploration and Discovery Information Evaluation”(EDIE) framework to assist with the development of the Rosetta system and its components.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10c36c2e8>, 'name': '5834868525ff05a97b0102e7', 'stable_id': '5834868525ff05a97b0102e7::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1186, 'type': 'document', 'sentences': [Sentence(Document 5834868525ff05a97b0102e7,0,b'As part of our participation in the development of evaluation tools for GALE systems, we developed the \xe2\x80\x9cExploration and Discovery Information Evaluation\xe2\x80\x9d(EDIE) framework to assist with the development of the Rosetta system and its components.'), Sentence(Document 5834868525ff05a97b0102e7,1,b'Along with the series of experiments testing the Rosetta system, we also developed associated evaluation me-thodologies and metrics, which make EDIE a complete framework for user-centered utility evaluation.\n')]}




**3/1247 Candidate/Span:**	`Mechanism(Span("b'Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization.'", sentence=7080, chars=[0,137], words=[0,18]))`

**Its parent Sentence's text:**	Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10cd5a780>, 'name': '5834868425ff05a97b00c3b8', 'stable_id': '5834868425ff05a97b00c3b8::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 425, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00c3b8,0,b'We describe novel but simple motion features for the problem of detecting objects in video sequences.'), Sentence(Document 5834868425ff05a97b00c3b8,1,b'Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization.'), Sentence(Document 5834868425ff05a97b00c3b8,2,b'We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features.'), Sentence(Document 5834868425ff05a97b00c3b8,3,b'Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition.\n')]}




In [7]:
# TODO: Compound Matcher for Method: 
Method = candidate_subclass('Method', ['method_cue'])

dict_method_matcher=DictionaryMatch(d=['dataset','benchmark','experiment ','experiments',"empirical","participant","survey"," conduct"," analyze"],longest_match_only=True) 

non_comma_dict_method_matcher=CandidateExtractor(Method, [ngrams], [Intersection(non_comma_matcher,dict_method_matcher)])
method_cands=extract_and_display(non_comma_dict_method_matcher,Method,"Method",document_breakdown_map)


Clearing existing...
Running UDF...

CPU times: user 40.9 s, sys: 380 ms, total: 41.3 s
Wall time: 41.1 s


**Split 0 - number of candidates extracted: 668**



Clearing existing...
Running UDF...

CPU times: user 5.57 s, sys: 232 ms, total: 5.81 s
Wall time: 5.65 s


**Split 1 - number of candidates extracted: 81**



Clearing existing...
Running UDF...

CPU times: user 6.01 s, sys: 254 ms, total: 6.26 s
Wall time: 6.08 s


**Split 2 - number of candidates extracted: 84**



**0/668 Candidate/Span:**	`Method(Span("b'Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories.\n'", sentence=9574, chars=[0,157], words=[0,24]))`

**Its parent Sentence's text:**	Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10ca6f780>, 'name': '5834868625ff05a97b011069', 'stable_id': '5834868625ff05a97b011069::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1121, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b011069,0,b'This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier.'), Sentence(Document 5834868625ff05a97b011069,1,b'We identify the inductive biases of each classifier and explore how boosting, as an error-driven resampling mechanism, reacts to those biases.'), Sentence(Document 5834868625ff05a97b011069,2,b'Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories.\n')]}




**1/668 Candidate/Span:**	`Method(Span("b'we can test a wide variety of control strategies over the course of a single experiment and can customize the assistance to the user.\n'", sentence=23851, chars=[20,153], words=[4,29]))`

**Its parent Sentence's text:**	With this hardware, we can test a wide variety of control strategies over the course of a single experiment and can customize the assistance to the user.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1193231d0>, 'name': '5947fc92c50f90cdd3b48c11', 'stable_id': '5947fc92c50f90cdd3b48c11::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5217, 'type': 'document', 'sentences': [Sentence(Document 5947fc92c50f90cdd3b48c11,0,b'While exoskeleton assistance strategies have been explored for decades, devices that can reduce the metabolic cost of walking are relatively new [1].'), Sentence(Document 5947fc92c50f90cdd3b48c11,1,b'Many of the current strategies use autonomous hardware and controllers specific to the hardware rather than the subject.'), Sentence(Document 5947fc92c50f90cdd3b48c11,2,b'Our lab uses an emulator system with lightweight end-effectors [2] and off-board motors and control hardware [3].'), Sentence(Document 5947fc92c50f90cdd3b48c11,3,b'With this hardware, we can test a wide variety of control strategies over the course of a single experiment and can customize the assistance to the user.\n')]}




**2/668 Candidate/Span:**	`Method(Span("b'near to the state of the art on both benchmarks.\n'", sentence=23564, chars=[145,193], words=[22,33]))`

**Its parent Sentence's text:**	Our algorithm is faster than traditional phrasestructure parsing and achieves 90.4% English parsing accuracy and 82.4% Chinese parsing accuracy, near to the state of the art on both benchmarks.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1192f8860>, 'name': '5938cd22c50f90cdd3adde8a', 'stable_id': '5938cd22c50f90cdd3adde8a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5131, 'type': 'document', 'sentences': [Sentence(Document 5938cd22c50f90cdd3adde8a,0,b'We present a new algorithm for transforming dependency parse trees into phrase-structure parse trees.'), Sentence(Document 5938cd22c50f90cdd3adde8a,1,b'We cast the problem as structured prediction and learn a statistical model.'), Sentence(Document 5938cd22c50f90cdd3adde8a,2,b'Our algorithm is faster than traditional phrasestructure parsing and achieves 90.4% English parsing accuracy and 82.4% Chinese parsing accuracy, near to the state of the art on both benchmarks.\n')]}




**3/668 Candidate/Span:**	`Method(Span("b'Experiments consider variants of the approach and illustrate its strengths and weaknesses.\n'", sentence=23507, chars=[0,90], words=[0,13]))`

**Its parent Sentence's text:**	Experiments consider variants of the approach and illustrate its strengths and weaknesses.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1192edcf8>, 'name': '5938c845c50f90cdd3addd09', 'stable_id': '5938c845c50f90cdd3addd09::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5114, 'type': 'document', 'sentences': [Sentence(Document 5938c845c50f90cdd3addd09,0,b'We show that discourse structure, as defined by Rhetorical Structure Theory and provided by an existing discourse parser, benefits text categorization.'), Sentence(Document 5938c845c50f90cdd3addd09,1,b'Our approach uses a recursive neural network and a newly proposed attention mechanism to compute a representation of the text that focuses on salient content, from the perspective of both RST and the task.'), Sentence(Document 5938c845c50f90cdd3addd09,2,b'Experiments consider variants of the approach and illustrate its strengths and weaknesses.\n')]}




In [8]:
# TODO: Compound Matcher for Finding: 
Finding = candidate_subclass('Finding', ['finding_cue'])

dict_finding_matcher=DictionaryMatch(d=['show that','shows that','found','indicate','results','performance','find'],longest_match_only=True) 

non_comma_dict_finding_matcher=CandidateExtractor(Finding, [ngrams], [Intersection(non_comma_matcher,dict_finding_matcher)])
finding_cands=extract_and_display(non_comma_dict_finding_matcher,Finding,"Finding",document_breakdown_map)

Clearing existing...
Running UDF...

CPU times: user 41.1 s, sys: 383 ms, total: 41.5 s
Wall time: 41.3 s


**Split 0 - number of candidates extracted: 1537**



Clearing existing...
Running UDF...

CPU times: user 5.9 s, sys: 258 ms, total: 6.16 s
Wall time: 5.98 s


**Split 1 - number of candidates extracted: 176**



Clearing existing...
Running UDF...

CPU times: user 6.41 s, sys: 291 ms, total: 6.7 s
Wall time: 6.5 s


**Split 2 - number of candidates extracted: 216**



**0/1537 Candidate/Span:**	`Finding(Span("b'Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories.\n'", sentence=9574, chars=[0,157], words=[0,24]))`

**Its parent Sentence's text:**	Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10ca6f780>, 'name': '5834868625ff05a97b011069', 'stable_id': '5834868625ff05a97b011069::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1121, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b011069,0,b'This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier.'), Sentence(Document 5834868625ff05a97b011069,1,b'We identify the inductive biases of each classifier and explore how boosting, as an error-driven resampling mechanism, reacts to those biases.'), Sentence(Document 5834868625ff05a97b011069,2,b'Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories.\n')]}




**1/1537 Candidate/Span:**	`Finding(Span("b'We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold.'", sentence=25003, chars=[0,158], words=[0,28]))`

**Its parent Sentence's text:**	We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1193abcf8>, 'name': '59c54d78c50f90cdd3fb9691', 'stable_id': '59c54d78c50f90cdd3fb9691::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5546, 'type': 'document', 'sentences': [Sentence(Document 59c54d78c50f90cdd3fb9691,0,b'We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in RD given a noisy sample from the manifold.'), Sentence(Document 59c54d78c50f90cdd3fb9691,1,b'We assume that the manifold satisfies a smoothness condition and that the noise distribution has compact support.'), Sentence(Document 59c54d78c50f90cdd3fb9691,2,b'We show that the optimal rate of convergence is n\xe2\x88\x92 2/(2+ d).'), Sentence(Document 59c54d78c50f90cdd3fb9691,3,b'Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded.\n')]}




**2/1537 Candidate/Span:**	`Finding(Span("b'the author has tried to show that it is possible to put a complete semester-long course in machine-held form.'", sentence=5668, chars=[12,120], words=[3,26]))`

**Its parent Sentence's text:**	Since 1989, the author has tried to show that it is possible to put a complete semester-long course in machine-held form.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10c062198>, 'name': '5834868425ff05a97b00ce2b', 'stable_id': '5834868425ff05a97b00ce2b::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 32, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00ce2b,0,b'Since 1989, the author has tried to show that it is possible to put a complete semester-long course in machine-held form.'), Sentence(Document 5834868425ff05a97b00ce2b,1,b'Several examples have been carried out (in Mathematica), and the paper reports on the experience, on the problems encountered, and on some suggestions for future developments.\n')]}




**3/1537 Candidate/Span:**	`Finding(Span("b'Analysis on the test results about the SDH network and analysis shows that the combination protection is more efficient than both MSP protection and UPSR protection.\n'", sentence=6435, chars=[0,165], words=[0,27]))`

**Its parent Sentence's text:**	Analysis on the test results about the SDH network and analysis shows that the combination protection is more efficient than both MSP protection and UPSR protection.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x10c878a20>, 'name': '5834868525ff05a97b00edb5', 'stable_id': '5834868525ff05a97b00edb5::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 237, 'type': 'document', 'sentences': [Sentence(Document 5834868525ff05a97b00edb5,0,b'The SDH protection for Ethernet ring, bandwidth adjustment of LCAS and the transmission delay of virtual concatenation are analyzed.'), Sentence(Document 5834868525ff05a97b00edb5,1,b'Combination protection is suggested based on the advantages of the above two protection methods.'), Sentence(Document 5834868525ff05a97b00edb5,2,b'Analysis on the test results about the SDH network and analysis shows that the combination protection is more efficient than both MSP protection and UPSR protection.\n')]}




Next is the ``Document Aggregation`` phase. First we show a few document examples, of how ``document_breakdown_map`` looks like, as well as how do we aggregate. 

In [30]:
for ind,docid in enumerate(document_breakdown_map.keys()):
    print("\n\nDocument "+str(ind)+"/"+str(len(document_breakdown_map.keys())))
    print("Extracted segments\n\n",docid,document_breakdown_map[docid],"\n\n")
    print("Complete abstract\n\n",document_breakdown_map[docid][list(document_breakdown_map[docid].keys())[0]][0].get_parent().get_parent().sentences)
    if ind>3:
        break



Document 0/2702
Extracted segments

 5834868625ff05a97b012283 {'Background': [Background(Span("b'the use of inflectional morphology and speech-perception abilities in children with SLI traditionally have employed synthetic speech stimuli.'", sentence=9740, chars=[82,222], words=[9,29]))], 'Finding': [Finding(Span("b'The purpose of this study was to replicate the findings reported in Leonard'", sentence=9741, chars=[0,74], words=[0,12]))]} 


Complete abstract

 [Sentence(Document 5834868625ff05a97b012283,0,b'http://jslhr.'), Sentence(Document 5834868625ff05a97b012283,1,b'pubs.'), Sentence(Document 5834868625ff05a97b012283,2,b'asha.'), Sentence(Document 5834868625ff05a97b012283,3,b'org/article.'), Sentence(Document 5834868625ff05a97b012283,4,b'aspx?'), Sentence(Document 5834868625ff05a97b012283,5,b'articleid='), Sentence(Document 5834868625ff05a97b012283,6,b'1780928 Grammatical Morphology and Perception of Synthetic and Natural Speech in Children'), Sentence(Document 5834868625ff05a97

Then we count the number of documents that contain exactly N segments, where N=1, 2, 3, 4, 5.

In [102]:
def get_document_text(doc):
    text_string=" ".join([str(sentence.text) for sentence in doc.sentences])
    return text_string

for n in [5,4,3,2,1]:
    print("Below are one or two document examples that contain exactly "+str(n)+" segments\n")
    showed_examples=0
    count=0
    for ind,docid in enumerate(document_breakdown_map.keys()):
        if len(document_breakdown_map[docid].keys())==n:
            count+=1
            if showed_examples<2:
                showed_examples+=1
                print(docid,": ", get_document_text(document_breakdown_map[docid][list(document_breakdown_map[docid].keys())[0]][0].get_parent().get_parent()))
                print("Extracted segments\n\n",docid,document_breakdown_map[docid],"\n\n")
    print("Total count is "+str(count)+"\n\n=================\n")
        

Below are one or two document examples that contain exactly 5 segments

5834868625ff05a97b013016 :  The ability to argue is essential in many aspects of life, but traditional face-to-face tutoring approaches do not scale up well. A solution for this dilemma may be computer-supported argumentation (CSA). The evaluation of CSA approaches in different domains has led to mixed results. To gain insights into the challenges and future prospects of CSA we conducted a survey among teachers, researchers, and system developers. Our investigation points to optimism regarding the potential success and importance of CSA.

Extracted segments

 5834868625ff05a97b013016 {'Background': [Background(Span("b'researchers'", sentence=14022, chars=[103,113], words=[18,18]))], 'Purpose': [Purpose(Span("b'but'", sentence=14019, chars=[59,61], words=[12,12]))], 'Mechanism': [Mechanism(Span("b'but traditional face-to-face tutoring approaches do not scale up well.'", sentence=14019, chars=[59,128], words=[12,26])

Then we want to some visualization of extracted Span as highlighted text on Document.

Color notation: <b style="color:orange;">Background</b> <b style="color:pink;">Purpose</b> <b style="color:green;">Mechanism</b> <b style="color:purple;">Method</b> <b style="color:blue;">Findings</b>

In [107]:
color_mapping={"Background":"orange","Purpose":"pink","Mechanism":"green","Method":"purple","Finding":"blue"}
docid='5834868625ff05a97b013016'
this_document=document_breakdown_map[docid][list(document_breakdown_map[docid].keys())[0]][0].get_parent().get_parent()
document_text=get_document_text(this_document)

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
    
added_segment=[]
for segment in document_breakdown_map[docid]:
    spans=document_breakdown_map[docid][segment]
    for span in spans:
        this_span=document_breakdown_map[docid][segment][0].__dict__[list(document_breakdown_map[docid][segment][0].__dict__)[6]]
        span_sentence_text=this_span.sentence.text
        span_text=str(span_sentence_text[this_span.char_start:(this_span.char_end+1)])
        document_text=document_text.replace(span_text.strip(),"<font color='"+color_mapping[segment]+"'>"+span_text.strip()+"</font>")

printmd(document_text)
    
    

The ability to argue is essential in many aspects of life, <font color='pink'>but</font> traditional face-to-face tutoring approaches do not scale up well. A solution for this dilemma may be computer-supported argumentation (CSA). <font color='blue'>The evaluation of CSA approaches in different domains has led to mixed results.</font> <font color='purple'>To gain insights into the challenges and future prospects of CSA we conducted a survey among teachers</font>, <font color='orange'>researchers</font>, and system developers. Our investigation points to optimism regarding the potential success and importance of CSA.
