## Test Run 3: Refine Labelling Function

This notebook continues from Test Run 2, but addresses the following issues:

1. Load more papers than toy examples as the toy Test Run 2. 
2. Complete/refine labelling functions.
3. Load two evaluation datasets and calculate overlap: (1) 5 system/model/method papers + 5 empirical papers; (2) 20K labelled paper dataset.



In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:85% !important; }</style>"))
debug_mode=1 # if not, debug_mode=0

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser
from snorkel.models import Document, Sentence

session = SnorkelSession()
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

sents = session.query(Sentence).all()
n_max_corpus=0
for sent in sents:
    n_max_corpus=max(n_max_corpus,len(sent.words))

print("The longest sentence has "+str(n_max_corpus)+" tokens.")

from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if i % 10 == 8 and "cscw18"!=doc.name[:6]:
            dev_sents.add(s)
#         elif i % 10 == 9:
        elif "cscw18"==doc.name[:6]:  # replace the earlier 10% test documents as cscw'18 annotation guideline 10 examples
            test_sents.add(s)
        elif "cscw18"!=doc.name[:6]:
            train_sents.add(s)
            
print(len(test_sents))


Documents: 5560
Sentences: 19540
The longest sentence has 157 tokens.
68


In [6]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import *

ngrams = Ngrams(n_max=n_max_corpus) # we define the maximum n value as n_max_corpus
document_breakdown_map=dict() # maps doc_id into a dict of ["Background", "Purpose", "Mechanism", "Method", "Finding"]
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
def extract_and_display(matcher,candidate_class,candidate_class_name,document_breakdown_map=None,selected_split=0):  # split over train/dev/test but returns only train set
    for (i, sents) in ([(0,train_sents), (1,dev_sents), (2,test_sents)] if selected_split==0 else ([(2,test_sents)] if selected_split==0 else [(1,dev_sents)])):
        %time matcher.apply(sents, split=i)
        printmd("**Split "+str(i)+" - number of candidates extracted: "+str(session.query(candidate_class).filter(candidate_class.split == i).count())+"**\n\n")
    cands = session.query(candidate_class).filter(candidate_class.split == selected_split).all()
    for i in range(min(4,len(cands))): # to print all cands, range(len(cands))
        printmd("**"+str(i)+"/"+str(len(cands))+" Candidate/Span:**\t`"+str(cands[i])+"`")
        printmd("**Its parent Sentence's text:**\t"+str(cands[i].get_parent().text))
        printmd("**Its parent Document's text:**\t"+str(cands[i].get_parent().get_parent().__dict__))
        print() 
        
    for cand in cands:
        doc_name=cand.get_parent().get_parent().name
        if doc_name not in document_breakdown_map:
            document_breakdown_map[doc_name]=dict()
        if candidate_class_name not in document_breakdown_map[doc_name]:
            document_breakdown_map[doc_name][candidate_class_name]=[]
        document_breakdown_map[doc_name][candidate_class_name]+=[cand]
        
    return cands
    

We want to differentiate ``Background`` with ``Purpose``. Recall our ``Background`` definition from Google Doc:

``Contains words that indicate prior work (e.g., “traditionally”, “researchers have…”), and then following sentence/span starts with some variant of “In this paper, we introduce…” (Exploring adjacency relationship through helper function in Snorkel)``


In [7]:
# remove if necessary 
# session.query(Background).all()
import snorkel.models.candidate as candidate

def del_defined_candidate_class(class_name):
    print("Existing", candidate.candidate_subclasses)
    del(candidate.candidate_subclasses[class_name])
    print("After deletion", candidate.candidate_subclasses)
    
## Usage:    
# del_defined_candidate_class('Purpose')


The next few cells describe our ``Span Matching`` phase: we respectively define the `Span` for the 5 segments are , and centrally stored in a document-id-indexed hashmap. This helps us to group the 5 segments back into documents.
3. Only those documents that have at least 3 segments are considered as valid

Following this ``Span Matching`` phase, we will be having an ``Document Aggregation`` phase, where adjacency relationship and (or) waterfall models are being emphasized through multiple criteria. For example, the following criteria prevents non-abstracts (e.g. proceeding cover letter, tutorial introduction, etc.) or less well-strctured abstracts from being included. 

1. Only those documents that have all 5 segments are considered as valid
2. Only those documents that have at least 4 segments are cosidered as valid

In [8]:
# Compound Matcher for Background: 
Background = candidate_subclass('Background', ['background_cue'])
non_comma_matcher=DictionaryMatch(d=[','],longest_match_only=True,reverse=True)  
transition_word=DictionaryMatch(d=['while','unlike','despite'],longest_match_only=True) 
transition_prev_work=DictionaryMatch(d=['previous','earlier','past'],longest_match_only=True) 
dict_background_matcher=DictionaryMatch(d=['previous work','traditionally','researchers'],longest_match_only=True) 
excluded_dict_background_matcher=DictionaryMatch(d=['we','unlike','our'],longest_match_only=True,reverse=True) 
non_comma_dict_background_matcher=CandidateExtractor(Background, [ngrams], [Intersection(non_comma_matcher,Union(dict_background_matcher,Intersection(transition_word,transition_prev_work)),excluded_dict_background_matcher)])
background_cands=extract_and_display(non_comma_dict_background_matcher,Background,"Background",document_breakdown_map)
print(document_breakdown_map)

Clearing existing...
Running UDF...

CPU times: user 1min 1s, sys: 959 ms, total: 1min 1s
Wall time: 1min 2s


**Split 0 - number of candidates extracted: 148**



Clearing existing...
Running UDF...

CPU times: user 8.08 s, sys: 411 ms, total: 8.49 s
Wall time: 8.58 s


**Split 1 - number of candidates extracted: 15**



Clearing existing...
Running UDF...

CPU times: user 472 ms, sys: 70.1 ms, total: 542 ms
Wall time: 521 ms


**Split 2 - number of candidates extracted: 0**



**0/148 Candidate/Span:**	`Background(Span("b'recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system.'", sentence=24132, chars=[83,244], words=[13,36]))`

**Its parent Sentence's text:**	While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d9a19b0>, 'name': '59398333c50f90cdd3ae2761', 'stable_id': '59398333c50f90cdd3ae2761::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5295, 'type': 'document', 'sentences': [Sentence(Document 59398333c50f90cdd3ae2761,0,b'While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system.'), Sentence(Document 59398333c50f90cdd3ae2761,1,b'We developed Cairo to fill this need.'), Sentence(Document 59398333c50f90cdd3ae2761,2,b'Cairo is a free, open-source, portable, user-friendly, GUI-driven program written in Java that provides a visual representation of word correspondences between bilingual pairs of sentences, as well as relevant translation model parameters.\n')]}




**1/148 Candidate/Span:**	`Background(Span("b'Researchers in psychometrics argue that before adding new labels to applications'", sentence=12535, chars=[0,79], words=[0,10]))`

**Its parent Sentence's text:**	Researchers in psychometrics argue that before adding new labels to applications, the labels must be empirically evaluated.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x112329978>, 'name': '5834868725ff05a97b014e76', 'stable_id': '5834868725ff05a97b014e76::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1958, 'type': 'document', 'sentences': [Sentence(Document 5834868725ff05a97b014e76,0,b'Linguistic labels such as high, medium, and low are commonly used in different applications.'), Sentence(Document 5834868725ff05a97b014e76,1,b'Researchers in psychometrics argue that before adding new labels to applications, the labels must be empirically evaluated.'), Sentence(Document 5834868725ff05a97b014e76,2,b'In this paper, we explain the process of selecting labels for a security assessment application.'), Sentence(Document 5834868725ff05a97b014e76,3,b'We also show how we evaluate the labels empirically using a sample population from Amazon Mechanical Turk users.\n')]}




**2/148 Candidate/Span:**	`Background(Span("b'Previous work explored semantic concepts for content analysis to assist retrieval.'", sentence=12041, chars=[0,81], words=[0,11]))`

**Its parent Sentence's text:**	Previous work explored semantic concepts for content analysis to assist retrieval.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1122f1c18>, 'name': '5834868625ff05a97b013640', 'stable_id': '5834868625ff05a97b013640::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1818, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b013640,0,b'Huge amount of videos on the Internet have rare textual information, which makes video retrieval challenging given a text query.'), Sentence(Document 5834868625ff05a97b013640,1,b'Previous work explored semantic concepts for content analysis to assist retrieval.'), Sentence(Document 5834868625ff05a97b013640,2,b"However, the human-defined concepts might fail to cover the data and there is a potential gap between these concepts and the semantics expected from user's query."), Sentence(Document 5834868625ff05a97b013640,3,b'Also, building a corpus is expensive and time-consuming.'), Sentence(Document 5834868625ff05a97b013640,4,b'To address these issues, we propose a semi-automatic framework to discover the semantic concepts.\n')]}




**3/148 Candidate/Span:**	`Background(Span("b'Because researchers embed successful ubicomp projects in rich real-world contexts that can touch many aspects of life'", sentence=10746, chars=[0,116], words=[0,18]))`

**Its parent Sentence's text:**	Because researchers embed successful ubicomp projects in rich real-world contexts that can touch many aspects of life, they've made multidisciplinary teams the norm rather than the exception.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11c631eb8>, 'name': '5834868625ff05a97b011a7a', 'stable_id': '5834868625ff05a97b011a7a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1438, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b011a7a,0,b"Because researchers embed successful ubicomp projects in rich real-world contexts that can touch many aspects of life, they've made multidisciplinary teams the norm rather than the exception."), Sentence(Document 5834868625ff05a97b011a7a,1,b'However, overcoming boundaries between various disciplines is a significant challenge and in many cases represents a key factor for successful development.'), Sentence(Document 5834868625ff05a97b011a7a,2,b'Problem-solving approaches differ radically, and finding common ground for assessing results can be difficult.\n')]}




In [9]:
# Compound Matcher for Purpose:
Purpose=candidate_subclass('Purpose',['purpose_cue'])

transition_regex_matcher=RegexMatchSpan(rgx="((^|\s)however.*$)|((^|\s)but(?!(also))*$)",longest_match_only=True)  # Correction: purpose 
excluded_dict_purpose_matcher=SentenceMatch(d=['but also','but without','but sometimes'],longest_match_only=True,reverse=True)  # the parent sentence shall not include "but also"
transition_matcher=Intersection(transition_regex_matcher,excluded_dict_purpose_matcher)

comparative_degree_matcher=Intersection(RegexMatchSpan(rgx="(.*more.*than.*$)|(.*er than.*$)",longest_match_only=True),transition_prev_work)  # Correction: purpose 
other_regex_matcher=RegexMatchSpan(rgx="(.*extend.*$)|(.*offer.*$)",longest_match_only=True)

dict_purpose_matcher=DictionaryMatch(d=['in this paper','in the paper',' that can ','in this study','to examine','we examine','to investigate','implications'],longest_match_only=True) 

# Unit test
# non_comma_dict_purpose_matcher=CandidateExtractor(Purpose, [ngrams], [Intersection(non_comma_matcher,other_regex_matcher)])

non_comma_dict_purpose_matcher=CandidateExtractor(Purpose, [ngrams], [Intersection(non_comma_matcher,Union(comparative_degree_matcher,other_regex_matcher,dict_purpose_matcher,transition_matcher))]) #,intersection(excluded_dict_purpose_matcher,transition_regex_matcher)])
purpose_cands=extract_and_display(non_comma_dict_purpose_matcher,Purpose,"Purpose",document_breakdown_map)
print(document_breakdown_map)

Clearing existing...
Running UDF...

CPU times: user 1min 37s, sys: 1.68 s, total: 1min 38s
Wall time: 1min 41s


**Split 0 - number of candidates extracted: 1501**



Clearing existing...
Running UDF...

CPU times: user 11.5 s, sys: 477 ms, total: 12 s
Wall time: 12.1 s


**Split 1 - number of candidates extracted: 203**



Clearing existing...
Running UDF...

CPU times: user 757 ms, sys: 118 ms, total: 876 ms
Wall time: 865 ms


**Split 2 - number of candidates extracted: 10**



**0/1501 Candidate/Span:**	`Purpose(Span("b'In this paper'", sentence=21545, chars=[0,12], words=[0,2]))`

**Its parent Sentence's text:**	In this paper, we describe several experiments with image understanding algorithms that were developed to aid remote visual inspection, in enhancing and recognizing surface cracks and corrosion from the live imagery of an aircraft surface.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d85b278>, 'name': '58cb6b8ec50f90cdd3875b14', 'stable_id': '58cb6b8ec50f90cdd3875b14::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4540, 'type': 'document', 'sentences': [Sentence(Document 58cb6b8ec50f90cdd3875b14,0,b'Visual inspection is, by far, the most widely used method in aircraft surface inspection.'), Sentence(Document 58cb6b8ec50f90cdd3875b14,1,b'We are currently developing a prototype remote visual inspection system, designed to facilitate testing the hypothesized feasibility and advantages of remote visual inspection of aircraft surfaces.'), Sentence(Document 58cb6b8ec50f90cdd3875b14,2,b'In this paper, we describe several experiments with image understanding algorithms that were developed to aid remote visual inspection, in enhancing and recognizing surface cracks and corrosion from the live imagery of an aircraft surface.\n')]}




**1/1501 Candidate/Span:**	`Purpose(Span("b'\xe2\x80\x94here we examine shrinkage toward diagonality.\n'", sentence=11574, chars=[0,46], words=[0,8]))`

**Its parent Sentence's text:**	—here we examine shrinkage toward diagonality.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1122c40f0>, 'name': '5834868625ff05a97b0122ab', 'stable_id': '5834868625ff05a97b0122ab::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1681, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b0122ab,0,b'The problem of estimating a covariance matrix in small samples has been considered by several authors following early work by Stein.'), Sentence(Document 5834868625ff05a97b0122ab,1,b'This problem can be especially important in hierarchical models where the standard errors of fixed and random effects depend on estimation of the covariance matrix of the distribution of the random effects.'), Sentence(Document 5834868625ff05a97b0122ab,2,b'We propose a set of hierarchical priors (HPs) for the covariance matrix that produce posterior shrinkage toward a specified structure'), Sentence(Document 5834868625ff05a97b0122ab,3,b'\xe2\x80\x94here we examine shrinkage toward diagonality.\n')]}




**2/1501 Candidate/Span:**	`Purpose(Span("b'which extend organisational boundaries.'", sentence=6712, chars=[119,157], words=[19,23]))`

**Its parent Sentence's text:**	The quest for scalability, reliability and cost reduction has led to the development of massively distributed systems, which extend organisational boundaries.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x111488240>, 'name': '5834868425ff05a97b00e0fe', 'stable_id': '5834868425ff05a97b00e0fe::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 315, 'type': 'document', 'sentences': [Sentence(Document 5834868425ff05a97b00e0fe,0,b'Modern computing paradigms have frequently adopted concepts from distributed systems.'), Sentence(Document 5834868425ff05a97b00e0fe,1,b'The quest for scalability, reliability and cost reduction has led to the development of massively distributed systems, which extend organisational boundaries.'), Sentence(Document 5834868425ff05a97b00e0fe,2,b'Voluntary computing environments (such as BOINC), Grids (such as EGEE and Globus), and more recently Cloud Computing (both open source and commercial) have established themselves as a range of distributed systems.\n')]}




**3/1501 Candidate/Span:**	`Purpose(Span("b'however'", sentence=12099, chars=[5,11], words=[2,2]))`

**Its parent Sentence's text:**	Few, however, allow new interfaces to be created from scratch because they do not provide a means of demonstrating when a recorded macro should be invoked.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x1122fb828>, 'name': '5834868725ff05a97b013a8a', 'stable_id': '5834868725ff05a97b013a8a::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 1836, 'type': 'document', 'sentences': [Sentence(Document 5834868725ff05a97b013a8a,0,b'Many programming by demonstration (PBD) systems elaborate on the idea of macro recording, and they allow users to extend existing applications.'), Sentence(Document 5834868725ff05a97b013a8a,1,b'Few, however, allow new interfaces to be created from scratch because they do not provide a means of demonstrating when a recorded macro should be invoked.'), Sentence(Document 5834868725ff05a97b013a8a,2,b'This paper discusses stimulus-response systems that allow both the when (stimulus/event) and the what (response macro) to be demonstrated.\n')]}




In [10]:
# Compound Matcher for Mechanism: 
Mechanism = candidate_subclass('Mechanism', ['mechanism_cue']) 
dict_mechanism_matcher=DictionaryMatch(d=['introduce','introduces','propose','proposes','we propose','we develop','approach'],longest_match_only=True) 
non_comma_dict_mechanism_matcher=CandidateExtractor(Mechanism, [ngrams], [Intersection(non_comma_matcher,dict_mechanism_matcher)])
mechanism_cands=extract_and_display(non_comma_dict_mechanism_matcher,Mechanism,"Mechanism",document_breakdown_map)

Clearing existing...
Running UDF...

CPU times: user 54.2 s, sys: 1 s, total: 55.2 s
Wall time: 56.5 s


**Split 0 - number of candidates extracted: 1404**



Clearing existing...
Running UDF...

CPU times: user 7.54 s, sys: 482 ms, total: 8.02 s
Wall time: 8.23 s


**Split 1 - number of candidates extracted: 169**



Clearing existing...
Running UDF...

CPU times: user 504 ms, sys: 136 ms, total: 640 ms
Wall time: 730 ms


**Split 2 - number of candidates extracted: 10**



**0/1404 Candidate/Span:**	`Mechanism(Span("b'We describe a Machine Translation (MT) approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources.'", sentence=17491, chars=[0,166], words=[0,27]))`

**Its parent Sentence's text:**	We describe a Machine Translation (MT) approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x112113908>, 'name': '58ccadd5c50f90cdd3882822', 'stable_id': '58ccadd5c50f90cdd3882822::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 3398, 'type': 'document', 'sentences': [Sentence(Document 58ccadd5c50f90cdd3882822,0,b'We describe a Machine Translation (MT) approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources.'), Sentence(Document 58ccadd5c50f90cdd3882822,1,b'Our approach assumes the availability of a small number of bi-lingual speakers of the two languages, but these need not be linguistic experts.'), Sentence(Document 58ccadd5c50f90cdd3882822,2,b'The bi-lingual speakers create a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool.\n')]}




**1/1404 Candidate/Span:**	`Mechanism(Span("b'Our approach introduces a moment-matching technique for trigonometric functions to approximate the Expectation Propagation messages efficiently.\n'", sentence=13390, chars=[0,144], words=[0,19]))`

**Its parent Sentence's text:**	Our approach introduces a moment-matching technique for trigonometric functions to approximate the Expectation Propagation messages efficiently.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11c635ac8>, 'name': '5834868625ff05a97b012ac9', 'stable_id': '5834868625ff05a97b012ac9::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 2200, 'type': 'document', 'sentences': [Sentence(Document 5834868625ff05a97b012ac9,0,b'The von Mises model encodes a multivariate circular distribution as an undirected probabilistic graphical model.'), Sentence(Document 5834868625ff05a97b012ac9,1,b'Presently, the only algorithm for performing inference in the model is Gibbs sampling, which becomes inefficient for large graphs.'), Sentence(Document 5834868625ff05a97b012ac9,2,b'To address this issue, we introduce an Expectation Propagation based algorithm for performing inference in the von Mises graphical model.'), Sentence(Document 5834868625ff05a97b012ac9,3,b'Our approach introduces a moment-matching technique for trigonometric functions to approximate the Expectation Propagation messages efficiently.\n')]}




**2/1404 Candidate/Span:**	`Mechanism(Span("b'Two new architectures are proposed for designing physical network nodes in packet-switched structures.'", sentence=21079, chars=[0,101], words=[0,15]))`

**Its parent Sentence's text:**	Two new architectures are proposed for designing physical network nodes in packet-switched structures.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d81b518>, 'name': '58c94b36c50f90cdd385f6ad', 'stable_id': '58c94b36c50f90cdd385f6ad::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4400, 'type': 'document', 'sentences': [Sentence(Document 58c94b36c50f90cdd385f6ad,0,b'Two new architectures are proposed for designing physical network nodes in packet-switched structures.'), Sentence(Document 58c94b36c50f90cdd385f6ad,1,b'They allow transparent optical packet networking and are based on the association of various subsystems, which have previously been proposed, and demonstrated.'), Sentence(Document 58c94b36c50f90cdd385f6ad,2,b'These elements are mainly based on the application of nonlinear behavior in semiconductor optical amplifiers.\n')]}




**3/1404 Candidate/Span:**	`Mechanism(Span("b'an approach to abstract syntax with binding structure.'", sentence=21019, chars=[124,177], words=[23,31]))`

**Its parent Sentence's text:**	A recent SIGACT News Logic column, guest-written by James Cheney DOI= 10.1145/1107523.1107537, discussed nominal logic [1], an approach to abstract syntax with binding structure.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d8109b0>, 'name': '58d08b9cc50f90cdd38a4f1d', 'stable_id': '58d08b9cc50f90cdd38a4f1d::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4383, 'type': 'document', 'sentences': [Sentence(Document 58d08b9cc50f90cdd38a4f1d,0,b'A recent SIGACT News Logic column, guest-written by James Cheney DOI= 10.1145/1107523.1107537, discussed nominal logic [1], an approach to abstract syntax with binding structure.'), Sentence(Document 58d08b9cc50f90cdd38a4f1d,1,b'In addition to providing a worthy tutorial on nominal logic, that column leveled five criticisms at higher-order abstract syntax, an alternative approach for dealing with binding structure in abstract syntax.'), Sentence(Document 58d08b9cc50f90cdd38a4f1d,2,b'We argue below that three of those criticisms are factually inaccurate, and the other two are misguided.\n')]}




In [None]:
# TODO: Compound Matcher for Method: 
Method = candidate_subclass('Method', ['method_cue'])

dict_method_matcher=DictionaryMatch(d=['dataset','benchmark','experiment ','experiments',"empirical","participant","survey"," conduct"," analyze"],longest_match_only=True) 

non_comma_dict_method_matcher=CandidateExtractor(Method, [ngrams], [Intersection(non_comma_matcher,dict_method_matcher)])
method_cands=extract_and_display(non_comma_dict_method_matcher,Method,"Method",document_breakdown_map)


Clearing existing...
Running UDF...

CPU times: user 1min 2s, sys: 1.56 s, total: 1min 3s
Wall time: 1min 6s


**Split 0 - number of candidates extracted: 752**



Clearing existing...
Running UDF...

CPU times: user 7.53 s, sys: 456 ms, total: 7.99 s
Wall time: 8.11 s


**Split 1 - number of candidates extracted: 81**



Clearing existing...
Running UDF...

CPU times: user 459 ms, sys: 92.3 ms, total: 551 ms
Wall time: 542 ms


**Split 2 - number of candidates extracted: 10**



**0/752 Candidate/Span:**	`Method(Span("b'we describe several experiments with image understanding algorithms that were developed to aid remote visual inspection'", sentence=21545, chars=[15,133], words=[4,19]))`

**Its parent Sentence's text:**	In this paper, we describe several experiments with image understanding algorithms that were developed to aid remote visual inspection, in enhancing and recognizing surface cracks and corrosion from the live imagery of an aircraft surface.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d85b278>, 'name': '58cb6b8ec50f90cdd3875b14', 'stable_id': '58cb6b8ec50f90cdd3875b14::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4540, 'type': 'document', 'sentences': [Sentence(Document 58cb6b8ec50f90cdd3875b14,0,b'Visual inspection is, by far, the most widely used method in aircraft surface inspection.'), Sentence(Document 58cb6b8ec50f90cdd3875b14,1,b'We are currently developing a prototype remote visual inspection system, designed to facilitate testing the hypothesized feasibility and advantages of remote visual inspection of aircraft surfaces.'), Sentence(Document 58cb6b8ec50f90cdd3875b14,2,b'In this paper, we describe several experiments with image understanding algorithms that were developed to aid remote visual inspection, in enhancing and recognizing surface cracks and corrosion from the live imagery of an aircraft surface.\n')]}




**1/752 Candidate/Span:**	`Method(Span("b'it would still take almost a day to analyze the average sentence.'", sentence=24074, chars=[79,143], words=[14,26]))`

**Its parent Sentence's text:**	If every combination could be accurately judged in one thousandth of a second, it would still take almost a day to analyze the average sentence.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d999cf8>, 'name': '5942b359c50f90cdd3b1f54d', 'stable_id': '5942b359c50f90cdd3b1f54d::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 5276, 'type': 'document', 'sentences': [Sentence(Document 5942b359c50f90cdd3b1f54d,0,b'Fifty six million, six hundred eighty seven thousand, and forty.'), Sentence(Document 5942b359c50f90cdd3b1f54d,1,b'A big number, to be sure.'), Sentence(Document 5942b359c50f90cdd3b1f54d,2,b'This is the number of possible semantic analyses for an average sized sentence in the Mikrokosmos Machine Translation project.'), Sentence(Document 5942b359c50f90cdd3b1f54d,3,b'Complex sentences have gone past the trillions.'), Sentence(Document 5942b359c50f90cdd3b1f54d,4,b'If every combination could be accurately judged in one thousandth of a second, it would still take almost a day to analyze the average sentence.'), Sentence(Document 5942b359c50f90cdd3b1f54d,5,b'And you can forget about the hard ones.'), Sentence(Document 5942b359c50f90cdd3b1f54d,6,b'And yet, understanding natural language sentences is intuitively not an exponential a air.\n')]}




**2/752 Candidate/Span:**	`Method(Span("b'the robot played like the rest of the participants and'", sentence=16575, chars=[18,71], words=[4,13]))`

**Its parent Sentence's text:**	In one condition, the robot played like the rest of the participants and, in the other, the robot moderated the game.

**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d619eb8>, 'name': '5834868825ff05a97b016ad8', 'stable_id': '5834868825ff05a97b016ad8::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 3142, 'type': 'document', 'sentences': [Sentence(Document 5834868825ff05a97b016ad8,0,b'We present initial findings from an experiment in which participants played Mafia, an established role-playing game, with our robot.'), Sentence(Document 5834868825ff05a97b016ad8,1,b'In one condition, the robot played like the rest of the participants and, in the other, the robot moderated the game.'), Sentence(Document 5834868825ff05a97b016ad8,2,b"We discuss general aspects of the interaction, participants' perceptions, and the potential of this scenario for studying group spatial behavior from robotic platforms.\n")]}




**3/752 Candidate/Span:**	`Method(Span("b'along with experimental results from an empirical evaluation of the completed system.\n'", sentence=20724, chars=[70,155], words=[11,24]))`

**Its parent Sentence's text:**	The design and implementation of the diagnostic system are presented, along with experimental results from an empirical evaluation of the completed system.


**Its parent Document's text:**	{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x11d7e6f98>, 'name': '58ce2d95c50f90cdd388f04f', 'stable_id': '58ce2d95c50f90cdd388f04f::document:0:0', 'meta': {'file_name': '70kpaper_061418_cleaned_noBookLecture.tsv'}, 'id': 4296, 'type': 'document', 'sentences': [Sentence(Document 58ce2d95c50f90cdd388f04f,0,b'This paper presents a grammar diagnostic system for controlled language checking.'), Sentence(Document 58ce2d95c50f90cdd388f04f,1,b'The implemented diagnostics were designed to address the most difficult rewrites for authors, based on an empirical analysis of log files containing over 180,000 sentences.'), Sentence(Document 58ce2d95c50f90cdd388f04f,2,b'The design and implementation of the diagnostic system are presented, along with experimental results from an empirical evaluation of the completed system.\n')]}




In [None]:
# TODO: Compound Matcher for Finding: 
Finding = candidate_subclass('Finding', ['finding_cue'])

dict_finding_matcher=DictionaryMatch(d=['show that','shows that','found','indicate','results','performance','find'],longest_match_only=True) 

non_comma_dict_finding_matcher=CandidateExtractor(Finding, [ngrams], [Intersection(non_comma_matcher,dict_finding_matcher)])
finding_cands=extract_and_display(non_comma_dict_finding_matcher,Finding,"Finding",document_breakdown_map)

Clearing existing...
Running UDF...

Next is the ``Document Aggregation`` phase. First we show a few document examples, of how ``document_breakdown_map`` looks like, as well as how do we aggregate. 

In [16]:
for ind,docid in enumerate(document_breakdown_map.keys()):
    print("\n\nDocument "+str(ind)+"/"+str(len(document_breakdown_map.keys())))
    print("Extracted segments\n\n",docid,document_breakdown_map[docid],"\n\n")
    print("Complete abstract\n\n",document_breakdown_map[docid][list(document_breakdown_map[docid].keys())[0]][0].get_parent().get_parent().sentences)
    if ind>2:
        break

Then we count the number of documents that contain exactly N segments, where N=1, 2, 3, 4, 5.

In [18]:
def get_document_text(doc):
    text_string=" ".join([str(sentence.text) for sentence in doc.sentences])
    return text_string


def rank_by_matched_segments(document_breakdown_map)
    for n in [5,4,3,2,1]:
        print("Below are one or two document examples that contain exactly "+str(n)+" segments\n")
        showed_examples=0
        count=0
        for ind,docid in enumerate(document_breakdown_map.keys()):
            if len(document_breakdown_map[docid].keys())==n:
                count+=1
                if showed_examples<2:
                    showed_examples+=1
                    print(docid,": ", get_document_text(document_breakdown_map[docid][list(document_breakdown_map[docid].keys())[0]][0].get_parent().get_parent()))
                    print("Extracted segments\n\n",docid,document_breakdown_map[docid],"\n\n")
        print("Total count is "+str(count)+"\n\n=================\n")
        
rank_by_matched_segments(document_breakdown_map)
        

{}
Below are one or two document examples that contain exactly 5 segments

Total count is 0


Below are one or two document examples that contain exactly 4 segments

Total count is 0


Below are one or two document examples that contain exactly 3 segments

Total count is 0


Below are one or two document examples that contain exactly 2 segments

Total count is 0


Below are one or two document examples that contain exactly 1 segments

Total count is 0




Then we want to some visualization of extracted Span as highlighted text on Document.

Color notation: <b style="color:orange;">Background</b> <b style="color:pink;">Purpose</b> <b style="color:green;">Mechanism</b> <b style="color:purple;">Method</b> <b style="color:blue;">Findings</b>

In [107]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
    
    
def print_colored_text(docid,document_breakdown_map=document_breakdown_map):
    
    color_mapping={"Background":"orange","Purpose":"pink","Mechanism":"green","Method":"purple","Finding":"blue"}
    this_document=document_breakdown_map[docid][list(document_breakdown_map[docid].keys())[0]][0].get_parent().get_parent()
    document_text=get_document_text(this_document)

    added_segment=[]
    for segment in document_breakdown_map[docid]:
        spans=document_breakdown_map[docid][segment]
        for span in spans:
            this_span=document_breakdown_map[docid][segment][0].__dict__[list(document_breakdown_map[docid][segment][0].__dict__)[6]]
            span_sentence_text=this_span.sentence.text
            span_text=str(span_sentence_text[this_span.char_start:(this_span.char_end+1)])
            document_text=document_text.replace(span_text.strip(),"<font color='"+color_mapping[segment]+"'>"+span_text.strip()+"</font>")

    printmd(document_text)

docid='5834868625ff05a97b013016'
print_colored_text(docid)   


The ability to argue is essential in many aspects of life, <font color='pink'>but</font> traditional face-to-face tutoring approaches do not scale up well. A solution for this dilemma may be computer-supported argumentation (CSA). <font color='blue'>The evaluation of CSA approaches in different domains has led to mixed results.</font> <font color='purple'>To gain insights into the challenges and future prospects of CSA we conducted a survey among teachers</font>, <font color='orange'>researchers</font>, and system developers. Our investigation points to optimism regarding the potential success and importance of CSA.


In [None]:
test_doc_breakdown_map={}
test_background_cands=extract_and_display(non_comma_dict_background_matcher,Background,"Background",document_breakdown_map=test_doc_breakdown_map,selected_split=2)
test_purpose_cands=extract_and_display(non_comma_dict_purpose_matcher,Purpose,"Purpose",document_breakdown_map=test_doc_breakdown_map,selected_split=2)
test_mechanism_cands=extract_and_display(non_comma_dict_mechanism_matcher,Mechanism,"Mechanism",document_breakdown_map=test_doc_breakdown_map,selected_split=2)
test_method_cands=extract_and_display(non_comma_dict_method_matcher,Method,"Method",document_breakdown_map=test_doc_breakdown_map,selected_split=2)
test_finding_cands=extract_and_display(non_comma_dict_finding_matcher,Finding,"Finding",document_breakdown_map=test_doc_breakdown_map,selected_split=2)

print_colored_text('cscw18social',document_breakdown_map=test_doc_breakdown_map)