## Test Run 1: Load AbstractNet Dataset: 70K unlabeled (and 2K labeled) Abstracts into DB

This notebook loads the dataset and create labeled *candidates* through labeling function. Feel extra free to document/bring up any upcoming confusion throughout the test run, e.g. are the following two consistent, the **(segment, label) pair** that we want to have, and the **candidates** that we instruct snorkel to extract? 

Before everything, please ensure that you have followed project-level ``README.md`` and installed all python dependencies, e.g. ``tika``.  

We filtered out null abstracts from `ClydeDB.csv` ([AbstractSegmentationCrowdNLP Git repo](https://github.com/zhoujieli/AbstractSegmentationCrowdNLP.git)), resulting in 48,914 valid ones out of 56,851 total abstracts. The 48,914 abstracts are saved to `data/70kpaper.tsv`.



In this section, we preprocess documents by parsing them into *contexts*. *Candidates* are extracted out of *contexts*, which are *instances* (one of the *background*, *mechanism*, *method*, and *findings*).

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
session = SnorkelSession()

# # Here, we just set how many documents we'll process for automatic testing- you can safely ignore this!
n_docs = 500 if 'CI' in os.environ else 100 #  change the number 1000 to 60,000 for real dataset 

from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/70kpaper_061418_cleaned.tsv', encoding="utf-8",max_docs=n_docs)

Get statistics on the number of documents and sentences, as below. This could take 5-8 minutes to load ~60K papers (see progress bar, also might have exception). The following code parses docs into sentences by period, averaging 4.49 sentences per documents. Earlier I spent a few hours debugging some hidden formatting error that confuses Spacey. Need to ensure that we format raw data from .csv into .tsv *without* preceeding and appending quotes.  


In [2]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser


corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)


from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Clearing existing...
Running UDF...

CPU times: user 3.97 s, sys: 369 ms, total: 4.34 s
Wall time: 3.15 s
Documents: 100
Sentences: 495


Next we extract `candidates` by defining the specific `CandidateExtractor` for abstract segmentation. We take *Background* as an example and come back with other segmentation parts, i.e., *mechanism*, *method*, *findings*, later. 

Some more explanation based on my understanding: 
    
1. `Candidates` are defined as a class that contains 1+ `Span` objects within one `Sentence` context.  
    
2. `Span(s)` correspond to conceptual categories in text like people or disease names. 
    
In the intro tutorial example, their `candidate` represents the possible `Spouse` mention `(Barrack Obama, Michelle Obama)`. As readers, we know this mention is true due to external knowledge and the keyword of `wedding` occuring later in the sentence. (Reference: (1) section `Writing a basic CandidateExtractor` in [Intro_tutorial_1](../intro/Intro_Tutorial_1.ipynb); (2) section `Candidate Member Functions and Variables` in [Workshop_1_Snorkel_API](../workshop/Workshop_1_Snorkel_API.ipynb)) 


+ Background: 
  - "Recent research ... ", 
  - "... have/has been widely ...", 
  - "How ... ?" (and as the first sentence), 
  - "Previous work...", 
  - "Motivated by...", 
  - "The success of ...", etc.
+ Mechanism:
  - something
  - some other pattern

We define `CandidateExtractor` as a wrapper of `CandidateSpace` (e.g. `Ngrams` is one type of `CandidateSpace`) and `Matcher` (e.g. `DictionaryMatcher`, `PersonMatcher`). Please make sure that `longest_match_only=True`, since this gets us longest span that contains dictionary words. (Reference: source code [candidates.py](../../snorkel/candidates.py) and [matchers.py](../../snorkel/matchers.py)]

In [3]:
from snorkel.models import candidate_subclass
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher,DictionaryMatch

Background = candidate_subclass('Background', ['background_cue'])

ngrams = Ngrams(n_max=30)
# Start simple: any ngram that matches the dictionary are *background* candidates! 
dict_matcher=DictionaryMatch(d=['previous','motivated','recent','widely'],longest_match_only=True) 
cand_extractor=CandidateExtractor(Background, [ngrams], [dict_matcher])

Now we apply the defined `CandidateExtractor` to the all `Sentences` in the collection (splitted 90/10/10 for train/dev/test). 

In [4]:
from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

train_sents = set()
dev_sents   = set()
test_sents  = set()

for i, doc in enumerate(docs):
    for s in doc.sentences:
        if number_of_people(s) <= 5:
            if i % 10 == 8:
                
                dev_sents.add(s)
            elif i % 10 == 9:
                test_sents.add(s)
            else:
                train_sents.add(s)
                
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    %time cand_extractor.apply(sents, split=i)
    print("Number of candidates extracted:", session.query(Background).filter(Background.split == i).count(),"\n\n")
                



Clearing existing...
Running UDF...
[=                                       ] 0%within here apply
()
within here apply
()
within here apply
()
within here apply
()
[=                                       ] 1%within here apply
()
within here apply
()
within here apply
()
within here apply
()
[=                                       ] 2%within here apply
()
within here apply
()
within here apply
()
Previous Work: Highway Control
set()
[]

Previous Work: Highway
{(0, 29)}
[True]

Previous Work:
{(0, 29)}
[True]

Previous Work
{(0, 29)}
[True]

Previous
{(0, 29)}
[True]

within here apply
()
[==                                      ] 3%within here apply
()
within here apply
()
within here apply
()
within here apply
()
[==                                      ] 4%within here apply
()
within here apply
()
within here apply
()
within here apply
()
within here apply
()
[===                                     ] 5%within here apply
()
within here apply
()
within here apply
()
within here appl

[===                                     ] 6%within here apply
()
within here apply
()
within here apply
()
within here apply
()
[===                                     ] 7%within here apply
()
within here apply
()
within here apply
()
within here apply
()
[====                                    ] 8%within here apply
()
within here apply
()
within here apply
()
within here apply
()
[====                                    ] 9%within here apply
()
within here apply
()
within here apply
()
In the previous lectures we have considered a programming language C0 with pointers and memory and array allocation.
set()
[]

In the previous lectures we have considered a programming language C0 with pointers and memory and array allocation
{(0, 115)}
[True]

the previous lectures we have considered a programming language C0 with pointers and memory and array allocation.
{(0, 115)}
[True]

In the previous lectures we have considered a programming language C0 with pointers and memory and array
{(0, 

As sketched in the previous lecture, this an important piece
{(0, 159)}
[True]

sketched in the previous lecture, this an important piece in
{(0, 159)}
[True]

in the previous lecture, this an important piece in the
{(0, 159)}
[True]

the previous lecture, this an important piece in the general
{(0, 159)}
[True]

previous lecture, this an important piece in the general technique
{(0, 159)}
[True]

As sketched in the previous lecture, this an important
{(0, 159)}
[True]

sketched in the previous lecture, this an important piece
{(0, 159)}
[True]

in the previous lecture, this an important piece in
{(0, 159)}
[True]

the previous lecture, this an important piece in the
{(0, 159)}
[True]

previous lecture, this an important piece in the general
{(0, 159)}
[True]

As sketched in the previous lecture, this an
{(0, 159)}
[True]

sketched in the previous lecture, this an important
{(0, 159)}
[True]

in the previous lecture, this an important piece
{(0, 159)}
[True]

the previous lecture, this

, in our recent research we have created software that can solve arbitrary automated mechanism design instances using a mixed integer/
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

in our recent research we have created software that can solve arbitrary automated mechanism design instances using a mixed integer/linear
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

our recent research we have created software that can solve arbitrary automated mechanism design instances using a mixed integer/linear program
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

recent research we have created software that can solve arbitrary automated mechanism design instances using a mixed integer/linear program solver
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

In contrast, in our recent research we have created software that can solve arbitrary automated mechanism design instances using a
{(3, 177), (11, 178), (0, 176)}
[False, False, True]

contrast, in our recent research we have c

, in our recent
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

in our recent research
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

our recent research we
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

recent research we have
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

in our recent
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

our recent research
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

recent research we
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

our recent
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

recent research
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

recent
{(3, 177), (11, 178), (0, 176)}
[True, True, True]

within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
(

setup motivated by the behavior of real-world distributed computation networks, where the machines are differently slow
{(0, 174), (10, 180), (3, 179)}
[True, True, True]

motivated by the behavior of real-world distributed computation networks, where the machines are differently slow at
{(0, 174), (10, 180), (3, 179)}
[True, True, True]

We discuss, analyze, and experiment with a setup motivated by the behavior of real-world
{(0, 174), (10, 180), (3, 179)}
[True, False, False]

discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed
{(0, 174), (10, 180), (3, 179)}
[True, False, True]

, analyze, and experiment with a setup motivated by the behavior of real-world distributed computation
{(0, 174), (10, 180), (3, 179)}
[True, True, True]

analyze, and experiment with a setup motivated by the behavior of real-world distributed computation networks
{(0, 174), (10, 180), (3, 179)}
[True, True, True]

, and experiment with a setup motivated by the b


motivated by
{(0, 174), (10, 180), (3, 179)}
[True, True, True]

motivated
{(0, 174), (10, 180), (3, 179)}
[True, True, True]

within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply

[True, False, True, True, False]

popular recently
{(45, 186), (35, 180), (48, 188), (46, 187), (44, 182)}
[True, True, True, True, True]

recently [
{(45, 186), (35, 180), (48, 188), (46, 187), (44, 182)}
[True, False, True, True, True]

recently
{(45, 186), (35, 180), (48, 188), (46, 187), (44, 182)}
[True, True, True, True, True]

within here apply
()
()
within here apply
()
Session types are widely accepted as an expressive discipline for structuring communications in concurrent and distributed systems.
set()
[]

Session types are widely accepted as an expressive discipline for structuring communications in concurrent and distributed systems
{(0, 130)}
[True]

types are widely accepted as an expressive discipline for structuring communications in concurrent and distributed systems.
{(0, 130)}
[True]

Session types are widely accepted as an expressive discipline for structuring communications in concurrent and distributed
{(0, 130)}
[True]

types are widely accepted as an expressive

within here apply
()
within here apply
()
within here apply
()
()
within here apply
()
within here apply
()
()

CPU times: user 1.83 s, sys: 989 ms, total: 2.82 s
Wall time: 2.74 s
Number of candidates extracted: 16 


Clearing existing...
Running UDF...
[==                                      ] 2%within here apply
()
[===                                     ] 5%within here apply
()
[====                                    ] 7%within here apply
()
[=====                                   ] 10%within here apply
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()
()

CPU times: user 272 ms, sys: 166 ms, total: 437 ms
Wall time: 434 ms
Number of candidates extracted: 0 


Clearing existing...
Running UDF...
[=                                       ] 2%within here apply
()
[==                                      ] 4%within here apply
()
[===                                     ] 6%within here apply
()
[====                                

Let's take a look at a few of those extracted `Candidates`! Obviously, since we used `DictionaryMatcher`, all `Candidates` will contain at least one word from the set \['previous','motivated','recent','widely'\].

In [5]:
cands = session.query(Background).filter(Background.split == 0).all()

for i in range(len(cands)):
    print("The Candidate/Span "+str(i)+"/"+str(len(cands))+":\t"+str(cands[i].background_cue))
    print("This Candidate/Span's parent Sentence's text:\t"+str(cands[i].get_parent().text))
    print()

The Candidate/Span 0/16:	Span("b'Previous Work: Highway Control'", sentence=329, chars=[0,29], words=[0,4])
This Candidate/Span's parent Sentence's text:	Previous Work: Highway Control

The Candidate/Span 1/16:	Span("b'Abstract: Switched LANs are become more widely used because they can provide a higher bandwidth than LANs based on shared media.'", sentence=311, chars=[0,127], words=[0,22])
This Candidate/Span's parent Sentence's text:	Abstract: Switched LANs are become more widely used because they can provide a higher bandwidth than LANs based on shared media.

The Candidate/Span 2/16:	Span("b'In the previous lectures we have considered a programming language C0 with pointers and memory and array allocation.'", sentence=295, chars=[0,115], words=[0,18])
This Candidate/Span's parent Sentence's text:	In the previous lectures we have considered a programming language C0 with pointers and memory and array allocation.

The Candidate/Span 3/16:	Span("b'As sketched in the previous lecture, 

**(Solved) Question 1:** the `CandidateExtractor` extracts only spans with length 1. But Each sentence is only getting matched once with one span (try search "sentence=3993", for example). The sentence is good, but we still would want longer span, e.g. half part or the whole of a sentence. 

`Span("b'previous'", sentence=4705, chars=[20,27], words=[4,4])`

**Answer 1:** overwrite `DictionaryMatch._f()`.

=====================================================================================

**Question 2:** Several `Span` corresponds to the same sentence. They wee all included desipte that we have set `longest_match_only=True`, as these spans all have the same length.  

**Answer 2:** Perhaps try either (1) filtering by first appearance of document ID; (2) set max_length as the length of the longest sentence length.

In [1]:



# Get info of sentence
# for sent in (docs[0].sentences):
#     print(sent.__dict__.keys())
#     print(sent.text)
#     print(sent.pos_tags)
#     print(sent.ner_tags)
    

# sents=session.query(Sentence).order_by(Sentence.name).all()
# print(sents[0])

In [2]:
from snorkel.models import candidate_subclass

Spouse = candidate_subclass('Background', ['background'])


from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher

ngrams         = Ngrams(n_max=30)
dict_matcher = DictionaryMatch()
cand_extractor = CandidateExtractor(Spouse, [ngrams], [dict_matcher])
DictionaryMatch()

NameError: name 'DictionaryMatch' is not defined

Thanks for reading! 

Some debugging note at the very end (could ignore). 

```
python -m spacy download en
```

Current issue: parser does not parse *by periods*. Sentence count is significantly fewer than expected! 
Potential fix: https://github.com/explosion/spaCy/issues/93

======= Some more debugging log here (not necessary, could skip reading) ======
~~~~
Xins-MacBook-Pro:~ xin$ source activate snorkel
(snorkel) Xins-MacBook-Pro:~ xin$ python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacey
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spacey'
>>> import spacy
>>> spacy.load('en')
<spacy.en.English object at 0x1080e1da0>
>>> model=spacy.load('en')
>>> docs=model.tokenizer('Hello, world. Here are two sentences.')
>>> for sent in docs.sents:
...     pritn(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> for sent in docs.sents:
...     print(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(raw_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'raw_text' is not defined
>>> raw_text='Hello, world. Here are two sentences.'
>>> doc = nlp(raw_text)
>>> sentences = [sent.string.strip() for sent in doc.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> model(raw_text)
Hello, world. Here are two sentences.
>>> docs=model(raw_text)
>>> docs.sents
<generator object at 0x14ad31948>
>>> docs=model(raw_text)
>>> sentences = [sent.string.strip() for sent in docs.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> 
~~~~
