## Test Run 1: Load AbstractNet Dataset: 70K unlabeled (and 2K labeled) Abstracts into DB

This notebook loads the dataset and create labeled *candidates* through labeling function. Before everything, please ensure that you have followed project-level ``README.md`` and installed all python dependencies, e.g. ``tika``.  

We filtered out null abstracts from `ClydeDB.csv` ([AbstractSegmentationCrowdNLP Git repo](https://github.com/zhoujieli/AbstractSegmentationCrowdNLP.git)), resulting in 48,914 valid ones out of 56,851 total abstracts. The 48,914 abstracts are saved to `data/70kpaper.tsv`.



In this section, we preprocess documents by parsing them into *contexts*. *Candidates* are extracted out of *contexts*, which are *instances* (one of the *background*, *mechanism*, *method*, and *findings*).

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
session = SnorkelSession()

# # Here, we just set how many documents we'll process for automatic testing- you can safely ignore this!
n_docs = 500 if 'CI' in os.environ else 1000 #  60,000 for real dataset 

from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/70kpaper_061418_cleaned.tsv', encoding="utf-8",max_docs=n_docs)

Get statistics on the number of documents and sentences, as below. This could take 5-8 minutes to load ~60K papers (see progress bar, also might have exception). The following code parses docs into sentences by period, averaging 4.49 sentences per documents. Earlier I spent a few hours debugging some hidden formatting error that confuses Spacey. Need to ensure that we format raw data from .csv into .tsv *without* preceeding and appending quotes.  


In [2]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser


corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)


from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Clearing existing...
Running UDF...

CPU times: user 32.8 s, sys: 1.27 s, total: 34 s
Wall time: 22.9 s
Documents: 1000
Sentences: 4487


Next we extract *candidates* by following a few patterns on abstract segmentation. We take *Background* as an example and come back with other segmentation parts later. 
+ Background: 
  - "Recent research ... ", 
  - "... have/has been widely ...", 
  - "How ... ?" (and as the first sentence), 
  - "Previous work...", 
  - "Motivated by...", 
  - "The success of ...", etc.
+ Mechanism:
  - something
  - some other pattern

In [9]:
from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

for sent in (docs[0].sentences):
    print(sent.__dict__.keys())
    print(sent.text)
    print(sent.pos_tags)
    print(sent.ner_tags)
    

# sents=session.query(Sentence).order_by(Sentence.name).all()
# print(sents[0])



dict_keys(['_sa_instance_state', 'char_offsets', 'text', 'document_id', 'dep_labels', 'entity_types', 'stable_id', 'dep_parents', 'pos_tags', 'abs_char_offsets', 'words', 'position', 'id', 'entity_cids', 'type', 'ner_tags', 'lemmas', 'document'])
 " A n   o b s e r v a t i o n   s y s t e m   f o r   v i e w i n g   l i g h t - s e n s i t i v e   t i s s u e   i n c l u d e s   a n   i l l u m i n a t i o n   s y s t e m   c o n f i g u r e d   t o   i l l u m i n a t e   t h e   l i g h t - s e n s i t i v e   t i s s u e ,   a n   i m a g i n g   s y s t e m   c o n f i g u r e d   t o   i m a g e   a t   l e a s t   a   p o r t i o n   o f   t h e   l i g h t - s e n s i t i v e   t i s s u e   u p o n   b e i n g   i l l u m i n a t e d   b y   t h e   i l l u m i n a t i o n   s y s t e m ,   a n d   a n   i m a g e   d i s p l a y   s y s t e m   i n   c o m m u n i c a t i o n   w i t h   t h e   i m a g i n g   s y s t e m   t o   d i s p l a y   a n   i m a g e   o f   t h e 

In [None]:
from snorkel.models import candidate_subclass

Spouse = candidate_subclass('Background', ['background'])


from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher

ngrams         = Ngrams(n_max=30)
dict_matcher = DictionaryMatch()
cand_extractor = CandidateExtractor(Spouse, [ngrams], [dict_matcher])
DictionaryMatch()

Thanks for reading! 

Some debugging note at the very end (could ignore). 

```
python -m spacy download en
```

Current issue: parser does not parse *by periods*. Sentence count is significantly fewer than expected! 
Potential fix: https://github.com/explosion/spaCy/issues/93

======= Some more debugging log here (not necessary, could skip reading) ======
~~~~
Xins-MacBook-Pro:~ xin$ source activate snorkel
(snorkel) Xins-MacBook-Pro:~ xin$ python
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacey
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spacey'
>>> import spacy
>>> spacy.load('en')
<spacy.en.English object at 0x1080e1da0>
>>> model=spacy.load('en')
>>> docs=model.tokenizer('Hello, world. Here are two sentences.')
>>> for sent in docs.sents:
...     pritn(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> for sent in docs.sents:
...     print(sent.text)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 439, in __get__ (spacy/tokens/doc.cpp:9808)
ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: 
https://spacy.io/docs/usage

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(raw_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'raw_text' is not defined
>>> raw_text='Hello, world. Here are two sentences.'
>>> doc = nlp(raw_text)
>>> sentences = [sent.string.strip() for sent in doc.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> model(raw_text)
Hello, world. Here are two sentences.
>>> docs=model(raw_text)
>>> docs.sents
<generator object at 0x14ad31948>
>>> docs=model(raw_text)
>>> sentences = [sent.string.strip() for sent in docs.sents]
>>> sentences
['Hello, world.', 'Here are two sentences.']
>>> 
~~~~
