## Test Run 1: Load AbstractNet Dataset: 70K unlabeled (and 2K labeled) Abstracts into DB

This notebook loads the dataset and create labeled *candidates* through labeling function. Before everything, please ensure that you have followed project-level ``README.md`` and installed all python dependencies, e.g. ``tika``.  

We filtered out null abstracts from `ClydeDB.csv` ([AbstractSegmentationCrowdNLP Git repo](https://github.com/zhoujieli/AbstractSegmentationCrowdNLP.git)), resulting in 48,914 valid ones out of 56,851 total abstracts. The 48,914 abstracts are saved to `data/70kpaper.tsv`.



In this section, we preprocess documents by parsing them into *contexts*. *Candidates* are extracted out of *contexts*, which are *instances* (one of the *background*, *mechanism*, *method*, and *findings*).

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

from snorkel import SnorkelSession
session = SnorkelSession()

# # Here, we just set how many documents we'll process for automatic testing- you can safely ignore this!
n_docs = 500 if 'CI' in os.environ else 230 #  60,000 for real dataset 

from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/70kpaper.tsv', encoding="utf-8",max_docs=n_docs)

Get the number of documents and sentences, as below. This could take 5-8 minutes to load ~60K papers (see progress bar, also might have exception). 

In [3]:
from snorkel.parser.spacy_parser import Spacy
from snorkel.parser import CorpusParser


corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)


from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Clearing existing...
Running UDF...

CPU times: user 2.66 s, sys: 449 ms, total: 3.11 s
Wall time: 3.15 s
Documents: 230
Sentences: 399


Next we extract *candidates* by following a few patterns on abstract segmentation:
+ Background: 
  - "Recent research ... ", 
  - "... have/has been widely ...", 
  - "How ... ?" (and as the first sentence), 
  - "Previous work...", 
  - "Motivated by...", 
  - "The success of ...", etc.
+ Mechanism:
  - something
  - some other pattern

In [7]:
from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

print(type(docs[0]))

<class 'snorkel.models.context.Document'>
