Test for:
`convert_annotations_to_spacy` 
that converts bibliographies to spacy docs.

 >Note: `convert_annotations_to_spacy` requires `node processManuscript.js` to be started in the `dataset-creation` dir

## Convert

In [1]:
!PYTHONPATH=.. python -m convert_annotations_to_spacy --help

Usage: python -m convert_annotations_to_spacy [OPTIONS] CROSSREF_DIR

  Sends crossref files to CSL processor and convert rendered annotated
  biblioraphies into Spacy DocBin files Dependency:     node
  processManuscript.js     needs to be started, for additional details see
  dataset-creation/README.md Args:     crossref_dir: a directory with files
  downloaded from crossref, see dataset-creation/crossref/crossrefDownload.py
  output_dir: where to put *.docbin files     parts: how many docbin files to
  be generated     parallel: how manu processes to use. You might want to
  launch more               than one dataset-creation/processManuscript.js
  node process, i.e. in k8s cluster,               if you need high
  parallelism

Arguments:
  CROSSREF_DIR  [required]

Options:
  --output-dir PATH               [default: train.ref]
  --parts INTEGER                 [default: 100]
  --parallel INTEGER              [default: 2]
  --install-completion [bash|zsh|fish|powershell|pwsh]
     

In [2]:
!rm -rf out && mkdir out
!PYTHONPATH=.. python -m convert_annotations_to_spacy crossref --output-dir out --parts 2

CSL processor supports 1747 styles
references.0.spacy:   0%|                                 | 0/3 [00:00<?, ?it/s]CSL processor supports 1747 styles
references.1.spacy: 100%|█████████████████████████| 2/2 [00:06<00:00,  3.48s/it]
references.0.spacy: 100%|█████████████████████████| 3/3 [00:08<00:00,  2.67s/it]
convert:done: out/references.0.spacy
convert:done: out/references.1.spacy


## Load

In [3]:
import spacy
from spacy.tokens import Doc, DocBin

from spacy import displacy


In [4]:
db = DocBin()
db.from_disk("out/references.0.spacy")

nlp = spacy.blank("en")

# doc len, in token
print([len(doc) for doc in db.get_docs(nlp.vocab)])


[304, 252, 276, 356, 293, 289, 222, 342, 233, 224, 262, 275, 265, 269, 263, 220, 202, 227, 366, 90, 250, 305, 201, 232, 250, 182, 97, 252, 209]


In [5]:
# doc = list(db.get_docs(nlp.vocab))[0]
for _doc in db.get_docs(nlp.vocab):
    displacy.render(_doc, style="ent") 

In [6]:
[(span.label_, span.text) for span in _doc.spans["bib"]]

[('given', 'JS'),
 ('family', 'Kun Deng'),
 ('family', 'Mehta'),
 ('given', 'PG'),
 ('family', 'Meyn'),
 ('author', 'JS, Kun Deng, Mehta PG, Meyn'),
 ('title', 'Model reduction for reduced order estimation in traffic models'),
 ('url', 'http://dx.doi.org/10.1109/acc.2008.4586609'),
 ('year', '2008'),
 ('issued', '(2008)'),
 ('bib',
  '[67]Niedbalski JS, Kun Deng, Mehta PG, Meyn S. Model reduction for reduced order estimation in traffic models, http://dx.doi.org/10.1109/acc.2008.4586609, (2008).\n'),
 ('given', 'KM'),
 ('author', 'KM'),
 ('title', 'The Ottawa Convention'),
 ('container-title', 'International Relations'),
 ('page', '51–70'),
 ('year', '1998'),
 ('issued', '(1998)'),
 ('bib',
  '[68]Georghiades KM. The Ottawa Convention. International Relations. 14(3): 51–70 (1998).\n'),
 ('family', '-Cabillic'),
 ('given', 'R'),
 ('family', 'Farhoud'),
 ('given', 'A'),
 ('family', 'Sure'),
 ('given', 'U'),
 ('family', 'Heinze'),
 ('given', 'S'),
 ('family', 'Henzel'),
 ('given', 'M'),
 (

In [7]:
_doc[0].is_sent_start

True