Test for:
`convert_annotations_to_spacy` 
that converts bibliographies to spacy docs.

It depends on `dataset-creation/processManuscript.js`


## Convert

In [1]:
!PYTHONPATH=.. python -m convert_annotations_to_spacy --help

Usage: python -m convert_annotations_to_spacy [OPTIONS] CROSSREF_DIR

  Sends crossref files to CSL processor and convert rendered annotated
  biblioraphies into Spacy DocBin files Dependency:     node
  processManuscript.js     needs to be started, for additional details see
  dataset-creation/README.md Args:     crossref_dir: a directory with files
  downloaded from crossref, see dataset-creation/crossref/crossrefDownload.py
  output_dir: where to put *.docbin files     parts: how many docbin files to
  be generated     parallel: how manu processes to use. You might want to
  launch more               than one dataset-creation/processManuscript.js
  node process, i.e. in k8s cluster,               if you need high
  parallelism

Arguments:
  CROSSREF_DIR  [required]

Options:
  --output-dir PATH               [default: train.ref]
  --parts INTEGER                 [default: 100]
  --parallel INTEGER              [default: 2]
  --install-completion [bash|zsh|fish|po

In [2]:
!rm -rf out && mkdir out
!PYTHONPATH=.. python -m convert_annotations_to_spacy crossref --output-dir out --parts 2

CSL processor: /Users/vdaviden/Code/GIANT/dataset-creation/processManuscript.js, pid: 76610
CSL processor: /Users/vdaviden/Code/GIANT/dataset-creation/processManuscript.js, pid: 76611
starting at http://localhost:3000
starting at http://localhost:3001
CSL processor supports 2473 styles
CSL processor supports 2473 styles
references.1.spacy: 100%|█████████████████████████| 2/2 [00:02<00:00,  1.40s/it]
references.0.spacy: 100%|█████████████████████████| 3/3 [00:03<00:00,  1.23s/it]
convert:done: out/references.0.spacy
convert:done: out/references.1.spacy


## Load

In [3]:
import spacy
from spacy.tokens import Doc, DocBin

from spacy import displacy


In [4]:
!ls 

build_dataset.ipynb                [34mout[m[m
convert_annotations_to_spacy.ipynb rendered_biblioraphy.json
[34mcrossref[m[m                           [31mtest-model.ipynb[m[m


In [5]:
db = DocBin()
db.from_disk("out/references.0.spacy")

nlp = spacy.blank("en")

# doc len, in token
print([len(doc) for doc in db.get_docs(nlp.vocab)])


[330, 306, 195, 151, 237, 267, 394, 287, 468, 312, 334, 321, 227, 309, 380, 243, 363, 289, 197, 382, 274, 324, 390, 287, 347, 198, 220, 360, 225, 147, 143]


In [6]:
# doc = list(db.get_docs(nlp.vocab))[0]
for _doc in db.get_docs(nlp.vocab):
    displacy.render(_doc, style="ent") 

In [7]:
[(span.label_, span.text) for span in _doc.spans["bib"]]

[('title', 'Title Page'),
 ('year', '1990'),
 ('issued', '1990'),
 ('container-title', 'Obstetrics and Gynecology Clinics of North America'),
 ('publisher', 'Elsevier BV'),
 ('bib',
  'Title Page. (1990) . Obstetrics and Gynecology Clinics of North America, 17, i. Elsevier BV.'),
 ('year', '2019'),
 ('issued', '2019'),
 ('title', 'Synaphea nexosa: Butcher, R'),
 ('publisher', 'IUCN'),
 ('container-title', 'IUCN Red List of Threatened Species'),
 ('url', '/iucn.uk.2020-'),
 ('bib',
  '(2019) Synaphea nexosa: Butcher, R.. IUCN. IUCN Red List of Threatened Species. &lt;http://dx.doi.org/10.2305/iucn.uk.2020-3.rlts.t118502389a121863300.en.&gt;\n'),
 ('family', 'M. R. F'),
 ('author', 'M. R. F'),
 ('year', '1975'),
 ('issued', '1975'),
 ('title', 'UNFAIR DISMISSAL'),
 ('container-title', 'Industrial Law Journal'),
 ('page', '114–116'),
 ('publisher', 'Oxford University Press (OUP)'),
 ('bib',
  'M. R. F.. (1975) UNFAIR DISMISSAL. Industrial Law Journal, 4, 114–116. Oxford University Press (

In [8]:
_doc[0].is_sent_start

True

### Augment

#### Problem to solve:

References could be multi-line. E.g., see https://libhelp.ncl.ac.uk/faq/183908 :
>"In order to create a hanging indent on the second line of each reference in your bibliography (i.e. in order to get your author names to stand out)"

#### Solution:
   
   implement [data augmenter](https://spacy.io/usage/training#data-augmentation) that will add "\n" with a hanging indent
   
####   Adding additional space requires  changing the tokenization, see note from Spacy doc:   
>" Note that if your data augmentation strategy involves changing the tokenization (for instance, removing or adding tokens) and your training examples include token-based annotations like the dependency parse or entity labels, you’ll need to take care to adjust the Example object so its annotations match and remain valid."

In [9]:

from spacy.training import Example
example = Example(_doc, _doc)

In [10]:
from pprint import pprint
example.to_dict().keys()


dict_keys(['doc_annotation', 'token_annotation'])

In [11]:
example.to_dict()["doc_annotation"].keys()

dict_keys(['cats', 'entities', 'links'])

In [12]:

example.to_dict()["token_annotation"].keys()

dict_keys(['ORTH', 'SPACY', 'TAG', 'LEMMA', 'POS', 'MORPH', 'HEAD', 'DEP', 'SENT_START'])

In [13]:
_d = nlp("test\n  a")
Example(_d,_d).to_dict()


{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['test', '\n  ', 'a'],
  'SPACY': [False, False, False],
  'TAG': ['', '', ''],
  'LEMMA': ['', '', ''],
  'POS': ['', '', ''],
  'MORPH': ['', '', ''],
  'HEAD': [0, 1, 2],
  'DEP': ['', '', ''],
  'SENT_START': [1, 0, 0]}}

In [14]:
from spacy.tokens import Doc
from spacy.training import Example

def insert_hanging_indent(example_dict, pos, orth = "\n  "): 

    example_dict["doc_annotation"]["entities"].insert(pos, "O" )
    example_dict["token_annotation"]["ORTH"].insert(pos, orth)
    if pos > 0:
        example_dict["token_annotation"]["SPACY"][pos-1] = False 
    example_dict["token_annotation"]["SPACY"].insert(pos, False )
    example_dict["token_annotation"]["TAG"].insert(pos, "")
    example_dict["token_annotation"]["LEMMA"].insert(pos, "")
    example_dict["token_annotation"]["POS"].insert(pos, "")
    example_dict["token_annotation"]["MORPH"].insert(pos, "")
    example_dict["token_annotation"]["DEP"].insert(pos, "")
    example_dict["token_annotation"]["SENT_START"].insert(pos, 0)
    example_dict["token_annotation"]["HEAD"] = list(range(len(example_dict["token_annotation"]["ORTH"])))


_d = nlp("test a")
example_dict = Example(_d, _d).to_dict()
insert_hanging_indent(example_dict, 1)
example_dict

{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['test', '\n  ', 'a'],
  'SPACY': [False, False, False],
  'TAG': ['', '', ''],
  'LEMMA': ['', '', ''],
  'POS': ['', '', ''],
  'MORPH': ['', '', ''],
  'HEAD': [0, 1, 2],
  'DEP': ['', '', ''],
  'SENT_START': [1, 0, 0]}}

In [15]:
example = Example(_doc, _doc) 
example_dict = example.to_dict()

for span in reversed(_doc.spans["bib"]):
    if span.label_ in ["title", "author"]:
        
        # print(span.label_, span)
        pos = span.end
        for i in range(pos, len(_doc)):
            print(span, _doc[i], _doc[i].is_punct)
            if _doc[i].is_punct:
                pos += 1
            else:
                break
            
        insert_hanging_indent(example_dict, pos, "|")

doc = Doc(nlp.vocab, example_dict["token_annotation"]["ORTH"], example_dict["token_annotation"]["SPACY"])
reference_doc = Example.from_dict(doc, example_dict).reference

Radiographic visualization of patellar tendon grafts for the reconstruction of the anterior cruciate ligament . True
Radiographic visualization of patellar tendon grafts for the reconstruction of the anterior cruciate ligament Arthroscopy False
Vaquero, J., Vidal, C., and Cubillo, A .. True
Vaquero, J., Vidal, C., and Cubillo, A ( True
Vaquero, J., Vidal, C., and Cubillo, A 1997 False
συνεκβάλλω . True
συνεκβάλλω ( True
συνεκβάλλω no False
UNFAIR DISMISSAL . True
UNFAIR DISMISSAL Industrial False
M. R. F .. True
M. R. F ( True
M. R. F 1975 False
Synaphea nexosa: Butcher, R .. True
Synaphea nexosa: Butcher, R IUCN False
Title Page . True
Title Page ( True
Title Page 1990 False


In [16]:
displacy.render(reference_doc, style="ent")