Test for:
`convert_annotations_to_spacy` 
that converts bibliographies to spacy docs.

 >Note: `convert_annotations_to_spacy` requires `node processManuscript.js` to be started in the `dataset-creation` dir

## Convert

In [1]:
!PYTHONPATH=.. python -m convert_annotations_to_spacy --help

Usage: python -m convert_annotations_to_spacy [OPTIONS] CROSSREF_DIR

  Sends crossref files to CSL processor and convert rendered annotated
  biblioraphies into Spacy DocBin files Dependency:     node
  processManuscript.js     needs to be started, for additional details see
  dataset-creation/README.md Args:     crossref_dir: a directory with files
  downloaded from crossref, see dataset-creation/crossref/crossrefDownload.py
  output_dir: where to put *.docbin files     parts: how many docbin files to
  be generated     parallel: how manu processes to use. You might want to
  launch more               than one dataset-creation/processManuscript.js
  node process, i.e. in k8s cluster,               if you need high
  parallelism

Arguments:
  CROSSREF_DIR  [required]

Options:
  --output-dir PATH               [default: train.ref]
  --parts INTEGER                 [default: 100]
  --parallel INTEGER              [default: 2]
  --install-completion [bash|zsh|fish|po

In [2]:
!rm -rf out && mkdir out
!PYTHONPATH=.. python -m convert_annotations_to_spacy crossref --output-dir out --parts 2

CSL processor supports 1747 styles
CSL processor supports 1747 styles
references.1.spacy: 100%|█████████████████████████| 2/2 [00:04<00:00,  2.41s/it]
references.0.spacy: 100%|█████████████████████████| 3/3 [00:05<00:00,  1.95s/it]
convert:done: out/references.0.spacy
convert:done: out/references.1.spacy


## Load

In [3]:
import spacy
from spacy.tokens import Doc, DocBin

from spacy import displacy


In [4]:
db = DocBin()
db.from_disk("out/references.0.spacy")

nlp = spacy.blank("en")

# doc len, in token
print([len(doc) for doc in db.get_docs(nlp.vocab)])


[365, 170, 89, 139, 264, 182, 148, 351, 299, 284, 238, 191, 207, 164, 153, 285, 205, 319, 207, 267, 162, 160, 197, 528, 110, 155, 892, 113, 219]


In [5]:
# doc = list(db.get_docs(nlp.vocab))[0]
for _doc in db.get_docs(nlp.vocab):
    displacy.render(_doc, style="ent") 

In [6]:
[(span.label_, span.text) for span in _doc.spans["bib"]]

[('family', 'Georgeou'),
 ('given', 'N'),
 ('family', 'Engel'),
 ('author', 'Georgeou N, Engel'),
 ('year', '2011'),
 ('issued', '2011'),
 ('title',
  'The Impact of Neoliberalism and New Managerialism on Development Volunteering: An Australian Case Study'),
 ('container-title', 'Australian Journal of Political Science'),
 ('bib',
  'Georgeou N, Engel S. 2011. The Impact of Neoliberalism and New Managerialism on Development Volunteering: An Australian Case Study. Australian Journal of Political Science. 46:297–311.\n'),
 ('title', 'Nobelist Studied Life Particles'),
 ('year', '1957'),
 ('issued', '1957'),
 ('container-title', 'The Science News-Letter'),
 ('bib',
  'Nobelist Studied Life Particles. 1957. . The Science News-Letter. 72:293.\n'),
 ('family', 'Yamada'),
 ('given', 'Y'),
 ('family', 'Ito'),
 ('given', 'S'),
 ('family', 'Kayamori'),
 ('given', 'R'),
 ('family', 'Shibata'),
 ('author', 'Yamada Y, Ito S, Kayamori R, Shibata'),
 ('year', '1979'),
 ('issued', '1979'),
 ('title',


In [7]:
_doc[0].is_sent_start

True

### Augment

#### Problem to solve:

References could be multi-line. E.g., see https://libhelp.ncl.ac.uk/faq/183908 :
>"In order to create a hanging indent on the second line of each reference in your bibliography (i.e. in order to get your author names to stand out)"

#### Solution:
   
   implement [data augmenter](https://spacy.io/usage/training#data-augmentation) that will add "\n" with a hanging indent
   
####   Adding additional space requires  changing the tokenization, see note from Spacy doc:   
>" Note that if your data augmentation strategy involves changing the tokenization (for instance, removing or adding tokens) and your training examples include token-based annotations like the dependency parse or entity labels, you’ll need to take care to adjust the Example object so its annotations match and remain valid."

In [8]:

from spacy.training import Example
example = Example(_doc, _doc)

In [9]:
from pprint import pprint
example.to_dict().keys()


dict_keys(['doc_annotation', 'token_annotation'])

In [10]:
example.to_dict()["doc_annotation"].keys()

dict_keys(['cats', 'entities', 'links'])

In [11]:

example.to_dict()["token_annotation"].keys()

dict_keys(['ORTH', 'SPACY', 'TAG', 'LEMMA', 'POS', 'MORPH', 'HEAD', 'DEP', 'SENT_START'])

In [12]:
_d = nlp("test\n  a")
Example(_d,_d).to_dict()


{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['test', '\n  ', 'a'],
  'SPACY': [False, False, False],
  'TAG': ['', '', ''],
  'LEMMA': ['', '', ''],
  'POS': ['', '', ''],
  'MORPH': ['', '', ''],
  'HEAD': [0, 1, 2],
  'DEP': ['', '', ''],
  'SENT_START': [1, 0, 0]}}

In [13]:
from spacy.tokens import Doc
from spacy.training import Example

def insert_hanging_indent(example_dict, pos, orth = "\n  "): 

    example_dict["doc_annotation"]["entities"].insert(pos, "O" )
    example_dict["token_annotation"]["ORTH"].insert(pos, orth)
    if pos > 0:
        example_dict["token_annotation"]["SPACY"][pos-1] = False 
    example_dict["token_annotation"]["SPACY"].insert(pos, False )
    example_dict["token_annotation"]["TAG"].insert(pos, "")
    example_dict["token_annotation"]["LEMMA"].insert(pos, "")
    example_dict["token_annotation"]["POS"].insert(pos, "")
    example_dict["token_annotation"]["MORPH"].insert(pos, "")
    example_dict["token_annotation"]["DEP"].insert(pos, "")
    example_dict["token_annotation"]["SENT_START"].insert(pos, 0)
    example_dict["token_annotation"]["HEAD"] = list(range(len(example_dict["token_annotation"]["ORTH"])))


_d = nlp("test a")
example_dict = Example(_d, _d).to_dict()
insert_hanging_indent(example_dict, 1)
example_dict

{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['test', '\n  ', 'a'],
  'SPACY': [False, False, False],
  'TAG': ['', '', ''],
  'LEMMA': ['', '', ''],
  'POS': ['', '', ''],
  'MORPH': ['', '', ''],
  'HEAD': [0, 1, 2],
  'DEP': ['', '', ''],
  'SENT_START': [1, 0, 0]}}

In [14]:
example = Example(_doc, _doc) 
example_dict = example.to_dict()

for span in reversed(_doc.spans["bib"]):
    if span.label_ in ["title", "author"]:
        
        # print(span.label_, span)
        pos = span.end
        for i in range(pos, len(_doc)):
            print(span, _doc[i], _doc[i].is_punct)
            if _doc[i].is_punct:
                pos += 1
            else:
                break
            
        insert_hanging_indent(example_dict, pos, "|")

doc = Doc(nlp.vocab, example_dict["token_annotation"]["ORTH"], example_dict["token_annotation"]["SPACY"])
reference_doc = Example.from_dict(doc, example_dict).reference

Government by Judiciary . True
Government by Judiciary The False
Hockett HC, Boudin LB . True
Hockett HC, Boudin LB 1933 False
Torsade De Pointes Triggered by High Grade Fever in Patient with LQT2 Syndrome . True
Torsade De Pointes Triggered by High Grade Fever in Patient with LQT2 Syndrome J False
Shibata R, Ashizawa N, Komiya N, Fukae S, Nakao K, Seto S, Maemura K. False
Lolita: a Janus text . True
Lolita: a Janus text 1995 False
Removal of antibiotic rifampicin from aqueous media by advanced electrochemical oxidation: Role of electrode materials, electrolytes and real water matrices . True
Removal of antibiotic rifampicin from aqueous media by advanced electrochemical oxidation: Role of electrode materials, electrolytes and real water matrices Electrochimica False
Brito LRD, Ganiyu SO, dos Santos EV, Oturan MA, Martínez-Huitle CA . True
Brito LRD, Ganiyu SO, dos Santos EV, Oturan MA, Martínez-Huitle CA 2021 False
The Effects of Vitamin D2on the Somatostatin and Calcitonin Concentrat

In [15]:
displacy.render(reference_doc, style="ent")