Test for:
`convert_annotations_to_spacy` 
that converts bibliographies to spacy docs.

It depends on `dataset-creation/processManuscript.js`


## Convert

In [1]:
!PYTHONPATH=.. python -m convert_annotations_to_spacy --help

Usage: python -m convert_annotations_to_spacy [OPTIONS] CROSSREF_DIR

  Sends crossref files to CSL processor and convert rendered annotated
  biblioraphies into Spacy DocBin files Dependency:     node
  processManuscript.js     needs to be started, for additional details see
  dataset-creation/README.md Args:     crossref_dir: a directory with files
  downloaded from crossref, see dataset-creation/crossref/crossrefDownload.py
  output_dir: where to put *.docbin files     parts: how many docbin files to
  be generated     parallel: how manu processes to use. You might want to
  launch more               than one dataset-creation/processManuscript.js
  node process, i.e. in k8s cluster,               if you need high
  parallelism

Arguments:
  CROSSREF_DIR  [required]

Options:
  --output-dir PATH               [default: train.ref]
  --parts INTEGER                 [default: 100]
  --parallel INTEGER              [default: 2]
  --install-completion [bash|zsh|fish|po

In [2]:
!rm -rf out && mkdir out
!PYTHONPATH=.. python -m convert_annotations_to_spacy crossref --output-dir out --parts 2

CSL processor: /Users/vdaviden/Code/GIANT/dataset-creation/processManuscript.js, pid: 81541
CSL processor: /Users/vdaviden/Code/GIANT/dataset-creation/processManuscript.js, pid: 81542
starting at http://localhost:3000
starting at http://localhost:3001
CSL processor supports 2473 styles
CSL processor supports 2473 styles
references.1.spacy: 100%|█████████████████████████| 2/2 [00:01<00:00,  1.01it/s]
references.0.spacy: 100%|█████████████████████████| 3/3 [00:02<00:00,  1.06it/s]
convert:done: out/references.0.spacy
convert:done: out/references.1.spacy


## Load

In [3]:
import spacy
from spacy.tokens import Doc, DocBin

from spacy import displacy


In [4]:
!ls 

build_dataset.ipynb                [34mout[m[m
convert_annotations_to_spacy.ipynb rendered_biblioraphy.json
[34mcrossref[m[m                           [31mtest-model.ipynb[m[m


In [5]:
db = DocBin()
db.from_disk("out/references.0.spacy")

nlp = spacy.blank("en")

# doc len, in token
print([len(doc) for doc in db.get_docs(nlp.vocab)])


[256, 224, 196, 206, 160, 139, 154, 238, 411, 398, 214, 227, 222, 192, 260, 247, 357, 370, 259, 426, 316, 207, 189, 181, 151, 157, 273, 188, 225, 257, 127]


In [6]:
# doc = list(db.get_docs(nlp.vocab))[0]
for _doc in db.get_docs(nlp.vocab):
    displacy.render(_doc, style="ent") 

In [7]:
[(span.label_, span.text) for span in _doc.spans["bib"]]

[('family', 'SILVEIRA'),
 ('given', 'E. K. P. D.'),
 ('author', 'SILVEIRA, E. K. P. D.'),
 ('year', '1975'),
 ('issued', '1975'),
 ('title', 'The management of Caribbean and Amazonian manatees in captivity'),
 ('container-title', 'International Zoo Yearbook'),
 ('page', '223–226'),
 ('bib',
  'SILVEIRA, E. K. P. D. (1975) The management of Caribbean and Amazonian manatees in captivity. International Zoo Yearbook 15, (1) 223–226. https://doi.org/10.1111/j.1748-1090.1975.tb01405.x'),
 ('year', '1979'),
 ('issued', '1979'),
 ('container-title', 'Japanese Journal of Radiological Technology'),
 ('page', '679'),
 ('bib',
  ' (1979) Japanese Journal of Radiological Technology 34, (5) 679. https://doi.org/10.6009/jjrt.kj00003104804'),
 ('family', 'Grossman'),
 ('given', 'N.'),
 ('author', 'Grossman, N.'),
 ('year', '2018'),
 ('issued', '2018'),
 ('title', 'Drones and Terrorism'),
 ('publisher', 'I.B.Tauris'),
 ('bib',
  'Grossman, N. (2018) Drones and Terrorism. I.B.Tauris https://doi.org/10.5

In [8]:
_doc[0].is_sent_start

True

### Augment

#### Problem to solve:

References could be multi-line. E.g., see https://libhelp.ncl.ac.uk/faq/183908 :
>"In order to create a hanging indent on the second line of each reference in your bibliography (i.e. in order to get your author names to stand out)"

#### Solution:
   
   implement [data augmenter](https://spacy.io/usage/training#data-augmentation) that will add "\n" with a hanging indent
   
####   Adding additional space requires  changing the tokenization, see note from Spacy doc:   
>" Note that if your data augmentation strategy involves changing the tokenization (for instance, removing or adding tokens) and your training examples include token-based annotations like the dependency parse or entity labels, you’ll need to take care to adjust the Example object so its annotations match and remain valid."

In [9]:

from spacy.training import Example
example = Example(_doc, _doc)

In [10]:
from pprint import pprint
example.to_dict().keys()


dict_keys(['doc_annotation', 'token_annotation'])

In [11]:
example.to_dict()["doc_annotation"].keys()

dict_keys(['cats', 'entities', 'links'])

In [12]:

example.to_dict()["token_annotation"].keys()

dict_keys(['ORTH', 'SPACY', 'TAG', 'LEMMA', 'POS', 'MORPH', 'HEAD', 'DEP', 'SENT_START'])

In [13]:
_d = nlp("test\n  a")
Example(_d,_d).to_dict()


{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['test', '\n  ', 'a'],
  'SPACY': [False, False, False],
  'TAG': ['', '', ''],
  'LEMMA': ['', '', ''],
  'POS': ['', '', ''],
  'MORPH': ['', '', ''],
  'HEAD': [0, 1, 2],
  'DEP': ['', '', ''],
  'SENT_START': [1, 0, 0]}}

In [14]:
from spacy.tokens import Doc
from spacy.training import Example

def insert_hanging_indent(example_dict, pos, orth = "\n  "): 

    example_dict["doc_annotation"]["entities"].insert(pos, "O" )
    example_dict["token_annotation"]["ORTH"].insert(pos, orth)
    if pos > 0:
        example_dict["token_annotation"]["SPACY"][pos-1] = False 
    example_dict["token_annotation"]["SPACY"].insert(pos, False )
    example_dict["token_annotation"]["TAG"].insert(pos, "")
    example_dict["token_annotation"]["LEMMA"].insert(pos, "")
    example_dict["token_annotation"]["POS"].insert(pos, "")
    example_dict["token_annotation"]["MORPH"].insert(pos, "")
    example_dict["token_annotation"]["DEP"].insert(pos, "")
    example_dict["token_annotation"]["SENT_START"].insert(pos, 0)
    example_dict["token_annotation"]["HEAD"] = list(range(len(example_dict["token_annotation"]["ORTH"])))


_d = nlp("test a")
example_dict = Example(_d, _d).to_dict()
insert_hanging_indent(example_dict, 1)
example_dict

{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O'], 'links': {}},
 'token_annotation': {'ORTH': ['test', '\n  ', 'a'],
  'SPACY': [False, False, False],
  'TAG': ['', '', ''],
  'LEMMA': ['', '', ''],
  'POS': ['', '', ''],
  'MORPH': ['', '', ''],
  'HEAD': [0, 1, 2],
  'DEP': ['', '', ''],
  'SENT_START': [1, 0, 0]}}

In [15]:
example = Example(_doc, _doc) 
example_dict = example.to_dict()

for span in reversed(_doc.spans["bib"]):
    if span.label_ in ["title", "author"]:
        
        # print(span.label_, span)
        pos = span.end
        for i in range(pos, len(_doc)):
            print(span, _doc[i], _doc[i].is_punct)
            if _doc[i].is_punct:
                pos += 1
            else:
                break
            
        insert_hanging_indent(example_dict, pos, "|")

doc = Doc(nlp.vocab, example_dict["token_annotation"]["ORTH"], example_dict["token_annotation"]["SPACY"])
reference_doc = Example.from_dict(doc, example_dict).reference

Image of the Month—Quiz Case . True
Image of the Month—Quiz Case Archives False
McKee, T. I. ( True
McKee, T. I. 2012 False
ChemInform Abstract: FERROMAGNETISM IN COPPER(II) OXYDIACETATE HEMIHYDRATE . True
ChemInform Abstract: FERROMAGNETISM IN COPPER(II) OXYDIACETATE HEMIHYDRATE Chemischer False
CORVAN, P. J., ESTES, W. E., WELLER, R. R., and HATFIELD, W. E. ( True
CORVAN, P. J., ESTES, W. E., WELLER, R. R., and HATFIELD, W. E. 1980 False
Drones and Terrorism . True
Drones and Terrorism I.B.Tauris False
Grossman, N. ( True
Grossman, N. 2018 False
The management of Caribbean and Amazonian manatees in captivity . True
The management of Caribbean and Amazonian manatees in captivity International False
SILVEIRA, E. K. P. D. ( True
SILVEIRA, E. K. P. D. 1975 False


In [16]:
displacy.render(reference_doc, style="ent")