Task:
`convert_annotations_to_spacy` converts bibliographies to spacy docs.
But this doc could be large. 

The idea is to split up large bibliographies into smaller parts to:
 - better control ram usage
 - build up more robust model, that can 'recover' if parsing starts from an intermediate token
 
 
 >Note: `convert_annotations_to_spacy` requires `node processManuscript.js` to be started in the `dataset-creation` dir

## Convert

In [2]:
!PYTHONPATH=.. python -m convert_annotations_to_spacy --help

Usage: python -m convert_annotations_to_spacy [OPTIONS] CROSSREF_DIR

Arguments:
  CROSSREF_DIR  [required]

Options:
  --output-dir PATH               [default: .]
  --parts INTEGER                 [default: 100]
  --parallel INTEGER              [default: 2]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.


In [24]:
!rm -rf out && mkdir out
!PYTHONPATH=.. python -m convert_annotations_to_spacy crossref --output-dir out --parts 2

CSL processor supports 1747 styles
CSL processor supports 1747 styles
references.0.docbin:   0%|                                | 0/3 [00:00<?, ?it/s]a problem for style  journal-of-political-philosophy
a problem for style  mercatus-center
references.0.docbin:  33%|████████                | 1/3 [00:03<00:07,  3.97s/it]a problem for style  u-schylku-starozytnosci
references.1.docbin: 100%|████████████████████████| 2/2 [00:06<00:00,  3.36s/it]
references.0.docbin: 100%|████████████████████████| 3/3 [00:09<00:00,  3.07s/it]
convert:done: out/references.0.docbin
convert:done: out/references.1.docbin


## Load

In [25]:
import spacy
from spacy.tokens import Doc, DocBin

from spacy import displacy


In [26]:
db = DocBin()
db.from_disk("out/references.0.docbin")

nlp = spacy.blank("en")

# doc len, in token
print([len(doc) for doc in db.get_docs(nlp.vocab)])


[4397, 3723, 4246, 3672, 3545, 3438, 3377, 3919, 2678, 1493, 3135, 2965, 3021, 2638, 3606, 2096, 4109, 2882, 2345, 3749, 2851, 2492, 3323, 2910]


## Split up

In [83]:
MAX_TOKENS = 512

def split_doc(doc:Doc, max_tokens:int=300):
    
    # 'index' of tokens which are not in an annotated span (or entity)
    free = set(range(len(doc)))
    spans = doc.spans["bib"]
    for span in spans:
        free -= set(range(span.start, span.end))
    
    
    def new_doc(doc, start, end):
        _doc = doc[start:end].as_doc(copy_user_data=True)
        return _doc


    
    start = 0
    for t in doc:
        if t.i - start > max_tokens and t.i in free:
            yield new_doc(doc, start, t.i-1)
            start = t.i-1

    if start+1 < len(doc):
        yield new_doc(doc, start, len(doc))
            

        
        
    


In [82]:
doc = list(db.get_docs(nlp.vocab))[0]
print(len(doc))
for _doc in split_doc(doc):
    displacy.render(_doc, style="ent") 
    


4397


In [30]:
displacy.render(doc, style="ent")