Skip to content

BioStor articles marked up using Journal Archiving Tag Set

Notifications You must be signed in to change notification settings

yezhifei001/biostor-jats

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

biostor-jats

Experiments in marking up BioStor articles up using Journal Archiving Tag Set (JATS; formerly NLM DTD).

Idea is to create JATS marked-up XML for BioStor articles. Initially simply article-level metadata and links to page scans, we then add content based on analysis of the page scans and associated DjVu and ABBYY XML.

BioStor provides starting point in form of archive with article metadata in JATS XML format, images (B&W and original), and DjVu and ABBYY OCR XML for article pages.

The scripts here then extract text to create hOCR files, which can then be used to generate a PDF with searchable text, and extract figures. Here is a live example.

Scripts

PHP scripts in the tools directory are used to extract and add content to the base provided by BioStor.

djvu2html converts DjVu XML to HTML including hOCR tags, see The hOCR Embedded OCR Workflow and Output Format.

php tools/djvu2html.php examples/65706

abby_pictures extracts picture and table blocks from ABBYY OCR, extracts corresponding part of image and puts these in the "figures" folder. It analyses the colours in the images to decide whether to use the B&W or original image for the figure, then adds links to the JATS XML.

php tools/abbyy_pictures.php examples/65706

hocr2pdf extracts text from the HTML files, combines it with the page scans and uses FPF to generate a PDF with searchable text.

php tools/hocr2pdf.php examples/65706

Stylesheet

XSLT style sheets are used to display the article.

About

BioStor articles marked up using Journal Archiving Tag Set

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 82.9%
  • PHP 15.9%
  • Other 1.2%