Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
define and implement a reader-based API for documents #9
At the moment, the document processing API is a push/sax-based API (
The current model requires the entire document to be processed into a normalised form (currently a list of text nodes). In order to support more advanced processing (e.g. joining all text spans from a block into a group to be read), a DOM-like model would need to be created, causing more overheads and complicating the model.
The DOM-like model has problems when reading the document in one go (e.g. from the command-line) where it requires the DOM to be created and then read, instead of reading as you go. Also, for graphical environments that show the text in a document view, there is a duplication of presentation.
With a pull model, the command-line interface can feed the reader directly into the text-to-speech engine. The engine can then do context-based actions for special content, like changing the voice or pitch. For the graphical environment, the data can be stored in the native text-buffer representation and that can itself be exposed to the voice as a document reader API.
To get there:
i. Make the tts engine take a |document_reader| and process that in a loop.
ii. Make cainteoir-gtk parse the document to a GTK text buffer.
iii. Make cainteoir-gtk implement a GTK text buffer to document_reader bridge.
There are several cases where one document format is dependent on the processing of other document formats:
This can be done by the parent document reader class having an active document context and delegating to that if available:
Each document reader will then have two constructors:
This will work for nested contexts (e.g. opf => xhtml => svg => rdfxml) but will be slower in delegating through all the layers.
A modified approach would be to have:
NOTE: This stack should be handled by a wrapper around a document_reader, with the wrapper itself implementing the document_reader interface and forwarding to
When a document_reader generates a switch_context event, it creates the document_reader object for the new context.
This could also be done without a stack -- consumers of the document_reader recurse into themselves with the new context, which will be more efficient.
To avoid code duplication and to offer decent performance, the document_reader API should look something like this:
This allows the data access to be inlined by the compiler and the implementations to focus on implementing the read method and not duplicating the data access code.
There can also be an xml_document_reader that handles the xml::reader and associated constructors for xml-based documents.
A document consists of:
The character encoding may change over the lifetime of the document. The application is encoding neutral as the document content is always returned as UTF-8 encoded data. The conversion is done when required.
The document metadata is exposed to the application as an RDF graph to the application. There are two types of metadata:
This distinction allows efficient metadata extraction when doing things like parsing the documents in the document library.
Primary metadata covers only the head section of a HTML document; only the OPF file in an ePub document and only the mimetype headers in an email document.
The reader-based API will then look something like:
with the character encoding being used internally and the content type exposed in the primary metadata.
If the document is not recognised, a null document reader is returned.