New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
define and implement a reader-based API for documents #9
Comments
There are several cases where one document format is dependent on the processing of other document formats:
This can be done by the parent document reader class having an active document context and delegating to that if available:
Each document reader will then have two constructors:
This will work for nested contexts (e.g. opf => xhtml => svg => rdfxml) but will be slower in delegating through all the layers. A modified approach would be to have:
Then:
NOTE: This stack should be handled by a wrapper around a document_reader, with the wrapper itself implementing the document_reader interface and forwarding to When a document_reader generates a switch_context event, it creates the document_reader object for the new context. This could also be done without a stack -- consumers of the document_reader recurse into themselves with the new context, which will be more efficient. |
To avoid code duplication and to offer decent performance, the document_reader API should look something like this:
This allows the data access to be inlined by the compiler and the implementations to focus on implementing the read method and not duplicating the data access code. There can also be an xml_document_reader that handles the xml::reader and associated constructors for xml-based documents. |
A document consists of:
The character encoding may change over the lifetime of the document. The application is encoding neutral as the document content is always returned as UTF-8 encoded data. The conversion is done when required. The document metadata is exposed to the application as an RDF graph to the application. There are two types of metadata:
This distinction allows efficient metadata extraction when doing things like parsing the documents in the document library. Primary metadata covers only the head section of a HTML document; only the OPF file in an ePub document and only the mimetype headers in an email document. |
The reader-based API will then look something like:
with the character encoding being used internally and the content type exposed in the primary metadata. The
If the document is not recognised, a null document reader is returned. |
This had now been implemented, supporting all existing document formats. The old parseDocument API has been removed. |
At the moment, the document processing API is a push/sax-based API (
parseDocument
anddocument_events
). This works, but is not very flexible.The current model requires the entire document to be processed into a normalised form (currently a list of text nodes). In order to support more advanced processing (e.g. joining all text spans from a block into a group to be read), a DOM-like model would need to be created, causing more overheads and complicating the model.
The DOM-like model has problems when reading the document in one go (e.g. from the command-line) where it requires the DOM to be created and then read, instead of reading as you go. Also, for graphical environments that show the text in a document view, there is a duplication of presentation.
With a pull model, the command-line interface can feed the reader directly into the text-to-speech engine. The engine can then do context-based actions for special content, like changing the voice or pitch. For the graphical environment, the data can be stored in the native text-buffer representation and that can itself be exposed to the voice as a document reader API.
To get there:
document_reader createDocumentReader(string filename);
.parseDocument
to the document reader API.i. Make the tts engine take a |document_reader| and process that in a loop.
ii. Make cainteoir-gtk parse the document to a GTK text buffer.
iii. Make cainteoir-gtk implement a GTK text buffer to document_reader bridge.
5. When all consumers have been converted, remove the old |parseDocument| and any associated code and interfaces.
The text was updated successfully, but these errors were encountered: