define and implement a reader-based API for documents #9

Closed
rhdunn opened this Issue Sep 23, 2011 · 5 comments

Comments

Projects
None yet
1 participant
@rhdunn
Owner

rhdunn commented Sep 23, 2011

At the moment, the document processing API is a push/sax-based API (parseDocument and document_events). This works, but is not very flexible.

The current model requires the entire document to be processed into a normalised form (currently a list of text nodes). In order to support more advanced processing (e.g. joining all text spans from a block into a group to be read), a DOM-like model would need to be created, causing more overheads and complicating the model.

The DOM-like model has problems when reading the document in one go (e.g. from the command-line) where it requires the DOM to be created and then read, instead of reading as you go. Also, for graphical environments that show the text in a document view, there is a duplication of presentation.

With a pull model, the command-line interface can feed the reader directly into the text-to-speech engine. The engine can then do context-based actions for special content, like changing the voice or pitch. For the graphical environment, the data can be stored in the native text-buffer representation and that can itself be exposed to the voice as a document reader API.

To get there:

  1. Define a document reader API -- document_reader createDocumentReader(string filename);.
  2. Implement a pull-to-push bridge that takes a document reader and generates document events.
  3. Convert each document parser to expose the document reader API and hook it up to the pull-to-push bridge.
  4. When all the document parsers are converted, move all consumers of parseDocument to the document reader API.

i. Make the tts engine take a |document_reader| and process that in a loop.

ii. Make cainteoir-gtk parse the document to a GTK text buffer.

iii. Make cainteoir-gtk implement a GTK text buffer to document_reader bridge.
5. When all consumers have been converted, remove the old |parseDocument| and any associated code and interfaces.

@rhdunn

This comment has been minimized.

Show comment
Hide comment
@rhdunn

rhdunn Jan 6, 2012

Owner

There are several cases where one document format is dependent on the processing of other document formats:

  • opf processes the sub-documents (different files) of files in the spine and toc;
  • (x)html processes svg and mathml sub-nodes (on the same document);
  • smil, ssml, pdf, swf and other documents contain embedded RDF/XML metadata (on the same document).

This can be done by the parent document reader class having an active document context and delegating to that if available:

struct ssml_reader : public document_reader
{
  std::shared_ptr<document_reader> context;
  xml::reader *reader;

  bool read()
  {
    if (context) {
      if (context->read()) return true;
      context.reset();
    }
    if (reader == xmlns::rdf("RDF")) {
      context = std::shared_ptr<document_reader>(new rdfxml_reader(reader));
    }
    ...
  }
};

Each document reader will then have two constructors:

reader(xml::reader *aReader)
  -- use the specified xml::reader to parse the xml data
  -- the owner of the xml::reader has checked the root node

reader(const std::shared_ptr<cainteoir::buffer> &aData)
  -- create a new xml::reader and check the root xml node

This will work for nested contexts (e.g. opf => xhtml => svg => rdfxml) but will be slower in delegating through all the layers.

A modified approach would be to have:

std::stack<std::shared_ptr<document_reader>> context;

Then:

if (!context.empty()) {
  if (context.top()->read()) {
    if (context.top()->nodeType() == document_reader::switch_context) {
      context.push(context.top()->activeContext());
      return context.top()->read();
    }
    return true;
  }
  context.pop();
}

NOTE: This stack should be handled by a wrapper around a document_reader, with the wrapper itself implementing the document_reader interface and forwarding to context.top(). The create_document_reader function will create the document reader and attach it to the wrapper, returning the wrapper object.

When a document_reader generates a switch_context event, it creates the document_reader object for the new context.

This could also be done without a stack -- consumers of the document_reader recurse into themselves with the new context, which will be more efficient.

Owner

rhdunn commented Jan 6, 2012

There are several cases where one document format is dependent on the processing of other document formats:

  • opf processes the sub-documents (different files) of files in the spine and toc;
  • (x)html processes svg and mathml sub-nodes (on the same document);
  • smil, ssml, pdf, swf and other documents contain embedded RDF/XML metadata (on the same document).

This can be done by the parent document reader class having an active document context and delegating to that if available:

struct ssml_reader : public document_reader
{
  std::shared_ptr<document_reader> context;
  xml::reader *reader;

  bool read()
  {
    if (context) {
      if (context->read()) return true;
      context.reset();
    }
    if (reader == xmlns::rdf("RDF")) {
      context = std::shared_ptr<document_reader>(new rdfxml_reader(reader));
    }
    ...
  }
};

Each document reader will then have two constructors:

reader(xml::reader *aReader)
  -- use the specified xml::reader to parse the xml data
  -- the owner of the xml::reader has checked the root node

reader(const std::shared_ptr<cainteoir::buffer> &aData)
  -- create a new xml::reader and check the root xml node

This will work for nested contexts (e.g. opf => xhtml => svg => rdfxml) but will be slower in delegating through all the layers.

A modified approach would be to have:

std::stack<std::shared_ptr<document_reader>> context;

Then:

if (!context.empty()) {
  if (context.top()->read()) {
    if (context.top()->nodeType() == document_reader::switch_context) {
      context.push(context.top()->activeContext());
      return context.top()->read();
    }
    return true;
  }
  context.pop();
}

NOTE: This stack should be handled by a wrapper around a document_reader, with the wrapper itself implementing the document_reader interface and forwarding to context.top(). The create_document_reader function will create the document reader and attach it to the wrapper, returning the wrapper object.

When a document_reader generates a switch_context event, it creates the document_reader object for the new context.

This could also be done without a stack -- consumers of the document_reader recurse into themselves with the new context, which will be more efficient.

@rhdunn

This comment has been minimized.

Show comment
Hide comment
@rhdunn

rhdunn Jan 6, 2012

Owner

To avoid code duplication and to offer decent performance, the document_reader API should look something like this:

struct document_reader
{
    const cainteoir::rope &text() const { return mText; }
    ...

    virtual bool read() = 0;
    virtual ~document_reader() {}
protected:
    cainteoir::rope mText;
    ...
};

This allows the data access to be inlined by the compiler and the implementations to focus on implementing the read method and not duplicating the data access code.

There can also be an xml_document_reader that handles the xml::reader and associated constructors for xml-based documents.

Owner

rhdunn commented Jan 6, 2012

To avoid code duplication and to offer decent performance, the document_reader API should look something like this:

struct document_reader
{
    const cainteoir::rope &text() const { return mText; }
    ...

    virtual bool read() = 0;
    virtual ~document_reader() {}
protected:
    cainteoir::rope mText;
    ...
};

This allows the data access to be inlined by the compiler and the implementations to focus on implementing the read method and not duplicating the data access code.

There can also be an xml_document_reader that handles the xml::reader and associated constructors for xml-based documents.

@rhdunn

This comment has been minimized.

Show comment
Hide comment
@rhdunn

rhdunn Apr 28, 2012

Owner

A document consists of:

  1. the content/mime type of the document;
  2. the character encoding the document is encoded as;
  3. metadata about the document;
  4. the content of the document.

The character encoding may change over the lifetime of the document. The application is encoding neutral as the document content is always returned as UTF-8 encoded data. The conversion is done when required.

The document metadata is exposed to the application as an RDF graph to the application. There are two types of metadata:

  1. primary metadata -- this is metadata that is easily and quickly accessed, that is used to describe the document (title, author, description);
  2. supplementary metadata -- this is metadata that is found in the entire document.

This distinction allows efficient metadata extraction when doing things like parsing the documents in the document library.

Primary metadata covers only the head section of a HTML document; only the OPF file in an ePub document and only the mimetype headers in an email document.

Owner

rhdunn commented Apr 28, 2012

A document consists of:

  1. the content/mime type of the document;
  2. the character encoding the document is encoded as;
  3. metadata about the document;
  4. the content of the document.

The character encoding may change over the lifetime of the document. The application is encoding neutral as the document content is always returned as UTF-8 encoded data. The conversion is done when required.

The document metadata is exposed to the application as an RDF graph to the application. There are two types of metadata:

  1. primary metadata -- this is metadata that is easily and quickly accessed, that is used to describe the document (title, author, description);
  2. supplementary metadata -- this is metadata that is found in the entire document.

This distinction allows efficient metadata extraction when doing things like parsing the documents in the document library.

Primary metadata covers only the head section of a HTML document; only the OPF file in an ePub document and only the mimetype headers in an email document.

@rhdunn

This comment has been minimized.

Show comment
Hide comment
@rhdunn

rhdunn Apr 28, 2012

Owner

The reader-based API will then look something like:

std::shared_ptr<document_reader>
createDocumentReader(const char *filename, rdf::graph &primaryMetadata);

with the character encoding being used internally and the content type exposed in the primary metadata.

The createDocumentReader call will read upto the end of the primary metadata:

  1. for email documents this is to the end of the mimetype headers;
  2. for HTML this is to the end of the node;
  3. for ePub, this is after reading the OPF file.

If the document is not recognised, a null document reader is returned.

Owner

rhdunn commented Apr 28, 2012

The reader-based API will then look something like:

std::shared_ptr<document_reader>
createDocumentReader(const char *filename, rdf::graph &primaryMetadata);

with the character encoding being used internally and the content type exposed in the primary metadata.

The createDocumentReader call will read upto the end of the primary metadata:

  1. for email documents this is to the end of the mimetype headers;
  2. for HTML this is to the end of the node;
  3. for ePub, this is after reading the OPF file.

If the document is not recognised, a null document reader is returned.

@rhdunn

This comment has been minimized.

Show comment
Hide comment
@rhdunn

rhdunn May 18, 2012

Owner

This had now been implemented, supporting all existing document formats. The old parseDocument API has been removed.

Owner

rhdunn commented May 18, 2012

This had now been implemented, supporting all existing document formats. The old parseDocument API has been removed.

@rhdunn rhdunn closed this May 18, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment