Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework section 3.1, document types #604

Merged
merged 4 commits into from Nov 1, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
141 changes: 119 additions & 22 deletions xproc/src/main/xml/specification.xml
Expand Up @@ -364,21 +364,53 @@ defined properties.</para>
<section xml:id="document-types">
<title>Document Types</title>

<para>From an XProc perspective, there are two kinds of documents: XML
documents and non-XML documents. Non-XML documents can be further subdivided
into text documents and binary documents. Text documents are called
out specially because they can be easily represented inline within a
pipeline.</para>
<para>XProc 3.0 has been designed to make it possible to process any kind of
document. Each document has a representation in the <biblioref linkend="xpath-datamodel"/>.
This is necessary so that any kind of document can be passed as an argument to XPath functions,
such as <tag>p:document-properties</tag>.
Practically speaking, there are five kinds of documents:</para>

<orderedlist>
<listitem>
<para><link linkend="xml-documents">XML documents</link>
</para>
</listitem>
<listitem>
<para><link linkend="html-documents">HTML documents</link>
</para>
</listitem>
<listitem>
<para><link linkend="text-documents">Text documents</link>
</para>
</listitem>
<listitem>
<para><link linkend="json-documents">JSON documents</link>
</para>
</listitem>
<listitem>
<para><link linkend="other-documents">Other documents</link>
</para>
</listitem>
</orderedlist>

<section xml:id="xml-documents">
<title>XML Documents</title>

<para>Representations of XML documents are instances of
the <biblioref linkend="xpath-datamodel"/>. They are identified by
an XML media type. <termdef xml:id="dt-XML-media-type">The
<para>Representations of XML documents are general instances of the XDM.
They are documents that contain a mixture
of other node types (elements, text, comments, and processing
instructions). This definition is intentionally broader than the
definition of a well-formed XML document because it is often
convenient for intermediate stages in a pipeline to produce
more-or-less arbitrary fragments of XML that can be combined together
by later stages.
XML documents are identified by an XML media type.
<termdef xml:id="dt-XML-media-type">The
“<literal>application/xml</literal>” and “<literal>text/xml</literal>
media types and all media types of the form
“<literal>application/<replaceable>something</replaceable>+xml</literal>”
(except for “<literal>application/xhtml+xml</literal>” which is explicitly
an <glossterm>HTML media type</glossterm>)
are <firstterm baseform="XML media type">XML media types</firstterm>.
</termdef>
</para>
Expand All @@ -395,34 +427,99 @@ in an XProc pipeline is <glossterm>implementation-defined</glossterm>.</impl>
</para>
</section>

<section xml:id="html-documents">
<title>HTML Documents</title>

<para>Representations of HTML documents are general instances of the XDM.
Within XProc, they are
<xref linkend="xml-documents"/>.
HTML documents are identified by an HTML media type.
<termdef xml:id="dt-HTML-media-type">The
“<literal>text/html</literal>” and “<literal>application/xhtml+xml</literal>
media types
are <firstterm baseform="HTML media type">HTML media types</firstterm>.
</termdef>
</para>

<para>The distinction between XML documents and HTML documents is apparent
in two places:</para>

<orderedlist>
<listitem>
<para>When an HTML document is <emphasis>parsed</emphasis>, for example when it
is the result of querying a web service or is loaded from a file on disk, an
HTML parser <rfc2119>must</rfc2119> be used. An HTML parser will construct a
balanced tree even if the HTML document would not be seen as
well-formed XML if it was parsed by an XML parser. An HTML parser may also
add elements not found in the original (for example table body elements inside tables).
</para>
<note xml:id="inline-html">
<para>The HTML parsing rules only apply when the content is parsed. HTML content
in an unencoded <tag>p:inline</tag> must be well-formed XML (because it is literally
in the pipeline) and will not be transformed in any way.</para>
</note>
</listitem>
<listitem>
<para>When an HTML document is serialized, it may be serialized using the
HTML serializer (see <xref linkend="xml-serialization-31"/>) by default.
</para>
</listitem>
</orderedlist>
</section>

<section xml:id="text-documents">
<title>Text Documents</title>

<para>Text documents are non-XML documents. A text document is represented by
a single text node wrapped in a document node as instances of the
<biblioref linkend="xpath-datamodel"/>.</para>

<para>Text documents are identified by a text media type.
<para>Representations of text documents are XDM documents that contain a single text node.
Text documents are identified by a
text media type.
<termdef xml:id="dt-text-media-type">Media types of the form
“<literal>text/<replaceable>something</replaceable></literal>”
are <firstterm baseform="text media type">text media types</firstterm> with the
exception of “<literal>text/xml</literal>” which is an XML media type.
and “<literal>text/html</literal>” which is an HTML media type.
</termdef>
</para>
</section>

<section xml:id="non-xml-documents">
<title>Non-XML Documents</title>

<para><impl>Representations of non-XML documents are
are <glossterm>implementation-dependent</glossterm>.</impl>
They are identified by media types that are not
<glossterm baseform="XML media type">XML media types</glossterm>.
<section xml:id="json-documents">
<title>JSON Documents</title>

<para>Representations of JSON documents are instances of the XDM.
They are maps, arrays, or
atomic values.
JSON documents are identified by a
JSON media type.
<termdef xml:id="dt-JSON-media-type">The
“<literal>application/json</literal>”
media type and all media types of the form
“<literal>application/<replaceable>something</replaceable>+json</literal>”
are <firstterm baseform="JSON media type">JSON media types</firstterm>.
</termdef>
</para>

<para>Implementors are free to optimize by storing them in convenient
formats, caching them on disk, etc.</para>
<note xml:id="json-documents-xdm" role="editorial">
<title>Editorial Note</title>
<para>This definition doesn’t say that JSON documents are represented by
<emphasis>document nodes</emphasis> that contain something (because document nodes
can’t contain maps or arrays). Can we get away with that?</para>
</note>
</section>

<section xml:id="other-documents">
<title>Other documents</title>

<para>Representations of other kinds of documents are empty XDM documents.
<impl>The <emphasis>underlying</emphasis> representations of other
kinds of documents are
<glossterm>implementation-defined</glossterm>.</impl>
Other kinds of documents are identified by media types that are not
<glossterm baseform="XML media type">XML media types</glossterm>,
<glossterm baseform="HTML media type">HTML media types</glossterm>,
<glossterm baseform="text media type">text media types</glossterm>,
or
<glossterm baseform="JSON media type">JSON media types</glossterm>.
</para>
</section>
</section>

Expand Down