xproc · ndw · Nov 1, 2018 · Oct 31, 2018 · Oct 31, 2018 · Oct 31, 2018
@@ -364,21 +364,53 @@ defined properties.</para>
 <section xml:id="document-types">
 <title>Document Types</title>
 
-<para>From an XProc perspective, there are two kinds of documents: XML
-documents and non-XML documents. Non-XML documents can be further subdivided
-into text documents and binary documents. Text documents are called
-out specially because they can be easily represented inline within a
-pipeline.</para>
+<para>XProc 3.0 has been designed to make it possible to process any kind of
+document. Each document has a representation in the <biblioref linkend="xpath-datamodel"/>.
+This is necessary so that any kind of document can be passed as an argument to XPath functions,
+such as <tag>p:document-properties</tag>.
+Practically speaking, there are five kinds of documents:</para>
+
+<orderedlist>
+<listitem>
+<para><link linkend="xml-documents">XML documents</link>
+</para>
+</listitem>
+<listitem>
+<para><link linkend="html-documents">HTML documents</link>
+</para>
+</listitem>
+<listitem>
+<para><link linkend="text-documents">Text documents</link>
+</para>
+</listitem>
+<listitem>
+<para><link linkend="json-documents">JSON documents</link>
+</para>
+</listitem>
+<listitem>
+<para><link linkend="other-documents">Other documents</link>
+</para>
+</listitem>
+</orderedlist>
 
 <section xml:id="xml-documents">
 <title>XML Documents</title>
 
-<para>Representations of XML documents are instances of
-the <biblioref linkend="xpath-datamodel"/>. They are identified by
-an XML media type. <termdef xml:id="dt-XML-media-type">The
+<para>Representations of XML documents are general instances of the XDM.
+They are documents that contain a mixture
+of other node types (elements, text, comments, and processing
+instructions). This definition is intentionally broader than the
+definition of a well-formed XML document because it is often
+convenient for intermediate stages in a pipeline to produce
+more-or-less arbitrary fragments of XML that can be combined together
+by later stages.
+XML documents are identified by an XML media type.
+<termdef xml:id="dt-XML-media-type">The
 “<literal>application/xml</literal>” and “<literal>text/xml</literal>
 media types and all media types of the form
 “<literal>application/<replaceable>something</replaceable>+xml</literal>”
+(except for “<literal>application/xhtml+xml</literal>” which is explicitly
+an <glossterm>HTML media type</glossterm>)
 are <firstterm baseform="XML media type">XML media types</firstterm>.
 </termdef>
 </para>
@@ -395,34 +427,99 @@ in an XProc pipeline is <glossterm>implementation-defined</glossterm>.</impl>
 </para>
 </section>
 
+<section xml:id="html-documents">
+<title>HTML Documents</title>
+
+<para>Representations of HTML documents are general instances of the XDM.
+Within XProc, they are
+<xref linkend="xml-documents"/>.
+HTML documents are identified by an HTML media type.
+<termdef xml:id="dt-HTML-media-type">The
+“<literal>text/html</literal>” and “<literal>application/xhtml+xml</literal>
+media types
+are <firstterm baseform="HTML media type">HTML media types</firstterm>.
+</termdef>
+</para>
+
+<para>The distinction between XML documents and HTML documents is apparent
+in two places:</para>
+
+<orderedlist>
+<listitem>
+<para>When an HTML document is <emphasis>parsed</emphasis>, for example when it
+is the result of querying a web service or is loaded from a file on disk, an
+HTML parser <rfc2119>must</rfc2119> be used. An HTML parser will construct a
+balanced tree even if the HTML document would not be seen as
+well-formed XML if it was parsed by an XML parser.  An HTML parser may also
+add elements not found in the original (for example table body elements inside tables).
+</para>
+<note xml:id="inline-html">
+<para>The HTML parsing rules only apply when the content is parsed. HTML content
+in an unencoded <tag>p:inline</tag> must be well-formed XML (because it is literally
+in the pipeline) and will not be transformed in any way.</para>
+</note>
+</listitem>
+<listitem>
+<para>When an HTML document is serialized, it may be serialized using the
+HTML serializer (see <xref linkend="xml-serialization-31"/>) by default.
+</para>
+</listitem>
+</orderedlist>
+</section>
+
 <section xml:id="text-documents">
 <title>Text Documents</title>
 
-<para>Text documents are non-XML documents. A text document is represented by
-a single text node wrapped in a document node as instances of the
-<biblioref linkend="xpath-datamodel"/>.</para>
-
-<para>Text documents are identified by a text media type.
+<para>Representations of text documents are XDM documents that contain a single text node.
+Text documents are identified by a
+text media type.
 <termdef xml:id="dt-text-media-type">Media types of the form
 “<literal>text/<replaceable>something</replaceable></literal>”
 are <firstterm baseform="text media type">text media types</firstterm> with the
 exception of “<literal>text/xml</literal>” which is an XML media type.
+and “<literal>text/html</literal>” which is an HTML media type.
 </termdef>
 </para>
 </section>
 
-<section xml:id="non-xml-documents">
-<title>Non-XML Documents</title>
-
-<para><impl>Representations of non-XML documents are
-are <glossterm>implementation-dependent</glossterm>.</impl>
-They are identified by media types that are not
-<glossterm baseform="XML media type">XML media types</glossterm>.
+<section xml:id="json-documents">
+<title>JSON Documents</title>
+
+<para>Representations of JSON documents are instances of the XDM.
+They are maps, arrays, or
+atomic values.
+JSON documents are identified by a
+JSON media type.
+<termdef xml:id="dt-JSON-media-type">The
+“<literal>application/json</literal>”
+media type and all media types of the form
+“<literal>application/<replaceable>something</replaceable>+json</literal>”
+are <firstterm baseform="JSON media type">JSON media types</firstterm>.
+</termdef>
 </para>
 
-<para>Implementors are free to optimize by storing them in convenient
-formats, caching them on disk, etc.</para>
+<note xml:id="json-documents-xdm" role="editorial">
+<title>Editorial Note</title>
+<para>This definition doesn’t say that JSON documents are represented by
+<emphasis>document nodes</emphasis> that contain something (because document nodes
+can’t contain maps or arrays). Can we get away with that?</para>
+</note>
+</section>
+
+<section xml:id="other-documents">
+<title>Other documents</title>
 
+<para>Representations of other kinds of documents are empty XDM documents.
+<impl>The <emphasis>underlying</emphasis> representations of other
+kinds of documents are
+<glossterm>implementation-defined</glossterm>.</impl>
+Other kinds of documents are identified by media types that are not
+<glossterm baseform="XML media type">XML media types</glossterm>,
+<glossterm baseform="HTML media type">HTML media types</glossterm>,
+<glossterm baseform="text media type">text media types</glossterm>,
+or
+<glossterm baseform="JSON media type">JSON media types</glossterm>.
+</para>
 </section>
 </section>