Skip to content

Commit

Permalink
Merge f3d6bdb into 83d1423
Browse files Browse the repository at this point in the history
  • Loading branch information
Omikhleia committed Jul 11, 2023
2 parents 83d1423 + f3d6bdb commit ab9ead0
Show file tree
Hide file tree
Showing 5 changed files with 209 additions and 50 deletions.
176 changes: 176 additions & 0 deletions documentation/c11-inputoutput.sil
@@ -0,0 +1,176 @@
\begin{document}
\chapter{Designing Inputters & Outputters}

Let’s dabble further into SILE’s internals.
As mentioned earlier in this manual, SILE relies on “input handlers” to parse content and construct an abstract syntax tree (AST) which can then be interpreted and rendered.
The actual rendering relies on an “output backend” to generate a result in the expected target format.

\center{\img[src=documentation/fig-input-to-output.png, width=99%lw]}

The standard distribution includes “inputters” (as we call them in brief) for the SIL language and its XML flavor,\footnote{%
Actually, SILE preloads \em{three} inputters: SIL, XML, and also one for Lua scripts.
} but SILE is not tied to supporting these formats \em{only.}
Adding another input format is just a matter of implementing the corresponding inputter.
This is exactly what third party modules adding “native” support for Markdown, Djot, and other markup languages achieve.
This chapter will give you a high-level overview of the process.

As of “outputter” backends, most users are likely interested in the one responsible for PDF output.
The standard distribution includes a few other backends: text-only output, debug output (mostly used internally by non-regression tests), and a few experimental ones.

\section{Designing an input handler}

Inputters usually live somewhere in the \code{inputters/} subdirectory of either where your first input file is located, your current working directory, or your SILE path.

\subsection{Initial boilerplate}

A minimum working inputter inherits from the \autodoc:package{base} inputter.
We need to declare the name of our new inputter, its priority order, and (at least) two methods.

When a file or string is processed, SILE looks for the first inputter claiming to know this format.
Inputters are sorted according to their priority order, an integer value.
For instance,
\begin{itemize}
\item{The XML inputter has a priority of 2.}
\item{The SIL inputter has a priority of 50.}
\end{itemize}

In this tutorial example, we are going to use a priority of 2.
Please note that depending on your input format and the way it can be analyzed in order to determine whether a given content is in that format, this value might not be appropriate.
At one point, you will have to consider in which order the various inputters need to evaluated.

We will return to the topic later below.
For now, let’s start with a file \code{inputters/myformat.lua} with the following content.

\begin[type=autodoc:codeblock]{raw}
local base = require("inputters.base")

local inputter = pl.class(base)
inputter._name = "myformat"
inputter.order = 2

function inputter.appropriate (round, filename, _)
-- We will later change it.
return false
end

function inputter:parse (doc)
-- We will later change it.
return tree
end

return inputter
\end{raw}

You have written you very first inputter, or more precisely the minimal \em{boilerplate} code for it.
One possible way to use it would be to load it from command line, before processing some file in the supported format:

\begin[type=autodoc:codeblock]{raw}
sile -u inputters.myformat somefile.xy
\end{raw}

However, this will not work yet.
We must to do a few real things now.

\subsection{Content appropriation}

What we first need is to tell SILE how to choose our inputter when it is given a file in our input format.
The \code{appropriate()} method of our inputter format is reponsible for providing the corresponding logic. It is a static method (so it does not have a \code{self} argument),
and it takes up to three arguments:
\begin{itemize}
\item{the round, an integer between 1 and 3.}
\item{the filename if we are processing a file (so \code{nil} in case we are processing some string directly, for instance via a raw command handler).}
\item{the textual content (of the file or string being processed).}
\end{itemize}

It is expected to return a boolean value, \code{true} if this handler is appropriate and \code{false} otherwise.

Earlier, we said that inputters were checked in their priority order.
This was not fully complete.
Let’s add another piece to our puzzle: Inputters are actually checked orderly indeed, but three times:
\begin{itemize}
\item{Round 1 expects the filename to be checked: for instance, we could base our decision on recognized file extensions.}
\item{Round 2 expects the content string to be checked: for instance, we could base our decision on some “magic” sequence of characters occuring early in the document (or any other content inspection strategy).}
\item{Round 3 expects the content to successfully be parsed.}
\end{itemize}

For instance, say you are designing an inputter for HTML.
The \em{appropriation} logic might look as follows.

\begin[type=autodoc:codeblock]{raw}
function inputter.appropriate (round, filename, doc)
if round == 1 then
return filename:match(".html$")
elseif round == 2 then
local sniff = doc:sub(1, 100)
local promising = sniff:match("<!DOCTYPE html>")
or sniff:match("<html>") or sniff:match("<html ")
return promising or false
end
return false
end
\end{raw}

Here, to keep the example simple, we decided not to implement round 3, which would require an actual HTML parser capable of intercepting syntax errors.
This is clearly outside the aim of this tutorial.\footnote{The third round is also the most “expensive” in terms of computing, so clever optimizations might be needed here, but we are not going to consider the topic here.}
You should nevertheless now have the basics for understanding how existing inputters are supposed to perform format detection.

\subsection{Content parsing}

Once SILE finds an inputter appropriating the content, it invokes its \code{parse()} method.
Eventually, you need to return a SILE document tree.
So this is where your task really takes off.
You have to parse the document, build a SILE abstract syntax tree and wrap it into a document.
The general structure will likely look as follows, but the details strongly depend on the input language you are going to support.

\begin[type=autodoc:codeblock]{raw}
function inputter:parse (doc)
local ast = myOwnFormatToAST(doc) -- parse doc and build a SILE AST
local tree = {{
ast,
command = "document",
options = { ... },
}}
return tree
end
\end{raw}

For the sake of a better illustration, we are going to pretend that our input format uses square brackets to mark italics.
Say it is all about it, and let us go for a naive and very low-level solution.

\begin[type=autodoc:codeblock]{raw}
function inputter:parse (doc)
local ast = {}
for token in SU.gtoke(doc, "%[[^]]*%]") do
if token.string then
ast[#ast+1] = token.string
else
-- bracketed content
local inside = token.separator:sub(2, #token.separator - 1)
ast[#ast+1] = {
[1] = inside,
command = "em",
id = "command",
-- our naive logic does not keep track of positions in the input stream
lno = 0, col = 0, pos = 0
}
end
end
local tree = {{
ast,
command = "document",
}}
return tree
end
\end{raw}

Of course, real input formats need more than that, such as parsing a complex grammar with LPEG or other tools.
SILE also provides some helpers to facilitate AST-related operations.
Again, we just kept it as simple as possible here, so as to describe the concepts and the general workflow and get you started.

\subsection{Inputter options}

In the preceding sections, we explained how to implement a simple input handler, with just a few methods being overridden.
The other default methods from the base inputter class still apply.
In particular, options passed to the \autodoc:command{\include} commands are passed onto our inputter instance and are available in \code{self.options}.

\end{document}
50 changes: 0 additions & 50 deletions documentation/developers.sil

This file was deleted.

32 changes: 32 additions & 0 deletions documentation/fig-input-to-output.dot
@@ -0,0 +1,32 @@
digraph G {
rankdir="LR"
input [shape=note]
output [shape=note]

input -> inputter

subgraph cluster_0 {
style=rounded;
color=lightgrey;
shape=note;
#node [style=filled,color=white];

label = "processing";

command[label="Command\nprocessing"]
typesetter[label="Typesetter"]
paragraphing[label="Hyphenation\n&\nLine-breaking"]
pagebreaking[label="Page-breaking"]
frame[label="Frame abstraction"]

command -> typesetter
typesetter -> frame
typesetter -> paragraphing
paragraphing -> pagebreaking
}

inputter -> command
pagebreaking -> outputter
outputter -> output
}

Binary file added documentation/fig-input-to-output.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions documentation/sile.sil
Expand Up @@ -60,6 +60,7 @@ Didier Willis\break
% Developers' guide
\include[src=documentation/c09-concepts.sil]
\include[src=documentation/c10-classdesign.sil]
\include[src=documentation/c11-inputoutput.sil]
\include[src=documentation/c11-xmlproc.sil]
\include[src=documentation/c12-tricks.sil]
\end{document}

0 comments on commit ab9ead0

Please sign in to comment.