This repository has been archived by the owner on Mar 6, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
srfi-109.html
913 lines (877 loc) · 39.3 KB
/
srfi-109.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>SRFI 109: Extended string quasi-literals</title>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link rel="stylesheet" href="/srfi.css" type="text/css" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css">
div.title h1 { font-size: small; color: blue }
div.title { font-size: xx-large; color: blue; font-weight: bold }
h1 { font-size: x-large; color: blue }
h2 { font-size: large; color: blue }
/* So var inside pre gets same font as var in paragraphs. */
var { font-family: monospace; }
em.non-terminal { }
em.non-termina-def { }
code.literal { font-style: normal; }
code.literal:before { content: "“" }
code.literal:after { content: "”" }
</style>
</head>
<body>
<div class="title">
<H1>Title</H1>
Extended string quasi-literals
</div>
<H1>Author</H1>
<p>Per Bothner
<code><a href="mailto:per@bothner.com"><per@bothner.com></a></code></p>
<h1 id="status">Status</h1>
<p>
This SRFI is currently in ``final'' status. To see an explanation of
each status that a SRFI can hold, see <a
href="http://srfi.schemers.org/srfi-process.html">here</a>.
To provide input on this SRFI, please
<a href="mailto:srfi minus 109 at srfi dot schemers dot org">mail to
<code><srfi minus 109 at srfi dot schemers dot org></code></a>. See
<a href="../srfi-list-subscribe.html">instructions here</a> to
subscribe to the list. You can access previous messages via
<a href="mail-archive/maillist.html">the archive of the mailing list</a>.
You can access
post-finalization messages via
<a href="http://srfi.schemers.org/srfi-109/post-mail-archive/maillist.html">
the archive of the mailing list</a>.
</p>
<ul>
<li>Received: <a href="http://srfi.schemers.org/srfi-109/srfi-109-1.1.html">2012-11-03</a></li>
<li>Revised: <a href="http://srfi.schemers.org/srfi-109/srfi-109-1.2.html">2013-02-04</a></li>
<li>Revised: <a href="http://srfi.schemers.org/srfi-109/srfi-109-1.3.html">2013-03-26</a></li>
<li>Revised: <a href="http://srfi.schemers.org/srfi-109/srfi-109-1.5.html">2013-04-19</a></li>
<li>Revised: <a href="http://srfi.schemers.org/srfi-109/srfi-109-1.6.html">2013-05-25</a></li>
<li>Finalized: <a href="http://srfi.schemers.org/srfi-109/srfi-109-1.8.html">2013-06-21</a></li>
<li>Draft: 2012-11-10 - 2013-01-10</li>
</ul>
<H1>Abstract</H1>
<p>
This specifies a reader extension for extended string quasi-literals,
including nicer multi-line strings, and enclosed unquoted expressions.
<p>
This proposal is related to
<a href="../srfi-108/srfi-108.html">SRFI-108 (extended string quasi-literals)</a> and <a href="../srfi-107/srfi-107.html">SRFI-107 (XML reader syntax)</a>,
as they share quite a bit of syntax.
<h1>Rationale</h1>
<p>This proposal aims to aid in a number of related problems
relating to string literals.
<h2>Multi-line string literals</h2>
<p>Standard Scheme literals are awkward for multi-line strings.
One problem is that the same delimiter (double-quote) is used for both
the start and end of the string. This is error-prone and not robust:
adding or removing a single character changes the meaning of the entire
rest of the program.
A related problem is that if the delimiter appears in the string it
needs to be quoted using an escape character, which can get hard-to-read.
If we have distinct start and end delimiters, then we only
need to escape <q>unbalanced</q> use of the delimiters.
<p>
A common solution is a
<a href="http://en.wikipedia.org/wiki/Here_document"><q>here document</q></a>,
where distinct multi-character start and end delimiters are used.
For example the <a href="http://en.wikipedia.org/wiki/Unix_shell">Unix shell</a>
uses uses <code><<</code> followed by an arbitrary token
as the start delimiter, and then the same token as the end delimiter:
<pre>
tr a-z A-Z <<END_TEXT
one two three
uno dos tres
END_TEXT
</pre>
<p>
This proposal uses just <code>#&{</code> and <code>}</code>
as the default start and end delimiters, respectively:
<pre>
(string-upcase &{
one two three
uno dos tres
})
</pre>
<h2>Enclosed (unquoted) expressions</h2>
<p>Commonly one wants to construct a string as a concatenation of
literal text and evaluated expressions.
Using explicit string concatenation (Scheme <code>string-append</code>
or Java's <code>+</code> operator)
is verbose and can be error-prone.
Using <code>format</code> is an alternative, but it is also a bit verbose.
Worse, the format specifier and expression it controls
are non-adjacent, which is awkward and error-prone.
Nicer is to be able to use
<a href="http://en.wikipedia.org/wiki/Variable_interpolation">Variable interpolation</a>, as in Unix shells:
<pre>
echo "Hello ${name}!"
</pre>
<p>
This proposal uses the syntax:
<pre>
&{Hello &[name]!}
</pre>
<p>
Note that <span><code class="literal">&</code></span> is used both
as part of the prefix <code class="literal">&{</code> to mark the entire string, and as an escape character within the string.
See the discussion
<a href="../srfi-108/srfi-108.html#delimiter-options">SRFI-108 (delimiter options)</a>.
<h2>Template processing</h2>
<p>
Going one step further, a
<a href="http://en.wikipedia.org/wiki/Template_processor">template processor</a>
has many uses.
Examples include <a href="http://brl.sourceforge.net/">BRL</a>
and <a href="http://en.wikipedia.org/wiki/JavaServer_Pages">JSP</a>,
which are both used to generate web pages.
<p>
The simple solution is to allow general Scheme expressions in substitutions:
<pre>
&{Hello &[(string-capitalize name)]!}
</pre>
<p>
You can also leave out the square brackets when the expression is
a parenthesized expression:
<pre>
&{Hello &(string-capitalize name)!}
</pre>
<p>
Note that this syntax for unquoted expressions matches that used in
<a href="../srfi-107/srfi-107.html">SRFI-107 (XML reader syntax)</a>.
<h2>Indentation and line-endings</h2>
<p>By default there is a one-to-one mapping between
whitespace in the literal and the resulting string
(except that <var>line-ending</var> is normalized to the newline character),
but it is often convenient (or at least prettier)
for them to be different.
<p>
You can of course easily add extra newline characters beyond those in
the literal:
<pre>
&{a&newline;b} ⟹ "a\nb"
</pre>
<p>
Conversely, the <dfn>line-continuation marker</dfn>
<code class="literal">&-</code> is used to suppress a newline:
<pre>
&{abc&-
def} ⟹ "abc def"
</pre>
<p>The marker also suppresses any <i>intraline whitespace</i> between
the <code class="literal">&-</code> and the newline,
but it does <em>not</em> suppress <i>intraline whitespace</i>
following the newline.
In the latter respect it differs from the <code class="literal">\</code>
at the end of a line in an R6RS string literal.
<p>
Suppressing initial whitespace is more generally useful than
just for continuation lines. For example it is important for properly
indenting source code to match the program structure.
The <dfn>indentation marker</dfn> <code class="literal">&|</code>
is used to mark the end of insignificant initial whitespace,
typically to indent strings inside a function.
The <code class="literal">&|</code> characters and all the preceding
whitespace are removed:
<pre>
(display (string-upcase &{
&|one two three
&|uno dos tres
}) out)
</pre>
<p>
As a matter of style, all of the indentation lines should line up:
An implementation may warn if indentation is inconsistent.
It is an error if there are any non-whitespace characters
between the previous newline and the indentation marker.
It is also an error to write an indentation marker
before the first newline in the literal.
<p>
One does not normally want an initial newline in a multi-line string.
However, as in the above example, the natural way to write this
is with the left brace on the previous line - otherwise either
the source is <q>wrongly</q> indented, or the matching columns
in the result don't line up in the source.
For that reason <code class="literal">&|</code>
also suppresses an initial newline.
Specifically, when the initial left-brace is followed by
optional (invisible) intraline-whitespace, then a newline,
then optional intraline-whitespace (the indentation), and
finally the indentation marker <code class="literal">&|</code>
- all of which is removed from the output.
Otherwise the <code class="literal">&|</code> only removes
initial intraline-whitespace on the same line (and itself).
<p>
However, traditionally there should be a <em>final</em> newline
in a multi-line string. So the following styles are suggested.
If the text is at top-level, or more generally,
the closing brace is in the first column, then write it like this:
<pre>
(define help-message &{
&|This is the first of 2 lines.
&|This last line is followed by a final newline.
})
</pre>
<p>
When the text is nested such that writing the closing brace should not
be in the left column, then you can use an extra indentation marker,
like this:
<pre>
(display
(string-upcase &{
&|This is the first of 2 lines.
&|This last line is followed by a final newline.
&|})
out)
</pre>
<p>Note in the above there are 3 indentation markers, but the
resulting string has 2 lines followed by a total of 2 newline characters,
because the first indentation markers suppresses the initial newline.
<p>
If you do not want to not end the final line with a newline,
you can either use a line-continuation marker,
or end the line with the closing brace:
<pre>
(display (string-upcase &{
&|This is the first of 2 lines.
&|This last line is not followed by a final newline.}) out)
</pre>
<!--
<p>An idea - perhaps not sufficiently useful - allows prefixed line numbers.
<pre>
(display (string-upcase &|{
1: |one two three
2: |uno dos tres
}) out)
</pre>
-->
<!--
FIXME - see later ...
It might be useful to allow comments at the end of each line.
For example, this facility could be used for line numbers:
<pre>
(display (string-upcase &{
&|one two three &;; 1
&|uno dos tres &;; 2
}) out)
</pre>
<p>
Here <span><code>&;</code></span> can be followed by horizontal whitespace
or comments; both the <span><code>&;</code></span> and the whitespace and comments
are ignored. (It is an error if there is anything else following.)
This is useful to add per-line comments. It is also useful to indicate
that the line ends with whitespace; adding <span><code>&;</code></span> after the
included whitespace makes that clear.
-->
<h2>Embedded comments</h2>
<p>For long strings it may be useful to embed comments, even
though this is redundant since it could be done using enclosed expressions:
<pre>
&{preamble &[#|ignore this part|#] postamble}
</pre>
<p>
However, this seems clumsy, so this specification has a comment syntax:
<pre>
&{preamble &#|ignore this part|# postamble}
</pre>
<p>
For example for line numbers:
<pre>
(display (string-upcase &{
&|&#|line 1|#one two
&|&#|line 2|# three
&|&#|line 3|#uno dos tres
}) out)
</pre>
<p>
(It is temping to allow comments before a
<code class="literal">&|</code> indentation marker,
but it entails more complexity that seems justified.)
<h2>Character escapes</h2>
<p>
We support the standard XML syntax for character references,
using either decimal or hexadecimal values.
The following string has two instances of the Ascii escape character,
as either decimal 27 or hex <code>1B</code>:
<pre>
&{&#27;&#x1B;}
</pre>
<!--<p><b>Design note:</b> Note we use <code>#&</code>
to introduce a literal, and <code>#</code> for a character escape. This could
be confusing, but we assume numeric character escapes will be rare.-->
<p>
You can also use the pre-defined XML entity names:
<pre>
&{&amp; &lt; &gt; &quot; &apos;} ⟹ "& < > \" '"
</pre>
<p>
In addition, <code>&lbrace;</code> <code>&rbrace;</code>
can be used for left and right curly brace:
<pre>
&{&rbrace;_&lbrace;} ⟹ "}_{"
</pre>
<p>
Note that these are only needed for unbalanced braces:
<pre>
&{A left brace '{' followed by a right brace '}' is ok.}
⟹ "A left brace '{' followed by a right brace '}' is ok."
</pre>
<p>
An implementation <em>must</em> support the character names <code>amp</code>,
<code>lt</code>, <code>gt</code>, <code>quot</code>,
<code>apos</code>, <code>lbrace</code>, and <code>rbrace</code>.
An implementation <em>should</em> support
<a href="http://www.w3.org/2003/entities/2007/w3centities-f.ent">the standard XML entity names</a>
(though resource-limited or non-Unicode-based implementations
are not required to). For example:
<pre>
&{L&aelig;rdals&oslash;yri}
⟹ "Lærdalsøyri"
</pre>
<p>
An implementation <em>should</em> also support the standard
R7RS character names <code>null</code>, <code>alarm</code>,
<code>backspace</code>, <code>tab</code>, <code>newline</code>,
<code>return</code>, <code>escape</code>, <code>space</code>,
and <code>delete</code>. For example:
<pre>
&{&escape;&space;}
</pre>
<p>
The reader translates the entity reference
<code>&<var>name</var>;</code>
to the variable reference <code>$entity$:<var>name</var></code>.
Therefore user-defined entity names are possible:
<pre>
(define $entity$:crnl "\r\n")
&{&crnl;} ⟹ "\r\n"
</pre>
<p>
<!--
<p><b>Discussion:</b> Instead of
<code>&lcurly; &rcurly; &lsquare; &rsquare;</code>
would other names be better? For example:
<code>&lbrace; &rbrace; &lbracket; &rbracket;</code>.
If we support slash-forms perhaps we don't need them,
but instead one could write:
<code>&\{ &\} &\[ &\]</code>.
-->
<h2>Possible extensions</h2>
<p>
This section discusses some ideas that seem worthwhile,
but need more thought, so are deferred for now.
<h3 id="special-characters">Special characters</h3>
<p>
Only the characters <code>'{'</code>, <code>'}'</code>, and
<code>'&'</code> are reserved and thus need special escaping.
Braces only need escaping when unbalanced, which is likely
to be rare in both text and quoted programs, thus the only
real problem is <code class="literal">&</code>.
A common solution in other languages is doubling.
That is one could read <code>&&</code> as
a single <code>&</code>. However, doubling is not otherwise
used in Scheme, so it may not be worth adding as a special case.
<p>
It might convenient to support standard string single-character slash
escapes in some form, For example:
<pre>
&{Hello!&\r&\n} ⟹ "Hello\r\n"
</pre>
<p>
Maybe not really needed, since one could just write:
<pre>
&{Hello&["\r\n"]}
</pre>
<h3 id="formatting">Formatting</h3>
<p>
Many Scheme implementations use <a href="http://srfi.schemers.org/srfi-48/srfi-48.html"><code>format</code></a> for
finer-grained control of the output. A problem with <code>format</code>
is that the association between format specifiers and data expressions
is positional, which is hard-to-read and error-prone.
A better solution places the specifier adjacant to the data expression:
<pre>
&{The response was &~,2f(* 100.0 (/ responses total))%.}
</pre>
<p>
The reader would map this to:
<pre>
($string$ "The response was " ($format$ "~,2f" (* 100.0 (/ responses total))) "%.")
</pre>
<p>A simple definition of <code>$format$</code>:
<pre>
(define ($format$ fmt . args) (apply format #t fmt args))
</pre>
<p>Implementations that support
<a href="http://en.wikipedia.org/wiki/Printf_format_string"><code>printf</code>-style formatting</a> can also optionally support those:
<pre>
&{The response was &%.2f(* 100.0 (/ responses total))%.}
</pre>
<p>
This would be read as:
<pre>
($string$ "The response was " ($sprintf$ "%.2f" (* 100.0 (/ responses total))) "%.")
</pre>
<p>
(The JavaFX Script language provided similar functionality.)
<h3>Internationalized strings</h3>
<p>
Internationalization refers to a framework so that
text messages can be emitted in multiple (human) languages,
depending on the user's preferred locale.
See <a href="http://srfi.schemers.org/srfi-29/srfi-29.html">SRFI-29</a>.
Strings that may need to be translated are marked specially.
For the sake of discussion we can use the prefix <code>^</code>
followed by a <var>key</var>:
<pre>
&^hello{Hello!}
</pre>
<p>
Here the key is the string <code>hello</code>. At runtime this key
is combined with the <q>current language</q> to produce a translated string.
If no translation is found, then the string in the literal <code>Hello!</code>
is used.
<p>
If there is no explicit key, the string is used as the key.
In the following, <code>"Hello!"</code> is used as the key.
<pre>
&^{Hello!}
</pre>
<h3>Complex formats and internationalization</h3>
<p>
A simple implementation of <code>$format$</code> as
a call to the <code>format</code> function
does not handle format specifiers that change the
argument order.
These are primarily useful for localizing messages,
since one might want change argument order when translating
from one language to another. Consider this warning message:
<pre>
&^{['&[partition]' has only &[avail] bytes free.}
</pre>
<p>
A translation might want to re-order the arguments, as if it were:
<pre>
&^{Only &[avail] bytes free on '&[partition]'.}
</pre>
<p>
That could be done if the translation database provides
for a format that re-orders the arguments,
perhaps using the tilde-asterisk format specifier forms.
For example (to pick some hypothetical translation database syntax):
<pre>
"'&[]' has only &[] bytes free." => "Only &~1@*~d[] bytes free on '&~0@*~s[]'."
</pre>
<p>
It follows that we can't use a one-to-one translation from
a format-specifier (<code>$format$</code>) to a call to the
<code>format</code> function. Instead we need to work with
single format string constructed from the entire text to be localized.
The complicates the implementation.
The basic algorithm should be something like:
<ol>
<li>
Construct a <var>text-part</var> by taking the literal text,
format specifiers, and expanded entity-references.
Leave out all the enclosed expressions.
Exact translation format to be specified,
but one idea is to represent each enclosed expression
by <code class="literal">&[]</code> if there is no format-specifier,
and <code class="literal">&[<var>specifier</var>]</code> if there is one.
<li>
If translation is specified, create a <var>translation-key</var>:
Either use an explicit <var>translation-key</var> given in the quasi-literal, or use
the <var>text-part</var> as an implicit <var>translation-key</var>
(GNU <code>gettext</code>-style).
Look for a translation in the translation database.
If one is found, use that as the translated <var>text-part</var>;
otherwise use <var>text-part</var> as-is.
<li>Convert the <var>text-part</var> to a format-string
by escaping stand-alone <code class="literal">~</code> characters.
Replace each <code class="literal">&[]</code>
by <code class="literal">~a</code>,
and each <code class="literal">&[<var>specifier</var>]</code>
by the <code class="literal"><var>specifier</var></code>,
<li>Invoke <code>format</code> with the resulting format string and
the enclosed expressions as the arguments.
</ol>
<!--
<p>
This procedure can be optimized at compile time if there is
no localizarion or format specifiers. Unfortunately, if
translation is required, then we basically have to convert
the <code>$quasi-string$</code> back to a <var>text-part</var> string,
translate it, and then parse the translated string.
Only the first part can be done at compile-time.
This unparsing and reparsing is hard to avoid
as long as the translation mappings are in text form, and anything
else seems difficult to work with.
-->
<!--
<p>
Hence the actual translation is more complex:
<pre>
<code class="literal">(format #t (string-concat</code> <b>TrToFormat[</b><var class="non-terminal">form</var><b>]</b>...<code class="literal">)</code> <b>TrToFormatArgs[</b><var class="non-terminal">form</var><b>]</b>...<code class="literal">)</code>
</pre>
where:
<pre>
<b>TrToFormat[</b>"<var>string-literal</var>"<b>]</b>
⟾ "<var>string-literal</var>" <i>except with % doubled</i>
<b>TrToFormatArgs[</b>"string-literal"<b>]</b>
⟾ #|nothing|#
<b>TrToFormat[</b>($entity-reference$ name)<b>]</b>
⟾ mapped-named
<b>TrToFormatArgs[</b>($entity-reference$ name)<b>]</b>
⟾ #|nothing|#
<b>TrToFormat[</b>($format$ <var class="non-terminal">format</var> <var class="non-terminal">expression</var> ...)<b>]</b>
⟾ <code>"</code><var class="non-terminal">format</var><code>"</code>
<b>TrToFormatArgs[</b>($format$ format <var class="non-terminal">expression</var> ...)<b>]</b>
⟾ <var class="non-terminal">expression</var> ...
<b>TrToFormat[</b>other-form<b>]</b>
⟾ %a
<b>TrToFormatArgs[</b>other-form<b>]</b>
⟾ other-form
</pre>
Note that the $entity-reference$ invocation after resolving to
a string becomes part of the format, so it must be
resolveable at compile-time. ((Or call string-concat to build format string.))
<pre>
($format$ format expression ...)
</pre>
When standalone, this is implemented as:
<pre>
(format #t format expression ...)
</pre>
-->
<h3 id="user-defined-end-token">User-defined end token</h3>
<p>
Many languages, including the Bourne shell,
allow for a a user-defined end token.
We could allow the as an option following a marker
character - for example <code class="literal">!</code>:
</p>
<pre>
(string-upcase &!END-TEXT{
one two three
uno dos tres
}!END-TEXT)
</pre>
<h3 id="splicing">Splicing of lists and vectors</h3>
<p>
Sometimes you want to insert all the values of a vector or list
in an enclosed-part.
I.e. you want to <q>splice</q> the elements of the list/vector
into the result string. This is similar to
the splicing of a list in quasi-quotation. It seems reasonable
to use the same prefix character <code class="literal">@</code>.
Thus:
<pre>
(define exp (list e1 e1 ... en))
&{_&[@exp]_}
</pre>
<p>
should be equivalent to:
<pre>
&{_&[e1 e2 ... en]_}
</pre>
<p>
This can be implemented using the <code class="literal">~{</code>
<code class="literal">~}</code> iteration format specifiers from
Common Lisp, if the implementations supports those:
<pre>
&{_&~{~a~}[exp]_}
</pre>
<!--
<h2>Translation to S-expressions</h2>
<p>
This specification could leave it implementation-defined how
the above syntax is implemented. Instead, following Scheme
tradition (including old-fashioned quasi-quotation), we specify
a mapping performed by the Scheme reader into a simpler S-expression format.
This allows quotation to be well-defined, and makes it easier
for various tools (and macros) to process this syntax.
For example:
<pre>
&{Hello &[name]!}
</pre>
is read as if it were:
<pre>
($string$ "Hello " $<<$ name $>>$ "!")
</pre>
We assume a predefined macro named <code>$string$</code>
which concatenates the various pieces together. The dollar-signs
are used to reduce name conflicts - they indicate <code>$string$</code>
is a special keyword which would normally not be redefined by the user.
<p>
Delimiting an enclosed expression inside a <code>$<<$</code>...<code>$>>$</code> pair
is mostly redundant, but can be important if we add support for
format specifiers: Enclosed expressions should be formatted
as if with a <code>~a</code> format specifiers, while literal
text should be part of the format string. The distinction
matters if we support format specifiers that re-order the arguments
(move the argument cursors), which is useful for internationalization.
-->
<h1>Specification</h1>
<h2>Syntax</h2>
<pre>
<var class="non-terminal-def">expression</var> ::= ...
| <var>extended-string-literal</var>
</pre>
<pre>
<var class="non-terminal-def" id="extended-string-literal-def">extended-string-literal</var> ::= <code class="literal">&{</code> <var class="non-terminal">initial-ignored</var>? <var>string-literal-part</var><sup>*</sup> <code class="literal">}</code>
<var class="non-terminal-def" id="string-literal-part-def">string-literal-part</var> ::=
<i>any character except </i><code>&</code><i>, </i><code>{</code> <i>or</i> <code>}</code>
| <code class="literal">{</code> <var class="non-terminal">string-literal-part</var><sup>*</sup> <code class="literal">}</code>
| <var class="non-terminal">char-ref</var>
| <var class="non-terminal">entity-ref</var>
| <var class="non-terminal">special-escape</var>
| <var>enclosed-part</var>
<var class="non-terminal-def">char-ref</var> ::=
<code class="literal">&#</code> <var class="non-terminal">digit<sup>+</sup></var> <code class="literal">;</code>
| <code class="literal">&#x</code> <var class="non-terminal">hex-digit<sup>+</sup></var> <code class="literal">;</code>
<var class="non-terminal-def">entity-ref</var> ::=
<code class="literal">&</code> <var class="non-terminal">char-or-entity-name</var> <code class="literal">;</code>
<var class="non-terminal-def">char-or-entity-name</var> ::= <var>tagname</var>
<var class="non-terminal-def">initial-ignored</var> ::=
<var class="non-terminal">intraline-whitespace</var> <var class="non-terminal">line-ending</var> <var class="non-terminal">intraline-whitespace</var> <code class="literal">&|</code>
<var class="non-terminal-def">special-escape</var> ::=
<var class="non-terminal">intraline-whitespace</var> <code class="literal">&|</code>
| <code class="literal">&</code> <var class="non-terminal">nested-comment</var>
| <code class="literal">&-</code> <var class="non-terminal">intraline-linespace</var> <var class="non-terminal">line-ending</var>
<var class="non-terminal-def">enclosed-part</var> ::=
<code class="literal">&</code> <var class="non-terminal">enclosed-modifier</var> <code class="literal">[</code> <var>expression<sup>*</sup></var> <code class="literal">]</code>
| <code class="literal">&</code> <var class="non-terminal">enclosed-modifier</var> <code class="literal">(</code> <var>expression</var><sup>+</sup> <code class="literal">)</code>
</pre>
<pre>
<var class="non-terminal-def" id="tagname-def">tagname</var> ::= <var class="non-terminal">tagname-initial</var> <var class="non-terminal">tagname-subsequent</var>*
<var class="non-terminal-def">tagname-initial</var> ::= <var class="non-terminal">letter</var>
<var class="non-terminal-def">tagname-subsequent</var> ::= <var class="non-terminal">tagname-initial</var> | <var class="non-terminal">digit</var> | <code class="literal">-</code> (hyphen) | <code class="literal">_</code> (underscore) | <code class="literal">.</code> (period)
</pre>
<p>
If we allowed <var class="non-terminal">tagname</var> to be an
arbitrary Scheme identifier there would be parsing difficulties.
One problem is that we use <code class="literal">&|</code> to skip
indentation, but R7RS identifier syntax uses <code class="literal">|</code>
as a delimiter for symbols with special characters.
Another conflict is if an implementation uses
<code class="literal">&~</code> or
<code class="literal">&%</code> to indicate format specifiers,
since these are allowed as R7RS identifier <var>initial</var> characters.
<p>
An implementation <em>may</em> extend
<var class="non-terminal">tagname</var>
to match <var class="non-terminal">Name</var> as
defined by the <a href="http://www.w3.org/TR/xml11/">XML 1.1 specification</a>.
<!--
<var class="non-terminal">NameStartChar</var> and
<var class="non-terminal">NameChar</var> are
defined in the <a href="http://www.w3.org/TR/xml11/">XML 1.1 specification</a>.
(<em>Note:</em>
<p>
As a matter of style, it is recommended that a <var>tagname</var>
consist of a letter, followed by zero or more letters, digits,
hyphens (<code>#\x2d</code>), or underscores (<code>#\x5f</code>).
<pre>
<var class="non-terminal-def">tagname-initial</var> ::= <var class="non-terminal">NameStartChar</var>
<var class="non-terminal-def">tagname-subsequent</var> ::= <var class="non-terminal">NameChar</var>
</pre>
-->
<!--Cowan suggested: "the cname must be a valid Scheme identifier (according to
the implementation's definition) which consists solely of characters
with Unicode general categories Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl,
or No (i.e. letters, combining marks, and numbers only)."
"XML special-cases hyphen, underscore, dot, and U+00B7,
the middle dot (which cannot be initial)"
-->
<p>
The following are defined by R6RS: <var class="non-terminal">nested-comment</var>,
<var class="non-terminal">intraline-whitespace</var>,
<var class="non-terminal">line-ending</var>,
<var class="non-terminal">letter</var>,
<var class="non-terminal">digit</var>,
and <var class="non-terminal">hex-digit</var>.
<pre>
<var class="non-terminal-def" id="enclosed-modifier-def">enclosed-modifier</var> ::= <i>empty</i>
</pre>
<p>
An <var class="non-terminal">enclosed-modifier</var> is normally empty:
However, implementations or future extensions may support non-empty modifiers.
For example, Kawa supports both <code>format</code>-style
and <code>printf</code>-style specifiers, so the syntax is:
<pre>
<var class="non-terminal-def">enclosed-modifier</var> ::= <i>empty</i>
| <code class="literal">~</code> <var class="non-terminal">format-specifier-after-tilde</var> (optional feature)
| <code class="literal">%</code> <var class="non-terminal">format-specifier-after-percent</var> (optional feature)
</pre>
<!--
<pre>
<var class="non-terminal-def">extended-datum-literal</var> ::= <code class="literal">#&</code><var>enclosed-datum</var>
<va class="non-terminal-def"r>enclosed-datum</var> ::= <var>enclosed-cmd</var>?<code class="literal">{</code><var>expression ...</var><code class="literal">}</code>?<code class="literal">[</code><var>enclosed-text</var><code class="literal">}</code>?
</pre>
-->
<h2 id="specification-translation">Translation</h2>
<p>
When the Scheme reader reads an <var class="non-terminal">extended-string-literal</var>
it returns a list whose first element is the symbol <code>$string$</code>,
and whose remaining elements are the translations of the string-literal parts.
The literal content (including each
<var class="non-terminal">char-ref</var> but excluding each
<var class="non-terminal">entity-ref</var>) is translated to
literal strings.
An <var class="non-terminal">entity-ref</var>
<code>&<var>ename</var>;</code> is translated to a
symbol <code>$entity$:<var>ename</var></code>.
Enclosed expressions are prefixed by a <code class="literal">$<<$</code>
symbol¸ and followed by a <code class="literal">$>>$</code>.
<p>
The translation is defined by conceptual
<q>read-time re-write function</q> <b>Tr</b>
which maps an <var class="non-terminal">extended-string-literal</var>
in the input stream to an equivalent <code>$string$</code> list - which
is then (conceptually) re-read. (A real reader would generate
S-expression forms directly, but this way we can express the
translation more concisely.)
<pre>
<b>Tr[</b><code class="literal">&{</code> <var class="non-terminal">initial-ignored</var>? <var class="non-terminal">content-segment</var><sup>*</sup> <code class="literal">}</code><b>]</b>
⟾ <code class="literal">($string$</code> <b>TrContent[</b><var class="non-terminal">content-segment</var><b>]</b><sup>*</sup> <code class="literal">)</code>
</pre>
<p>
Each <q>segment</q> corresponds to a
<var class="non-terminal">string-literal-part</var> in the syntax,
except that a run of multiple plain characters and
<var class="non-terminal">char-ref</var>s are combined to a single
string literal. In addition the <var class="non-terminal">special-escape</var>
forms are dropped without appearing in the result.
<pre>
<b>TrContent[</b>simple-text<sup>+</sup><b>]</b>
⟾ <code class="literal">"</code><b> TrText[</b>simple-text<b>]</b><sup>+</sup> <code class="literal">"</code>
<b>TrText[</b><i>any character except </i><code class="literal">&</code><i>, or</i> <code class="literal">\</code><i>, line-ending, or final (unbalanced)</i> <code class="literal">}</code><b>]</b>
⟾ <i>that character as-is</i>
<b>TrText[</b><var>line-ending</var><b>]</b>
⟾ <code class="literal">\n</code>
<b>TrText[</b><code class="literal">\</code><b>]</b>
⟾ <code class="literal">\\</code>
<b>TrText[</b><code class="literal">&#x</code> <var class="non-terminal">hex-digit</var><sup>+</sup> <code class="literal">;</code><b>]</b>
⟾ <code class="literal">\x</code> <var class="non-terminal">hex-digit</var><sup>+</sup> <code class="literal">;</code>
<b>TrText[</b><code class="literal">&#</code> <var class="non-terminal">digit</var><sup>+</sup> <code class="literal">;</code><b>]</b>
⟾ <code class="literal">\x</code> <i>corresponding hex-digits</i> <code class="literal">;</code>
<b>TrText[</b><code class="literal">&</code> <var class="non-terminal">nested-comment</var><b>]</b>
⟾ <code class="literal"></code>
<b>TrText[</b><var class="non-terminal">intraline-whitespace</var> <code class="literal">&|</code><b>]</b>
⟾ <code class="literal"></code>
<b>TrText[</b><code class="literal">&-</code> <var class="non-terminal">intraline-whitespace</var> <var class="non-terminal">line-ending</var><b>]</b>
⟾ <code class="literal"></code>
</pre>
<p>
Translations for the other segment kinds are straight-forward:
<pre>
<b>TrContent[</b><code class="literal">&</code><var class="non-terminal">ename</var><code class="literal">;</code><b>]</b>
⟾ <code class="literal">$entity$:</code><var class="non-terminal">ename</var>
<b>TrContent[</b><code class="literal">&(</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">)</code><b>]</b>
⟾ <code class="literal">$<<$ (</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">) $>>$</code>
<b>TrContent[</b><code class="literal">&[</code> <var class="non-terminal">expression</var><sup>*</sup> <code class="literal">]</code><b>]</b>
⟾ <code class="literal">$<<$</code> <var class="non-terminal">expression</var><sup>*</sup> <code class="literal">$>>$</code>
</pre>
<p>The following are optional and/or for a future specification:
<pre>
<b>TrContent[</b><code class="literal">&~</code> <var class="non-terminal">format</var> <code class="literal">(</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">)</code><b>]</b>
⟾ <code class="literal">($format$ "</code> <var class="non-terminal">format</var> <code class="literal">" (</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">))</code>
<b>TrContent[</b><code class="literal">&~</code> <var class="non-terminal">format</var> <code class="literal">[</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">]</code><b>]</b>
⟾ <code class="literal">($format$ "</code><var class="non-terminal">format</var><code class="literal">"</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">)</code>
<b>TrContent[</b><code class="literal">&%</code> <var class="non-terminal">format</var> <code class="literal">(</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">)</code><b>]</b>
⟾ <code class="literal">($sprintf$ "</code> <var class="non-terminal">format</var> <code class="literal">" (</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">))</code>
<b>TrContent[</b><code class="literal">&%</code> <var class="non-terminal">format</var> <code class="literal">[</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">]</code><b>]</b>
⟾ <code class="literal">($sprintf$ "</code><var class="non-terminal">format</var><code class="literal">"</code> <var class="non-terminal">expression</var><sup>+</sup> <code class="literal">)</code>
</pre>
<h2>Implementing the translated forms</h2>
The reader translation:
<pre>
($string$ <var class="non-terminal">form</var> ...)
</pre>
evaluates approximately to an immutable string created by
concatenating each <var class="non-terminal">form</var>.
A basic implementation could be:
<pre>
(define ($string$ . args)
(let ((port (open-output-string)))
(for-each
(lambda (arg)
(if (and (not (eq? arg $<<$)) (not (eq? arg $>>$)))
(display arg port)))
args)
(get-output-string port)))
</pre>
<p>
The string created by a <code>$string$</code> form is immutable,
and need not have a unique identity. E.g. if the operands
are constant then an implementation is allowed to constant-fold
the expression to a string literal.
<p>
In addition <code>$<<$</code> <code>$>>$</code> are
bound to unique objects, distinct from each other or other objects
(as determinted by <code>eq?</code>). These bindings
should preferbly be non-assignable if an implementation has
a mechanism for that (for example using identifier macros).
<pre>
(define $<<$ (make-string 0))
(define $>>$ (make-string 0))
</pre>
<p>
Note that R6RS and R7RS-draft allows <code>eq?</code> to return <code>#t</code>
for distinct calls to <code>(make-string 0)</code>.
A implementation that does so needs to initialize <code>$<<$</code>
and <code>$>>$</code> some other way.
<p>
If <code>$format$</code> is supported, a minimal implementation is:
<pre>
(define-syntax $format$
(syntax-rules ()
(($format$ fmt arg ...)
(format #f fmt arg ...))))
</pre>
<h1>Implementation</h1>
<p>
Since this specification changes the reader format, and there
is no standard Scheme way to do that, there is no portable implementation.
However, this specification is being implemented in
<a href="http://www.gnu.org/software/kawa/">Kawa</a>.
(Check out the
<a href="http://www.gnu.org/software/kawa/Getting-Kawa.html">development version using Subversion</a>.)
<p>
A more sophisticated implementation of the <code>$string$</code> macro
which maps to a single <code>format</code> call is
at the time of writing
in <a href="http://sourceware.org/viewvc/kawa/trunk/kawa/lib/syntax.scm?view=co">syntax.scm</a>.
<h2>Test suite</h2>
There is a test suite in the
<a href="http://sourceware.org/viewvc/kawa/trunk/testsuite/srfi-109-test.scm?view=co">Kawa source tree</a>.
There are also <a href="http://sourceware.org/viewvc/kawa/trunk/testsuite/bad-srfi-109.scm?view=co">tests of mal-formed literals</a>.
<h1>Copyright</h1>
<p>
Copyright (C) Per Bothner 2013</p>
<p>
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:</p>
<p>
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.</p>
<p>
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.</p>
<hr />
<address>Author: <a href="mailto:per@bothner.com">Per Bothner</a></address>
<address>Editor: <a href="mailto:srfi-editors at srfi dot schemers dot org">
Mike Sperber</a></address>
</body>
</html>