-
Notifications
You must be signed in to change notification settings - Fork 9
/
index.html
1742 lines (1695 loc) · 87.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<title>Unicode in XML and other Markup Languages</title>
<meta charset="utf-8"/>
<script src="https://www.w3.org/Tools/respec/respec-w3c-common" class="remove"></script>
<script class="remove">
var respecConfig = {
// specification status (e.g. WD, LCWD, WG-NOTE, etc.). If in doubt use ED.
specStatus: "ED",
publishDate: "2017-07-08",
previousPublishDate: "2016-05-03",
previousMaturity: "WG-NOTE",
noRecTrack: true,
shortName: "unicode-xml",
copyrightStart: "1999",
edDraftURI: "https://w3c.github.io/unicode-xml/",
// if this is a LCWD, uncomment and set the end of its review period
// lcEnd: "2009-08-05",
// editors, add as many as you like
// only "name" is required
authors: [
{ name: "Martin Dürst", mailto: "duerst@it.aoyama.ac.jp",
company: "Invited Expert" },
{ name: "Asmus Freytag", mailto: "asmus@unicode.org",
company: "Unicode Consortium" },
],
editors: [
{ name: "Addison Phillips", mailto: "addison@amazon.com",
company: "Invited Expert" },
],
wg: "Internationalization Working Group",
wgURI: "https://www.w3.org/International/core/",
wgPublicList: "www-international",
bugTracker: { new: "https://github.com/w3c/unicode-xml/issues",
open: "https://github.com/w3c/unicode-xml/issues" } ,
otherLinks: [
{
key: "Github",
data: [
{
value: "repository",
href: "https://github.com/w3c/unicode-xml"
}
]
}
],
// URI of the patent status for this WG, for Rec-track documents
// !!!! IMPORTANT !!!!
// This is important for Rec-track documents, do not copy a patent URI from a random
// document unless you know what you're doing. If in doubt ask your friendly neighbourhood
// Team Contact.
wgPatentURI: "https://www.w3.org/2004/01/pp-impl/32113/status",
// !!!! IMPORTANT !!!! MAKE THE ABOVE BLINK IN YOUR HEAD
};
</script>
<link rel="stylesheet" href="local.css" type="text/css" />
<style type="text/css">
.unicode {
font-style: normal
}
.unicode:link {
color: #FF0000;
background-color: #FFFFFF
}
.unicode:visited {
color: #808080;
background-color: #FFFFFF
}
.unicode:active {
color: #0000FF;
background-color: #FFFFFF
}
em.unicode {
font-style: normal
}
ins {
background-color: #FF6;
}
.deprecation-box {
background-color: #FF9999;
border: 8px solid red;
padding: 1em;
width: 95%;
position: center;
}
</style>
</head>
<body>
<div id="abstract">
<p>This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.</p>
</div>
<div id="sotd">
<p>This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.</p>
<div class="deprecation-box">
<h3>This document has been withdrawn</h3>
<p>Many of the materials in this document are stale and out of date;
the W3C is maintaining this version solely as a historical reference.
This document was originally produced as a joint publication between
the W3C and the <a href="http://www.unicode.org">Unicode
Consortium</a>. In 2016, Unicode withdrew publication as a Unicode
Technical Report. </p>
</div>
<div class="note">
<p style="font-weight: bold; font-size: 120%">Sending comments on this
document</p>
<p>If you wish to make comments regarding this document, please raise them as <a href="https://github.com/w3c/unicode-xml/issues" style="font-size: 120%;">github issues</a>.
Only send comments by email if you are unable to raise issues on github (see links below).
All comments are welcome.</p>
<p>To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document.</p>
</div>
</div>
<section id="Introduction">
<h2>Introduction</h2>
<p>The Unicode Standard [<a href="#Unicode">Unicode</a>] defines the
universal character set. Its primary goal is to provide an unambiguous
encoding of the content of plain text, ultimately covering all languages in
the world, but also major text-based notational systems for science,
technology, music, and scholarship.</p>
<p>Currently in its <a href="#Unicode">sixth major version</a>, Unicode
contains a large number of characters covering most of the currently used
scripts in the world. It also contains additional characters for
interoperability with older character encodings, and characters with
control-like functions included primarily for reasons of providing
unambiguous interpretation of plain text. Unicode provides specifications for
use of all of these characters.</p>
<p>For document and data interchange, the Internet and the World Wide Web
make extensive use of marked-up text such as HTML
and XML. In many instances, markup provides the same, or
essentially similar features to those provided by format characters in the
Unicode Standard for use in plain text. Another special character category
provided by Unicode are compatibility characters. While there may be valid
reasons to support these characters and their specifications in plain text,
their use in marked-up text can conflict with the rules of the markup
language. Formatting characters are discussed in Section 3, <a
href="#Suitable">Characters not Suitable for Use With Markup</a> and
Section 4, <a href="#Format">Format Characters Suitable for Use With
Markup</a>, compatibility characters in Section 5, <a
href="#Compatibility">Characters with Compatibility Mappings</a>.
Section 6 briefly discusses noncharacters, and Section 7 is devoted to white
space.</p>
<p>The interaction of character
encoding and methods of escaping characters in markup are discussed in the
Character Model for the World Wide Web [<a href="#Charmod">Charmod</a>].</p>
<p>The issues of using Unicode characters with marked-up text depend to some
degree on the rules of the markup language in question and the set of
elements it contains. In a narrow sense, this document concerns itself only
with XML, and to some extent HTML. However, much of the general information
presented here should be useful in a broader context, including some page
layout languages.</p>
<p class="note">Many of the recommendations of this
report depend on the availability of particular markup or styling. Where
possible, appropriate DTDs or Schemas should be used or designed to make
such markup or styling available, or the DTDs or Schemas used should be
appropriately extended. The current version of this document makes no
specific recommendations for the design of DTDs or Schemas, or for the use
of particular DTDs or Schemas, but the information presented here may be
useful to designers of DTDs and Schemas, and to people selecting DTDs or
Schemas for their applications. </p>
<p class="note">The recommendations of this report do not apply in the case
of XML used for blind data transport and similar cases.</p>
<section id="Notation">
<h3>Notation</h3>
<p>This report uses XML [<a href="#xml10">XML</a>] as a prominent and general
example of markup. The XML namespace notation [<a
href="#Namespace">Namespace</a>] is used to indicate that a certain element
is taken from a specific markup language. As an example, the prefix 'xhtml:'
indicates that this element is taken from [<a href="#XHTML">XHTML</a>]. This
means that the examples containing the namespace prefix 'xhtml:' are assumed
to include a namespace declaration of xmlns:xhtml="..." </p>
<p>Characters are denoted using the notation used in the Unicode Standard,
that is, an optional U+ followed by their hexadecimal number, using at least
4 digits, such as "U+1234" or "U+10FFFD". In XML or HTML this could be
expressed as "&#x1234;" or "&#x10FFFD;".</p>
</section>
</section>
<section id="General">
<h2>General Considerations</h2>
<p>There are several general points to consider when looking at the
interaction between character encoding and markup. </p>
<ul>
<li>Linearity of text vs. hierarchy of markup structure</li>
<li>Overlap of control codes and markup semantics</li>
<li>Markup vs. Styling</li>
<li>Coincidence of semantic markup and functions </li>
<li>Extensibility of markup</li>
</ul>
<section id="Linearity">
<h3>Linearity versus Structure</h3>
<p>Encoding text as a sequence of characters without further
information leads to a linear sequence, commonly called plain text. Character
follows character, without any particular structure. Markup, on the other
hand, defines a hierarchical structure for the text or data. In the case of
XML and most other, similar markup languages, the markup defines a tree
structure. While this tree structure is linearized for transmission in the
XML document, once the document has been parsed, the tree is available
directly.</p>
<p>Operations that are easy to perform on trees are often
difficult to perform on linear sequences and vice versa. By separating
functionality between character encoding and markup appropriately, the
architecture becomes simpler, more powerful and longer-lasting.</p>
<p>In particular, operations on hierarchical structures can
easily make sure that information is kept in context. Attributes assigned to
parts of a document are moved together with the associated part of the
document. Assigning an attribute to a part of a document limits the scope of
the attribute to that part of the document. Performing the same operations on
linear sequences of characters using control codes to set attributes and to
delimit their scope requires much more work and is error prone. Locating the
start or end of a span of text of the same attribute requires scanning
backwards and forwards for the embedded delimiter or control code. Moving or
editing text often results in mismatched control codes, so that an attribute
might suddenly apply to text it was not intended for.</p>
</section>
<section id="Overlap">
<h3>Overlap of Control Code and Markup Semantics</h3>
<p>When markup is not available, plain text may require control
characters. This is usually the case where plain text must contain some
scoping or attribute information in order to be legible, i.e. to be
able to transmit the same content between originator and receiver. Many of
these control characters have direct equivalents in particular markup
languages, since markup handles these concerns efficiently. If both
characters and their markup equivalents may be present in the same text, the
question of priority is raised. Therefore it is important to identify and
resolve these ambiguities at the time markup is first applied.</p>
</section>
<section id="Markup">
<h3>Markup and Styling</h3>
<p>Besides the basic character encoding and text markup there is
a third contributor to text functionality, namely styling. Markup is
concerned with the logical structure of the text or data, e.g. to
indicate sections, subsections, and headers in a document, or to indicate the
various fields of an address record. Styling is used to present the
information in various ways, e.g. in different fonts, different type
styles (italic, bold), different colors, etc. Some character codes do
not encode a generic character, but a styled character. Where these
characters are used, styling information is frozen, i.e. it is no
longer possible to alter the appearance of the text by applying style
information. However, there are many examples where a historically free
stylistic variation has over time become a semantic distinction that is
properly encoded as plain text. Sometimes, what is a free variation in some
contexts, implies strict semantic differentiation in others. In all such
instances, altering the appearance of the text by styling information would
irreparably alter the content of the text. This is of particular concern with
mathematical notation or systems for phonetic and phonemic transcription
which make extensive semantic use of styles on a character by character
basis.</p>
</section>
<section id="Coincidence">
<h3>Coincidence of Markup and Functions</h3>
<p>Dealing with various functionalities on the markup level has
the additional advantage that in most cases, text portions that need some
particular attribute (or styling) are actually those text portions identified
by markup. A paragraph may be in French, a citation may need a bidi
embedding, a keyword may be in italics, a list number may be circled, and so
on. This makes it very efficient to associate those attributes with
markup.</p>
<p>However, where local or point-like functionality is needed,
markup is <em>not</em> very efficient and its main benefit, easy manipulation
of scope, is not required. On the contrary, the intrusion of markup in the
middle of words can make search or sort operations more difficult. For these
cases expressing the information as character codes is not only a viable, but
often the preferred alternative, which needs to be considered in the design
of markup languages.</p>
</section>
<section id="Extensibility">
<h3>Extensibility of Markup</h3>
<p>Character encoding works with a range of integers used as
character codes. This is extremely efficient, but has some limitations.
Markup, on the other hand, is much more extensible. Using technologies such
as XML Namespaces [<a href="#Namespace">Namespace</a>] and their application
in schema languages like [<a href="#XMLSchema">XML Schema</a>], various
vocabularies can be mixed.</p>
</section>
<section id="Suitability">
<h3>Suitability of Characters in Markup</h3>
<p>The suitability of a particular character for markup depends on its status
in the Unicode Standard, the nature of its behavior in text and the
availability of equivalent markup. Many format characters that are needed for
advanced plain text are not suitable for use with markup. <a
href="#Suitable">Section 3</a> gives a list and detailed descriptions.
However, not all format characters are unsuitable for use with markup. <a
href="#Format">Section 4</a> provides a list of format characters that are
suitable for use with markup and gives some discussion about their use. In
addition to format characters, the Unicode Standard also has compatibility
characters, some of which may be replaceable by suitable markup. These
characters are discussed in <a href="#Compatibility">Section 5</a>.</p>
</section>
</section>
<section id="Suitable">
<h2>Characters not Suitable for use With Markup</h2>
<p>There are characters which are unsuitable in the context of markup in
XML/HTML and whose use is discouraged, because one or more of the following
conditions apply:</p>
<ul>
<li>They are deprecated in the Unicode Standard.</li>
<li>They are unsupportable without additional data.</li>
<li>They are difficult to handle because they are stateful.</li>
<li>They are better handled by markup.</li>
<li>They are undesirable because of conflict with equivalent markup.</li>
</ul>
<p><a href="#Charlist">Section 3.1</a> provides a list of such characters.
Sections <a href="#Line">3.2</a> through <a href="#OtherDeprecated">3.10</a> discuss in more detail the following points for the discouraged
characters.</p>
<ul>
<li>Short description of semantics</li>
<li>Reason for inclusion in Unicode</li>
<li>Specific problems when used with markup</li>
<li>Other areas where problems may occur (e.g. plain text)</li>
<li>What kind of markup to use instead</li>
<li>What to do if detected in a particular context</li>
</ul>
<section id="Charlist">
<h3>Table of Characters not Suitable for use With Markup</h3>
<p>The following table contains the characters currently considered not
suitable for use with markup in XML or HTML. (See however the note in the <a href="#Introduction">Introduction</a>.) They
may also be unsuitable for other markup or page layout languages. For
determining possible conflict this report uses the markup available in
HTML.</p>
<figure>
<figcaption>Characters not suitable for use with markup</figcaption>
<table>
<tbody>
<tr>
<th><p
>Codepoints</p>
</th>
<th><p
>Names/Description</p>
</th>
<th><p>Short
Comment</p>
</th>
</tr>
<tr>
<td>U+0340..U+0341</td>
<td>Clones of grave and acute</td>
<td>Deprecated in Unicode</td>
</tr>
<tr>
<td>U+17A3, U+17D3</td>
<td>Obsolete characters for Khmer</td>
<td>Deprecated in Unicode</td>
</tr>
<tr>
<td>U+2028..U+2029</td>
<td>Line and paragraph separator</td>
<td>use <xhtml:br />,
<xhtml:p></xhtml:p>, or equivalent</td>
</tr>
<tr>
<td>U+202A..U+202E</td>
<td>BIDI embedding controls <br />
(LRE, RLE, LRO, RLO, PDF)</td>
<td>Strongly discouraged in [<a
href="#html4.01">HTML4.01</a>]</td>
</tr>
<tr>
<td>U+206A..U+206B</td>
<td>Activate/Inhibit Symmetric swapping</td>
<td>Deprecated in Unicode</td>
</tr>
<tr>
<td>U+206C..U+206D</td>
<td>Activate/Inhibit Arabic form shaping</td>
<td>Deprecated in Unicode</td>
</tr>
<tr>
<td>U+206E..U+206F</td>
<td>Activate/Inhibit National digit shapes</td>
<td>Deprecated in Unicode</td>
</tr>
<tr>
<td>U+FFF9..U+FFFB</td>
<td>Interlinear annotation characters</td>
<td>Use ruby markup [<a href="#Ruby">Ruby</a>]</td>
</tr>
<tr>
<td rowspan="2">U+FEFF</td>
<td>as ZWNBSP</td>
<td>Use U+2060 Word Joiner instead</td>
</tr>
<tr>
<td>as Byte Order Mark</td>
<td>Use only at the start of a file, not as part of
markup</td>
</tr>
<tr>
<td>U+FFFC</td>
<td>Object replacement character</td>
<td>Use markup, e.g. HTML <object> or HTML
<img></td>
</tr>
<tr>
<td>U+1D173..U+1D17A</td>
<td>Scoping for Musical Notation</td>
<td>Use an appropriate markup language</td>
</tr>
<tr>
<td>U+E0000..U+E007F</td>
<td>Language Tag code points </td>
<td>Use xhtml:lang or xml:lang</td>
</tr>
</tbody>
</table>
</figure>
<p>Except for Line and Paragraph Separator, or the Byte Order Mark, it is
acceptable for browsers and similar user agents to ignore the presence of
discouraged characters in HTML or XML. It is up to authoring tools to ensure
proper conversion between these characters and equivalent markup where it
exists.</p>
</section>
<section id="Line">
<h3>Line and Paragraph Separator, U+2028..U+2029</h3>
<p><em>Short description</em>: The line and paragraph separator provide
unambiguous means to denote hard line breaks and paragraph delimiters in
plain text.</p>
<p><em>Reason for inclusion</em>: These characters were introduced into the
Unicode Standard to overcome the ambiguous and widely divergent use of
control codes for this purpose. See Section
5.8, Newline Guidelines, in [<a href="#Unicode">Unicode</a>].</p>
<p><em>Problems when used in markup</em>: Including these characters in
markup text does not work where it would duplicate the existing markup
commands for delimiting paragraphs and lines.</p>
<p><em>Problems with other uses</em>: The be can also
problematic when used in plain text, because legacy data is usually converted
code point for code point into Unicode and all receivers of Unicode plain
text have to effectively be able to interpret the existing use of control
codes for this purpose. As a result, fewer Unicode implementations support
these characters, than would be the case otherwise.</p>
<p><em>Replacement markup</em>: In HTML, use <xhtml:br /> instead of
U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p>
instead of separating them with U+2029.</p>
<p><em>What to do if detected</em>: In a browser context, treat as white
space, or ignore. When received in an editing context, replace the character
by the corresponding markup. </p>
</section>
<section id="Bidi">
<h3>Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF), U+202A..U+202E</h3>
<p><em>Short description</em>: The bidi embedding controls are required to
supplement the Unicode Bidirectional Algorithm in plain text</p>
<p><em>Reason for inclusion</em>: The Unicode Bidirectional algorithm
unambiguously resolves the display direction for bidirectional text. It does
so by assigning all characters directional categories and then resolving
these in context. In a number of circumstances this <em>implicit </em> method does not produce satisfactory results and embedding controls are
needed to ensure that sender and receiver agree on the display direction for
a given text. See Unicode Technical Report #9, The Bidirectional Algorithm <a
href="#UTR9">[UAX 9]</a>.</p>
<p><em>Problems when used in markup</em>: These characters duplicate
available markup, which is better suited to handle the stateful nature of
their effect. </p>
<p><em>Problems with other uses</em>: The embedding controls introduce a
state into the plain text, which must be maintained when editing or
displaying the text. Processes that are modifying the text without being
aware of this state may inadvertently affect the rendering of large portions
of the text, for example by removing a PDF.</p>
<p><em>Replacement markup</em>: The following table gives the replacement
markup:<br />
</p>
<table>
<tbody>
<tr>
<td><b>Unicode</b></td>
<td><b>Equivalent markup</b></td>
<td><b>Comment</b></td>
</tr>
<tr>
<td><p>RLO</p></td>
<td><xhtml:bdo dir = "rtl"></td>
<td> </td>
</tr>
<tr>
<td><p>LRO</p></td>
<td><xhtml:bdo dir = "ltr"></td>
<td> </td>
</tr>
<tr>
<td>PDF</td>
<td></xhtml:bdo></td>
<td>when used to terminate RLO or LRO only, otherwise
ignore</td>
</tr>
<tr>
<td>RLE</td>
<td>dir = "rtl"</td>
<td>attribute on block or inline element</td>
</tr>
<tr>
<td>LRE</td>
<td>dir = "ltr"</td>
<td>attribute on block or inline element</td>
</tr>
</tbody>
</table>
<p>For details on bidi markup, please see Section 8.2 of HTML [<a
href="#HTML4.0-8.2">HMTL 4.0-8.2</a>]. The text of HTML 4.0 gives this
recommendation: </p>
<blockquote>
<p><strong>Using HTML directionality markup with Unicode
characters.</strong> Authors and designers of authoring software should be
aware that conflicts can arise if the <a
href="https://www.w3.org/TR/html401/struct/dirlang.html#adef-dir"
class="noxref"><samp class="ainst">dir</samp></a> attribute is used on
inline elements (including <a
href="https://www.w3.org/TR/html401/struct/dirlang.html#edef-BDO"
class="noxref"><samp class="einst">BDO</samp></a>) concurrently with the
corresponding <a href="#Unicode"
class="normref">[UNICODE]</a> formatting characters. Preferably one or the
other should be used exclusively. The markup method offers a better
guarantee of document structural integrity and alleviates some problems
when editing bidirectional HTML text with a simple text editor, but some
software may be more apt at using the <a href="#Unicode"
class="normref">[UNICODE]</a> characters. If both methods are used, great
care should be exercised to insure proper nesting of markup and directional
embedding or override, otherwise, rendering results are undefined.</p>
</blockquote>
<p>This document goes beyond HTML and recommends that <em>only</em> the markup
should be used.</p>
<p class="note"> The interpretation of how to handle directionality markup
for block level elements differs in different versions of [<a
href="#CSS">CSS</a>].</p>
<p><em>What to do if detected</em>: In a browser context, ignore. When
received in an editing context, replace the characters by the appropriate
markup. </p>
</section>
<section id="Deprecated">
<h3>Deprecated Formatting Characters, U+206A..U+206F</h3>
<p><em>Short description</em>: These characters are deprecated. They were
originally intended to allow explicit activation of contextual shaping,
numeric digit rendering and symmetric swapping.</p>
<p><em>Reason for inclusion</em>: These characters were retained from draft
versions of ISO 10646.</p>
<p><em>Problems when used in markup</em>: The processing model for these
characters is not supported in markup.</p>
<p><em>Problems with other uses</em>: The Unicode Standard requires that
symmetric swapping, contextual shaping, and alternate digit shapes are
enabled by default and no longer supports inhibiting any of them by use of
these character codes. The most likely effect of their occurrence in
generated text would be that of a 'garbage' character.</p>
<p><em>Conversion for use with markup</em>: Apply the appropriate conversion
to bring the data stream in line with the Unicode text model for
bidirectional text and cursively-connected scripts.</p>
<p><em>What to do if detected</em>: When received by a browser as part of
marked up text, they may be ignored. When received in an editing context,
they may be removed, possibly with a warning. Alternatively, an appropriate
conversion from the legacy text model may be provided. This will most likely
be limited to applications directly interfacing with and knowledgeable of the
particular legacy implementation that inspired these characters.</p>
</section>
<section id="BOM">
<h3>Byte Order Mark, ZWNBSP, U+FEFF</h3>
<p><em>Short description</em>: U+FEFF has two functions. It is formally known
as <span class="uname">zero width no-break space</span> (ZWNBSP), and can act as a word joiner, but its primary use is as <em>byte
order mark (BOM)</em>, to indicate in a file signature at the start of a file
that a file is in a particular Unicode encoding form and of a particular byte
order. Using U+FEFF as a word joiner in new data is deprecated as of [<a
href="#Unicode32">Unicode3.2</a>] in favor of U+2060 <span
class="uname">word joiner</span> (WJ). The use as byte
order mark remains unaffected.</p>
<p><em>Reason for inclusion</em>: Originally included in Unicode for the sole
purpose of indicating byte order or use in file signatures, the character
acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and
Unicode. When used as a byte order mark the character is placed at the
beginning of a file. If a recipient views it as FEFF then the byte order
between sender and receiver match. If the recipient views it as FFFE (a
non-character code point) then the sender used opposite byte order from the
recipient, and the recipient needs to invert the byte order or refuse to read
the file. When used as a ZWNBSP the character is intended to prevent breaks
between adjacent characters. This function is now provided by U+2060 <span
class="uname">word joiner</span> (WJ) making it
unnecessary to insert U+FEFF in the middle of a file. For more information
see Chapter 16 of [<a href="#Unicode">Unicode</a>].</p>
<p><em>Problems when used in markup</em>: Using U+FEFF as ZWNBSP makes it
impossible to distinguish it from the case where a byte order mark was left
in the middle of a file inadvertently due to incorrect splicing. U+FEFF can
and in some cases (XML encoded in UTF-16) must be used at the start of a file
containing markup, but as a signature, this is not part of actual markup or
marked-up content. Some older versions of browsers and parsers may not
correctly recognize U+FEFF at the start of a file encoded in UTF-8. For
details of how U+FEFF participates in encoding detection of XML files, see
Appendix F of <a href="#xml10">[XML 1.0]</a>. </p>
<p><em>Problems with other uses</em>: The use of byte order mark as ZWNBSP is
also problematic when used in plain text, and has been deprecated for that
purpose in favor of U+2060 <span class="uname">word
joiner</span>. The use of U+FEFF in file signatures to indicate byte order is
the only recommended use of this character.</p>
<p><em>Replacement markup</em>: None. In locations other than the beginning
of a text file, U+FEFF can be removed or replaced by U+2060 in an editing
environment.</p>
<p><em>What to do if detected</em>: When received by a browser as part of
marked-up text, treat depending on location. At the start of an external
entity, treat as byte order mark (i.e. as part of the character encoding, not
as part of the parsed character stream, see e.g. Section 4.3.3 of <a
href="#xml10">[XML 1.0]</a>). Otherwise, assume it is older data using it as
ZWNBSP. When receiving plain text in an editing environment, editors may take
one or more of several actions: replace ZWNBSP in the middle of a file with
WJ or issue a warning to the user.</p>
</section>
<section id="Interlinear">
<h3>Interlinear Annotation Characters, U+FFF9-U+FFFB</h3>
<p><em>Short description</em>: The interlinear annotation characters are used
to delimit interlinear annotations in certain circumstances. They are
intended to provide text anchors and delimiters for interlinear annotation
for in-process use and are not intended for interchange.</p>
<p><em>Reason for inclusion</em>: The interlinear annotation characters were
included in Unicode only in order to reserve code points for very frequent
application-internal use. The interlinear annotation characters are used to
delimit interlinear annotations in contexts where other delimiters are not
available, and where non-textual means exist to carry formatting information.
Many text-processing applications store the text and the associated markup
(or in some cases styling information) of a document in separate structures.
The actual text is kept in a single linear structure; additional information
is kept separately with pointers to the appropriate text positions. This is
called out-of-band information. The overall implementation makes sure that
these two structures are kept in sync. If the text contains interlinear
annotations, it is extremely helpful for implementations to have delimiters
in the text itself; even though delimiters are not otherwise used for style
markup. With this method, and unlike the case of the object replacement
character, all textual information can remain in the standard text stream,
but any additional formatting information is kept separately. In addition,
the Interlinear Annotation Anchor serves as a placeholder for formatting
information for the whole annotation object, the same way a paragraph mark
can be a placeholder to attach paragraph formatting information.</p>
<p><em>Problems when used in markup</em>: Including interlinear annotation
characters in marked-up text does not work because the additional formatting
information (how to position the annotation,...) is not available.</p>
<p><em>Problems with other uses</em>: The interlinear annotation characters
are also problematic when used in plain text, and are not intended for that
purpose. In particular, on older display systems that simply ignore or
replace the Interlinear Annotation Characters, the meaning of the text may be
changed.</p>
<p><em>Replacement markup</em>: The markup to be used in place of the
Interlinear Annotation Characters depends on the formatting and nature of the
interlinear annotation in question. For ruby, please see [<a
href="#Ruby">Ruby</a>].</p>
<p><em>What to do if detected</em>: When received by a browser as part of
marked-up text, they may be ignored. When receiving plain text in an editing
environment, editors may take one or more of several actions: remove U+FFF9
together with removing all characters between U+FFFA and following U+FFFB;
ignore U+FFF9 and turn U+FFFA and U+FFFB into "[" and "]" respectively, or
into similar characters; issue a warning to the user; or tentatively convert
into appropriate ruby markup for further editing and formatting by the
user.</p>
</section>
<section id="Object">
<h3>Object Replacement Character, U+FFFC</h3>
<p><em>Short description</em>: The object replacement character is used to
stand in place of an object (e.g. an image) included in a text.</p>
<p><em>Reason for inclusion</em>: The object replacement character was
included in Unicode only in order to reserve a codepoint for a very frequent
application-internal use. Many text-processing applications store the text
and the associated markup (or in some cases styling information) of a
document in separate structures. The actual text is kept in a single linear
structure; additional information is kept separately with pointers to the
appropriate text positions. The overall implementation makes sure that these
two structures are kept in sync. If the text contains objects such as images,
it is extremely helpful for implementations to have a sentinel in the text
itself; any additional information is kept separately.</p>
<p><em>Problems when used in markup</em>: Including an object replacement
character in markup text does not work because the additional information
(what object to include,...) is not available.</p>
<p><em>Problems with other uses</em>: The object replacement character is
also problematic when used in plain text, because there is no way in plain
text to provide the actual object information or a reference to it.</p>
<p><em>Replacement markup</em>: The markup to be used in place of the Object
Replacement Character depends on the object in question and the markup
context it is used in. Typical cases are <xhtml:img src='...' />,
<xhtml:object ...>, or <html:applet ...>. These constructs allow
providing all additional information needed to identify and use the object in
question.</p>
<p><em>What to do if detected</em>: Browsers may ignore this character. When
received in an editing context, if the actual object is accessible, editors
may either replace the character by the appropriate markup for that object,
or otherwise remove it, ideally providing a warning.</p>
</section>
<section id="Musical">
<h3>Musical Controls, U+1D173..U+1D17A</h3>
<p><em>Short description</em>: A series of characters for controlling scope
in musical notation.</p>
<p><em>Reason for inclusion</em>: These characters designate the start and
end of common musical constructs. Full musical layout depends on additional
information, for example pitch, that cannot be encoded using Unicode.
However, many musical symbols may be depicted in isolation (and without
assigning pitch) as part of a textual discussion of music. Plain text use of
Unicode characters is primarily intended for this latter purpose. The scoping
operators can be used to support limited renderings of beams, slurs, phrases,
etc. in this context. However, in the context of markup languages, musical
scoring calls for a dedicated markup language (analogous to MathML) which
would be expected to contain markup for these constructs.</p>
<p><em>Problems when used in markup</em>: These characters duplicate
information that can in principle be expressed in markup.</p>
<p><em>Problems with other uses</em>: Their special code range allows them to
be easily filtered, but applications that do not expect them will treat them
as garbage characters.</p>
<p><em>Replacement markup</em>: Replace with equivalent markup if
available.</p>
<p><em>What to do if detected</em>: Browsers may ignore these characters.
When received in an editing context, editors may remove or replace them by
equivalent markup.</p>
</section>
<section id="Language">
<h3>Language Tag Characters, U+E0000..U+E007F</h3>
<p><em>Short description</em>: A series of characters for expressing language
tags, based on existing standards for language tags using the rules in
Chapter 16 of [<a href="#Unicode">Unicode</a>].</p>
<p><em>Reason for inclusion</em>: These characters allow in-band language
tagging in situations where full markup is not available, while allowing easy
filtering by applications that do not support them. They were solely included
for the benefit of those Internet protocols, such as ACAP, which require a
standard mechanism for marking language in UTF-8 strings, and at the same
time to avoid the use of other tagging schemes that relied on specific
details of the encoding form used.</p>
<p><em>Problems when used in markup</em>: These characters duplicate
information that can be expressed in markup.</p>
<p><em>Problems with other uses</em>: Their special code range allows them to
be easily filtered, but applications that do not expect them will treat them
as garbage characters.</p>
<p><em>Replacement markup</em>: Replace with equivalent language markup. XML
and XHTML have the xml:lang attribute. HTML has the lang attribute. These
attributes follow different scoping rules than the tag characters, therefore
this replacement will generally not be a simple 1:1 substitution.</p>
<p><em>What to do if detected</em>: Browsers may ignore these characters.
When received in an editing context, editors may remove or replace them by
equivalent markup.</p>
</section>
<section id="OtherDeprecated">
<h3>Other Characters Deprecated in Unicode</h3>
<p><em>Short description</em>: The Unicode Character Database [<a
href="#UnicodeData">UnicodeData</a>] lists all characters that have been
deprecated in [<a href="#Unicode">Unicode</a>]. This list may grow (slowly)
over time. Deprecated characters remain valid characters forever, but their
use is strongly discouraged. Deprecation of characters is applied only in
exceptional circumstances. It is never the result of historical changes of a
writing system: characters no longer in current, modern use are retained in
Unicode, as they are needed for the representation of historical
documents.</p>
<p><em>Reason for inclusion</em>: Usually, characters that are deprecated
were never needed, but were inadvertently added to the Unicode Standard,
perhaps based on incomplete information available at the time of encoding.</p>
<p><em>Problems when used in markup</em>: Except where noted elsewhere in
this document, their presence in markup presents the same problems as in
plain text, usually that of an unnecessary duplicate encoding.</p>
<p><em>Problems with other uses</em>: Depends on the character and the reason
for its deprecation. For more information see [<a
href="#Unicode">Unicode</a>].</p>
<p><em>Conversion for use with markup</em>: For deprecated characters not
discussed elsewhere in this document, see the relevant descriptions of those
characters in [<a href="#Unicode">Unicode</a>] for information on the
recommended alternatives.</p>
<p><em>What to do if detected</em>: Unless a specific recommendation is
given elsewhere, deprecated characters are not ignored; where possible, in an
editing environment, a preferred alternate encoding may be substituted.</p>
</section>
</section>
<section id="Format">
<h2>Format Characters Suitable for Use with Markup</h2>
<p>The following table contains format characters that do not exhibit the
problems discussed at the start of <a href="#Suitable">Section 3</a>. Despite
their apparent relation to or similarity with characters in table <a
href="#Charlist">3.1</a>, they are considered suitable for use with markup.
It is not acceptable for user agents to ignore the characters in table 4.1.
For a description of these characters see [<a
href="#Unicode">Unicode</a>].</p>
<figure>
<figcaption>Some characters that affect text format but are suitable for use with markup</figcaption>
<table>
<tbody>
<tr>
<th><p>Code
points</p>
</th>
<th><p
>Names/Description</p>
</th>
<th><p>Short
Comment</p>
</th>
</tr>
<tr>
<td>U+00A0</td>
<td>No-break Space</td>
<td>Line break control</td>
</tr>
<tr>
<td>U+00AD</td>
<td>Soft Hyphen</td>
<td>Line break control</td>
</tr>
<tr>
<td>U+034F</td>
<td>Combining Grapheme Joiner</td>
<td>Used in sorting</td>
</tr>
<tr>
<td>U+0600</td>
<td>Arabic Number Sign</td>
<td>Subtending mark</td>
</tr>
<tr>
<td>U+0601</td>
<td>Arabic Sign Sanah</td>
<td>Subtending mark</td>
</tr>
<tr>
<td>U+0602</td>
<td>Arabic Footnote Marker</td>
<td>Subtending mark</td>
</tr>
<tr>
<td>U+0603</td>
<td>Arabic Sign Safha</td>
<td>Subtending mark</td>
</tr>
<tr>
<td>U+06DD</td>
<td>Arabic End of Ayah</td>
<td>Enclosing mark</td>
</tr>
<tr>
<td>U+070F</td>
<td>Syriac Abbreviation Mark (SAM)</td>
<td>Supertending mark</td>
</tr>
<tr>
<td>U+0F0C</td>
<td>Tibetan Mark Delimiter Tsheg Bstar</td>
<td>Non-breaking form of 0F0B</td>
</tr>
<tr>
<td>U+115F..U+1160</td>
<td>Hangul Jamo Fillers</td>
<td>Filler</td>
</tr>
<tr>
<td>U+180B..U+180E</td>
<td>Mongolian Variation Selectors(FVS1..FVS3), Mongolian
Vowel Separator</td>
<td>Required for Mongolian</td>
</tr>
<tr>
<td>U+200B</td>
<td>Zero-width Space</td>
<td>Line break control</td>
</tr>
<tr>
<td>U+200C..U+200D</td>
<td>Zero-width Join Controls (ZWJ and ZWNJ)</td>
<td>Required for a.o. Persian and many Indic scripts</td>
</tr>
<tr>
<td>U+200E..U+200F</td>
<td>Implicit Directional Marks (LRM and RLM)</td>
<td>LRM and RLM are allowed</td>
</tr>
<tr>
<td>U+2011</td>
<td>Non-breaking Hyphen</td>
<td>Line break control</td>
</tr>
<tr>
<td>U+202F</td>
<td>Narrow No-break Space</td>
<td>Line break control/Mongolian</td>
</tr>
<tr>
<td>U+2044</td>
<td>Fraction Slash</td>
<td>Or use markup (MathML)</td>
</tr>
<tr>
<td>U+2060</td>
<td>Word Joiner</td>
<td>Use for that purpose instead of U+FEFF ZWNBSP</td>
</tr>
<tr>
<td>U+2061..U+2064</td>
<td>Invisible Mathematical Operators</td>
<td>Mathematical use</td>
</tr>
<tr>
<td>U+2FF0..U+2FFB</td>
<td>Ideographic Character Description</td>
<td>Graphic characters (not controls)</td>
</tr>
<tr>
<td>U+303E</td>
<td>Ideographic Variation Indicator</td>
<td>Graphic character (not a control)</td>
</tr>
<tr>
<td>U+FF80</td>
<td>Halfwidth Hangul Filler</td>
<td>Filler, not generally required</td>
</tr>
<tr>
<td>FE00..FE0F</td>
<td>Variation Selectors</td>
<td>Modify graphic characters</td>
</tr>
<tr>
<td>E0100..E01DF</td>
<td>Variation Selectors</td>
<td>Modify graphic characters</td>
</tr>
</tbody>
</table>
</figure>
<p>The following subsections briefly discuss some of the characters from the
above list, particularly those that affect more than their immediately
adjacent neighbors. Please see the Unicode Standard [<a
href="#Unicode">Unicode</a>] for full details.</p>
<section id="Subtending">
<h3>Subtending Marks</h3>
<p>Subtending marks are needed to represent a common feature in the Arabic
and Syriac scripts where a mark can be placed below a range of characters,
for example below a sequence of digits, to indicate a year. The Syriac
abbreviation mark is placed above a series of characters, making it
technically a supertending mark, and the <span
class="uname">ARABIC END OF AYAH</span> is an enclosing
mark. In the character stream, a subtending mark precedes the affected
characters. The end of affected range of characters is defined implicitly,
usually by the first non-alphanumeric character. </p>
<p>Unlike subtending marks, the scope of combining enclosing
marks, such as <span
style="text-transform: uppercase; font-variant: small-caps;">combining
enclosing circle,</span> is limited to the preceding default grapheme
cluster. For details on grapheme clusters see Unicode Standard Annex #29:
"Text Boundaries", [<a href="#UAX29">UAX 29</a>] .</p>
<p>There is currently no existing markup that can represent the
scoping and layout functions defined by these characters, so they cannot be
substituted. It is unresolved to what degree intervening markup affects the
scope of these marks.</p>
</section>
<section id="Fraction">
<h3>Fraction Slash</h3>
<p>The fraction slash is used between sequences of decimal
digits to form fractions. Whether the resulting fraction has a horizontal or
diagonal fraction line is unspecified. The fallback is to leave the digits
unchanged and display a regular slash. In order to separate a digit from a
following fraction, as in 1¾, the use of <span
class="uname">U+2009 THIN SPACE</span> is recommended.</p>
<p>For better control of fractions the use of [<a
href="#MathML">MathML</a>] is suggested where appropriate.</p>
</section>
<section id="Variation">
<h3>Variation Selectors</h3>
<p>A variation selector is intended to cause a specific variant form (or
range of variant forms) when applied to a base character. For a variation
selector to have an effect it must immediately follow its base character.
Only pre-determined combinations of selected base characters and specific
variation selectors have a defined effect. All other combinations are
ill-formed and are to be ignored. The list of standardized combinations is
documented in the Unicode Character Database, see [<a
href="#Variants">Variants</a>]. In addition to the 256 generic variation
selectors, there are 3 Mongolian <em>free variation selectors</em>. They
function in all other ways like variation selectors, except they only apply
to base characters from the Mongolian script. Since Mongolian, like Arabic,