/
808.txt
103 lines (73 loc) · 3.66 KB
/
808.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
* Introduction
This document describes how to validate files with respect to its [[encoding]]
in an implementation that decodes a file for the purpose of conformance checking.
[EG[
An example of such implementation is perl-web-encodings
<https://github.com/manakai/perl-web-encodings>.
]EG]
* Terminology
This document depends on the [CITE[Infra Standard]].
The terms
[DFN[byte]],
[DFN[byte string]],
[DFN[code point]], and
[DFN[string]]
are defined by the [CITE[Infra Standard]].
The terms
[DFN[encoding]],
[DFN[encode]],
[DFN[decode]],
[DFN[decoder]],
[DFN[[CODE[fatal]]]],
[DFN[error]],
[DFN[Big5]],
and
[DFN[Shift_JIS]]
are defined by the [CITE[Encoding Standard]].
The term [DFN[BOM]] is defined by the [CITE[Unicode Standard]].
The term [DFN[[VAR[encoding]] string]], where [VAR[encoding]] is an [[encoding]],
represents a [[byte string]] which is intended to be [[decoded][decode]]
by [VAR[encoding]]'s [[decoder]].
[DFN[Bytes for [VAR[code point]] in [VAR[encoding]]]],
where [VAR[code point]] is a [[code point]] and [VAR[encoding]] is an [[encoding]],
are the result of [[encoding][encode]] [VAR[code point]] in [VAR[encoding]].
A [[code point]] [VAR[char]] is [DFN[encodable]] in [[encoding]] [VAR[encoding]]
if [[encoding][encode]] a [[string]] [VAR[char]] in [VAR[encoding]] with
[VAR[error mode]] [CODE[fatal]] would not result in [[error]].
An [VAR[encoding]] string [VAR[string]] is [DFN[in error]] if
[[decoding][decode]] [VAR[string]] with [[encoding]] [VAR[encoding]] and
[VAR[error mode]] [CODE[fatal]] would result in [[error]].
* General rules
An [VAR[encoding]] string [MUST[MUST NOT]] be [[in error]].
[[BOM]] [SHOULD[SHOULD NOT]] be used.
;; It is not roundtrippable and it makes any encoding metadata ignored.
* Big5
A [[Big5]] string is discouraged to contain bytes for a [[code point]]
in [[Big5]] which is not [[encodable]] in [[Big5]].
;; In other words, use of HKSCS extensions are discouraged, as they are not
roundtrippable.
A [[Big5]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[Big5]]
if bytes for [VAR[char]] in [[Big5]] is not equal to [VAR[bytes]].
;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.
* Shift_JIS
A [[Shift_JIS]] string is discouraged to contain bytes for a [[code point]]
in [[Shift_JIS]] which is not [[encodable]] in [[Shift_JIS]].
;; In other words, use of EUDCs are discouraged, as they are not
roundtrippable and in fact not interoperable at all.
A [[Shift_JIS]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[Shift_JIS]]
if bytes for [VAR[char]] in [[Shift_JIS]] is not equal to [VAR[bytes]].
;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.
* EUC-JP
An [[EUC-JP]] string is discouraged to contain bytes for a [[code point]]
in [[EUC-JP]] which is not [[encodable]] in [[EUC-JP]].
;; In other words, use of JIS X 0212 characters are discouraged, as they are not
roundtrippable.
An [[EUC-JP]] string is discouraged to contain [[bytes][byte]] [VAR[bytes]] which
is [[decoded][decode]] to [[code point]] [VAR[char]] in [[EUC-JP]]
if bytes for [VAR[char]] in [[EUC-JP]] is not equal to [VAR[bytes]].
;; In other words, when there are multiple byte representations for a [[code point]],
non-canonical representations are discouraged, as they are not roundtrippable.