-
Notifications
You must be signed in to change notification settings - Fork 4
/
566.txt
178 lines (141 loc) · 7.73 KB
/
566.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
[10] [DFN[[RUBYB[サロゲートペア]@en[surrogate pair]]]]は、
[[UTF-16]] で2つの[[16ビット符号単位]]を組み合わせて1つの[[Unicode符号位置]]を表すものです。
[11] [[サロゲートペア]]で使う[[16ビット符号単位]]に相当する[[符号位置]]は、
([[符号化方式]]に関わらず) 使わないことになっています。
* 仕様書
[REFS[
- [26]
[CITE[[[The Unicode Standard]], Version 13.0 - ch23.pdf]], [TIME[2020-03-09T17:53:52.000Z]], [TIME[2020-12-15T09:20:09.803Z]] <https://www.unicode.org/versions/latest/ch23.pdf#G24089>
]REFS]
* 符号位置
[27]
[DFN[[RUBYB[サロゲート領域][surrogates area]]]]の[[符号位置]]
[DFN[[CODE[U+D800]]]] - [DFN[[CODE[U+DFFF]]]]
は、
[[UTF-16]]
で
[CODE[U+10000]] - [CODE[U+10FFFF]]
を記述するために2つ1組で使う[[16ビット符号単位]]、
[DFN[[RUBYB[サロゲート][surrogate]]]]です。
組のことを[DFN[[RUBYB[サロゲートペア][surrogate pair]]]]といいます。
[SRC[>>26]]
-
[28]
[DFN[high-surrogate code point]]
[CODE[U+D800]] - [DFN[[CODE[U+DBFF]]]]
の1024個の[[符号位置]]は、
[[サロゲートペア]]の1つ目に使います。
[SRC[>>26]]
-- [31]
[DFN[private-use high-surrogate code points]]
[DFN[[CODE[U+DB80]]]] - [CODE[U+DFFF]]
[SRC[>>26]]
は、
[[PUP]]
の[[サロゲートペア]]の1つ目に使います。
-
[29]
[DFN[low-surrogate code point]]
[DFN[[CODE[U+DC00]]]] - [CODE[U+DFFF]]
の1024個の[[符号位置]]は、
[[サロゲートペア]]の2つ目に使います。
[SRC[>>26]]
[FIG(railroad)[ [30] [[サロゲートペア]]
= =
== [[high surrogate]]
== [[low surrogate]]
]FIG]
[35]
[[Unicode]]
の[[ブロック]]名としての
[[High Surrogates]]
は、
[[High Private Use Surrogates]]
と重ならない
[DFN[[CODE[U+DB7F]]]]
までに限定されています。
[REFS[
- [33] [DFN[[[High Surrogates]]]]
符号位置の一覧
<https://chars.suikawiki.org/set/%24unicode%3ABlock%3AHigh_Surrogates>
- [32] [DFN[[[High Private Use Surrogates]]]]
符号位置の一覧
<https://chars.suikawiki.org/set/%24unicode%3ABlock%3AHigh_Private_Use_Surrogates>
- [34] [DFN[[[Low Surrogates]]]]
符号位置の一覧
<https://chars.suikawiki.org/set/%24unicode%3ABlock%3ALow_Surrogates>
]REFS]
* 符号化方式
[12] 次の[[符号化方式]]で[[サロゲートペア]]が使われました。
[FIG(short list)[
- [[UTF-16]]
- [[CESU-8]]
- [[UTF-32S]]
]FIG]
* 歴史
[1]
[CITE@en[I'm not a Klingon : UTF-16, UTF-8 & UTF-32 update to conform with Unicode 5.0's security concerns.]] ([TIME[2007-07-27 23:26:23 +09:00]] 版) <http://blogs.msdn.com/shawnste/archive/2007/07/23/utf-16-utf-8-utf-32-update-to-conform-with-unicode-5-0-s-security-concerns.aspx>
[2] [CITE@en[Web Applications 1.0 r7084 Make WebSocket silently convert isolated surrogated to U+FFFD rather than throwing an exception. This will result in data corruption when a user types in astral-plane characters that get truncated by naiive script half-way through, rather than crashing the application.]]
( ([TIME[2012-05-03 05:06:00 +09:00]] 版))
<http://html5.org/tools/web-apps-tracker?from=7083&to=7084>
[3] [CITE@en[Notifications API: minor change]]
( ([[Anne van Kesteren]] 著, [TIME[2012-11-30 00:03:40 +09:00]] 版))
<http://lists.w3.org/Archives/Public/public-web-notification/2012Nov/0010.html>
[4] [CITE[IRC logs: freenode / #whatwg / 20130915]]
( ([TIME[2013-09-16 22:06:01 +09:00]] 版))
<http://krijnhoetmer.nl/irc-logs/whatwg/20130915#l-241>
[5] [CITE[IRC logs: freenode / #whatwg / 20101111]]
( ([TIME[2010-11-19 23:02:05 +09:00]] 版))
<http://krijnhoetmer.nl/irc-logs/whatwg/20101111#l-579>
[6] [CITE@en[Web Applications 1.0 r6184 Try to clean up the stuff about Unicode characters.]]
( ([TIME[2011-06-04 04:40:00 +09:00]] 版))
<http://html5.org/tools/web-apps-tracker?from=6183&to=6184>
[7] [CITE[IRC logs: freenode / #whatwg / 20140329]]
( ([TIME[2014-03-31 13:06:30 +09:00]] 版))
<http://krijnhoetmer.nl/irc-logs/whatwg/20140329>
[8] [CITE[IRC logs: freenode / #whatwg / 20140516]]
( ([TIME[2014-05-21 16:29:26 +09:00]] 版))
<http://krijnhoetmer.nl/irc-logs/whatwg/20140516>
[9] [CITE@en[''''''[''''''CSSWG'''''']'''''' Minutes Seoul F2F 2014-05-19 Part V: Counter Styles, CSS Formatting for Books, Font Load Events, Future F2F Meetings, CSS Syntax - Unpaired Surrogates, MQ Listener]]
( ([[Dael Jackson]] 著, [TIME[2014-06-09 08:42:57 +09:00]] 版))
<http://lists.w3.org/Archives/Public/www-style/2014Jun/0060.html>
[13] [CITE@en[Define JavaScript string and scalar value string]]
([[annevk]]著, [TIME[2017-03-27 16:01:49 +09:00]])
<https://github.com/whatwg/infra/commit/f1be763cfba23d2fc780b35403074c599e69616e>
[14] [CITE@en['''['''c''']''' (2) Disallow surrogates in the input stream; make the syntax sect…]]
([[Hixie]]著, [TIME[2013-09-14 06:27:11 +09:00]])
<https://github.com/whatwg/html/commit/6dfaa1a826fae1dd50695710498434d201e543f6>
[15] [CITE@en['''['''''']''' (0) Catch unpaired surrogates before trying to convert them to UTF-8.]]
([[Hixie]]著, [TIME[2009-06-13 07:33:53 +09:00]])
<https://github.com/whatwg/html/commit/53f640d41e2aadfde9cada86d3046d5912ecc818>
[16] [CITE@en['''['''ct''']''' (2) Make surrogates in UTF-8 and character references turn into …]]
([[Hixie]]著, [TIME[2009-09-16 18:22:01 +09:00]])
<https://github.com/whatwg/html/commit/6db21943d024e774d2aa52573981c130767034e9>
[17] [CITE@en['''['''t''']''' (0) Remove the requirement that the parser deal with raw surrogat…]]
([[Hixie]]著, [TIME[2011-02-09 09:29:12 +09:00]])
<https://github.com/whatwg/html/commit/3accfd8a1893d91cb3cdbae62b6d8980e456dda6>
[18] [CITE@en['''['''giow''']''' (0) Fix the UTF-8 decoder error handling to handle a few error…]]
([[Hixie]]著, [TIME[2011-03-04 11:56:49 +09:00]])
<https://github.com/whatwg/html/commit/74e3b6cb761ee8a79b3a1a44d029c128fd0a201f>
[19] [CITE@en['''['''giow''']''' (0) Unpaired surrogates should throw an exception in close, li…]]
([[Hixie]]著, [TIME[2011-06-22 08:01:17 +09:00]])
<https://github.com/whatwg/html/commit/226e15ebd3d557a67bedcfc043e165d24e4182c1>
[20] [CITE@en['''['''giow''']''' (1) Make WebSocket silently convert isolated surrogated to U+F…]]
([[Hixie]]著, [TIME[2012-05-03 05:06:23 +09:00]])
<https://github.com/whatwg/html/commit/a817b04f4c262645ef996a5176b4a3f0a3a11928>
[21] [CITE@en['''['''c''']''' (2) Disallow surrogates in the input stream; make the syntax sect…]]
([[Hixie]]著, [TIME[2013-09-14 06:27:11 +09:00]])
<https://github.com/whatwg/html/commit/6dfaa1a826fae1dd50695710498434d201e543f6>
[22] [CITE@en['''['''cssom''']''' Add IDL `CSSOMString`, typedef of either USVString or DOMString]]
([[SimonSapin]]著, [TIME[2017-04-21 12:28:09 +09:00]])
<https://github.com/w3c/csswg-drafts/commit/830ae19ffd9a6fa6eb60aa21549d334cb18fb706>
[23] [[サロゲートペア]]の ([[UTF-16]] の) 支持者は、
不支持者に対して、「[[サロゲートペア]]の処理は[[結合文字]]の処理より簡単だ、
どのみち[[結合文字]]の処理は必要だから大した問題ではない」
と返すのが (十数年にわたってw) 定番となっています。
[24] [[文字符号化]]のレイヤーと[[文字]]の処理のレイヤーが混ざっているのは20世紀のソフトウェア開発技法だと思うんですがねぇ。
([[シフトJIS]]は[[半角]]と[[全角]]が[[バイト数]]と一致するから表示処理が楽だ、
みたいなのと同じでしょう。)
[25] [CITE@en['''['''css-syntax''']''' Remove 'code point' and 'surrogate code point' in favor …]]
([[tabatkins]]著, [TIME[2017-06-10 03:59:51 +09:00]])
<https://github.com/w3c/csswg-drafts/commit/320a990184a331057a56a17cdf627fee81bdc5d3>