@@ -16,9 +16,9 @@ In the browser:
16
16
<script src="cptable.js"></script>
17
17
<script src="cputils.js"></script>
18
18
19
- The complete set of codepages is large because of some Double Byte Character Set
19
+ The complete set of codepages is large due to some Double Byte Character Set
20
20
encodings. A much smaller file that just includes SBCS codepages is provided in
21
- this repo (` sbcs.js ` ).
21
+ this repo (` sbcs.js ` ), as well as a file for other projects ( ` cpexcel.js ` )
22
22
23
23
If you know which codepages you need, you can include individual scripts for
24
24
each codepage. The individual files are provided in the ` bits/ ` directory.
@@ -43,7 +43,7 @@ In node:
43
43
44
44
## Usage
45
45
46
- The codepages are indexed by number. To get the unicode character for a given
46
+ The codepages are indexed by number. To get the unicode character for a given
47
47
codepoint, use the ` dec ` property:
48
48
49
49
var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ
@@ -57,117 +57,187 @@ There are a few utilities that deal with strings and buffers:
57
57
var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
58
58
var buf = cptable.utils.encode(936, 汇总);
59
59
60
+ ` cptable.utils.encode(CP, data, ofmt) ` accepts a String or Array of characters
61
+ and returns a representation controlled by ` ofmt ` :
62
+
63
+ - Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
64
+ - If ` ofmt == 'str' ` , return a String where ` o.charCodeAt(i) ` is the ith byte
65
+ - If ` ofmt == 'arr' ` , return an Array of bytes
66
+
60
67
## Building the script
61
68
62
69
This script uses [ voc] ( npm.im/voc ) . The script to build the codepage tables and
63
70
the JS source is ` codepage.md ` , so building is as simple as ` voc codepage.md ` .
64
71
65
- ## Supported Codepages
66
-
67
- The standard Windows codepages are supported:
68
-
69
- - 1250 Windows Central Europe
70
- - 1251 Windows Cyrillic
71
- - 1252 Windows Latin I
72
- - 1253 Windows Green
73
- - 1254 Windows Turkish
74
- - 1255 Windows Hebrew
75
- - 1256 Windows Arabic
76
- - 1257 Windows Baltic
77
- - 1258 Windows Vietnam
78
- - 874 Windows Thai
79
-
80
- The full collection of ` ISO-8859 ` codepages are also supported. The East-Asian
81
- Double Byte Character Sets are also supported:
82
-
83
- - 932 Japanese Shift-JIS
84
- - 936 Simplified Chinese GBK
85
- - 949 Korean
86
- - 950 Traditional Chinese Big5
87
-
88
- The complete list of supported codepages can be found in the file ` pages.csv ` .
72
+ ## Generated Codepages
73
+
74
+ The complete list of hardcoded codepages can be found in the file ` pages.csv ` .
75
+
76
+ Some codepages are easier to implement algorithmically. Since these are
77
+ hardcoded in utils, there is no corresponding entry (they are "magic")
78
+
79
+ | CP# | Information | Description |
80
+ | --: | ----------- | ----------- |
81
+ | 37| unicode.org |IBM EBCDIC US-Canada
82
+ | 437| unicode.org |OEM United States
83
+ | 500| unicode.org |IBM EBCDIC International
84
+ | 708|MakeEncoding.cs|Arabic (ASMO 708)
85
+ | 720|MakeEncoding.cs|Arabic (Transparent ASMO); Arabic (DOS)
86
+ | 737| unicode.org |OEM Greek (formerly 437G); Greek (DOS)
87
+ | 775| unicode.org |OEM Baltic; Baltic (DOS)
88
+ | 850| unicode.org |OEM Multilingual Latin 1; Western European (DOS)
89
+ | 852| unicode.org |OEM Latin 2; Central European (DOS)
90
+ | 855| unicode.org |OEM Cyrillic (primarily Russian)
91
+ | 857| unicode.org |OEM Turkish; Turkish (DOS)
92
+ | 858|MakeEncoding.cs|OEM Multilingual Latin 1 + Euro symbol
93
+ | 860| unicode.org |OEM Portuguese; Portuguese (DOS)
94
+ | 861| unicode.org |OEM Icelandic; Icelandic (DOS)
95
+ | 862| unicode.org |OEM Hebrew; Hebrew (DOS)
96
+ | 863| unicode.org |OEM French Canadian; French Canadian (DOS)
97
+ | 864| unicode.org |OEM Arabic; Arabic (864)
98
+ | 865| unicode.org |OEM Nordic; Nordic (DOS)
99
+ | 866| unicode.org |OEM Russian; Cyrillic (DOS)
100
+ | 869| unicode.org |OEM Modern Greek; Greek, Modern (DOS)
101
+ | 870|MakeEncoding.cs|IBM EBCDIC Multilingual/ROECE (Latin 2)
102
+ | 874| unicode.org |Windows Thai
103
+ | 875| unicode.org |IBM EBCDIC Greek Modern
104
+ | 932| unicode.org |Japanese Shift-JIS
105
+ | 936| unicode.org |Simplified Chinese GBK
106
+ | 949| unicode.org |Korean
107
+ | 950| unicode.org |Traditional Chinese Big5
108
+ | 1026| unicode.org |IBM EBCDIC Turkish (Latin 5)
109
+ | 1047|MakeEncoding.cs|IBM EBCDIC Latin 1/Open System
110
+ | 1140|MakeEncoding.cs|IBM EBCDIC US-Canada (037 + Euro symbol)
111
+ | 1141|MakeEncoding.cs|IBM EBCDIC Germany (20273 + Euro symbol)
112
+ | 1142|MakeEncoding.cs|IBM EBCDIC Denmark-Norway (20277 + Euro symbol)
113
+ | 1143|MakeEncoding.cs|IBM EBCDIC Finland-Sweden (20278 + Euro symbol)
114
+ | 1144|MakeEncoding.cs|IBM EBCDIC Italy (20280 + Euro symbol)
115
+ | 1145|MakeEncoding.cs|IBM EBCDIC Latin America-Spain (20284 + Euro symbol)
116
+ | 1146|MakeEncoding.cs|IBM EBCDIC United Kingdom (20285 + Euro symbol)
117
+ | 1147|MakeEncoding.cs|IBM EBCDIC France (20297 + Euro symbol)
118
+ | 1148|MakeEncoding.cs|IBM EBCDIC International (500 + Euro symbol)
119
+ | 1149|MakeEncoding.cs|IBM EBCDIC Icelandic (20871 + Euro symbol)
120
+ | 1200| magic |Unicode UTF-16, little endian (BMP of ISO 10646)
121
+ | 1201| magic |Unicode UTF-16, big endian
122
+ | 1250| unicode.org |Windows Central Europe
123
+ | 1251| unicode.org |Windows Cyrillic
124
+ | 1252| unicode.org |Windows Latin I
125
+ | 1253| unicode.org |Windows Green
126
+ | 1254| unicode.org |Windows Turkish
127
+ | 1255| unicode.org |Windows Hebrew
128
+ | 1256| unicode.org |Windows Arabic
129
+ | 1257| unicode.org |Windows Baltic
130
+ | 1258| unicode.org |Windows Vietnam
131
+ | 1361|MakeEncoding.cs|Korean (Johab)
132
+ |10000| unicode.org |MAC Roman
133
+ |10001|MakeEncoding.cs|Japanese (Mac)
134
+ |10002|MakeEncoding.cs|MAC Traditional Chinese (Big5)
135
+ |10003|MakeEncoding.cs|Korean (Mac)
136
+ |10004|MakeEncoding.cs|Arabic (Mac)
137
+ |10005|MakeEncoding.cs|Hebrew (Mac)
138
+ |10006| unicode.org |Greek (Mac)
139
+ |10007| unicode.org |Cyrillic (Mac)
140
+ |10008|MakeEncoding.cs|MAC Simplified Chinese (GB 2312)
141
+ |10010|MakeEncoding.cs|Romanian (Mac)
142
+ |10017|MakeEncoding.cs|Ukrainian (Mac)
143
+ |10021|MakeEncoding.cs|Thai (Mac)
144
+ |10029| unicode.org |MAC Latin 2 (Central European)
145
+ |10079| unicode.org |Icelandic (Mac)
146
+ |10081| unicode.org |Turkish (Mac)
147
+ |10082|MakeEncoding.cs|Croatian (Mac)
148
+ |12000| magic |Unicode UTF-32, little endian byte order
149
+ |12001| magic |Unicode UTF-32, big endian byte order
150
+ |20000|MakeEncoding.cs|CNS Taiwan (Chinese Traditional)
151
+ |20001|MakeEncoding.cs|TCA Taiwan
152
+ |20002|MakeEncoding.cs|Eten Taiwan (Chinese Traditional)
153
+ |20003|MakeEncoding.cs|IBM5550 Taiwan
154
+ |20004|MakeEncoding.cs|TeleText Taiwan
155
+ |20005|MakeEncoding.cs|Wang Taiwan
156
+ |20105|MakeEncoding.cs|Western European IA5 (IRV International Alphabet 5) 7-bit
157
+ |20106|MakeEncoding.cs|IA5 German (7-bit)
158
+ |20107|MakeEncoding.cs|IA5 Swedish (7-bit)
159
+ |20108|MakeEncoding.cs|IA5 Norwegian (7-bit)
160
+ |20127| magic |US-ASCII (7-bit)
161
+ |20261|MakeEncoding.cs|T.61
162
+ |20269|MakeEncoding.cs|ISO 6937 Non-Spacing Accent
163
+ |20273|MakeEncoding.cs|IBM EBCDIC Germany
164
+ |20277|MakeEncoding.cs|IBM EBCDIC Denmark-Norway
165
+ |20278|MakeEncoding.cs|IBM EBCDIC Finland-Sweden
166
+ |20280|MakeEncoding.cs|IBM EBCDIC Italy
167
+ |20284|MakeEncoding.cs|IBM EBCDIC Latin America-Spain
168
+ |20285|MakeEncoding.cs|IBM EBCDIC United Kingdom
169
+ |20290|MakeEncoding.cs|IBM EBCDIC Japanese Katakana Extended
170
+ |20297|MakeEncoding.cs|IBM EBCDIC France
171
+ |20420|MakeEncoding.cs|IBM EBCDIC Arabic
172
+ |20423|MakeEncoding.cs|IBM EBCDIC Greek
173
+ |20424|MakeEncoding.cs|IBM EBCDIC Hebrew
174
+ |20833|MakeEncoding.cs|IBM EBCDIC Korean Extended
175
+ |20838|MakeEncoding.cs|IBM EBCDIC Thai
176
+ |20866|MakeEncoding.cs|Russian Cyrillic (KOI8-R)
177
+ |20871|MakeEncoding.cs|IBM EBCDIC Icelandic
178
+ |20880|MakeEncoding.cs|IBM EBCDIC Cyrillic Russian
179
+ |20905|MakeEncoding.cs|IBM EBCDIC Turkish
180
+ |20924|MakeEncoding.cs|IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
181
+ |20932|MakeEncoding.cs|Japanese (JIS 0208-1990 and 0212-1990)
182
+ |20936|MakeEncoding.cs|Simplified Chinese (GB2312-80)
183
+ |20949|MakeEncoding.cs|Korean Wansung
184
+ |21025|MakeEncoding.cs|IBM EBCDIC Cyrillic Serbian-Bulgarian
185
+ |21866|MakeEncoding.cs|Ukrainian Cyrillic (KOI8-U)
186
+ |28591| unicode.org |ISO 8859-1 Latin 1 (Western European)
187
+ |28592| unicode.org |ISO 8859-2 Latin 2 (Central European)
188
+ |28593| unicode.org |ISO 8859-3 Latin 3
189
+ |28594| unicode.org |ISO 8859-4 Baltic
190
+ |28595| unicode.org |ISO 8859-5 Cyrillic
191
+ |28596| unicode.org |ISO 8859-6 Arabic
192
+ |28597| unicode.org |ISO 8859-7 Greek
193
+ |28598| unicode.org |ISO 8859-8 Hebrew (ISO-Visual)
194
+ |28599| unicode.org |ISO 8859-9 Turkish
195
+ |28600| unicode.org |ISO 8859-10 Latin 6
196
+ |28601| unicode.org |ISO 8859-11 Latin (Thai)
197
+ |28603| unicode.org |ISO 8859-13 Latin 7 (Estonian)
198
+ |28604| unicode.org |ISO 8859-14 Latin 8 (Celtic)
199
+ |28605| unicode.org |ISO 8859-15 Latin 9
200
+ |28606| unicode.org |ISO 8859-15 Latin 10
201
+ |29001|MakeEncoding.cs|Europa 3
202
+ |38598|MakeEncoding.cs|ISO 8859-8 Hebrew (ISO-Logical)
203
+ |50220|MakeEncoding.cs|ISO 2022 JIS Japanese with no halfwidth Katakana
204
+ |50221|MakeEncoding.cs|ISO 2022 JIS Japanese with halfwidth Katakana
205
+ |50222|MakeEncoding.cs|ISO 2022 Japanese JIS X 0201-1989 (1 byte Kana-SO/SI)
206
+ |50225|MakeEncoding.cs|ISO 2022 Korean
207
+ |50227|MakeEncoding.cs|ISO 2022 Simplified Chinese
208
+ |51932|MakeEncoding.cs|EUC Japanese
209
+ |51936|MakeEncoding.cs|EUC Simplified Chinese
210
+ |51949|MakeEncoding.cs|EUC Korean
211
+ |52936|MakeEncoding.cs|HZ-GB2312 Simplified Chinese
212
+ |54936|MakeEncoding.cs|GB18030 Simplified Chinese (4 byte)
213
+ |57002|MakeEncoding.cs|ISCII Devanagari
214
+ |57003|MakeEncoding.cs|ISCII Bengali
215
+ |57004|MakeEncoding.cs|ISCII Tamil
216
+ |57005|MakeEncoding.cs|ISCII Telugu
217
+ |57006|MakeEncoding.cs|ISCII Assamese
218
+ |57007|MakeEncoding.cs|ISCII Oriya
219
+ |57008|MakeEncoding.cs|ISCII Kannada
220
+ |57009|MakeEncoding.cs|ISCII Malayalam
221
+ |57010|MakeEncoding.cs|ISCII Gujarati
222
+ |57011|MakeEncoding.cs|ISCII Punjabi
223
+ |65000| magic |Unicode (UTF-7)
224
+ |65001| magic |Unicode (UTF-8)
225
+
226
+ Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the
227
+ case of direct conflicts, unicode.org takes precedence. In cases where the
228
+ unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.
89
229
90
230
## Missing Codepages
91
231
92
232
The following codepages are not implemented. Normative references may not be
93
233
available in all cases. Furthermore, other software packages are known to hack
94
- certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic
234
+ certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic
95
235
ISO-8869-6 when in fact there are many differences), so all implementations
96
236
* should* be cleanroom when possible.
97
237
98
238
- 709 Arabic (ASMO-449+, BCON V4)
99
239
- 710 Arabic - Transparent Arabic
100
- - 870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
101
- - 1047 IBM EBCDIC Latin 1/Open System
102
- - 1140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
103
- - 1141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
104
- - 1142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
105
- - 1143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
106
- - 1144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
107
- - 1145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
108
- - 1146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
109
- - 1147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
110
- - 1148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
111
- - 1149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
112
- - 1200 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
113
- - 1201 Unicode UTF-16, big endian byte order; available only to managed applications
114
- - 1361 Korean (Johab)
115
- - 10001 Japanese (Mac)
116
- - 10002 MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
117
- - 10003 Korean (Mac)
118
- - 10004 Arabic (Mac)
119
- - 10005 Hebrew (Mac)
120
- - 10008 MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
121
- - 10010 Romanian (Mac)
122
- - 10017 Ukrainian (Mac)
123
- - 10021 Thai (Mac)
124
- - 10082 Croatian (Mac)
125
- - 12000 Unicode UTF-32, little endian byte order; available only to managed applications
126
- - 12001 Unicode UTF-32, big endian byte order; available only to managed applications
127
- - 20000 CNS Taiwan; Chinese Traditional (CNS)
128
- - 20001 TCA Taiwan
129
- - 20002 Eten Taiwan; Chinese Traditional (Eten)
130
- - 20003 IBM5550 Taiwan
131
- - 20004 TeleText Taiwan
132
- - 20005 Wang Taiwan
133
- - 20105 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
134
- - 20106 IA5 German (7-bit)
135
- - 20107 IA5 Swedish (7-bit)
136
- - 20108 IA5 Norwegian (7-bit)
137
- - 20127 US-ASCII (7-bit)
138
- - 20261 T.61
139
- - 20269 ISO 6937 Non-Spacing Accent
140
- - 20273 IBM EBCDIC Germany
141
- - 20277 IBM EBCDIC Denmark-Norway
142
- - 20278 IBM EBCDIC Finland-Sweden
143
- - 20280 IBM EBCDIC Italy
144
- - 20284 IBM EBCDIC Latin America-Spain
145
- - 20285 IBM EBCDIC United Kingdom
146
- - 20290 IBM EBCDIC Japanese Katakana Extended
147
- - 20297 IBM EBCDIC France
148
- - 20420 IBM EBCDIC Arabic
149
- - 20423 IBM EBCDIC Greek
150
- - 20424 IBM EBCDIC Hebrew
151
- - 20833 IBM EBCDIC Korean Extended
152
- - 20838 IBM EBCDIC Thai
153
- - 20866 Russian (KOI8-R); Cyrillic (KOI8-R)
154
- - 20871 IBM EBCDIC Icelandic
155
- - 20880 IBM EBCDIC Cyrillic Russian
156
- - 20905 IBM EBCDIC Turkish
157
- - 20924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
158
- - 20932 Japanese (JIS 0208-1990 and 0212-1990)
159
- - 20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
160
- - 20949 Korean Wansung
161
- - 21025 IBM EBCDIC Cyrillic Serbian-Bulgarian
162
240
- 21027 (deprecated) <-- is this necessary?
163
- - 21866 Ukrainian (KOI8-U); Cyrillic (KOI8-U)
164
- - 29001 Europa 3
165
- - 38598 ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
166
- - 50220 ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
167
- - 50221 ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
168
- - 50222 ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
169
- - 50225 ISO 2022 Korean
170
- - 50227 ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
171
241
- 50229 ISO 2022 Traditional Chinese
172
242
- 50930 EBCDIC Japanese (Katakana) Extended
173
243
- 50931 EBCDIC US-Canada and Japanese
@@ -176,26 +246,16 @@ ISO-8869-6 when in fact there are many differences), so all implementations
176
246
- 50936 EBCDIC Simplified Chinese
177
247
- 50937 EBCDIC US-Canada and Traditional Chinese
178
248
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
179
- - 51932 EUC Japanese
180
- - 51936 EUC Simplified Chinese; Chinese Simplified (EUC)
181
- - 51949 EUC Korean
182
249
- 51950 EUC Traditional Chinese
183
- - 52936 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
184
- - 54936 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
185
- - 57002 ISCII Devanagari
186
- - 57003 ISCII Bengali
187
- - 57004 ISCII Tamil
188
- - 57005 ISCII Telugu
189
- - 57006 ISCII Assamese
190
- - 57007 ISCII Oriya
191
- - 57008 ISCII Kannada
192
- - 57009 ISCII Malayalam
193
- - 57010 ISCII Gujarati
194
- - 57011 ISCII Punjabi
195
- - 65000 Unicode (UTF-7)
196
250
197
251
## Sources
198
252
199
253
- [ Unicode Consortium Public Mappings] ( http://www.unicode.org/Public/MAPPINGS/ )
200
254
- [ Code Page Enumeration] ( http://msdn.microsoft.com/en-us/library/cc195051.aspx )
201
- - [ Code Page Identifiers] ( http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx )
255
+ - [ Code Page Identifiers] ( http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx )
256
+
257
+ ## Badges
258
+
259
+ [ ![ githalytics.com alpha] ( https://cruel-carlota.pagodabox.com/afa29a5e8495a01059ee5b353f9042cb " githalytics.com ")] ( http://githalytics.com/SheetJS/js-codepage )
260
+ [ ![ Build Status] ( https://travis-ci.org/SheetJS/js-codepage.svg?branch=master )] ( https://travis-ci.org/SheetJS/js-codepage )
261
+ [ ![ Coverage Status] ( https://coveralls.io/repos/SheetJS/js-codepage/badge.png )] ( https://coveralls.io/r/SheetJS/js-codepage )
0 commit comments