Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Octocat-spinner-32-eaf2f5

Cannot retrieve contributors at this time

executable file 148 lines (146 sloc) 7.056 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
<blockquote> 1. <a href="http://en.wikipedia.orghttp://en.wikipedia.org/wiki/Utf8" title="UTF-8">UTF-8</a> -> <a href="http://en.wikipedia.orghttp://en.wikipedia.org/wiki/Utf8" title="UTF-8">http://en.wikipedia.orghttp://en.wikipedia.org/wiki/Utf8</a><br> 2. <a href="http://en.wikipedia.orghttp://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="http://en.wikipedia.orghttp://en.wikipedia.org/wiki/CJK_Unified_Ideographs">CJK Unified Ideographs</a> </blockquote>

<h4>1. Unicode CJK</h4>
<p> I am looking for Chinese charset. They are in the range of <a href="http://en.wikipedia.orghttp://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="http://en.wikipedia.orghttp://en.wikipedia.org/wiki/CJK_Unified_Ideographs">CJK</a>.</p>
<p>The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,941 basic Chinese characters in the range U+4E00 through U+9FCC. The Charts:</p>
<p><a title="List of CJK Unified Ideographs, part 1 of 4" href="http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs,_part_1_of_4">4E00-62FF</a>, <a title="List of CJK Unified Ideographs, part 2 of 4" href="http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs,_part_2_of_4">6300-77FF</a>, <a title="List of CJK Unified Ideographs, part 3 of 4" href="http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs,_part_3_of_4">7800-8CFF</a>, <a title="List of CJK Unified Ideographs, part 4 of 4" href="http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs,_part_4_of_4">8D00-9FFF</a>.</p>
<h4>2. utf-8 <-> Unicode table</h4>
<p>What I need to do is to translate the right side Unicode to left-side UTF-8 3-bytes character.</p>
<table>
  <tr style="color:red; font-weight:bolder">
    <th>utf-8(3字节)</th>
    <th colspan="16">unicode(16位 - 用十六进制)</th>
  <tr>
  <tr>
    <th>&nbsp;<br>
      3-byte<br>
      E_<br>
      &nbsp;</th>
    <td bgcolor="#EEEEEE"><small>Indic</small><br>
      <small>0800*</small><br>
      <i><b>224</b></i></td>
    <td bgcolor="#EEEEEE"><small>Misc.</small><br>
      <small>1000</small><br>
      <i><b>225</b></i></td>
    <td bgcolor="#EEEEEE"><small>Symbol</small><br>
      <small>2000</small><br>
      <i><b>226</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/Kana" title="Kana">Kana</a><br>
      <a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>3000</small><br>
      <i><b>227</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>4000</small><br>
      <i><b>228</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>5000</small><br>
      <i><b>229</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>6000</small><br>
      <i><b>230</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>7000</small><br>
      <i><b>231</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>8000</small><br>
      <i><b>232</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/CJK_Unified_Ideographs" title="CJK Unified Ideographs">CJK</a></small><br>
      <small>9000</small><br>
      <i><b>233</b></i></td>
    <td bgcolor="#EEEEEE"><small>Asian</small><br>
      <small>A000</small><br>
      <i><b>234</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/Hangul" title="Hangul">Hangul</a></small><br>
      <small>B000</small><br>
      <i><b>235</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/Hangul" title="Hangul">Hangul</a></small><br>
      <small>C000</small><br>
      <i><b>236</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/Hangul" title="Hangul">Hangul</a><br>
      Surr</small><br>
      <small>D000</small><br>
      <i><b>237</b></i></td>
    <td bgcolor="#EEEEEE"><small><a href="http://en.wikipedia.org/wiki/Private_Use_Area" title="Private Use Area" class="mw-redirect">Priv Use</a></small><br>
      <small>E000</small><br>
      <i><b>238</b></i></td>
    <td bgcolor="#EEEEEE"><small>Forms</small><br>
      <small>F000</small><br>
      <i><b>239</b></i></td>
  </tr>
</table>
<h4>3. unicode->utf8 convert Formular</h4>
For CJK set, there is 3-bytes utf8 for a unicode charactor(16-bits).
<table class="wikitable">
  <tbody>
    <tr>
      <th>Bits</th>
      <th>Last code point</th>
      <th>Byte 1</th>
      <th>Byte 2</th>
      <th>Byte 3</th>
      <th>Byte 4</th>
      <th>Byte 5</th>
      <th>Byte 6</th>
    </tr>
    <tr>
      <th>&nbsp;&nbsp;7</th>
      <td>U+007F</td>
      <td><code>0xxxxxxx</code></td>
    </tr>
    <tr>
      <th>11</th>
      <td>U+07FF</td>
      <td><code>110xxxxx</code></td>
      <td><code>10xxxxxx</code></td>
    </tr>
    <tr style="color:red; size:150%">
      <th><strong>16</strong></th>
      <td><strong>U+FFFF</strong></td>
      <td><strong><code>1110xxxx</code></strong></td>
      <td><strong><code>10xxxxxx</code></strong></td>
      <td><strong><code>10xxxxxx</code></strong></td>
    </tr>
    <tr>
      <th>21</th>
      <td>U+1FFFFF</td>
      <td><code>11110xxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
    </tr>
    <tr>
      <th>26</th>
      <td>U+3FFFFFF</td>
      <td><code>111110xx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
    </tr>
    <tr>
      <th>31</th>
      <td>U+7FFFFFFF</td>
      <td><code>1111110x</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
      <td><code>10xxxxxx</code></td>
    </tr>
  </tbody>
</table>

<h4>4. example</h4>
<p>对于汉字‘大', U+5927, unicode 转化为utf-8的步骤如下:</p>
<pre>
按照unicode转utf-8的编码规则,汉字使用3字节序列
所以套用三字节转换公式
0800 - FFFF
1110xxxx 10xxxxxx 10xxxxxx
其中用x代表的16位使用unicode相应的位来填充
0x5927转换为2进制0101 1001 0010 0111
填充到上面公式中的x中变成
11100101 10100100 10100111
用16进制表示为E5 A4 A7
验证方法为:
在浏览器地址栏中输入javascript:alert(encodeURI('大').replace(/%/g,'')),按回车。
</pre>
</body></html>
Something went wrong with that request. Please try again.