Skip to content

A JavaScript port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

License

tlemburg/fold-to-ascii-js

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fold-to-ascii-js

A JavaScript port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

Sources

This is a straightforward port of the very extensive switch/case statement found in http://svn.apache.org/repos/asf/lucene/java/tags/lucene_solr_4_5_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java

The function to determine character codes is taken from a code example in the MDN (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt#Example.3A_Fixing_charCodeAt_to_handle_non-Basic-Multilingual-Plane_characters_if_their_presence_earlier_in_the_string_is_unknown).

Non-Basic-Multilingual-Plane characters

The function is ready to handle non-Basic-Multilingual-Plane characters. This is meant to support future extensions of the replacement table with characters from the high surrogate range. Still at the moment none of these characters have replacements.

Documentation

Configuration

The function can be configured to either replace unmapped non-ASCII characters or to leave them untouched through the replaceUnmapped variable.

Should this be true, a universal replacement string for these characters can be defined through changing defaultString.

Usage

foldToASCII("Północ") //=> "Polnoc"
var x = "Północ"
x.foldToASCII() //=> "Polnoc"

Tests

All replacement tasks are covered by QUnit tests. See https://github.com/mplatt/fold-to-ascii-js/blob/master/test/test.html

FAQ

Why is character x being replaced with y and not with z?

The unambiguous allocation of characters to replacements is not possible since it is language-dependent. For example a user from France might expect ü to be replaced with u while a user from Germany expects the replacement to be ue. The replacements featured here are kept as general as possible.

Replacement Patterns

Character(s) Replacement
À Á Â Ã Ä Å Ā Ă Ą Ə Ǎ Ǟ Ǡ Ǻ Ȁ Ȃ Ȧ Ⱥ ᴀ Ḁ Ạ Ả Ấ Ầ Ẩ Ẫ Ậ Ắ Ằ Ẳ Ẵ Ặ Ⓐ A A
à á â ã ä å ā ă ą ǎ ǟ ǡ ǻ ȁ ȃ ȧ ɐ ə ɚ ᶏ ᶕ ḁ ẚ ạ ả ấ ầ ẩ ẫ ậ ắ ằ ẳ ẵ ặ ₐ ₔ ⓐ ⱥ Ɐ a a
AA
Æ Ǣ Ǽ ᴁ AE
AO
AU
Ꜹ Ꜻ AV
AY
(a)
aa
æ ǣ ǽ ᴂ ae
ao
au
ꜹ ꜻ av
ay
Ɓ Ƃ Ƀ ʙ ᴃ Ḃ Ḅ Ḇ Ⓑ B B
ƀ ƃ ɓ ᵬ ᶀ ḃ ḅ ḇ ⓑ b b
(b)
Ç Ć Ĉ Ċ Č Ƈ Ȼ ʗ ᴄ Ḉ Ⓒ C C
ç ć ĉ ċ č ƈ ȼ ɕ ḉ ↄ ⓒ Ꜿ ꜿ c c
(c)
Ð Ď Đ Ɖ Ɗ Ƌ ᴅ ᴆ Ḋ Ḍ Ḏ Ḑ Ḓ Ⓓ Ꝺ D D
ð ď đ ƌ ȡ ɖ ɗ ᵭ ᶁ ᶑ ḋ ḍ ḏ ḑ ḓ ⓓ ꝺ d d
DŽ DZ DZ
Dž Dz Dz
(d)
ȸ db
dž dz ʣ ʥ dz
È É Ê Ë Ē Ĕ Ė Ę Ě Ǝ Ɛ Ȅ Ȇ Ȩ Ɇ ᴇ Ḕ Ḗ Ḙ Ḛ Ḝ Ẹ Ẻ Ẽ Ế Ề Ể Ễ Ệ Ⓔ ⱻ E E
è é ê ë ē ĕ ė ę ě ǝ ȅ ȇ ȩ ɇ ɘ ɛ ɜ ɝ ɞ ʚ ᴈ ᶒ ᶓ ᶔ ḕ ḗ ḙ ḛ ḝ ẹ ẻ ẽ ế ề ể ễ ệ ₑ ⓔ ⱸ e e
(e)
Ƒ Ḟ Ⓕ ꜰ Ꝼ ꟻ F F
ƒ ᵮ ᶂ ḟ ẛ ⓕ ꝼ f f
(f)
ff
ffi
ffl
fi
fl
Ĝ Ğ Ġ Ģ Ɠ Ǥ ǥ Ǧ ǧ Ǵ ɢ ʛ Ḡ Ⓖ Ᵹ Ꝿ G G
ĝ ğ ġ ģ ǵ ɠ ɡ ᵷ ᵹ ᶃ ḡ ⓖ ꝿ g g
(g)
Ĥ Ħ Ȟ ʜ Ḣ Ḥ Ḧ Ḩ Ḫ Ⓗ Ⱨ Ⱶ H H
ĥ ħ ȟ ɥ ɦ ʮ ʯ ḣ ḥ ḧ ḩ ḫ ẖ ⓗ ⱨ ⱶ h h
Ƕ HV
(h)
ƕ hv
Ì Í Î Ï Ĩ Ī Ĭ Į İ Ɩ Ɨ Ǐ Ȉ Ȋ ɪ ᵻ Ḭ Ḯ Ỉ Ị Ⓘ ꟾ I I
ì í î ï ĩ ī ĭ į ı ǐ ȉ ȋ ɨ ᴉ ᵢ ᵼ ᶖ ḭ ḯ ỉ ị ⁱ ⓘ i i
IJ IJ
(i)
ij ij
Ĵ Ɉ ᴊ Ⓙ J J
ĵ ǰ ȷ ɉ ɟ ʄ ʝ ⓙ ⱼ j j
(j)
Ķ Ƙ Ǩ ᴋ Ḱ Ḳ Ḵ Ⓚ Ⱪ Ꝁ Ꝃ Ꝅ K K
ķ ƙ ǩ ʞ ᶄ ḱ ḳ ḵ ⓚ ⱪ ꝁ ꝃ ꝅ k k
(k)
Ĺ Ļ Ľ Ŀ Ł Ƚ ʟ ᴌ Ḷ Ḹ Ḻ Ḽ Ⓛ Ⱡ Ɫ Ꝇ Ꝉ Ꞁ L L
ĺ ļ ľ ŀ ł ƚ ȴ ɫ ɬ ɭ ᶅ ḷ ḹ ḻ ḽ ⓛ ⱡ ꝇ ꝉ ꞁ l l
LJ LJ
LL
Lj Lj
(l)
lj lj
ll
ʪ ls
ʫ lz
Ɯ ᴍ Ḿ Ṁ Ṃ Ⓜ Ɱ ꟽ ꟿ M M
ɯ ɰ ɱ ᵯ ᶆ ḿ ṁ ṃ ⓜ m m
(m)
Ñ Ń Ņ Ň Ŋ Ɲ Ǹ Ƞ ɴ ᴎ Ṅ Ṇ Ṉ Ṋ Ⓝ N N
ñ ń ņ ň ʼn ŋ ƞ ǹ ȵ ɲ ɳ ᵰ ᶇ ṅ ṇ ṉ ṋ ⁿ ⓝ n n
NJ NJ
Nj Nj
(n)
nj nj
Ò Ó Ô Õ Ö Ø Ō Ŏ Ő Ɔ Ɵ Ơ Ǒ Ǫ Ǭ Ǿ Ȍ Ȏ Ȫ Ȭ Ȯ Ȱ ᴏ ᴐ Ṍ Ṏ Ṑ Ṓ Ọ Ỏ Ố Ồ Ổ Ỗ Ộ Ớ Ờ Ở Ỡ Ợ Ⓞ Ꝋ Ꝍ O O
ò ó ô õ ö ø ō ŏ ő ơ ǒ ǫ ǭ ǿ ȍ ȏ ȫ ȭ ȯ ȱ ɔ ɵ ᴖ ᴗ ᶗ ṍ ṏ ṑ ṓ ọ ỏ ố ồ ổ ỗ ộ ớ ờ ở ỡ ợ ₒ ⓞ ⱺ ꝋ ꝍ o o
Œ ɶ OE
OO
Ȣ ᴕ OU
(o)
œ ᴔ oe
oo
ȣ ou
Ƥ ᴘ Ṕ Ṗ Ⓟ Ᵽ Ꝑ Ꝓ Ꝕ P P
ƥ ᵱ ᵽ ᶈ ṕ ṗ ⓟ ꝑ ꝓ ꝕ ꟼ p p
(p)
Ɋ Ⓠ Ꝗ Ꝙ Q Q
ĸ ɋ ʠ ⓠ ꝗ ꝙ q q
(q)
ȹ qp
Ŕ Ŗ Ř Ȑ Ȓ Ɍ ʀ ʁ ᴙ ᴚ Ṙ Ṛ Ṝ Ṟ Ⓡ Ɽ Ꝛ Ꞃ R R
ŕ ŗ ř ȑ ȓ ɍ ɼ ɽ ɾ ɿ ᵣ ᵲ ᵳ ᶉ ṙ ṛ ṝ ṟ ⓡ ꝛ ꞃ r r
(r)
Ś Ŝ Ş Š Ș Ṡ Ṣ Ṥ Ṧ Ṩ Ⓢ ꜱ ꞅ S S
ś ŝ ş š ſ ș ȿ ʂ ᵴ ᶊ ṡ ṣ ṥ ṧ ṩ ẜ ẝ ⓢ Ꞅ s s
SS
(s)
ß ss
st
Ţ Ť Ŧ Ƭ Ʈ Ț Ⱦ ᴛ Ṫ Ṭ Ṯ Ṱ Ⓣ Ꞇ T T
ţ ť ŧ ƫ ƭ ț ȶ ʇ ʈ ᵵ ṫ ṭ ṯ ṱ ẗ ⓣ ⱦ t t
Þ Ꝧ TH
TZ
(t)
ʨ tc
þ ᵺ ꝧ th
ʦ ts
tz
Ù Ú Û Ü Ũ Ū Ŭ Ů Ű Ų Ư Ǔ Ǖ Ǘ Ǚ Ǜ Ȕ Ȗ Ʉ ᴜ ᵾ Ṳ Ṵ Ṷ Ṹ Ṻ Ụ Ủ Ứ Ừ Ử Ữ Ự Ⓤ U U
ù ú û ü ũ ū ŭ ů ű ų ư ǔ ǖ ǘ ǚ ǜ ȕ ȗ ʉ ᵤ ᶙ ṳ ṵ ṷ ṹ ṻ ụ ủ ứ ừ ử ữ ự ⓤ u u
(u)
ue
Ʋ Ʌ ᴠ Ṽ Ṿ Ỽ Ⓥ Ꝟ Ꝩ V V
ʋ ʌ ᵥ ᶌ ṽ ṿ ⓥ ⱱ ⱴ ꝟ v v
VY
(v)
vy
Ŵ Ƿ ᴡ Ẁ Ẃ Ẅ Ẇ Ẉ Ⓦ Ⱳ W W
ŵ ƿ ʍ ẁ ẃ ẅ ẇ ẉ ẘ ⓦ ⱳ w w
(w)
Ẋ Ẍ Ⓧ X X
ᶍ ẋ ẍ ₓ ⓧ x x
(x)
Ý Ŷ Ÿ Ƴ Ȳ Ɏ ʏ Ẏ Ỳ Ỵ Ỷ Ỹ Ỿ Ⓨ Y Y
ý ÿ ŷ ƴ ȳ ɏ ʎ ẏ ẙ ỳ ỵ ỷ ỹ ỿ ⓨ y y
(y)
Ź Ż Ž Ƶ Ȝ Ȥ ᴢ Ẑ Ẓ Ẕ Ⓩ Ⱬ Ꝣ Z Z
ź ż ž ƶ ȝ ȥ ɀ ʐ ʑ ᵶ ᶎ ẑ ẓ ẕ ⓩ ⱬ ꝣ z z
(z)
⁰ ₀ ⓪ ⓿ 0 0
¹ ₁ ① ⓵ ❶ ➀ ➊ 1 1
1.
(1)
² ₂ ② ⓶ ❷ ➁ ➋ 2 2
2.
(2)
³ ₃ ③ ⓷ ❸ ➂ ➌ 3 3
3.
(3)
⁴ ₄ ④ ⓸ ❹ ➃ ➍ 4 4
4.
(4)
⁵ ₅ ⑤ ⓹ ❺ ➄ ➎ 5 5
5.
(5)
⁶ ₆ ⑥ ⓺ ❻ ➅ ➏ 6 6
6.
(6)
⁷ ₇ ⑦ ⓻ ❼ ➆ ➐ 7 7
7.
(7)
⁸ ₈ ⑧ ⓼ ❽ ➇ ➑ 8 8
8.
(8)
⁹ ₉ ⑨ ⓽ ❾ ➈ ➒ 9 9
9.
(9)
⑩ ⓾ ❿ ➉ ➓ 10
10.
(10)
⑪ ⓫ 11
11.
(11)
⑫ ⓬ 12
12.
(12)
⑬ ⓭ 13
13.
(13)
⑭ ⓮ 14
14.
(14)
⑮ ⓯ 15
15.
(15)
⑯ ⓰ 16
16.
(16)
⑰ ⓱ 17
17.
(17)
⑱ ⓲ 18
18.
(18)
⑲ ⓳ 19
19.
(19)
⑳ ⓴ 20
20.
(20)
« » “ ” „ ″ ‶ ❝ ❞ ❮ ❯ " "
‘ ’ ‚ ‛ ′ ‵ ‹ › ❛ ❜ ' '
‐ ‑ ‒ – — ⁻ ₋ - -
⁅ ❲ [ [
⁆ ❳ ] ]
⁽ ₍ ❨ ❪ ( (
((
⁾ ₎ ❩ ❫ ) )
))
❬ ❰ < <
❭ ❱ > >
❴ { {
❵ } }
⁺ ₊ + +
⁼ ₌ = =
!
!!
!?
#
$
⁒ % %
&
⁎ * *
,
.
⁄ / /
:
⁏ ; ;
?
??
?!
@
\
‸ ^ ^
_ _
⁓ ~ ~

About

A JavaScript port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%