Ubase

Ubase is a command-line program, and an OCaml library for removing diacritics (accents, etc.) from Latin letters in UTF8 string. For instance, "é" -> "e", "🅴" -> "E", etc. There is also a corresponding javascript library and executable.

It should work for all utf8 strings, regardless of normalization NFC, NFD, NFKD, NFKC.

Please don't use this library to store your strings without accents! On the contrary, store them in full UTF8 encoding, and use this library to simplify searching and comparison.

Ubase OCaml library

Example

let nfc = "V\197\169 Ng\225\187\141c Phan";;
let nfd = "Vu\204\131 Ngo\204\163c Phan";;

print_endline nfc;;
Vũ Ngọc Phan

print_endline nfd;;
Vũ Ngọc Phan

Ubase.from_utf8 nfc;;
- : string = "Vu Ngoc Phan"

Ubase.from_utf8 nfd;;
- : string = "Vu Ngoc Phan"

Usage

val from_utf8 : ?malformed:string -> ?strip:string -> string -> string
(** Remove all diacritics on Latin letters from a standard string containing
   UTF8 text. Any malformed UTF8 will be replaced by the [malformed] parameter
   (by default "?"). If the optional parameter [strip] is present, all
   non-ASCII, non-Latin unicode characters will be replaced by the [strip]
   string (which can be empty). If both [malformed] and [strip] contain only
   ASCII characters, then the result of [from_utf8] is guaranteed to
   contain only ASCII characters. *)

If your accented string is encoded in isolatin (8859-1), you first have to convert it to utf8 using isolatin_to_utf8 mystring.

Install

ubase is available on opam:

opam install ubase

That's it!

If you prefer to build a local version, download the repository, move into the ubase directory, and

dune build
opam install .

Ubase versions >= 10 have no dependency, apart from ocaml >= 4.14. Previous versions depend on uutf but work with older ocamls.

Quick test

Before installing

From the ubase directory:

dune utop

From the command line

Once you have installed the library, you can execute the ubase program from a terminal.

Doc

Documentation and API are available here.

Manually building the docs, from the ubase directory:

dune build @doc
firefox ./_build/default/_doc/_html/ubase/Ubase/index.html

Using Ubase for accent-insensitive searching

Have a look at Ufind, a small search engine based on Ubase.

UTF8 coverage

Ubase covers more than 2000 utf8 chars, it should be quite complete. File an issue if some character is not properly 'basified'.

The `ubase` program

If you installed the library, the ubase program is automatically installed. If you don't need the library, you may directly download the binary from the Releases page, or here:

You can execute the ubase program from a terminal. Its usage is straighforward:

$ ubase Déjà vu !
Deja vu !

$ ubase "Bøǹĵöůɍ"
Bonjour

$ ubase Anh xin lỗi các em bé vì đã đề tặng cuốn sách này cho một ông người lớn.
Anh xin loi cac em be vi da de tang cuon sach nay cho mot ong nguoi lon.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
bin		bin
docs		docs
lib		lib
test		test
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
dune-project		dune-project
ubase.opam		ubase.opam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ubase

Ubase OCaml library

Example

Usage

Install

Quick test

Before installing

From the command line

Doc

Using Ubase for accent-insensitive searching

UTF8 coverage

The `ubase` program

About

Releases 1

Packages

Contributors 3

Languages

License

sanette/ubase

Folders and files

Latest commit

History

Repository files navigation

Ubase

Ubase OCaml library

Example

Usage

Install

Quick test

Before installing

From the command line

Doc

Using Ubase for accent-insensitive searching

UTF8 coverage

The ubase program

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

The `ubase` program

Packages