Skip to content

📦 ASCII transliterations of Unicode text for Go.

License

Notifications You must be signed in to change notification settings

aisbergg/go-unidecode

Repository files navigation

go-unidecode

GoDoc GoReport Coverage Status CodeQL License LinkedIn

ASCII transliterations of Unicode text for Go. Unicode characters are mapped to ASCII characters based on their phonetic representation. E.g.: AndréAndre, 北京Bei Jing

Inspired by python-unidecode.

Table of Contents

Installation

go get -u github.com/aisbergg/go-unidecode

Install CLI tool:

$ go install github.com/aisbergg/go-unidecode/cmd/unidecode

$ unidecode 北京kožušček
Bei Jing kozuscek

$ cat file.txt | unidecode -e replace -r "#" -

back to top ⇧

Usage

package main

import (
	"fmt"
	"strings"

	"github.com/aisbergg/go-unidecode/pkg/unidecode"
)

func main() {
	//
	// General Usage
	//

	s := "abc 北京kožušček"
	d, _ := unidecode.Unidecode(s, unidecode.Ignore)
	fmt.Println(d)
	// Output: abc Bei Jing kozuscek

	s = "北京"
	b, _ := unidecode.UnidecodeBytes([]byte(s), unidecode.Ignore)
	fmt.Println(string(b))
	// Output: Bei Jing

	//
	// Error Handling
	//

	// return an error if an untransliteratable character is found
	s = "⁐"
	_, err := unidecode.Unidecode(s, unidecode.Strict)
	fmt.Println(err)
	// Output: no replacement found for character ⁐ in position 0

	// preserve untransliteratable characters
	d, _ = unidecode.Unidecode(s, unidecode.Preserve)
	fmt.Println(d)
	// Output: ⁐

	// replace untransliteratable characters with specified replacement text.
	d, _ = unidecode.Unidecode(s, unidecode.Replace, "?")
	fmt.Println(d)
	// Output: ?

	//
	// Append existing buffer to prevent allocations while unidecoding
	//

	s = "kožušček"
	buf := make([]byte, 0, len(s)+len(s)/3)
	b, _ = unidecode.Append(buf, s, unidecode.Ignore)
	fmt.Println(string(b))
	// Output: kozuscek

	//
	// Writing to an io.Writer
	//

	bld := strings.Builder{}
	w := unidecode.NewWriter(&bld, unidecode.Ignore)
	w.WriteString(s)
	fmt.Println(bld.String())
	// Output: kozuscek
}

back to top ⇧

Benchmark

The source code for the benchmarks is located in the benchmarks directory.

cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkAisberggUnidecode-4         	   34971	     32703 ns/op	    6144 B/op	       1 allocs/op
BenchmarkAisberggUnidecodeAppend-4   	   38949	     30046 ns/op	       0 B/op	       0 allocs/op
BenchmarkAisberggUnidecodeWriter-4   	   27589	     43437 ns/op	   23981 B/op	       0 allocs/op
BenchmarkFiamUnidecode-4             	     949	   1211890 ns/op	 4305247 B/op	    2335 allocs/op
BenchmarkMozillazgUnidecode-4        	   10000	    102804 ns/op	  107960 B/op	     608 allocs/op

back to top ⇧

Contributing

If you have any suggestions, want to file a bug report or want to contribute to this project in some other way, please read the contribution guideline.

And don't forget to give this project a star 🌟! Thanks again!

back to top ⇧

License

Distributed under the MIT License. See LICENSE for more information.

back to top ⇧

Contact

André Lehmann

back to top ⇧

Acknowledgments

I needed an up-to-date and efficient library for decoding of unicode characters. I looked at mozillazg/go-unidecode, but it didn't deliver what I was searching for. Therefore I took it on my own and build my own library using the transliteration tables from the Python library avian2/unidecode. A big thanks to all you contributors of avian2/unidecode!

back to top ⇧