go-uucode

go-uucode is a small Go Unicode segmentation and width package inspired by Jacob Sandlund's excellent uucode. Jacob's Zig implementation does the hard architectural work here: generated Unicode tables, compact property rows, and a fast staged lookup strategy. This package ports that table-first approach to Go.

It provides:

extended grapheme cluster iteration over Go strings
UAX #14 line break opportunity iteration over Go strings
grapheme-aware terminal cell width with StringWidth
typed lookup APIs for generated Unicode category, break, binary, emoji, width, and case properties
no runtime UCD parser, cache, or fallback path

Usage

package main

import (
	"fmt"

	"github.com/rockorager/go-uucode"
)

func main() {
	s := "👩🏽‍🚀🇨🇭A\u0300"

	it := uucode.NewGraphemeIterator(s)
	for {
		g, ok := it.Next()
		if !ok {
			break
		}
		fmt.Printf("%q [%d:%d]\n", s[g.Start:g.End], g.Start, g.End)
	}

	text := "hello, 世界\nnext"
	lines := uucode.NewLineIterator(text)
	for {
		seg, ok := lines.Next()
		if !ok {
			break
		}
		fmt.Printf("%q %s\n", text[seg.Start:seg.End], seg.Break)
	}

	fmt.Println(uucode.StringWidth("ò👨🏻‍❤️‍👨🏿_"))
	fmt.Println(uucode.IsLetter('界'), uucode.WordBreak('A'), uucode.LineBreak(' '))
}

Benchmarks

Benchmarks below were run on an Apple M4 Max with Go 1.26.1. Both libraries reported 0 B/op and 0 allocs/op.

Public API benchmark	go-uucode ns/op	uniseg ns/op	Speedup
Grapheme ASCII	361.6	3326	9.20x
Grapheme Combining	254.6	1810	7.11x
Grapheme Emoji	184.7	1863	10.09x
Grapheme Mixed	255.5	2452	9.60x
Width ASCII	33.75	489.2	14.49x
Width Combining	286.6	331.2	1.16x
Width Emoji	217.3	444.1	2.04x
Width Mixed	250.9	500.1	1.99x

Predicate APIs are benchmarked against Go's unicode package on a rotating 32-rune corpus. These are speed comparisons against the public stdlib APIs; the local Go toolchain reports unicode.Version == "15.0.0" while go-uucode ships Unicode 17 data. The rows below show the mean of the six benchmark subcases:

Predicate benchmark	go-uucode ns/op	stdlib ns/op	Speedup
IsUpper	1.67	5.73	3.42x
IsLower	1.68	5.77	3.44x
IsTitle	2.63	2.26	0.86x
IsLetter	1.75	6.50	3.72x
IsNumber	1.69	5.24	3.11x
IsDigit	1.68	4.68	2.78x
IsMark	1.75	6.52	3.72x
IsPunct	2.52	6.24	2.47x
IsSymbol	2.54	6.31	2.48x
IsGraphic	2.56	22.23	8.69x
IsPrint	2.62	22.26	8.50x
IsControl	0.37	0.68	1.83x
IsSpace	2.66	3.33	1.25x

Generated binary property APIs are benchmarked against unicode.Is with the matching stdlib range table on a property-focused 32-rune corpus:

Binary property benchmark	go-uucode ns/op	stdlib ns/op	Speedup
IsASCIIHexDigit	2.44	2.58	1.06x
IsHexDigit	2.45	3.45	1.41x
IsDash	2.47	3.83	1.55x
IsDiacritic	2.45	5.25	2.14x
IsQuotationMark	2.44	3.85	1.57x
IsPatternSyntax	2.45	4.40	1.80x
IsPatternWhiteSpace	2.44	3.31	1.36x
IsVariationSelector	2.42	3.04	1.26x
IsNoncharacter	2.44	3.13	1.28x
IsUnifiedIdeograph	2.47	2.95	1.20x

Simple case mapping APIs are benchmarked against the matching unicode functions on a case-focused 32-rune corpus:

Case mapping benchmark	go-uucode ns/op	stdlib ns/op	Speedup
ToUpper	2.01	6.86	3.42x
ToLower	1.92	6.83	3.56x
ToTitle	1.92	6.85	3.57x
SimpleFold	1.93	6.37	3.30x

String case folding is benchmarked against strings.EqualFold:

EqualFold benchmark	go-uucode ns/op	stdlib ns/op	Speedup
ASCII equal	12.38	13.00	1.05x
ASCII miss	8.92	9.02	1.01x
Kelvin	9.80	9.29	0.95x
Greek sigma	18.72	28.79	1.54x
Mixed Unicode	31.94	41.27	1.29x
Length miss	2.68	2.63	0.98x

Run the package benchmarks:

go test -run '^$' -bench . -benchmem

The comparison against github.com/rivo/uniseg lives in a separate nested module so uniseg is not a dependency of this package:

cd bench/uniseg
go test -run '^$' -bench . -benchmem

Generated Tables

The package ships Unicode 17 source files and generates packed runtime tables. The hot path uses three stages:

runtimeStage1 indexes 256-code-point blocks by cp >> 8
runtimeStage2 indexes the low byte within deduplicated blocks
runtimeStage3 stores deduplicated packed property rows

Regenerate after changing UCD files or generator logic:

go generate ./...

The generated runtime rows store compact fields for grapheme segmentation, terminal-width calculation, general category predicates, word/sentence/line break properties, East Asian Width, PropList binary properties, simple case mapping, simple case folding, and emoji properties used by the public lookup functions.

Attribution

The design is based on the real jacobsandlund/uucode. If you are interested in the original implementation, Unicode table generation strategy, or a Zig library for this problem space, start there.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bench		bench
cmd/uucodegen		cmd/uucodegen
ucd		ucd
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
benchmark_test.go		benchmark_test.go
casefold.go		casefold.go
doc.go		doc.go
go.mod		go.mod
grapheme.go		grapheme.go
grapheme_width.go		grapheme_width.go
grapheme_width_test.go		grapheme_width_test.go
linebreak.go		linebreak.go
linebreak_test.go		linebreak_test.go
properties.go		properties.go
runtime_tables.go		runtime_tables.go
tables_gen.go		tables_gen.go
types.go		types.go
uucode_test.go		uucode_test.go
wcwidth.go		wcwidth.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-uucode

Usage

Benchmarks

Generated Tables

Attribution

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-uucode

Usage

Benchmarks

Generated Tables

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages