go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/sourcecode.social/reiver/go-utf8

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.NewRuneReader(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.NewRuneScanner(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

UTF-8 encoding				value	code point	decimal	binary	name
byte 1	byte 2	byte 3	byte 4	value	code point	decimal	binary	name
`0b0,1000001`				A	U+0041	65	`0b0000,0000,0100,0001`	LATIN CAPITAL LETTER A
`0b0,1110010`				r	U+0072	114	`0b0000,0000,0111,0010`	LATIN SMALL LETTER R
`0b110,00010`	`0b10,100001`			¡	U+00A1	161	`0b0000,0000,1010,0001`	INVERTED EXCLAMATION MARK
`0b110,11011`	`0b10,110101`			۵	U+06F5	1781	`0b0000,0110,1111,0101`	EXTENDED ARABIC-INDIC DIGIT FIVE
`0b1110,0010`	`0b10,000000`	`0b10,110001`		‱	U+2031	8241	`0b0010,0000,0011,0001`	PER TEN THOUSAND SIGN
`0b1110,0010`	`0b10,001001`	`0b10,100001`		≡	U+2261	8801	`0b0010,0010,0110,0001`	IDENTICAL TO
`0b11110,000`	`0b10,010000`	`0b10,001111`	`0b10,010101`	𐏕	U+000103D5	66517	`b0001,0000,0011,1101,0101`	OLD PERSIAN NUMBER HUNDRED
`0b11110,000`	`0b10,011111`	`0b10,011001`	`0b10,000010`	🙂	U+0001F642	128578	`0b0001,1111,0110,0100,0010`	SLIGHTLY SMILING FACE

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

# of bytes	# bits for code point	1st code point	last code point	byte 1	byte 2	byte 3	byte 4
1	7	U+000000	U+00007F	`0xxxxxxx`
2	11	U+000080	U+0007FF	`110xxxxx`	`10xxxxxx`
3	16	U+000800	U+00FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
4	21	U+010000	U+10FFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
LICENSE		LICENSE
README.md		README.md
errors.go		errors.go
format.go		format.go
format_test.go		format_test.go
invalidutf8.go		invalidutf8.go
nilreader.go		nilreader.go
nilwriter.go		nilwriter.go
readrune.go		readrune.go
readrune_test.go		readrune_test.go
runeerror.go		runeerror.go
runelength.go		runelength.go
runelength_test.go		runelength_test.go
runereader.go		runereader.go
runereader_test.go		runereader_test.go
runescanner.go		runescanner.go
runescanner_buffered_test.go		runescanner_buffered_test.go
runescanner_test.go		runescanner_test.go
runewriter.go		runewriter.go
runewriter_test.go		runewriter_test.go
writerune.go		writerune.go
writerune_test.go		writerune_test.go

License

reiver/go-utf8

Folders and files

Latest commit

History

Repository files navigation

go-utf8

Documention

Reading a Single UTF-8 Character

Write a Single UTF-8 Character

io.RuneReader

io.RuneScanner

UTF-8

UTF-8 Versus ASCII

UTF-8 Encoding

About

Topics

Resources

License

Stars

Watchers

Forks

Languages