Reed-Solomon Erasure Code engine in Go, could more than 10GB/s per core
Switch branches/tags
Nothing to show
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
mathtool approx_combi Jul 2, 2018
.gitignore File mode Executable to Regular Sep 6, 2017
.travis.yml add avx512 drop ssse3 (#13) Jul 14, 2018
LICENSE File mode Executable to Regular Sep 6, 2017 performance example 10+4 in Readme Nov 4, 2018
avx2_amd64.s asmfmt Jul 23, 2018
avx512_amd64.s asmfmt Jul 23, 2018
matrix.go clean code May 17, 2018
matrix_test.go clean code May 17, 2018
rs.go fix dep & refine test Nov 4, 2018
rs_amd64.go add avx512 drop ssse3 (#13) Jul 14, 2018
rs_other.go clean code May 17, 2018
rs_test.go fix dep & refine test Nov 4, 2018
tbl.go clean code May 17, 2018


GoDoc MIT licensed Build Status Go Report Card


  1. Reed-Solomon Erasure Code engine in pure Go.(Based on intel ISA-L & Klauspost ReedSolomon)
  2. Fast: more than 10GB/s per physics core


To get the package use the standard:

go get


See the associated GoDoc



  1. All arch are supported
  2. Go1.11(for AVX512)


  1. Coding over in GF(2^8)
  2. Primitive Polynomial: x^8 + x^4 + x^3 + x^2 + 1 (0x1d)
  3. mathtool/gentbls.go : generator Primitive Polynomial and it's log table, exp table, multiply table, inverse table etc. We can get more info about how galois field work
  4. mathtool/cntinverse.go : calculate how many inverse matrix will have in different RS codes config
  5. Cauchy Matrix is generator matrix

Why so fast?

These three parts will cost too much time:

  1. lookup galois-field tables
  2. read/write memory
  3. calculate inverse matrix in the reconstruct process

SIMD will solve no.1

Cache-friendly codes will help to solve no.2 & no.3, and more, use a sync.Map for cache inverse matrix, it will help to save about 1000ns when we need same matrix.


Performance depends mainly on:

  1. CPU instruction extension( AVX512 or AVX2)
  2. number of data/parity vects
  3. unit size of calculation ( see it in rs.go )
  4. size of shards
  5. speed of memory (waste so much time on read/write mem, :D )
  6. performance of CPU
  7. the way of using (reuse memory)

And we must know the benchmark test is quite different with encoding/decoding in practice.

Because in benchmark test loops, the CPU Cache will help a lot. In practice, we must reuse the memory to make the performance become as good as the benchmark test.

Example of performance on my AWS c5d.large (Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz.) DataCnt = 10; ParityCnt = 4


Vector size AVX512 (MB/S) AVX2 (MB/S)
4KB 12775 9174
64KB 11618 8964
1MB 7918 6820

Links & Thanks