Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I found excelize(golang) is 2~3 times slower than js-xlsx #439

Closed
xtutu opened this issue Jul 19, 2019 · 8 comments
Closed

I found excelize(golang) is 2~3 times slower than js-xlsx #439

xtutu opened this issue Jul 19, 2019 · 8 comments

Comments

@xtutu
Copy link

xtutu commented Jul 19, 2019

Platform: Windows 10
go version go1.12.4 windows/amd64
node version v8.11.1


image


Maybe there is some problem with my usage?

code is here:

package main
import (
	"fmt"
	"time"
    "github.com/360EntSecGroup-Skylar/excelize"
)
func main() {
	excelPath := "data.xlsx"

	timeStart := time.Now().UTC().UnixNano() / 1e6
	times := 5
	for i :=0 ; i < times; i++ {
		f, _ := excelize.OpenFile(excelPath)
		_, _ = f.GetRows("main")
	}

	timeFinished := time.Now().UTC().UnixNano() / 1e6
	fmt.Println(timeFinished - timeStart)
}
import * as XLSX from 'xlsx';
function run() {
  let currentMilliSec = Date.now()
  let times = 5

  let excelPath = "data.xlsx"
  for (let i = 0; i < times; i++) {
    var workbook = XLSX.readFile(excelPath);
    let sheet = workbook.Sheets["main"];

    // let ref: string = sheet['!ref'];
    // console.log(ref);
    // let xx = ref.split(':')
    // console.log(xx[0], xx[1]);
    // let start = XLSX.utils.decode_cell(xx[0]);
    // console.log(start.c, start.r, XLSX.utils.encode_cell(start));
    // let end = XLSX.utils.decode_cell(xx[1]);
    // console.log(end.c, end.r, XLSX.utils.encode_cell(end));

    // let starRowIdx = 0
    // for (let r = 0; r <= end.r; r++) {
    //   let cellName = XLSX.utils.encode_cell({ c: 0, r: r });
    //   let cell = sheet[cellName]
    // }
  }

  let after = Date.now()
  console.log("js cost: ", after - currentMilliSec, "ms")
}

run()
@mlh758
Copy link
Contributor

mlh758 commented Jul 20, 2019

Does worksbook.Sheets["main"] actually parse and read each row? GetRows does, it looks like you have the actual reading code commented out on the JS tests there.

@xtutu
Copy link
Author

xtutu commented Jul 22, 2019

Does worksbook.Sheets["main"] actually parse and read each row?
Yes.

Actually

var workbook = XLSX.readFile(excelPath); //  this code is already finished parse.

Because even I uncomment the original code, Time Cost Not Changed !

@mlh758

@mlh758
Copy link
Contributor

mlh758 commented Jul 23, 2019

I’ll have to try the benchmarks out. We’ve had a few passes to improve the performance of the library but there is probably plenty of work that could be done as a community. Using Go benchmarks and the profiler could help highlight the slowest areas.

Also, many Node libraries are hooked up to some pretty efficient C libraries under the hood.

@mlh758
Copy link
Contributor

mlh758 commented Jul 31, 2019

I had a pretty big excel file on hand so I went ahead and tried things out, here are my benchmark files:

package main

import (
	"fmt"
	"time"

	"github.com/360EntSecGroup-Skylar/excelize"
)

func main() {
	excelPath := "data.xlsx"

	timeStart := time.Now()
	times := 5
	for i := 0; i < times; i++ {
		f, _ := excelize.OpenFile(excelPath)
		rows, _ := f.Rows("DTAs")
		for rows.Next() {
			row := rows.Columns()
			for i := range row {
				if i >= 0 {
					continue
				}
			}
		}
	}

	fmt.Println(time.Since(timeStart).Seconds())
}
const XLSX = require('xlsx');

function run() {
  let currentMilliSec = Date.now()
  let times = 5

  let excelPath = "data.xlsx"
  for (let i = 0; i < times; i++) {
    var workbook = XLSX.readFile(excelPath);
    let sheet = workbook.Sheets["DTAs"];
  }

  let after = Date.now()
  console.log("js cost: ", after - currentMilliSec, "ms")
}

run()

The go example ran in 17 seconds, the js example ran in 20 seconds, which puts them at roughly even. If I use GetRows like you did in your benchmarks the Go code runs up to 25 seconds. Seems like Rows is much more efficient to use.

@mlh758
Copy link
Contributor

mlh758 commented Jul 31, 2019

I just noticed I ran the benchmarks against 1.4.1. If I switch the import to "github.com/360EntSecGroup-Skylar/excelize/v2" and run against 2.0.1 the times don't change much though.

@mlh758
Copy link
Contributor

mlh758 commented Jul 31, 2019

For fun I moved this to a benchmark:

package main

import (
	"testing"

	"github.com/360EntSecGroup-Skylar/excelize/v2"
)

const excelPath = "data.xlsx"

func BenchmarkMain(b *testing.B) {
	for i := 0; i < b.N; i++ {
		f, _ := excelize.OpenFile(excelPath)
		rows, _ := f.Rows("DTAs")
		for rows.Next() {
			row, _ := rows.Columns()
			for i := range row {
				if i >= 0 {
					continue
				}
			}
		}
	}
}

Then ran it with the profilers enabled:

go test -bench=BenchmarkMain -benchmem -memprofile memprofile.out -cpuprofile profile.out

CPU profile results:

Showing top 10 nodes out of 177
      flat  flat%   sum%        cum   cum%
     720ms 19.51% 19.51%      720ms 19.51%  runtime.memmove
     290ms  7.86% 27.37%      290ms  7.86%  runtime.(*mspan).refillAllocCache
     230ms  6.23% 33.60%      230ms  6.23%  runtime.(*mspan).init
     160ms  4.34% 37.94%      350ms  9.49%  encoding/xml.(*printer).marshalAttr
     160ms  4.34% 42.28%      160ms  4.34%  runtime.memclrNoHeapPointers
     120ms  3.25% 45.53%     2210ms 59.89%  encoding/xml.(*Decoder).unmarshal
     110ms  2.98% 48.51%      200ms  5.42%  runtime.scanobject
     100ms  2.71% 51.22%     1120ms 30.35%  runtime.mallocgc
      90ms  2.44% 53.66%      150ms  4.07%  encoding/xml.(*Decoder).getc
      80ms  2.17% 55.83%       80ms  2.17%  runtime.bulkBarrierPreWrite

Memory:

Showing top 10 nodes out of 56
      flat  flat%   sum%        cum   cum%
  247.02MB 20.59% 20.59%   252.52MB 21.05%  encoding/xml.(*Decoder).rawToken
  198.97MB 16.59% 37.18%   198.97MB 16.59%  reflect.unsafe_NewArray
  123.88MB 10.33% 47.50%   123.88MB 10.33%  bytes.makeSlice
  106.01MB  8.84% 56.34%   138.01MB 11.50%  encoding/xml.(*printer).marshalAttr
   92.50MB  7.71% 64.05%   345.03MB 28.76%  encoding/xml.(*Decoder).Token
   83.51MB  6.96% 71.01%    83.51MB  6.96%  reflect.Zero
   72.56MB  6.05% 77.06%    72.56MB  6.05%  bytes.Replace
   68.20MB  5.69% 82.74%    84.20MB  7.02%  github.com/360EntSecGroup-Skylar/excelize/v2.checkRow
   64.50MB  5.38% 88.12%   729.01MB 60.77%  encoding/xml.(*Decoder).unmarshal
      33MB  2.75% 90.87%       33MB  2.75%  reflect.packEface

It looks like the vast majority of our time is spent the standard library's xml decoder.

@mlh758
Copy link
Contributor

mlh758 commented Jul 31, 2019

cpu

For visual reference

@mlh758
Copy link
Contributor

mlh758 commented Aug 1, 2019

@xtutu sent me the data file he was using for testing. When I use Rows() on it I get a segfault from rows.go line 106. Not sure what's going on there yet, but we almost certainly have a bug.

Edit: Nope, typo on my part. Wasn't checking the errors since it was a benchmark. Library is fine.

@mlh758 mlh758 mentioned this issue Aug 4, 2019
10 tasks
@xuri xuri closed this as completed in ac91ca0 Aug 5, 2019
xuri added a commit that referenced this issue Aug 5, 2019
nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
We were parsing the whole sheet twice since the
sheet reader already reads in all the rows.

getTotalRowsCols function is unused after these changes
so it has been deleted as well.

Closes qax-os#439
nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
nullfy pushed a commit to nullfy/excelize that referenced this issue Oct 23, 2020
Instead of re-encoding the full sheet to change the namespaces
in the encoded bytes, read the sheet again and do the byte
replacements there.

In this case, file access ends up being more performant than
marshaling the sheet back to XML.

In the SharedStrings test, ensure the strings are actually read.

Fix qax-os#439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants