optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

ashkulz · 2016-12-20T11:40:26Z

This can be costly for certain implementations e.g. pyexcel-xlsx (which internally uses openpyxl) where it is computed on-the-fly.

It isn't possible to make the fix in pyexcel-xlsx as there is the possibility of it changing due to modifications via the API. One does have to make the (rather reasonable) assumption that the number of columns doesn't change during execution of to_array.

codecov-io · 2016-12-20T11:55:45Z

Current coverage is 99.00% (diff: 100%)

Merging #25 into master will increase coverage by <.01%

@@             master        #25   diff @@
==========================================
  Files            31         31          
  Lines          2517       2518     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2492       2493     +1   
  Misses           25         25          
  Partials          0          0

Powered by Codecov. Last update 7bcfb4e...f95361b

This can be costly for certain implementations e.g. pyexcel-xlsx (which internally uses openpyxl) where it is computed on-the-fly.

chfw · 2016-12-20T13:14:32Z

Your pull request is appreciated. However, I failed to see the logical difference although the file were changed.

At the moment, pyexcel-xlsx use max_column to get the number of columns. Are you suggesting a different function to be used?

In order to support columns computed on the fly, here is an example in ods where _iterate_columns was overriden.

I think the interfaces in Sheet.py could be revised further to make it clearer. Do you have any suggestions?

ashkulz · 2016-12-20T13:30:53Z

When pyexcel-xlsx calls max_column, it triggers a dynamic computation which can take a lot of time ... in case you have 10000 rows, it gets called 10000 times instead of 1, which is quite wasteful. All the change does is to pre-compute it so that it doesn't have to be done every time.

In case _iterate_columns is re-implemented in a derived class, then the ncols parameter can be ignored.

ashkulz · 2016-12-20T13:37:29Z

As a comparison, in 0.2.x the commit advarisk/pyexcel-io@eb3bfbc87aee20ac72b39f3971650c4d34608914 feels very clear and logical (which is what I'm actually using in production). I don't think that the API with _iterate_columns is perfect; but don't really have an alternate suggestion ☹️

chfw · 2016-12-20T15:56:44Z

I see what you meant now. max_column is computed always hence your change will cache the value which could improve performance.

Let me do a bit research around it.

ashkulz · 2016-12-20T16:06:07Z

In case you're open to releasing 0.2.4.1 with the above change, I'll submit a PR for that too ... but there is no 0.2.x branch, which is what I've done in the fork.

chfw · 2016-12-20T17:56:57Z

have create branch 0.2.x and will release it while I review all plugins and make necessary changes in 0.3.0 and its plugins.

chfw · 2016-12-20T23:44:07Z

please evaluate latest pyexcel-xlsx for performance improvement.

optimize calls to number_of_columns() in SheetReader._iterate_columns()

f95361b

This can be costly for certain implementations e.g. pyexcel-xlsx (which internally uses openpyxl) where it is computed on-the-fly.

chfw mentioned this pull request Dec 20, 2016

performance concern over sheet.max_column pyexcel/pyexcel-xlsx#13

Closed

ashkulz mentioned this pull request Dec 20, 2016

optimize calls to number_of_columns() in SheetReader.to_array() #26

Merged

chfw closed this Dec 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

ashkulz commented Dec 20, 2016

codecov-io commented Dec 20, 2016 •

edited

Loading

chfw commented Dec 20, 2016

ashkulz commented Dec 20, 2016

ashkulz commented Dec 20, 2016

chfw commented Dec 20, 2016

ashkulz commented Dec 20, 2016

chfw commented Dec 20, 2016

chfw commented Dec 20, 2016

optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

Conversation

ashkulz commented Dec 20, 2016

codecov-io commented Dec 20, 2016 • edited Loading

Current coverage is 99.00% (diff: 100%)

chfw commented Dec 20, 2016

ashkulz commented Dec 20, 2016

ashkulz commented Dec 20, 2016

chfw commented Dec 20, 2016

ashkulz commented Dec 20, 2016

chfw commented Dec 20, 2016

chfw commented Dec 20, 2016

codecov-io commented Dec 20, 2016 •

edited

Loading