Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

Closed
wants to merge 1 commit into from
Closed

Conversation

ashkulz
Copy link

@ashkulz ashkulz commented Dec 20, 2016

This can be costly for certain implementations e.g. pyexcel-xlsx (which internally uses openpyxl) where it is computed on-the-fly.

It isn't possible to make the fix in pyexcel-xlsx as there is the possibility of it changing due to modifications via the API. One does have to make the (rather reasonable) assumption that the number of columns doesn't change during execution of to_array.

@codecov-io
Copy link

codecov-io commented Dec 20, 2016

Current coverage is 99.00% (diff: 100%)

Merging #25 into master will increase coverage by <.01%

@@             master        #25   diff @@
==========================================
  Files            31         31          
  Lines          2517       2518     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2492       2493     +1   
  Misses           25         25          
  Partials          0          0          

Powered by Codecov. Last update 7bcfb4e...f95361b

This can be costly for certain implementations e.g. pyexcel-xlsx (which
internally uses openpyxl) where it is computed on-the-fly.
@chfw
Copy link
Member

chfw commented Dec 20, 2016

Your pull request is appreciated. However, I failed to see the logical difference although the file were changed.

At the moment, pyexcel-xlsx use max_column to get the number of columns. Are you suggesting a different function to be used?

In order to support columns computed on the fly, here is an example in ods where _iterate_columns was overriden.

I think the interfaces in Sheet.py could be revised further to make it clearer. Do you have any suggestions?

@ashkulz
Copy link
Author

ashkulz commented Dec 20, 2016

When pyexcel-xlsx calls max_column, it triggers a dynamic computation which can take a lot of time ... in case you have 10000 rows, it gets called 10000 times instead of 1, which is quite wasteful. All the change does is to pre-compute it so that it doesn't have to be done every time.

In case _iterate_columns is re-implemented in a derived class, then the ncols parameter can be ignored.

@ashkulz
Copy link
Author

ashkulz commented Dec 20, 2016

As a comparison, in 0.2.x the commit advarisk/pyexcel-io@eb3bfbc87aee20ac72b39f3971650c4d34608914 feels very clear and logical (which is what I'm actually using in production). I don't think that the API with _iterate_columns is perfect; but don't really have an alternate suggestion ☹️

@chfw
Copy link
Member

chfw commented Dec 20, 2016

I see what you meant now. max_column is computed always hence your change will cache the value which could improve performance.

Let me do a bit research around it.

@ashkulz
Copy link
Author

ashkulz commented Dec 20, 2016

In case you're open to releasing 0.2.4.1 with the above change, I'll submit a PR for that too ... but there is no 0.2.x branch, which is what I've done in the fork.

@chfw
Copy link
Member

chfw commented Dec 20, 2016

have create branch 0.2.x and will release it while I review all plugins and make necessary changes in 0.3.0 and its plugins.

@chfw
Copy link
Member

chfw commented Dec 20, 2016

please evaluate latest pyexcel-xlsx for performance improvement.

@chfw chfw closed this Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants