-
Couldn't load subscription status.
- Fork 270
Add indices #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jakirkham
wants to merge
2
commits into
pytoolz:master
Choose a base branch
from
jakirkham:add_indices
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add indices #345
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you had in mind for an example, but does this help? If not, do you have some other ideas of what you might like to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am not used to using index access inside for loops, normally people just loop over the values directly and in numpy you don't want to be doing a bunch of scalar accesses like this. To help me understand can you explain some real code that you have written that uses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try. 😄
So in some cases I have binary data that I need to split up into smaller blocks on in separate processes and potentially combine results from at different stages. This data normally is on disk and may be a single file or split across multiple files. In these cases, I need an index for each block that I will work with. While I suppose one could compute a single index for each block, it makes the code much harder to reason about and it is already somewhat complex code (e.g. adds halos to data blocks, slices out halos afterwards, etc.). Being able to have indices like this makes it easier to reason about these cases and handle arbitrary dimensions. Not to mention stitching the pieces together becomes much more straightforward.
Hopefully that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand, thanks for clarifying! Looking through some of my numpy code I see there are places where I could have used something like this; however, I realized that this is in numpy as
numpy.indices. I wonder if I would want this when working with normal lists/tuples where numpy was not available. If we are going the route of allowing more functions into toolz but selectivly curating the top level namespace then I would be +1 on adding this, but -0 on putting it in the top level. This is because I think it is not immediatly obvious when this is the right function to use over just standard looping or slice indexing so it is more "advanced" than other functions in toolz.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I'm ok with not including it in the main namespace.
Yeah
numpy.indicesis pretty different from this. Instead of doing something like this, it creates a massive array such that each index combination is specified. This ends up being pretty expensive for large arrays.We can actually do much better if we note that much of this information is redundant and we are willing to part with having it in one big array. For most use cases, these are safe assumptions. Following them we get something like this. For decent sized arrays, it is not unreasonable to see an order of magnitude or potentially a few orders of magnitude speed up by following this strategy.*
Even if we do need a full array with all combinations like
numpy.indices, we can pack the result from thexnumpyfunction linked above into an array and still cutdown the creation time to roughly half.** My benchmarking is still rather primitive at this point, but it does seem reliable thus far.