Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 32 additions & 5 deletions toolz/itertoolz.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@
from toolz.utils import no_default


__all__ = ('remove', 'accumulate', 'groupby', 'merge_sorted', 'interleave',
'unique', 'isiterable', 'isdistinct', 'take', 'drop', 'take_nth',
'first', 'second', 'nth', 'last', 'get', 'concat', 'concatv',
'mapcat', 'cons', 'interpose', 'frequencies', 'reduceby', 'iterate',
'sliding_window', 'partition', 'partition_all', 'count', 'pluck',
__all__ = ('remove', 'accumulate', 'groupby', 'indices', 'merge_sorted',
'interleave', 'unique', 'isiterable', 'isdistinct', 'take',
'drop', 'take_nth', 'first', 'second', 'nth', 'last', 'get',
'concat', 'concatv', 'mapcat', 'cons', 'interpose',
'frequencies', 'reduceby', 'iterate', 'sliding_window',
'partition', 'partition_all', 'count', 'pluck',
'join', 'tail', 'diff', 'topk', 'peek', 'random_sample')


Expand Down Expand Up @@ -97,6 +98,32 @@ def groupby(key, seq):
return rv


def indices(*sizes):
""" Iterates over a length/shape.

>>> list(indices(3, 2))
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]

This can help nicely index an array.

>>> l = [[1, 2],
... [3, 4],
... [5, 6]]

>>> for i, j in indices(3, 2):
... print("l[%i][%i] = %i" % (i, j, l[i][j]))
l[0][0] = 1
l[0][1] = 2
l[1][0] = 3
l[1][1] = 4
l[2][0] = 5
l[2][1] = 6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you had in mind for an example, but does this help? If not, do you have some other ideas of what you might like to see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am not used to using index access inside for loops, normally people just loop over the values directly and in numpy you don't want to be doing a bunch of scalar accesses like this. To help me understand can you explain some real code that you have written that uses this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try. 😄

So in some cases I have binary data that I need to split up into smaller blocks on in separate processes and potentially combine results from at different stages. This data normally is on disk and may be a single file or split across multiple files. In these cases, I need an index for each block that I will work with. While I suppose one could compute a single index for each block, it makes the code much harder to reason about and it is already somewhat complex code (e.g. adds halos to data blocks, slices out halos afterwards, etc.). Being able to have indices like this makes it easier to reason about these cases and handle arbitrary dimensions. Not to mention stitching the pieces together becomes much more straightforward.

Hopefully that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand, thanks for clarifying! Looking through some of my numpy code I see there are places where I could have used something like this; however, I realized that this is in numpy as numpy.indices. I wonder if I would want this when working with normal lists/tuples where numpy was not available. If we are going the route of allowing more functions into toolz but selectivly curating the top level namespace then I would be +1 on adding this, but -0 on putting it in the top level. This is because I think it is not immediatly obvious when this is the right function to use over just standard looping or slice indexing so it is more "advanced" than other functions in toolz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I'm ok with not including it in the main namespace.


Yeah numpy.indices is pretty different from this. Instead of doing something like this, it creates a massive array such that each index combination is specified. This ends up being pretty expensive for large arrays.

We can actually do much better if we note that much of this information is redundant and we are willing to part with having it in one big array. For most use cases, these are safe assumptions. Following them we get something like this. For decent sized arrays, it is not unreasonable to see an order of magnitude or potentially a few orders of magnitude speed up by following this strategy.*

Even if we do need a full array with all combinations like numpy.indices, we can pack the result from the xnumpy function linked above into an array and still cutdown the creation time to roughly half.*

* My benchmarking is still rather primitive at this point, but it does seem reliable thus far.


"""

return itertools.product(*map(range, sizes))


def merge_sorted(*seqs, **kwargs):
""" Merge and sort a collection of sorted collections

Expand Down
27 changes: 26 additions & 1 deletion toolz/tests/test_itertoolz.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from functools import partial
from random import Random
from pickle import dumps, loads
from toolz.itertoolz import (remove, groupby, merge_sorted,
from toolz.itertoolz import (remove, groupby, indices, merge_sorted,
concat, concatv, interleave, unique,
isiterable, getter,
mapcat, isdistinct, first, second,
Expand Down Expand Up @@ -52,6 +52,31 @@ def test_groupby():
assert groupby(iseven, [1, 2, 3, 4]) == {True: [2, 4], False: [1, 3]}


def test_indices():
assert list(indices(0)) == []
assert list(indices(0, 5)) == []
assert list(indices(5, 0)) == []

assert list(indices(5)) == [(0,),
(1,),
(2,),
(3,),
(4,)]

assert list(indices(1, 5)) == [(0, 0,),
(0, 1,),
(0, 2,),
(0, 3,),
(0, 4,)]

assert list(indices(3, 2)) == [(0, 0),
(0, 1),
(1, 0),
(1, 1),
(2, 0),
(2, 1)]


def test_groupby_non_callable():
assert groupby(0, [(1, 2), (1, 3), (2, 2), (2, 4)]) == \
{1: [(1, 2), (1, 3)],
Expand Down