Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a more memory-efficient RangeIndex-sort of thing to avoid large arange(N) indexes in some cases #939

Closed
wesm opened this issue Mar 18, 2012 · 13 comments · Fixed by #11892
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@wesm
Copy link
Member

wesm commented Mar 18, 2012

No description provided.

@hayd
Copy link
Contributor

hayd commented Sep 2, 2014

Here's @jtratner's tree for this https://github.com/jtratner/pandas/tree/add-range-index (based off Wes').

(I keep struggling to find it. Perhaps I'll rebase and PR, would be interesting to experiment with this.)

@immerrr
Copy link
Contributor

immerrr commented Sep 2, 2014

And there's likely some code that can be salvaged from BlockPlacement class I've added when refactoring block managers.

@jreback jreback modified the milestones: 0.16, 0.15.0 Sep 23, 2014
@jreback jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@ARF1
Copy link

ARF1 commented Apr 20, 2015

I am interested in RangeIndexes as well. Does anybody know what the state of @jtratner's tree is? Is there anything I can do to help push this towards a PR?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2015

it's in a reasonable state
would welcome an update PR for it

@hayd
Copy link
Contributor

hayd commented Apr 20, 2015

I linked to it above, it's here https://github.com/jtratner/pandas/tree/add-range-index

IIRC it ought to rebase pretty cleanly (it's mainly new files). +1 to a resurrection!

@ARF1
Copy link

ARF1 commented Apr 21, 2015

Ok, I rebased jtratner's tree: https://github.com/ARF1/pandas/tree/range_index

Of course it was too much to hope it would pass all the test. - One can always dream...

Having not worked on pandas internals before I am probably not the best person to move this forward efficiently. If somebody else feels like they are better suited I would be happy to contribute individual fixes to their tree.

If there are no takers, I will give it a shot myself. Though I will probably need a fair amount of hand-holding to get started.

Running the already written tests, it appears the Index api has changed. Is that possible? Can anybody point me to the PR / issue to help me understand how to adapt the code appropriately?

pandas/core/range.py#L61-L76 has the following segment in its __init__() which seems no longer possible:

self = np.array([], dtype='int64').view(RangeIndex)
...
self.left = left

As I understand this, RangeIndex being a subclass of Index used to be a ndarray subclass but is no longer, right?

@jreback
Copy link
Contributor

jreback commented Apr 21, 2015

yes Index is no longer a subclass of ndarray since 0.15.0.You can look at the recently merged CategoricalIndex for the api

@jorisvandenbossche
Copy link
Member

@ARF1 The easiest thing to do is probably to just open a pull request based on your branch. Then it is easier to comment on the code and give feedback.

@ARF1
Copy link

ARF1 commented Apr 21, 2015

@jorisvandenbossche Ok, I did not want to "pollute" the PR list with unfinished code. The rebased tree is available as PR #9961.

@jreback Thanks. I will take a look and see if I can find my way around...

@jreback jreback modified the milestones: 0.17.0, Next Major Release Apr 24, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@jreback jreback mentioned this issue Dec 23, 2015
2 tasks
jreback pushed a commit to jreback/pandas that referenced this issue Dec 24, 2015
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
jreback pushed a commit to jreback/pandas that referenced this issue Jan 13, 2016
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
jreback pushed a commit to jreback/pandas that referenced this issue Jan 16, 2016
`RangeIndex(1, 10, 2)` is a memory saving alternative to
`Index(np.arange(1, 10,2))`: c.f. pandas-dev#939.

This re-implementation is compatible with the current `Index()` api and is a
drop-in replacement for `Int64Index()`. It automatically converts to
Int64Index() when required by operations.

At present only for a minimum number of operations the type is
conserved (e.g. slicing, inner-, left- and right-joins). Most other operations
trigger creation of an equivalent Int64Index (or at least an equivalent numpy
array) and fall back to its implementation.

This PR also extends the functionality of the `Index()` constructor to allow
creation of `RangeIndexes()` with
```
Index(20)
Index(2, 20)
Index(0, 20, 2)
```
in analogy to
```
range(20)
range(2, 20)
range(0, 20, 2)
```

restore Index() fastpath precedence

Various fixes suggested by @jreback and @shoyer

Cache a private Int64Index object the first time it or its values are required.
Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1).
Make RangeIndex immutable. See start, stop, step properties.
In test_constructor(): check class, attributes (possibly including dtype).
In test_copy(): check that copy is not identical (but equal) to the existing.
In test_duplicates(): Assert is_unique and has_duplicates return correct values.

fix slicing

fix view

Set RangeIndex as default index
* enh: set RangeIndex as default index
* fix: pandas.io.packers: encode() and decode() for RangeIndex
* enh: array argument pass-through
* fix: reindex
* fix: use _default_index() in pandas.core.frame.extract_index()
* fix: pandas.core.index.Index._is()
* fix: add RangeIndex to ABCIndexClass
* fix: use _default_index() in _get_names_from_index()
* fix: pytables tests
* fix: MultiIndex.get_level_values()
* fix: RangeIndex._shallow_copy()
* fix: null-size RangeIndex equals() comparison
* enh: make RangeIndex.is_unique immutable

enh: various performance optimizations

 * optimize argsort()
 * optimize tolist()
 * comment clean-up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants