New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation for DOK. #85

Merged
merged 12 commits into from Jan 22, 2018

Conversation

Projects
None yet
2 participants
@hameerabbasi
Collaborator

hameerabbasi commented Jan 21, 2018

I've made a small DOK implementation. So far, it supports:

  • Setting items with integers and slices, including setting with arrays.
  • Getting single items only.
  • Converting to/from COO with COO(DOK)/DOK(COO)
  • Has a todense() for debugging.

I haven't added tests or docs (yet) or tested performance, but it seems to work fine.

Edit: xref #15

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 21, 2018

You can do, for example,

x[1:3, 1:3] = 5
x[2, 2] = 3
x[1:3, 1:3] = [4, 6]
x[1:3, 1:3] = [[4, 5], [6, 7]]

And all of them should work. :)

@mrocklin

Some small comments. This could obviously use some tests.

item = normalize_index(item, self.shape)
for i in item:
if not isinstance(i, Integral):

This comment has been minimized.

@mrocklin

mrocklin Jan 21, 2018

Collaborator

perhaps if not all(isinstance(i, Integral) for i in item):

' when getting an item.')
if not len(item) == self.ndim:
raise IndexError('Can only get single elements.')

This comment has been minimized.

@mrocklin

mrocklin Jan 21, 2018

Collaborator

Might want to include the item in the error message with "expected key of length %d, got %s" % (self.ndim, str(item))

def nnz(self):
return len(self.dict)
def __getitem__(self, item):

This comment has been minimized.

@mrocklin

mrocklin Jan 21, 2018

Collaborator

I would use key or index for this variable name

if not len(item) == self.ndim:
raise IndexError('Can only get single elements.')
if item in self.dict.keys():

This comment has been minimized.

@mrocklin

mrocklin Jan 21, 2018

Collaborator

You want if item in self.dict. This is much faster. The .keys() result is either a list (python 2) or iterator (python 3), each of which have linear time membership checks.

result = np.zeros(self.shape, dtype=self.dtype)
for c, d in six.iteritems(self.dict):
result[c] = d

This comment has been minimized.

@mrocklin

mrocklin Jan 21, 2018

Collaborator

You might be able to do something like result[self.dict.keys()] = self.dict.values() here

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 21, 2018

Collaborator

This works in Python 2, but not 3 (where these are iterators). In addition, it'll take up more memory for the resulting lists even if we were to do this.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 21, 2018

Tests added, and a few bugs caught and fixed. :)

Will do docs soon.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 22, 2018

I think this is ready for a final review and merge.

Keep in mind we can keep adding features later. Right now, we just need an implementation. :)

You can get started by defining the shape (and optionally, datatype) of the
:obj:`DOK` array. If you do not specify a dtype, it is inferred from the first
element.

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

This seems somewhat dangerous to me. Is this what scipy.sparse does?

I thought that this was a fail case:

In [1]: from sparse import DOK, COO

In [2]: dok = DOK((5, 5))

In [3]: dok[1, 1] = 5

In [4]: dok.dtype
Out[4]: dtype('int64')

In [5]: dok[2, 2] = 3.2

In [6]: dok
Out[6]: <DOK: shape=(5, 5), dtype=int64, nnz=2>

In [7]: dok.todense()
Out[7]: 
array([[0, 0, 0, 0, 0],
       [0, 5, 0, 0, 0],
       [0, 0, 3, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

But then I tried the same thing with numpy and got the same behavior :/

In [8]: import numpy as np

In [9]: x = np.array([1, 2, 3])

In [10]: x
Out[10]: array([1, 2, 3])

In [11]: x.dtype
Out[11]: dtype('int64')

In [12]: x[2] = 3.2

In [13]: x
Out[13]: array([1, 2, 3])

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

I'm mimicking Numpy semantics here. Numpy doesn't change the data type of the whole array for just one element, it has to be decided beforehand.

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

Yeah, I agree with that behavior. My concern is that they might not realize that they are selecting a dtype, and then have the rest of their data silently truncated. Consider the following example:

dok = DOK((5, 5))

data = {(1, 1): 1, (2, 2), 2.5, (3, 3): 3.5}
for k, v in data.items():
    dok[k] = v

The first entry defines the dtype as int, but then the subsequent values are silently truncated.

It looks like scipy defaults dtype to np.float64. We might want to do the same.

In [1]: from scipy.sparse import dok_matrix

In [2]: dok_matrix((5, 5))
Out[2]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Dictionary Of Keys format>

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

Looks like Numpy does, too:

import numpy as np
x = np.zeros((5, 5))
x.dtype
Out[4]: dtype('float64')
y = np.array([3, 1.5])
y.dtype
Out[6]: dtype('float64')

I could use, for example, numpy.result_type on the initial data.

d = {0: x, 1: y}
np.result_type(*d.values())
Out[11]: dtype('float64')

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

I've fixed this now. It defaults to float64, or to the maximum dtype of values if it is provided.

.. code-block:: python
s = DOK((6, 5, 2), [dtype=np.float64])

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

I expect naive users to copy-paste this code and be confused why it fails

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

I'll change that real quick.

.. code-block:: python
s2 = COO(s)

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

Should we also have a to_coo method? I reached for this first.

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

I'll add to_coo and from_coo. Doesn't hurt, and they're tiny.

pass
class DOK:

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

I recommend using class DOK(object) for as long as we support Python 2

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

Tests succeed with Python 2.7... Which version of Python has issues with this?

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

This is just general cleanliness in Python 2. class Foo: creates an "old style class" which very few people expect these days. People have grown used to new style classes (class Foo(object)) and their class methods. These are also the default in Python 3.

See https://stackoverflow.com/questions/54867/what-is-the-difference-between-old-style-and-new-style-classes-in-python

[0, 0, 0, 0, 0]])
"""
def __init__(self, shape, values=None, dtype=None):
"""

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

I recommend only including the docstring on the main class and removing it from the __init__ method. People tend to check for docstrings like DOK? or help(DOK) rather than help(DOK.__init__). Also, the two versions here will almost certainly grow out of sync.

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

That's how we're doing it in COO. I tend to keep the class docstring somewhat general (like an overview) and the __init__ docstring specific to constructing objects.

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

I recommend merging the docstring in __init__ up to the class-level docstring. I find that people rarely look at the __init__ docstring in practice.

<DOK: shape=(5, 5, 5), dtype=int64, nnz=1>
"""
from .coo import COO
self.dict = {}

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

It seems odd to me to use the name of a core collection as a variable name. I'm not sure I have a better solution though. Maybe .data?

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

.data sounds good to me. I thought of that at first, but wanted to make sure users don't confuse it with COO.data

This comment has been minimized.

@mrocklin

mrocklin Jan 22, 2018

Collaborator

Happy with either. I don't have strong thoughts here.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 22, 2018

Another round of reviews, maybe? :)

hameerabbasi added some commits Jan 22, 2018

if not len(data):
self.dtype = np.dtype('float64')
else:
self.dtype = np.result_type(*map(lambda x: np.asarray(x).dtype, data.values()))

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 22, 2018

Collaborator

I had to do map here as otherwise it would not work on dicts of int.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 22, 2018

And another round! 💯

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jan 22, 2018

+1 from me

@hameerabbasi hameerabbasi merged commit 9441369 into pydata:master Jan 22, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@hameerabbasi hameerabbasi deleted the hameerabbasi:dok branch Jan 22, 2018

hameerabbasi added a commit to hameerabbasi/sparse that referenced this pull request Feb 27, 2018

Add implementation for DOK. (pydata#85)
* Add initial implementation for DOK.

* Lots of tests and a few bugfixes.

* Start writing docs for DOK.

* Complete the API docs. Moving on to user manual.

* Complete documentation.

* Implemented changes proposed by @mrocklin.

* Add missing RST files.

* Remove undocumented __init__.

* Remove non-working code from docs.

* Fix dtype issue + tests.

* Updated docs.

* Less memory usage on Python 2 by using itervalues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment