Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation for DOK. #85

Merged
merged 12 commits into from Jan 22, 2018
Merged

Add implementation for DOK. #85

merged 12 commits into from Jan 22, 2018

Conversation

hameerabbasi
Copy link
Collaborator

@hameerabbasi hameerabbasi commented Jan 21, 2018

I've made a small DOK implementation. So far, it supports:

  • Setting items with integers and slices, including setting with arrays.
  • Getting single items only.
  • Converting to/from COO with COO(DOK)/DOK(COO)
  • Has a todense() for debugging.

I haven't added tests or docs (yet) or tested performance, but it seems to work fine.

Edit: xref #15

@hameerabbasi
Copy link
Collaborator Author

You can do, for example,

x[1:3, 1:3] = 5
x[2, 2] = 3
x[1:3, 1:3] = [4, 6]
x[1:3, 1:3] = [[4, 5], [6, 7]]

And all of them should work. :)

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. This could obviously use some tests.

sparse/dok.py Outdated
item = normalize_index(item, self.shape)

for i in item:
if not isinstance(i, Integral):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps if not all(isinstance(i, Integral) for i in item):

sparse/dok.py Outdated
' when getting an item.')

if not len(item) == self.ndim:
raise IndexError('Can only get single elements.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to include the item in the error message with "expected key of length %d, got %s" % (self.ndim, str(item))

sparse/dok.py Outdated
def nnz(self):
return len(self.dict)

def __getitem__(self, item):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use key or index for this variable name

sparse/dok.py Outdated
if not len(item) == self.ndim:
raise IndexError('Can only get single elements.')

if item in self.dict.keys():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want if item in self.dict. This is much faster. The .keys() result is either a list (python 2) or iterator (python 3), each of which have linear time membership checks.

sparse/dok.py Outdated
result = np.zeros(self.shape, dtype=self.dtype)

for c, d in six.iteritems(self.dict):
result[c] = d
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to do something like result[self.dict.keys()] = self.dict.values() here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works in Python 2, but not 3 (where these are iterators). In addition, it'll take up more memory for the resulting lists even if we were to do this.

@hameerabbasi
Copy link
Collaborator Author

Tests added, and a few bugs caught and fixed. :)

Will do docs soon.

@hameerabbasi
Copy link
Collaborator Author

I think this is ready for a final review and merge.

Keep in mind we can keep adding features later. Right now, we just need an implementation. :)


You can get started by defining the shape (and optionally, datatype) of the
:obj:`DOK` array. If you do not specify a dtype, it is inferred from the first
element.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems somewhat dangerous to me. Is this what scipy.sparse does?

I thought that this was a fail case:

In [1]: from sparse import DOK, COO

In [2]: dok = DOK((5, 5))

In [3]: dok[1, 1] = 5

In [4]: dok.dtype
Out[4]: dtype('int64')

In [5]: dok[2, 2] = 3.2

In [6]: dok
Out[6]: <DOK: shape=(5, 5), dtype=int64, nnz=2>

In [7]: dok.todense()
Out[7]: 
array([[0, 0, 0, 0, 0],
       [0, 5, 0, 0, 0],
       [0, 0, 3, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

But then I tried the same thing with numpy and got the same behavior :/

In [8]: import numpy as np

In [9]: x = np.array([1, 2, 3])

In [10]: x
Out[10]: array([1, 2, 3])

In [11]: x.dtype
Out[11]: dtype('int64')

In [12]: x[2] = 3.2

In [13]: x
Out[13]: array([1, 2, 3])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mimicking Numpy semantics here. Numpy doesn't change the data type of the whole array for just one element, it has to be decided beforehand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with that behavior. My concern is that they might not realize that they are selecting a dtype, and then have the rest of their data silently truncated. Consider the following example:

dok = DOK((5, 5))

data = {(1, 1): 1, (2, 2), 2.5, (3, 3): 3.5}
for k, v in data.items():
    dok[k] = v

The first entry defines the dtype as int, but then the subsequent values are silently truncated.

It looks like scipy defaults dtype to np.float64. We might want to do the same.

In [1]: from scipy.sparse import dok_matrix

In [2]: dok_matrix((5, 5))
Out[2]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Dictionary Of Keys format>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Numpy does, too:

import numpy as np
x = np.zeros((5, 5))
x.dtype
Out[4]: dtype('float64')
y = np.array([3, 1.5])
y.dtype
Out[6]: dtype('float64')

I could use, for example, numpy.result_type on the initial data.

d = {0: x, 1: y}
np.result_type(*d.values())
Out[11]: dtype('float64')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this now. It defaults to float64, or to the maximum dtype of values if it is provided.


.. code-block:: python

s = DOK((6, 5, 2), [dtype=np.float64])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect naive users to copy-paste this code and be confused why it fails

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change that real quick.


.. code-block:: python

s2 = COO(s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have a to_coo method? I reached for this first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add to_coo and from_coo. Doesn't hurt, and they're tiny.

sparse/dok.py Outdated
pass


class DOK:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using class DOK(object) for as long as we support Python 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests succeed with Python 2.7... Which version of Python has issues with this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just general cleanliness in Python 2. class Foo: creates an "old style class" which very few people expect these days. People have grown used to new style classes (class Foo(object)) and their class methods. These are also the default in Python 3.

See https://stackoverflow.com/questions/54867/what-is-the-difference-between-old-style-and-new-style-classes-in-python

sparse/dok.py Outdated
[0, 0, 0, 0, 0]])
"""
def __init__(self, shape, values=None, dtype=None):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend only including the docstring on the main class and removing it from the __init__ method. People tend to check for docstrings like DOK? or help(DOK) rather than help(DOK.__init__). Also, the two versions here will almost certainly grow out of sync.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how we're doing it in COO. I tend to keep the class docstring somewhat general (like an overview) and the __init__ docstring specific to constructing objects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend merging the docstring in __init__ up to the class-level docstring. I find that people rarely look at the __init__ docstring in practice.

sparse/dok.py Outdated
<DOK: shape=(5, 5, 5), dtype=int64, nnz=1>
"""
from .coo import COO
self.dict = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to me to use the name of a core collection as a variable name. I'm not sure I have a better solution though. Maybe .data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.data sounds good to me. I thought of that at first, but wanted to make sure users don't confuse it with COO.data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with either. I don't have strong thoughts here.

@hameerabbasi
Copy link
Collaborator Author

Another round of reviews, maybe? :)

sparse/dok.py Outdated
if not len(data):
self.dtype = np.dtype('float64')
else:
self.dtype = np.result_type(*map(lambda x: np.asarray(x).dtype, data.values()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do map here as otherwise it would not work on dicts of int.

@hameerabbasi
Copy link
Collaborator Author

And another round! 💯

@mrocklin
Copy link
Collaborator

+1 from me

@hameerabbasi hameerabbasi merged commit 9441369 into pydata:master Jan 22, 2018
@hameerabbasi hameerabbasi deleted the dok branch January 22, 2018 19:39
hameerabbasi added a commit to hameerabbasi/sparse that referenced this pull request Feb 27, 2018
* Add initial implementation for DOK.

* Lots of tests and a few bugfixes.

* Start writing docs for DOK.

* Complete the API docs. Moving on to user manual.

* Complete documentation.

* Implemented changes proposed by @mrocklin.

* Add missing RST files.

* Remove undocumented __init__.

* Remove non-working code from docs.

* Fix dtype issue + tests.

* Updated docs.

* Less memory usage on Python 2 by using itervalues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants