Add implementation for DOK. #85
Conversation
You can do, for example,
And all of them should work. :) |
Some small comments. This could obviously use some tests. |
item = normalize_index(item, self.shape) | ||
|
||
for i in item: | ||
if not isinstance(i, Integral): |
mrocklin
Jan 21, 2018
Collaborator
perhaps if not all(isinstance(i, Integral) for i in item):
perhaps if not all(isinstance(i, Integral) for i in item):
' when getting an item.') | ||
|
||
if not len(item) == self.ndim: | ||
raise IndexError('Can only get single elements.') |
mrocklin
Jan 21, 2018
Collaborator
Might want to include the item in the error message with "expected key of length %d, got %s" % (self.ndim, str(item))
Might want to include the item in the error message with "expected key of length %d, got %s" % (self.ndim, str(item))
def nnz(self): | ||
return len(self.dict) | ||
|
||
def __getitem__(self, item): |
mrocklin
Jan 21, 2018
Collaborator
I would use key
or index
for this variable name
I would use key
or index
for this variable name
if not len(item) == self.ndim: | ||
raise IndexError('Can only get single elements.') | ||
|
||
if item in self.dict.keys(): |
mrocklin
Jan 21, 2018
Collaborator
You want if item in self.dict
. This is much faster. The .keys()
result is either a list (python 2) or iterator (python 3), each of which have linear time membership checks.
You want if item in self.dict
. This is much faster. The .keys()
result is either a list (python 2) or iterator (python 3), each of which have linear time membership checks.
result = np.zeros(self.shape, dtype=self.dtype) | ||
|
||
for c, d in six.iteritems(self.dict): | ||
result[c] = d |
mrocklin
Jan 21, 2018
Collaborator
You might be able to do something like result[self.dict.keys()] = self.dict.values()
here
You might be able to do something like result[self.dict.keys()] = self.dict.values()
here
hameerabbasi
Jan 21, 2018
Author
Collaborator
This works in Python 2, but not 3 (where these are iterators). In addition, it'll take up more memory for the resulting lists even if we were to do this.
This works in Python 2, but not 3 (where these are iterators). In addition, it'll take up more memory for the resulting lists even if we were to do this.
Tests added, and a few bugs caught and fixed. :) Will do docs soon. |
I think this is ready for a final review and merge. Keep in mind we can keep adding features later. Right now, we just need an implementation. :) |
|
||
You can get started by defining the shape (and optionally, datatype) of the | ||
:obj:`DOK` array. If you do not specify a dtype, it is inferred from the first | ||
element. |
mrocklin
Jan 22, 2018
Collaborator
This seems somewhat dangerous to me. Is this what scipy.sparse does?
I thought that this was a fail case:
In [1]: from sparse import DOK, COO
In [2]: dok = DOK((5, 5))
In [3]: dok[1, 1] = 5
In [4]: dok.dtype
Out[4]: dtype('int64')
In [5]: dok[2, 2] = 3.2
In [6]: dok
Out[6]: <DOK: shape=(5, 5), dtype=int64, nnz=2>
In [7]: dok.todense()
Out[7]:
array([[0, 0, 0, 0, 0],
[0, 5, 0, 0, 0],
[0, 0, 3, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
But then I tried the same thing with numpy and got the same behavior :/
In [8]: import numpy as np
In [9]: x = np.array([1, 2, 3])
In [10]: x
Out[10]: array([1, 2, 3])
In [11]: x.dtype
Out[11]: dtype('int64')
In [12]: x[2] = 3.2
In [13]: x
Out[13]: array([1, 2, 3])
This seems somewhat dangerous to me. Is this what scipy.sparse does?
I thought that this was a fail case:
In [1]: from sparse import DOK, COO
In [2]: dok = DOK((5, 5))
In [3]: dok[1, 1] = 5
In [4]: dok.dtype
Out[4]: dtype('int64')
In [5]: dok[2, 2] = 3.2
In [6]: dok
Out[6]: <DOK: shape=(5, 5), dtype=int64, nnz=2>
In [7]: dok.todense()
Out[7]:
array([[0, 0, 0, 0, 0],
[0, 5, 0, 0, 0],
[0, 0, 3, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
But then I tried the same thing with numpy and got the same behavior :/
In [8]: import numpy as np
In [9]: x = np.array([1, 2, 3])
In [10]: x
Out[10]: array([1, 2, 3])
In [11]: x.dtype
Out[11]: dtype('int64')
In [12]: x[2] = 3.2
In [13]: x
Out[13]: array([1, 2, 3])
hameerabbasi
Jan 22, 2018
Author
Collaborator
I'm mimicking Numpy semantics here. Numpy doesn't change the data type of the whole array for just one element, it has to be decided beforehand.
I'm mimicking Numpy semantics here. Numpy doesn't change the data type of the whole array for just one element, it has to be decided beforehand.
mrocklin
Jan 22, 2018
Collaborator
Yeah, I agree with that behavior. My concern is that they might not realize that they are selecting a dtype, and then have the rest of their data silently truncated. Consider the following example:
dok = DOK((5, 5))
data = {(1, 1): 1, (2, 2), 2.5, (3, 3): 3.5}
for k, v in data.items():
dok[k] = v
The first entry defines the dtype as int, but then the subsequent values are silently truncated.
It looks like scipy defaults dtype to np.float64
. We might want to do the same.
In [1]: from scipy.sparse import dok_matrix
In [2]: dok_matrix((5, 5))
Out[2]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
Yeah, I agree with that behavior. My concern is that they might not realize that they are selecting a dtype, and then have the rest of their data silently truncated. Consider the following example:
dok = DOK((5, 5))
data = {(1, 1): 1, (2, 2), 2.5, (3, 3): 3.5}
for k, v in data.items():
dok[k] = v
The first entry defines the dtype as int, but then the subsequent values are silently truncated.
It looks like scipy defaults dtype to np.float64
. We might want to do the same.
In [1]: from scipy.sparse import dok_matrix
In [2]: dok_matrix((5, 5))
Out[2]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
hameerabbasi
Jan 22, 2018
Author
Collaborator
Looks like Numpy does, too:
import numpy as np
x = np.zeros((5, 5))
x.dtype
Out[4]: dtype('float64')
y = np.array([3, 1.5])
y.dtype
Out[6]: dtype('float64')
I could use, for example, numpy.result_type on the initial data.
d = {0: x, 1: y}
np.result_type(*d.values())
Out[11]: dtype('float64')
Looks like Numpy does, too:
import numpy as np
x = np.zeros((5, 5))
x.dtype
Out[4]: dtype('float64')
y = np.array([3, 1.5])
y.dtype
Out[6]: dtype('float64')
I could use, for example, numpy.result_type on the initial data.
d = {0: x, 1: y}
np.result_type(*d.values())
Out[11]: dtype('float64')
hameerabbasi
Jan 22, 2018
Author
Collaborator
I've fixed this now. It defaults to float64
, or to the maximum dtype of values
if it is provided.
I've fixed this now. It defaults to float64
, or to the maximum dtype of values
if it is provided.
|
||
.. code-block:: python | ||
s = DOK((6, 5, 2), [dtype=np.float64]) |
mrocklin
Jan 22, 2018
Collaborator
I expect naive users to copy-paste this code and be confused why it fails
I expect naive users to copy-paste this code and be confused why it fails
hameerabbasi
Jan 22, 2018
Author
Collaborator
I'll change that real quick.
I'll change that real quick.
|
||
.. code-block:: python | ||
s2 = COO(s) |
mrocklin
Jan 22, 2018
Collaborator
Should we also have a to_coo
method? I reached for this first.
Should we also have a to_coo
method? I reached for this first.
hameerabbasi
Jan 22, 2018
Author
Collaborator
I'll add to_coo
and from_coo
. Doesn't hurt, and they're tiny.
I'll add to_coo
and from_coo
. Doesn't hurt, and they're tiny.
pass | ||
|
||
|
||
class DOK: |
mrocklin
Jan 22, 2018
Collaborator
I recommend using class DOK(object)
for as long as we support Python 2
I recommend using class DOK(object)
for as long as we support Python 2
hameerabbasi
Jan 22, 2018
Author
Collaborator
Tests succeed with Python 2.7... Which version of Python has issues with this?
Tests succeed with Python 2.7... Which version of Python has issues with this?
mrocklin
Jan 22, 2018
Collaborator
This is just general cleanliness in Python 2. class Foo:
creates an "old style class" which very few people expect these days. People have grown used to new style classes (class Foo(object)
) and their class methods. These are also the default in Python 3.
This is just general cleanliness in Python 2. class Foo:
creates an "old style class" which very few people expect these days. People have grown used to new style classes (class Foo(object)
) and their class methods. These are also the default in Python 3.
[0, 0, 0, 0, 0]]) | ||
""" | ||
def __init__(self, shape, values=None, dtype=None): | ||
""" |
mrocklin
Jan 22, 2018
Collaborator
I recommend only including the docstring on the main class and removing it from the __init__
method. People tend to check for docstrings like DOK?
or help(DOK)
rather than help(DOK.__init__)
. Also, the two versions here will almost certainly grow out of sync.
I recommend only including the docstring on the main class and removing it from the __init__
method. People tend to check for docstrings like DOK?
or help(DOK)
rather than help(DOK.__init__)
. Also, the two versions here will almost certainly grow out of sync.
hameerabbasi
Jan 22, 2018
Author
Collaborator
That's how we're doing it in COO
. I tend to keep the class docstring somewhat general (like an overview) and the __init__
docstring specific to constructing objects.
That's how we're doing it in COO
. I tend to keep the class docstring somewhat general (like an overview) and the __init__
docstring specific to constructing objects.
mrocklin
Jan 22, 2018
Collaborator
I recommend merging the docstring in __init__
up to the class-level docstring. I find that people rarely look at the __init__
docstring in practice.
I recommend merging the docstring in __init__
up to the class-level docstring. I find that people rarely look at the __init__
docstring in practice.
<DOK: shape=(5, 5, 5), dtype=int64, nnz=1> | ||
""" | ||
from .coo import COO | ||
self.dict = {} |
mrocklin
Jan 22, 2018
Collaborator
It seems odd to me to use the name of a core collection as a variable name. I'm not sure I have a better solution though. Maybe .data
?
It seems odd to me to use the name of a core collection as a variable name. I'm not sure I have a better solution though. Maybe .data
?
hameerabbasi
Jan 22, 2018
Author
Collaborator
.data
sounds good to me. I thought of that at first, but wanted to make sure users don't confuse it with COO.data
.data
sounds good to me. I thought of that at first, but wanted to make sure users don't confuse it with COO.data
mrocklin
Jan 22, 2018
Collaborator
Happy with either. I don't have strong thoughts here.
Happy with either. I don't have strong thoughts here.
Another round of reviews, maybe? :) |
if not len(data): | ||
self.dtype = np.dtype('float64') | ||
else: | ||
self.dtype = np.result_type(*map(lambda x: np.asarray(x).dtype, data.values())) |
hameerabbasi
Jan 22, 2018
Author
Collaborator
I had to do map
here as otherwise it would not work on dicts
of int
.
I had to do map
here as otherwise it would not work on dicts
of int
.
And another round! |
+1 from me |
* Add initial implementation for DOK. * Lots of tests and a few bugfixes. * Start writing docs for DOK. * Complete the API docs. Moving on to user manual. * Complete documentation. * Implemented changes proposed by @mrocklin. * Add missing RST files. * Remove undocumented __init__. * Remove non-working code from docs. * Fix dtype issue + tests. * Updated docs. * Less memory usage on Python 2 by using itervalues.
I've made a small DOK implementation. So far, it supports:
COO(DOK)
/DOK(COO)
todense()
for debugging.I haven't added tests or docs (yet) or tested performance, but it seems to work fine.
Edit: xref #15