-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[do not merge] W internal df representation #513
Conversation
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
interestingly, this mostly works.... theres a small issue in the distance weights not returning the correct number (but the neighbors themselves are correct), and a small issue in the indexing for some contiguity tests, but i think the logic is pretty square |
This is a lot of very good exploratory work. But there are also things that arguably could go in now (the cached property changes) and that would maybe make it easier to focus on the df-representation issues/changes. |
cool. Yeah I could try and split some of this up into chunks, but there might be a handful of snags there too. Currently we do some clever stuff like silently pass objects between the cache to avoid rebuilding in a newly instantiated object so those conventions wouldnt work with "proper" cached properties. adopting cached properties will also force the library to be python 3.7+ (i think is the version), but that still in the supported window so shouldnt be much of an isssue |
alright ive removed all the unnecessary changes so the diff is easier. Just about all the material changes are in weights.py The idea is to take a stab at the dataframe representation sketched out for the geographs that we might use in tandem with #427. goals:
I havent touched any of the spint stuff, nor hardly any of the existing adjlist stuff, but the vast majority of the test suite is passing (or the failures are immaterial, e.g. the wrong signature of a sparse class, but the appropriate format nonetheless). One thing that is definitely going wrong is that KNN is returning too many neighbors, but i wanted to get some feedback on this approach before digging into it |
I think we're pretty close, actually... iirc, one of the last 'real' issue was getting the tree-cutting back in place so that KNN weights work, but pretty sure @sjsrey took care of that. At a high level, i think there are a handful of decisions we need to make about backwards compatibility, and whether we do a full deprecation when we adopt the new structure
|
np.testing.assert_array_equal(sparse.data, [1, 1, 1, 1, 0]) | ||
np.testing.assert_array_equal(sparse.row, [0, 1, 1, 2, 3]) | ||
np.testing.assert_array_equal(sparse.col, [1, 0, 2, 1, 3]) | ||
sparse = self.w_islands.to_sparse("bsr") | ||
self.assertIsInstance(sparse, scipy.sparse._arrays.bsr_array) | ||
self.assertIsInstance(sparse, scipy.sparse._bsr.bsr_matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current way to indicate this is scipy.sparse.issparse(sparse)
, then sparse.format == "bsr"
. they're moving away from the matrix classes, so we should try to keep with the array stuff.
def _reset(self): | ||
"""Reset properties.""" | ||
self._cache = {} | ||
def sort_neighbors(self, ids=None, inplace=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that, if we support this, we should minimise its use.
I note that this is used a lot in the tests. I think what should be done is to implement __equals__
and the set operations so that W1==W2
looks at whether the graphs are isomorphic, W1 is W2
looks at the exact ordering of the data structures, W1 < W2
checks for subgraphs... etc. I am happy to implement this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i vote we toss it. It doesnt mean anything in the new structure, so keeping it around for legacy is misleading at best.
100% on the equals method, that sounds great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the way i've done things here, the sort is required to guarantee the ordering of the rows/cols when we convert to a sparse matrix. IIRC, without sorting, there's no way to know what order the neighbors come in (second level of the multiindex), so the values for nonbinary weights can be incorrect (even though the neighbor set is still correct). If there's a better way to do that, we can ditch the sort entirely, othwerwise we may still need it privately.
conceptually there's obviously no reason to 'sort' the graph, so i dont think we need to expose it publicly (and if you really needed to, you could grab the adjlist and sort yourself). Either way, the equals method you're outlining is the proper way to test for ==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence, not supporting any sorting whatsoever will make sure that the input order is never broken, so long as we can avoid pandas
's predilection for it... pysal/libpysal@geographs/libpysal/weights/experimental/base.py#L216
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but i think i also ran into something strang where i needed to be explicit. I think islands can also induce some weirdness the way i did it. If you have an empty neighbor set in the input dict, the island(s) has to be inserted back in manually or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. For the "fresh" implementation, I think we will should go with a self-weight of zero for islands, and provide .islands
as a convenience method, not an expression of state. We have .sparse
, .neighbors/.weights
, and .adjacency
that we want to keep "aligned" with input order, and we've agreed that .adjacency
is the canonical state of the graph.
To achieve this, we can deal with islands directly in our core data structure or as state in our W
object that is referenced every time we work with .adjacency
.
I see three options. I prefer self weight of zero. It's simpler, and consistent across all expressions of the graph.
A: self weight of zero
We might use islands w/ a self-weight of zero. Many weights formats already support self-loops, adjacency lists use this representation now in libpysal
, and we need to start testing this in our stats.... This "induces" an edge for islands, yes, but this edge
- does not affect any of the linear algebra/
csgraph
results - has a very minor effect on their speed
- ensures the edge table stores all observations explicitly
- every derivative of that edge table (
.sparse
,.neighbors/.weights
will store all observations explicitly without special casing.
We get round-tripping, internal consistency, and downstream consistency for free.
B: data structure sentinel
The other way to represent this in the datastructure is to use a custom "sentinel" value that depends on the container type: {island:set()}
or {island:[]}
for the neighbors dict (which makes sense for that expression of the graph), self weight of numpy.nan
in the adjacency table (like "missing data"), and size the sparse array as (n,n)
when constructing it. This has the same drawbacks as above, but look more "normal" in each data structure. We'd have to then also be aware downstream, that:
- any sparse array produced may omit observations from
sparse.nonzero()
. We have to check every time that every element ofnumpy.arange(n)
is somewhere insparse.nonzero()
. - any neighbor dictionary cannot index into
neighbors[i][j]
. Every indexing into[]
can raise an error. - we have to keep track of each sentinel value for output types.
This also makes it possible to round-trip, but is slightly harder because of the implicit handling in .sparse
.
C: state in the w object
If we separate islands from the data structure and deal with them as object state, then we always have to refer to that state when using the data structure. This is challenging, for example, because:
- we can't assume that
.sparse
,.adjacency
, or.neighbors
is a complete representation of the graphW
- we can't ensure that
.sparse
is aligned with inputs withoutW
Round-tripping is impossible unless you implement islands in the data structure (options A,B).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly in favour of the option A as it ensures we keep everything at one place. One of the core issues of the old W is that we have way too many objects that we need to keep in sync. Let's simplify this as much as we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on self-weight of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're unanimous here then. I think the question for the fresh implementation is what happens when a user creates a W with islands if it doesnt follow that convention.
The workaround i went for here here is for when a user instantiates a W from dicts, but the neighbor dict has an empty list. In prior versions, that gives you a W with an island. But if you try and parse that dict into a multiindex series, the empty list turns into a NaN and gets dropped (so i insert the self-loops back in manually). In the new version, we'd raise an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New builders will represent islands with the zero self-weight, so W
should never "know" about it. We could write a _sanitize_islands()
function or something to check for and correct this in the dict-formatted inputs?
resolved by Graph :) |
in a similar vein to #427 i wanted to play around with different internal representations. This is a bit different than 427 as im trying to mimic the current api. Not really intended to be merged, but opening here for discussion
i also wanted to swap out the old _cache for cached properties