Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[do not merge] W internal df representation #513

Closed
wants to merge 39 commits into from
Closed

Conversation

knaaptime
Copy link
Member

in a similar vein to #427 i wanted to play around with different internal representations. This is a bit different than 427 as im trying to mimic the current api. Not really intended to be merged, but opening here for discussion

i also wanted to swap out the old _cache for cached properties

@knaaptime
Copy link
Member Author

knaaptime commented Jan 24, 2023

interestingly, this mostly works.... theres a small issue in the distance weights not returning the correct number (but the neighbors themselves are correct), and a small issue in the indexing for some contiguity tests, but i think the logic is pretty square

@knaaptime knaaptime changed the title [do not merge] internal df representation [do not merge] W internal df representation Jan 24, 2023
@sjsrey
Copy link
Member

sjsrey commented Jan 25, 2023

This is a lot of very good exploratory work.

But there are also things that arguably could go in now (the cached property changes) and that would maybe make it easier to focus on the df-representation issues/changes.

@knaaptime
Copy link
Member Author

cool. Yeah I could try and split some of this up into chunks, but there might be a handful of snags there too. Currently we do some clever stuff like silently pass objects between the cache to avoid rebuilding in a newly instantiated object so those conventions wouldnt work with "proper" cached properties.

adopting cached properties will also force the library to be python 3.7+ (i think is the version), but that still in the supported window so shouldnt be much of an isssue

@knaaptime
Copy link
Member Author

knaaptime commented Jan 29, 2023

alright ive removed all the unnecessary changes so the diff is easier. Just about all the material changes are in weights.py

The idea is to take a stab at the dataframe representation sketched out for the geographs that we might use in tandem with #427.

goals:

  • weights represented by a multiindex pandas dataframe
    • (tries to bridge the gap between multiindex+sparse representation vs adjacency table representation in the discussion
  • conversion between formats is handled by pandas.sparse so the matrix algebra still happens through sparse (instead of, e.g. groupby/apply for lag)
    • this means ensuring the sorting of the matrix before/after conversions, which is handled by the controlling the sorting of the adjlist multiindex
    • lets us sort the weights (e.g. to re-align the matrix representation with an input), or rename them easily
    • the multiindex weights series can go directly to sparse (or from)
  • weights transforms are handled by pandas transforms
    • makes the code simpler than apply in this case, and apparently can now use numba under the hood?
    • stored as additional columns (as necessary), one of which is the "active" weights columns, similar to the active geometry column in a geodataframe
  • existing attributes like neighbors/weights dicts and getters/setters are preserved as properties, but can be rewritten easily to ensure im/mutability of the underlying graph object one way or the other

I havent touched any of the spint stuff, nor hardly any of the existing adjlist stuff, but the vast majority of the test suite is passing (or the failures are immaterial, e.g. the wrong signature of a sparse class, but the appropriate format nonetheless). One thing that is definitely going wrong is that KNN is returning too many neighbors, but i wanted to get some feedback on this approach before digging into it

@martinfleis martinfleis changed the base branch from master to main February 27, 2023 08:44
@knaaptime knaaptime mentioned this pull request May 25, 2023
15 tasks
@knaaptime
Copy link
Member Author

knaaptime commented May 25, 2023

I think we're pretty close, actually... iirc, one of the last 'real' issue was getting the tree-cutting back in place so that KNN weights work, but pretty sure @sjsrey took care of that. At a high level, i think there are a handful of decisions we need to make about backwards compatibility, and whether we do a full deprecation when we adopt the new structure

  • do we want to make a decision on mutability of weights?
    • related, some of the caching works a bit differently in bump supported Python versions and correct lat2SW doctest #154 since we're using cached_properties not managing the cache ourselves. This could lead to unexpected behavior since cached_properties stash the attribute when first computed (so its immutable), but since weights are currently mutable, the old/current version recomputes the cache when the W changes underneath
    • there are a few methods/properties that i kludged just for consistency, but we should probably remove just to be explicit (like w.id_order)
  • there's one last thing when roundtripping a W through sparse. Can't remember off the top of my head, but i remember thinking it's probably trivial. I think the symmetrize method needs a look, but thats all that comes to mind

np.testing.assert_array_equal(sparse.data, [1, 1, 1, 1, 0])
np.testing.assert_array_equal(sparse.row, [0, 1, 1, 2, 3])
np.testing.assert_array_equal(sparse.col, [1, 0, 2, 1, 3])
sparse = self.w_islands.to_sparse("bsr")
self.assertIsInstance(sparse, scipy.sparse._arrays.bsr_array)
self.assertIsInstance(sparse, scipy.sparse._bsr.bsr_matrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current way to indicate this is scipy.sparse.issparse(sparse) , then sparse.format == "bsr". they're moving away from the matrix classes, so we should try to keep with the array stuff.

def _reset(self):
"""Reset properties."""
self._cache = {}
def sort_neighbors(self, ids=None, inplace=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, if we support this, we should minimise its use.

I note that this is used a lot in the tests. I think what should be done is to implement __equals__ and the set operations so that W1==W2 looks at whether the graphs are isomorphic, W1 is W2 looks at the exact ordering of the data structures, W1 < W2 checks for subgraphs... etc. I am happy to implement this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i vote we toss it. It doesnt mean anything in the new structure, so keeping it around for legacy is misleading at best.

100% on the equals method, that sounds great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the way i've done things here, the sort is required to guarantee the ordering of the rows/cols when we convert to a sparse matrix. IIRC, without sorting, there's no way to know what order the neighbors come in (second level of the multiindex), so the values for nonbinary weights can be incorrect (even though the neighbor set is still correct). If there's a better way to do that, we can ditch the sort entirely, othwerwise we may still need it privately.

conceptually there's obviously no reason to 'sort' the graph, so i dont think we need to expose it publicly (and if you really needed to, you could grab the adjlist and sort yourself). Either way, the equals method you're outlining is the proper way to test for ==

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, not supporting any sorting whatsoever will make sure that the input order is never broken, so long as we can avoid pandas's predilection for it... pysal/libpysal@geographs/libpysal/weights/experimental/base.py#L216

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but i think i also ran into something strang where i needed to be explicit. I think islands can also induce some weirdness the way i did it. If you have an empty neighbor set in the input dict, the island(s) has to be inserted back in manually or something like that

Copy link
Member

@ljwolf ljwolf May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. For the "fresh" implementation, I think we will should go with a self-weight of zero for islands, and provide .islands as a convenience method, not an expression of state. We have .sparse, .neighbors/.weights, and .adjacency that we want to keep "aligned" with input order, and we've agreed that .adjacency is the canonical state of the graph.

To achieve this, we can deal with islands directly in our core data structure or as state in our W object that is referenced every time we work with .adjacency.

I see three options. I prefer self weight of zero. It's simpler, and consistent across all expressions of the graph.

A: self weight of zero

We might use islands w/ a self-weight of zero. Many weights formats already support self-loops, adjacency lists use this representation now in libpysal, and we need to start testing this in our stats.... This "induces" an edge for islands, yes, but this edge

  • does not affect any of the linear algebra/csgraph results
  • has a very minor effect on their speed
  • ensures the edge table stores all observations explicitly
  • every derivative of that edge table (.sparse, .neighbors/.weights will store all observations explicitly without special casing.

We get round-tripping, internal consistency, and downstream consistency for free.

B: data structure sentinel

The other way to represent this in the datastructure is to use a custom "sentinel" value that depends on the container type: {island:set()} or {island:[]} for the neighbors dict (which makes sense for that expression of the graph), self weight of numpy.nan in the adjacency table (like "missing data"), and size the sparse array as (n,n) when constructing it. This has the same drawbacks as above, but look more "normal" in each data structure. We'd have to then also be aware downstream, that:

  • any sparse array produced may omit observations from sparse.nonzero(). We have to check every time that every element of numpy.arange(n) is somewhere in sparse.nonzero().
  • any neighbor dictionary cannot index into neighbors[i][j]. Every indexing into [] can raise an error.
  • we have to keep track of each sentinel value for output types.

This also makes it possible to round-trip, but is slightly harder because of the implicit handling in .sparse.

C: state in the w object

If we separate islands from the data structure and deal with them as object state, then we always have to refer to that state when using the data structure. This is challenging, for example, because:

  • we can't assume that .sparse, .adjacency, or .neighbors is a complete representation of the graph W
  • we can't ensure that .sparse is aligned with inputs without W

Round-tripping is impossible unless you implement islands in the data structure (options A,B).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly in favour of the option A as it ensures we keep everything at one place. One of the core issues of the old W is that we have way too many objects that we need to keep in sync. Let's simplify this as much as we can.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on self-weight of 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're unanimous here then. I think the question for the fresh implementation is what happens when a user creates a W with islands if it doesnt follow that convention.

The workaround i went for here here is for when a user instantiates a W from dicts, but the neighbor dict has an empty list. In prior versions, that gives you a W with an island. But if you try and parse that dict into a multiindex series, the empty list turns into a NaN and gets dropped (so i insert the self-loops back in manually). In the new version, we'd raise an error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New builders will represent islands with the zero self-weight, so W should never "know" about it. We could write a _sanitize_islands() function or something to check for and correct this in the dict-formatted inputs?

@knaaptime
Copy link
Member Author

resolved by Graph :)

@knaaptime knaaptime closed this Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants