[do not merge] W internal df representation #513

knaaptime · 2023-01-24T18:32:16Z

in a similar vein to #427 i wanted to play around with different internal representations. This is a bit different than 427 as im trying to mimic the current api. Not really intended to be merged, but opening here for discussion

i also wanted to swap out the old _cache for cached properties

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

knaaptime · 2023-01-24T18:34:58Z

interestingly, this mostly works.... theres a small issue in the distance weights not returning the correct number (but the neighbors themselves are correct), and a small issue in the indexing for some contiguity tests, but i think the logic is pretty square

sjsrey · 2023-01-25T04:11:56Z

This is a lot of very good exploratory work.

But there are also things that arguably could go in now (the cached property changes) and that would maybe make it easier to focus on the df-representation issues/changes.

knaaptime · 2023-01-25T15:39:01Z

cool. Yeah I could try and split some of this up into chunks, but there might be a handful of snags there too. Currently we do some clever stuff like silently pass objects between the cache to avoid rebuilding in a newly instantiated object so those conventions wouldnt work with "proper" cached properties.

adopting cached properties will also force the library to be python 3.7+ (i think is the version), but that still in the supported window so shouldnt be much of an isssue

knaaptime · 2023-01-29T07:26:53Z

alright ive removed all the unnecessary changes so the diff is easier. Just about all the material changes are in weights.py

The idea is to take a stab at the dataframe representation sketched out for the geographs that we might use in tandem with #427.

goals:

weights represented by a multiindex pandas dataframe
- (tries to bridge the gap between multiindex+sparse representation vs adjacency table representation in the discussion
conversion between formats is handled by pandas.sparse so the matrix algebra still happens through sparse (instead of, e.g. groupby/apply for lag)
- this means ensuring the sorting of the matrix before/after conversions, which is handled by the controlling the sorting of the adjlist multiindex
- lets us sort the weights (e.g. to re-align the matrix representation with an input), or rename them easily
- the multiindex weights series can go directly to sparse (or from)
weights transforms are handled by pandas transforms
- makes the code simpler than apply in this case, and apparently can now use numba under the hood?
- stored as additional columns (as necessary), one of which is the "active" weights columns, similar to the active geometry column in a geodataframe
existing attributes like neighbors/weights dicts and getters/setters are preserved as properties, but can be rewritten easily to ensure im/mutability of the underlying graph object one way or the other

I havent touched any of the spint stuff, nor hardly any of the existing adjlist stuff, but the vast majority of the test suite is passing (or the failures are immaterial, e.g. the wrong signature of a sparse class, but the appropriate format nonetheless). One thing that is definitely going wrong is that KNN is returning too many neighbors, but i wanted to get some feedback on this approach before digging into it

knaaptime · 2023-05-25T02:50:40Z

I think we're pretty close, actually... iirc, one of the last 'real' issue was getting the tree-cutting back in place so that KNN weights work, but pretty sure @sjsrey took care of that. At a high level, i think there are a handful of decisions we need to make about backwards compatibility, and whether we do a full deprecation when we adopt the new structure

do we want to make a decision on mutability of weights?
- related, some of the caching works a bit differently in bump supported Python versions and correct lat2SW doctest #154 since we're using cached_properties not managing the cache ourselves. This could lead to unexpected behavior since cached_properties stash the attribute when first computed (so its immutable), but since weights are currently mutable, the old/current version recomputes the cache when the W changes underneath
- there are a few methods/properties that i kludged just for consistency, but we should probably remove just to be explicit (like w.id_order)
there's one last thing when roundtripping a W through sparse. Can't remember off the top of my head, but i remember thinking it's probably trivial. I think the symmetrize method needs a look, but thats all that comes to mind

ljwolf · 2023-05-25T03:24:50Z

libpysal/weights/tests/test_weights.py

        np.testing.assert_array_equal(sparse.data, [1, 1, 1, 1, 0])
        np.testing.assert_array_equal(sparse.row, [0, 1, 1, 2, 3])
        np.testing.assert_array_equal(sparse.col, [1, 0, 2, 1, 3])
        sparse = self.w_islands.to_sparse("bsr")
-        self.assertIsInstance(sparse, scipy.sparse._arrays.bsr_array)
+        self.assertIsInstance(sparse, scipy.sparse._bsr.bsr_matrix)


I think the current way to indicate this is scipy.sparse.issparse(sparse) , then sparse.format == "bsr". they're moving away from the matrix classes, so we should try to keep with the array stuff.

ljwolf · 2023-05-25T03:36:36Z

libpysal/weights/weights.py

-    def _reset(self):
-        """Reset properties."""
-        self._cache = {}
+    def sort_neighbors(self, ids=None, inplace=False):


I think that, if we support this, we should minimise its use.

I note that this is used a lot in the tests. I think what should be done is to implement __equals__ and the set operations so that W1==W2 looks at whether the graphs are isomorphic, W1 is W2 looks at the exact ordering of the data structures, W1 < W2 checks for subgraphs... etc. I am happy to implement this.

i vote we toss it. It doesnt mean anything in the new structure, so keeping it around for legacy is misleading at best.

100% on the equals method, that sounds great

i think the way i've done things here, the sort is required to guarantee the ordering of the rows/cols when we convert to a sparse matrix. IIRC, without sorting, there's no way to know what order the neighbors come in (second level of the multiindex), so the values for nonbinary weights can be incorrect (even though the neighbor set is still correct). If there's a better way to do that, we can ditch the sort entirely, othwerwise we may still need it privately.

conceptually there's obviously no reason to 'sort' the graph, so i dont think we need to expose it publicly (and if you really needed to, you could grab the adjlist and sort yourself). Either way, the equals method you're outlining is the proper way to test for ==

Hence, not supporting any sorting whatsoever will make sure that the input order is never broken, so long as we can avoid pandas's predilection for it... pysal/libpysal@geographs/libpysal/weights/experimental/base.py#L216

yeah, but i think i also ran into something strang where i needed to be explicit. I think islands can also induce some weirdness the way i did it. If you have an empty neighbor set in the input dict, the island(s) has to be inserted back in manually or something like that

Right. For the "fresh" implementation, I think we ~~will~~ should go with a self-weight of zero for islands, and provide .islands as a convenience method, not an expression of state. We have .sparse, .neighbors/.weights, and .adjacency that we want to keep "aligned" with input order, and we've agreed that .adjacency is the canonical state of the graph.

To achieve this, we can deal with islands directly in our core data structure or as state in our W object that is referenced every time we work with .adjacency.

I see three options. I prefer self weight of zero. It's simpler, and consistent across all expressions of the graph.

A: self weight of zero

We might use islands w/ a self-weight of zero. Many weights formats already support self-loops, adjacency lists use this representation now in libpysal, and we need to start testing this in our stats.... This "induces" an edge for islands, yes, but this edge

does not affect any of the linear algebra/csgraph results

has a very minor effect on their speed

ensures the edge table stores all observations explicitly

every derivative of that edge table (.sparse, .neighbors/.weights will store all observations explicitly without special casing.

We get round-tripping, internal consistency, and downstream consistency for free.

B: data structure sentinel

The other way to represent this in the datastructure is to use a custom "sentinel" value that depends on the container type: {island:set()} or {island:[]} for the neighbors dict (which makes sense for that expression of the graph), self weight of numpy.nan in the adjacency table (like "missing data"), and size the sparse array as (n,n) when constructing it. This has the same drawbacks as above, but look more "normal" in each data structure. We'd have to then also be aware downstream, that:

any sparse array produced may omit observations from sparse.nonzero(). We have to check every time that every element of numpy.arange(n) is somewhere in sparse.nonzero().

any neighbor dictionary cannot index into neighbors[i][j]. Every indexing into [] can raise an error.

we have to keep track of each sentinel value for output types.

This also makes it possible to round-trip, but is slightly harder because of the implicit handling in .sparse.

C: state in the w object

If we separate islands from the data structure and deal with them as object state, then we always have to refer to that state when using the data structure. This is challenging, for example, because:

we can't assume that .sparse, .adjacency, or .neighbors is a complete representation of the graph W

we can't ensure that .sparse is aligned with inputs without W

Round-tripping is impossible unless you implement islands in the data structure (options A,B).

I'm strongly in favour of the option A as it ensures we keep everything at one place. One of the core issues of the old W is that we have way too many objects that we need to keep in sync. Let's simplify this as much as we can.

+1 on self-weight of 0.

I think we're unanimous here then. I think the question for the fresh implementation is what happens when a user creates a W with islands if it doesnt follow that convention.

The workaround i went for here here is for when a user instantiates a W from dicts, but the neighbor dict has an empty list. In prior versions, that gives you a W with an island. But if you try and parse that dict into a multiindex series, the empty list turns into a NaN and gets dropped (so i insert the self-loops back in manually). In the new version, we'd raise an error?

New builders will represent islands with the zero self-weight, so W should never "know" about it. We could write a _sanitize_islands() function or something to check for and correct this in the dict-formatted inputs?

knaaptime · 2024-01-17T16:24:37Z

resolved by Graph :)

sjsrey and others added 16 commits January 16, 2023 17:26

Handle set_index properly in to_adjlist

d02e70d

Add option to sort/unsort return from to_adjlist

4cd9119

"rebase" onto latest

0fbba3d

allow scaling in perimeter weights

8fa57d0

docstring

d709556

update perimeter tests to include perim_std argument

765e5a8

Update libpysal/weights/contiguity.py

07e92f0

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update libpysal/weights/contiguity.py

d579475

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update libpysal/weights/contiguity.py

2305432

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

add example

3819a22

rm id_order

9f344ab

reset id in kwarg dict in gabrial

f4aeaf4

W using dataframe internally

a9d9fec

set ids differently in dataframe builder

a61a34a

islands

38c8794

rm idorder in weights readers

1809340

knaaptime changed the title ~~[do not merge] internal df representation~~ [do not merge] W internal df representation Jan 24, 2023

knaaptime added 9 commits January 25, 2023 08:18

v transform

39104f7

sparse

d1dc60b

different sparse outputs

88c80a4

zip dicts

900a52b

rm errant datasets

22a1f1b

restore shape readers, rm old weights file

d9b8d26

prune unused code and tweak adjlist

6cee6a8

move transforms

07dc728

ordering

5be4b75

knaaptime and others added 13 commits February 11, 2023 13:14

document dict_to_df

db7561a

transpose dict to df

002430d

orient dict

aa47e43

sparse nan in dict-to-df

ca1c54f

include serge updates to test_distance

37c18ef

Fix for queen and w2 scheme

4b33360

fix symmetrize; update remap_ids

6696f2a

fix sparse assertion

943b410

Merge pull request #36 from sjsrey/w2

49972d7

xarray tests passing for w2

6aae2bb

Delete errant paste.

ca35ba1

Remove bad paste.

973f076

Merge pull request #37 from sjsrey/w2

d07b369

martinfleis changed the base branch from master to main February 27, 2023 08:44

Merge remote-tracking branch 'upstream/main' into pr/knaaptime/513

9cab05b

knaaptime mentioned this pull request May 25, 2023

w class specific deprecations #493

Open

15 tasks

ljwolf reviewed May 25, 2023

View reviewed changes

knaaptime mentioned this pull request Jul 29, 2023

BUG: weights.WSP2W casts weights to numpy.array instead of a list #538

Closed

knaaptime mentioned this pull request Aug 7, 2023

graph.Graph build_* constructors #542

Merged

This was referenced Aug 27, 2023

Intercept Coincident points #548

Merged

Weights sprint planning #534

Open

Coindicent tests #554

Merged

knaaptime mentioned this pull request Jan 16, 2024

ENH: add Graph.apply, Graph.aggregate and allow callable as transformation in transform #676

Merged

knaaptime closed this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[do not merge] W internal df representation #513

[do not merge] W internal df representation #513

knaaptime commented Jan 24, 2023

knaaptime commented Jan 24, 2023 •

edited

Loading

sjsrey commented Jan 25, 2023

knaaptime commented Jan 25, 2023

knaaptime commented Jan 29, 2023 •

edited

Loading

knaaptime commented May 25, 2023 •

edited

Loading

ljwolf May 25, 2023

ljwolf May 25, 2023

knaaptime May 25, 2023

knaaptime May 26, 2023

ljwolf May 26, 2023

knaaptime May 26, 2023

ljwolf May 26, 2023 •

edited

Loading

martinfleis May 26, 2023

sjsrey May 26, 2023

knaaptime May 26, 2023

ljwolf May 26, 2023

knaaptime commented Jan 17, 2024

[do not merge] W internal df representation #513

[do not merge] W internal df representation #513

Conversation

knaaptime commented Jan 24, 2023

knaaptime commented Jan 24, 2023 • edited Loading

sjsrey commented Jan 25, 2023

knaaptime commented Jan 25, 2023

knaaptime commented Jan 29, 2023 • edited Loading

knaaptime commented May 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ljwolf May 26, 2023 • edited Loading

Choose a reason for hiding this comment

A: self weight of zero

B: data structure sentinel

C: state in the w object

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knaaptime commented Jan 17, 2024

knaaptime commented Jan 24, 2023 •

edited

Loading

knaaptime commented Jan 29, 2023 •

edited

Loading

knaaptime commented May 25, 2023 •

edited

Loading

ljwolf May 26, 2023 •

edited

Loading