-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure Graph.sparse
is robust enough
#687
Comments
Is there a reproducible example we can use to test against? Round-tripping was a core design goal... |
Yes, here: In [1]: import pandas as pd
...: from libpysal.graph import Graph
In [2]: path = "https://github.com/Urban-Analytics-Technology-Platform/demoland-engine/raw/3fba672d08d7d09bdb692c35594064dbcdd69662/data/tyne_and_wear/matrix.parquet"
In [3]: adjacency_pyarrow = pd.read_parquet(path, engine="pyarrow")["weight"]
...: adjacency_fastparquet = pd.read_parquet(path, engine="fastparquet")["weight"]
In [4]: pd.testing.assert_series_equal(adjacency_pyarrow, adjacency_fastparquet)
In [5]: graph_pyarrow = Graph(adjacency_pyarrow, "r", is_sorted=True)
...: graph_fastparquet = Graph(adjacency_fastparquet, "r", is_sorted=True)
In [6]: graph_pyarrow.equals(graph_fastparquet)
Out[6]: True
In [7]: sparse_pyarrow = graph_pyarrow.sparse
...: sparse_fastparquet = graph_fastparquet.sparse
In [8]: (sparse_pyarrow == sparse_fastparquet).todense().all()
Out[8]: False
In [9]: sparse_pyarrow.todense()[0][0] # <- this is correct
Out[9]: 0.0
In [10]: sparse_fastparquet.todense()[0][0] # <- this is wrong
Out[10]: 0.007518796992481203 I haven't looked into pandas source code, but my guess is that it is because: In [11]: (adjacency_pyarrow.index.codes[1] == adjacency_fastparquet.index.codes[1]).all()
Out[11]: False
In [12]: adjacency_fastparquet.index.codes
Out[12]:
[array([ 0, 0, 0, ..., 3794, 3794, 3794]),
array([ 0, 1, 2, ..., 3792, 2300, 3406])]
In [13]: adjacency_pyarrow.index.codes
Out[13]: FrozenList([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [3, 4, 14, 26, 28, 55, 59, 61, 78, 100, 104, 108, 115, 129, 131, 132, 138, 139, 150, 153, 155, 157, 158, 159, 160, 166, 198, 199, 204, 243, 245, 288, 290, 292, 295, 297, 298, 300, 302, 304, 430, 457, 461, 464, 465, 556, 559, 560, 561, 562, 563, 564, 565, 566, 568, 569, 601, 604, 607, 623, 625, 628, 629, 658, 688, 689, 690, 691, 692, 693, 696, 743, 744, 747, 752, 757, 761, 765, 767, 787, 841, 849, 853, 857, 864, 906, 916, 923, 924, 1120, 1124, 1164, 1282, 1291, 1353, 1357, 1565, 1566, 1572, 1625, ...]]) Therefore, my trust in pandas conversion to sparse is a bit lower than it was before. Because everything else looks exactly the same. E.g.: In [14]: (graph_pyarrow.unique_ids == graph_fastparquet.unique_ids).all()
Out[14]: True |
The sparse representation seems to be inherently dependent on the way MultiIndex was originally built, which is something we can't always control, hence should figure out more robust way of conversion, imo. |
well it seems like the conversion to sparse works ok, but the creation of the multiindex is different between the two readers? Once you're moving from dataframe to sparse, the parquet backend is irrelevant. So since its im there's something different about the way the if we dont depend on the index, then we need to depend on read-order of the rows (since we default to a range index when no id column is available), so can we guarantee the rows always come back in the same order between the two> |
im game to move to whatever works best, but im not sure we have the real source of the problem yet. Even with the current W architecture, we're dependent on pandas.indices of some kind, so if those dont come back in the same order after IO, then we're in trouble. Even if you had your own graph ordering that doesnt come from a range index, we expect the order is preserved after IO ...this is also why GAL and GWT exist in the first place. In the early days you couldnt guarantee ordering or index placement, so those files ended up defining a strict spec. I raise that only to say that graphs are special, and we might be forced to make some stricter decisions about how they are stored. If this does end up being an IO issue, i wonder if there's something we could do in the parquet metedata that's specific to a graph? we could also reset the index on write so that you just have regular columns, and you make those a multiindex on read? |
but also... even if your Graph is static, but you redo your analysis and your geodataframe comes back reordered from fastparquet, it wont be aligned with the data any longer, so i think we need to raise the complexities of using a stored Graph |
No no, this is misunderstanding. The order of the index is equal in both cases. Fastparquet does not reorder it. What is affected is some underlying pandas representation that influences conversion to sparse. It could also be a bug in pandas we just found, I need to look further. But I'd be just more comfortable if I knew how the conversion to sparse works and had a full control over it, which is not the case with the current method. |
Using this custom code returns exactly the same behaviour as above return sparse.coo_array(
(
self._adjacency,
(
self._adjacency.index.codes[0],
self._adjacency.index.codes[1],
),
),
shape=(self.n, self.n),
) What fixes it though, is passing My suggestion is to expand the docstring of Note that the issue shown above was not caused by libpysal's functionality (e.g. our IO) but by my attempt to go around it, to be able to deserialise graph in WebAssembly which does not have pyarrow but has fastparquet. It seems that the way of doing that is to trigger sorting. |
I've been recently reading some adjacencies with
fastparquet
instead ofpyarrow
and noticed that the resulting sparse array is different even though the Series are considered equal. The reason is that the MultiIndex is created differently, leading to different underlyingMultiIndex.codes
, which, I suppose, somehow changes the conversion to sparse. We need to revise this and ensure that the way pandas creates sparse is robust enough and always yields properly sorted array or deploy our own way of generating sparse (which we had at some point as I wrote it so it will be in the commit history).this is mostly a note to myself to check this before Graph leaves experimental zone.
The text was updated successfully, but these errors were encountered: