Ensure `Graph.sparse` is robust enough #687

martinfleis · 2024-02-28T16:14:59Z

I've been recently reading some adjacencies with fastparquet instead of pyarrow and noticed that the resulting sparse array is different even though the Series are considered equal. The reason is that the MultiIndex is created differently, leading to different underlying MultiIndex.codes, which, I suppose, somehow changes the conversion to sparse. We need to revise this and ensure that the way pandas creates sparse is robust enough and always yields properly sorted array or deploy our own way of generating sparse (which we had at some point as I wrote it so it will be in the commit history).

this is mostly a note to myself to check this before Graph leaves experimental zone.

The text was updated successfully, but these errors were encountered:

ljwolf · 2024-02-28T17:22:17Z

Is there a reproducible example we can use to test against? Round-tripping was a core design goal...

martinfleis · 2024-02-28T18:08:59Z

Yes, here:

In [1]: import pandas as pd
   ...: from libpysal.graph import Graph

In [2]: path = "https://github.com/Urban-Analytics-Technology-Platform/demoland-engine/raw/3fba672d08d7d09bdb692c35594064dbcdd69662/data/tyne_and_wear/matrix.parquet"

In [3]: adjacency_pyarrow = pd.read_parquet(path, engine="pyarrow")["weight"]
   ...: adjacency_fastparquet = pd.read_parquet(path, engine="fastparquet")["weight"]

In [4]: pd.testing.assert_series_equal(adjacency_pyarrow, adjacency_fastparquet)

In [5]: graph_pyarrow = Graph(adjacency_pyarrow, "r", is_sorted=True)
   ...: graph_fastparquet = Graph(adjacency_fastparquet, "r", is_sorted=True)

In [6]: graph_pyarrow.equals(graph_fastparquet)
Out[6]: True

In [7]: sparse_pyarrow = graph_pyarrow.sparse
   ...: sparse_fastparquet = graph_fastparquet.sparse

In [8]: (sparse_pyarrow == sparse_fastparquet).todense().all()
Out[8]: False

In [9]: sparse_pyarrow.todense()[0][0] # <- this is correct
Out[9]: 0.0

In [10]: sparse_fastparquet.todense()[0][0] # <- this is wrong
Out[10]: 0.007518796992481203

I haven't looked into pandas source code, but my guess is that it is because:

In [11]: (adjacency_pyarrow.index.codes[1] == adjacency_fastparquet.index.codes[1]).all()
Out[11]: False

In [12]: adjacency_fastparquet.index.codes
Out[12]: 
[array([   0,    0,    0, ..., 3794, 3794, 3794]),
 array([   0,    1,    2, ..., 3792, 2300, 3406])]

In [13]: adjacency_pyarrow.index.codes
Out[13]: FrozenList([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [3, 4, 14, 26, 28, 55, 59, 61, 78, 100, 104, 108, 115, 129, 131, 132, 138, 139, 150, 153, 155, 157, 158, 159, 160, 166, 198, 199, 204, 243, 245, 288, 290, 292, 295, 297, 298, 300, 302, 304, 430, 457, 461, 464, 465, 556, 559, 560, 561, 562, 563, 564, 565, 566, 568, 569, 601, 604, 607, 623, 625, 628, 629, 658, 688, 689, 690, 691, 692, 693, 696, 743, 744, 747, 752, 757, 761, 765, 767, 787, 841, 849, 853, 857, 864, 906, 916, 923, 924, 1120, 1124, 1164, 1282, 1291, 1353, 1357, 1565, 1566, 1572, 1625, ...]])

Therefore, my trust in pandas conversion to sparse is a bit lower than it was before. Because everything else looks exactly the same. E.g.:

In [14]: (graph_pyarrow.unique_ids == graph_fastparquet.unique_ids).all()
Out[14]: True

martinfleis · 2024-02-28T18:10:56Z

The sparse representation seems to be inherently dependent on the way MultiIndex was originally built, which is something we can't always control, hence should figure out more robust way of conversion, imo.

knaaptime · 2024-02-28T18:19:55Z

well it seems like the conversion to sparse works ok, but the creation of the multiindex is different between the two readers? Once you're moving from dataframe to sparse, the parquet backend is irrelevant.

So since its im there's something different about the way the codes are generated from the multiindex (probably something to do with the way fastparquet tries to parallelize reads?)

if we dont depend on the index, then we need to depend on read-order of the rows (since we default to a range index when no id column is available), so can we guarantee the rows always come back in the same order between the two>

knaaptime · 2024-02-28T18:37:17Z

The sparse representation seems to be inherently dependent on the way MultiIndex was originally built, which is something we can't always control, hence should figure out more robust way of conversion, imo.

im game to move to whatever works best, but im not sure we have the real source of the problem yet. Even with the current W architecture, we're dependent on pandas.indices of some kind, so if those dont come back in the same order after IO, then we're in trouble. Even if you had your own graph ordering that doesnt come from a range index, we expect the order is preserved after IO

...this is also why GAL and GWT exist in the first place. In the early days you couldnt guarantee ordering or index placement, so those files ended up defining a strict spec. I raise that only to say that graphs are special, and we might be forced to make some stricter decisions about how they are stored.

If this does end up being an IO issue, i wonder if there's something we could do in the parquet metedata that's specific to a graph? we could also reset the index on write so that you just have regular columns, and you make those a multiindex on read?

knaaptime · 2024-02-28T19:03:41Z

Even with the current W architecture, we're dependent on pandas.indices of some kind, so if those dont come back in the same order after IO, then we're in trouble. Even if you had your own graph ordering that doesnt come from a range index, we expect the order is preserved after IO

but also... even if your Graph is static, but you redo your analysis and your geodataframe comes back reordered from fastparquet, it wont be aligned with the data any longer, so i think we need to raise the complexities of using a stored Graph

martinfleis · 2024-02-28T19:23:54Z

No no, this is misunderstanding. The order of the index is equal in both cases. Fastparquet does not reorder it. What is affected is some underlying pandas representation that influences conversion to sparse. It could also be a bug in pandas we just found, I need to look further. But I'd be just more comfortable if I knew how the conversion to sparse works and had a full control over it, which is not the case with the current method.

martinfleis · 2024-02-28T21:14:03Z

Using this custom code returns exactly the same behaviour as above

        return sparse.coo_array(
            (
                self._adjacency,
                (
                    self._adjacency.index.codes[0],
                    self._adjacency.index.codes[1],
                ),
            ),
            shape=(self.n, self.n),
        )

What fixes it though, is passing is_sorted=False, ensuring we don't skip our internal treatment of the index and canonical order. I just thought that since I know where this adjacency list is coming from, I can just bypass sorting here but that doesn't seem to be the case.

My suggestion is to expand the docstring of Graph.__init__ saying that is_sorted shall not be used unless you really really know that the index treatment has been done on the pandas object you are trying to use and be generally cautious about using is_sorted=True in internal codebase. With pyarrow, IO with is_sorted=True seems to be fine and the rest of the places where is_sorted=True is used are also okay imo.

Note that the issue shown above was not caused by libpysal's functionality (e.g. our IO) but by my attempt to go around it, to be able to deserialise graph in WebAssembly which does not have pyarrow but has fastparquet. It seems that the way of doing that is to trigger sorting.

martinfleis added the graph label Feb 28, 2024

martinfleis self-assigned this Feb 28, 2024

martinfleis mentioned this issue Apr 10, 2024

DOC: clarify the requirements of the canonical sorting of Graph index #700

Merged

jGaboardi closed this as completed in #700 Apr 10, 2024

martinfleis mentioned this issue May 30, 2024

PERF: sorting-related improvements in Graph #715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure `Graph.sparse` is robust enough #687

Ensure `Graph.sparse` is robust enough #687

martinfleis commented Feb 28, 2024

ljwolf commented Feb 28, 2024

martinfleis commented Feb 28, 2024

martinfleis commented Feb 28, 2024

knaaptime commented Feb 28, 2024

knaaptime commented Feb 28, 2024

knaaptime commented Feb 28, 2024

martinfleis commented Feb 28, 2024

martinfleis commented Feb 28, 2024

Ensure Graph.sparse is robust enough #687

Ensure Graph.sparse is robust enough #687

Comments

martinfleis commented Feb 28, 2024

ljwolf commented Feb 28, 2024

martinfleis commented Feb 28, 2024

martinfleis commented Feb 28, 2024

knaaptime commented Feb 28, 2024

knaaptime commented Feb 28, 2024

knaaptime commented Feb 28, 2024

martinfleis commented Feb 28, 2024

martinfleis commented Feb 28, 2024

Ensure `Graph.sparse` is robust enough #687

Ensure `Graph.sparse` is robust enough #687