Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intercept Coincident points #548

Merged
merged 37 commits into from
Aug 25, 2023
Merged

Intercept Coincident points #548

merged 37 commits into from
Aug 25, 2023

Conversation

ljwolf
Copy link
Member

@ljwolf ljwolf commented Aug 10, 2023

This is my start at intercepting coincident points. I still need to

  • implement the jitter solution
  • re-order the new_adj table according to the derived original ids
  • (possibly) implement these checks in _triangulation.py as a wrapper function? I think we could do this easily by intercepting the input & output of the wrapped function.
  • implement for knn weights
  • tests

I will hold off on implementing these for _distance.py until @martinfleis's changes land and we decide if the decorator strategy is the way to go.

libpysal/graph/_utils.py Outdated Show resolved Hide resolved
libpysal/graph/_utils.py Outdated Show resolved Hide resolved
@ljwolf
Copy link
Member Author

ljwolf commented Aug 11, 2023

I've implemented the jitter solution using displacements in a circle with a radius of the resolution of the input dtype (if a float), or the resolution of float32 (1e-6) if the input are integers or float168. This should be the smallest possible movement required to de-dupe points.

After some simulations, I don't see any improvement in the agreement of _delaunay() for smaller displacements (up to 1e-22): for anything at or below this resolution value, the agreement between any two jittered tessellations is about 68%.... not very stable.

import geopandas, geodatasets, numpy
from libpysal.graph._triangulation import _delaunay, 
from tqdm import trange

gdf = (
    geopandas.read_file(geodatasets.get_path("geoda groceries"))
    .explode(index_parts=False)
    .reset_index(drop=True)
)

coincident = geopandas.pd.concat((gdf.head(5), gdf.head(2), gdf.head(1)), axis=0).reset_index(drop=True)
coincident.index = (
    coincident.index.astype(str).str.rjust(3, '0') 
    + '-' 
    + coincident.Chain.str.replace(" ", "").str.replace("/", "")
    )

jgraph = _delaunay(coincident.geometry, coincident='jitter')

def iou(a,b, normalized=True):
    n_together = len(a.intersection(b))
    if normalized:
        return n_together / len(a.union(b))
    else:
        return n_together

reps = [set(zip(*jgraph[0:2]))] + [set(zip(*_delaunay(coincident.geometry, coincident='jitter')[0:2])) for _ in trange(999)]
sims_lt = [iou(reps[i], reps[j]) for i in trange(1000) for j in range(i+1)]
sims = numpy.empty((1000,1000))
sims[*numpy.tril_indices(1000)] = sims_lt
sims += sims.T
numpy.fill_diagonal(sims, 1)

Figure_1

@ljwolf ljwolf changed the title [WIP] start intercepting coincident points Intercept Coincident points Aug 11, 2023
@ljwolf ljwolf added the graph label Aug 11, 2023
@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #548 (ca5ec9a) into geographs (e579820) will decrease coverage by 0.1%.
The diff coverage is 56.5%.

Impacted file tree graph

@@             Coverage Diff             @@
##           geographs    #548     +/-   ##
===========================================
- Coverage       80.8%   80.7%   -0.1%     
===========================================
  Files            125     125             
  Lines          14351   14405     +54     
===========================================
+ Hits           11595   11628     +33     
- Misses          2756    2777     +21     
Files Changed Coverage Δ
libpysal/graph/_utils.py 41.0% <35.7%> (-1.7%) ⬇️
libpysal/graph/_triangulation.py 91.8% <62.5%> (-7.4%) ⬇️
libpysal/graph/base.py 97.9% <100.0%> (ø)
libpysal/graph/tests/test_base.py 100.0% <100.0%> (ø)
libpysal/graph/tests/test_builders.py 100.0% <100.0%> (ø)

... and 1 file with indirect coverage changes

@ljwolf
Copy link
Member Author

ljwolf commented Aug 11, 2023

Great that we're doing kernel as a thin class. This makes basically all of _triangulate.py possible to implement as just building a sparse distance matrix. I have very good feelings about this API design!

ljwolf and others added 4 commits August 25, 2023 16:41
Co-authored-by: James Gaboardi <jgaboardi@gmail.com>
Co-authored-by: James Gaboardi <jgaboardi@gmail.com>
Co-authored-by: James Gaboardi <jgaboardi@gmail.com>
@martinfleis
Copy link
Member

Just for reference, this is my solution to sorting.

df = pd.DataFrame({"focal": heads, "neighbor": tails})
# df = df.sample(frac=1).reset_index(drop=True) 
sorted_index = df.applymap(coincident.index.unique().tolist().index).sort_values(["focal", "neighbor"]).index
return heads[sorted_index], tails[sorted_index], weights[sorted_index]

@ljwolf ljwolf merged commit 3c686b7 into pysal:geographs Aug 25, 2023
0 of 7 checks passed
@martinfleis
Copy link
Member

martinfleis commented Aug 25, 2023

@ljwolf have you intentionally removed all the kernel functionality from triangulation here? it is all in the decorator...

@ljwolf
Copy link
Member Author

ljwolf commented Aug 25, 2023

it's all in the decorator

yep. I think a similar approach would be useful for KNN/distance band? rather than doing knn truncation in kernel for precomputed weights, we make knn and distance band similar to the triangulators: ways to build a generic distance matrix, which then gets passed to kernel for weighting.

it keeps the behaviour very consistent, and reduces the amount of unique class/constructor-based code.

@knaaptime
Copy link
Member

thats ideal because it allows

Graph.build_kernel(gdf, threshold=2000, kernel='gaussian') and

Graph.from_sparse(distance_matrix).build_kernel(threshold=2000, kernel='gaussian'),

where distance_matrix is a travel time matrix from routingpy or pandana

@ljwolf
Copy link
Member Author

ljwolf commented Aug 25, 2023

Indeed, already can with _kernel(distance_matrix, metric="precomputed", kernel="gaussian") so I think Graph.build_kernel(distance_matrix, metric="precomputed", kernel="gaussian") would work.

The simplification would be that kernel wouldnt take a "k" argument, because that is a knn-style operation, so Graph.build_kernel(distance_matrix, k=5, kernel="gaussian") becomes Graph.build_knn(distance_matrix, k=5, kernel="gaussian")

@martinfleis
Copy link
Member

I am not sure if I am a big fan of this decorator approach. It does not result in the most legible code.

Graph.from_sparse(distance_matrix).build_kernel(threshold=2000, kernel='gaussian')

This would result in inconsistent API. build_kernel builds Graph from external object, not self. What are you looking for is transform() which should support a custom kernel. What we have there now is a subset of kernels anyway.

@knaaptime
Copy link
Member

got it, so the appropriate way is

Graph.build_kernel(distance_matrix, metric="precomputed", threshold=2000, kernel="gaussian")

equivalent to

Graph.build_kernel(gdf, threshold=2000, kernel="gaussian")

building kernel from the sparse versus building kernel from the implied distances in the gdf.coordinates

@martinfleis
Copy link
Member

Yes, that is how it currently works. But I can also imagine

Graph.from_sparse(distance_matrix).transform(transformation="kernel", threshold=2000, kernel='gaussian')

@knaaptime
Copy link
Member

i think we need a from_adjacency. Currently the only way to build from a pandas adjlist is to give that directly to the init

@martinfleis
Copy link
Member

i think we need a from_adjacency. Currently the only way to build from a pandas adjlist is to give that directly to the init

or using from_arrays(df['focal'], df['neighbor'], df['weight']) but feel free to write from_adjacency as well. That surely won't hurt.

@knaaptime
Copy link
Member

or using from_arrays(df['focal'], df['neighbor'], df['weight'])

that assumes they are sorted the same way

@knaaptime
Copy link
Member

(the within-focal sorting of the neighbors may be different for each focal, so you wont get the right sparse)

@martinfleis
Copy link
Member

True. Then the constructor method may be a wise thing to do.

@knaaptime
Copy link
Member

martinfleis added a commit that referenced this pull request Aug 25, 2023
@knaaptime
Copy link
Member

knaaptime commented Aug 25, 2023

update: looks like i was using old geographs code. This mostly works now (see updated gist. The preview refuses to update, click the actual link)

https://gist.github.com/knaaptime/60bcf942508a47f6028ab1bfd1f3fa26

only thing is the user has to know how to reshape the adjlist to matrix and reindex the rows/cols.

@martinfleis
Copy link
Member

I think it is a good idea to have a constructor where we do this sorting ourselves. We just need to be clear that this happens as we won't have the original df as a reference as with the build_* constructors.

@knaaptime
Copy link
Member

knaaptime commented Aug 25, 2023

im happy to add a _from_adjlist constructor, cause i think we should have one anyway, and it can include the reindexing logic, which should be pretty costless.

But this still doesnt solve the problem raised above. To get a kernel weight based on precomputed network distances, a user still has to manually convert to a matrix from an adjcency. Technically you can do the signature

Graph.build_kernel(distance_matrix, metric="precomputed", threshold=2000, kernel="gaussian"), but its hard for us to help you get an adjlist into the right distance_matrix unless we allow something like Martin's transform

Graph.from_sparse(distance_matrix).transform(transformation="kernel", threshold=2000, kernel='gaussian'). I'm not sure i love that convention, but dont have a better idea at the moment

with the current api, its hard for us to know an adjlist has been passed that needs to be reindexed?

@knaaptime
Copy link
Member

I think it is a good idea to have a constructor where we do this sorting ourselves. We just need to be clear that this happens as we won't have the original df as a reference as with the build_* constructors.

as of this commit, we're always reordering the inner index, (we take the row index as given and reorder the column index in the sparse), right? so i think that's just a stipulation of the Graph in general?

@martinfleis
Copy link
Member

That function is not used yet. I need to look into that and figure out if that's actually true in all cases. I think it is but don't want to say it for sure before testing.

@knaaptime
Copy link
Member

ok, let me know what you find. After working through this in #513, i ended up pretty confident that we always need to reorder columns to match row order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants