New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error in AbsoluteClustering with geographic coordinate system #169
Comments
Thank you, Eli! I was having trouble replicating the API, even with Apologies if my jargon is imprecise, and thanks! |
can you share the specific error you're hitting? the good to keep in mind, though, that the function is just a convenience. Probably the "right" thing to do is set the CRS with a bit more intention by finding an appropriate one to match your study area using a site like epsg.io. (though with that said, your use-case of looping through study areas is exactly what i was thinking of when i requested the function be added to geopandas in the first place :P) |
This is likely a helpful reference: https://pyproj4.github.io/pyproj/stable/examples.html#find-utm-crs-by-latitude-and-longitude |
I would be careful using UTM projection for area: https://en.wikipedia.org/wiki/Mercator_projection
If you want the area, I would recommend using an equal area projection. EPSG:6933 is a potential candidate for an equal area projection you can use globally. |
You might also consider using geodesic area: https://pyproj4.github.io/pyproj/stable/examples.html#geodesic-area |
thats why we tag the experts :) Thanks again.
and that's why you should set the CRS "intentionally". With our current code, you'll get a much closer estimate of the segregation statistic using UTM than geographic coords, but it still might not be correct. I was just looking at the geodesic area from the first link. Looks pretty reasonable to adopt that approach under the hood here. I also think i remember @martinfleis considering an enhancement to use geodesic area under the hood for the Maybe the best thing to do would be include two arguments:
|
It is super inefficient at the moment compared to pygeos so it won't happen anytime soon. The best option here is to wrap S2 as sf does but that would require pygeos-like interface to S2. Reproject ;). |
Thank you both for the help and suggestions. In my case, I am working at the municipality level, so I don't think the mercator distortions are much of an issue as long as they affect all areas in a municipality to a similar extent, but I might be wrong on this. Since I am new at using geopandas and pysal/segregation, would you suggest starting clean with a new environemnt and proceed with an installation from conda-forge for both? Should I start with geopandas or any other dependency first and then pysal/segregation? At the moment, I am working with jupyter in VSCode. I am not sure if the required packages/extensions might create some issues with geopandas and/or pysal/segregation. |
done, we'll go with strategy 1 (force a projection) and leave it to the user to ensure that the one they've chosen is reasonable @federicoricca if you start a new environment, then install |
Ok so it took a bit more trouble than expected but I am up and running now. I believe there are some conflicts between jupyter and the latest geopandas (fiona in particular) so I had to follow a specific order to avoid any errors when imports. From a new environment:
and necessarily following that:
I believe the same would work on geopandas alone. Also, it does not work by just letting the VSCode jupyter extension installed the required packages, it has to be installed manually and first, before geopandas/segregation. I am not sure why. I will try and replicate the API examples reprojecting, as well as apply that method to my data. I will let you know asap how it goes. Thanks again everyone! Just for reference, I am attaching the .txt of my environment finally working, geo.txt |
@knaaptime, I tried to replicate the API for SpatialProximity and AbsoluteClustering, with mixed results. For SpatialProximity, it is possible to recover the same statistic as the API by not reprojecting the
Reprojecting gives no error but a slightly different result.
Things get uglier for AbsoluteClustering. Without projecting, the result gives errors and is off compared to the API. With the same projection as before, it gives no error and a negative index. In neither case it is possible to replicate the API results.
I will try and replicate the examples notebook in the repo and see if I get similar inconsistencies. Any idea what the issue might be here? Reprojecting does not seem to make it work either, the ACL should not be negative, as far as I understand. Thanks! |
I could swear I was able to reproduce the API example before i emailed you back, but i can confirm i'm definitely seeing the same results as you now. I also see this show up in the new examples for version 2.0. We'll need to take a closer look at AbsoluteClustering and RelativeClustering Looks like SpatialProximity is behaving like it should |
I remember in the paper we developed (https://link.springer.com/article/10.1007/s42001-019-00059-3), Table 3 (below) compares the values of However, If I'm not mistaken, I developed Absolute Clustering after writing this section in the paper. Therefore, I didn't make the comparison. It would be cool to make this comparison again and compare ACL. |
i think the issue has to do with the way |
@renanxcortes take a look at the new results at the bottom. This makes a small change in the clustering indices to use a libpysal weights object to calculate the distance matrix instead of using scikit |
Got it! Thanks! Also, I think these changes may affect also Distance Decay Isolation/Exposure (which also uses scikit's |
(Altough, there is an issue in OasisR that I raised here: cran/OasisR@cc3681d) |
@renanxcortes take another look at that notebook now that the distance calculations all use libpysal.W objects. In particular, dxInteraction and dxIsolation are now essentially equivalent to their aspatial counterparts. With exponential decay on tract-level data, "nearby" observations get discounted so quickly they barely have an effect (difference is out to like the 8th decimal), though if you use an inappropriate projection like 4326 you can see the function is working like it should. My guess is Massey and Denton used raw geographic coordinates to calculate DPxx in this table which is why they get a reasonable difference from xPx (mislabelled in the table) |
I was having a look at your modifications to the ACL. If I understand correctly, the ACL is now using a simple inverse-square decay rather than exponential like in Massey and Denton? Also, I was just wondering, if the DistanceBand binary is set to false, does the threshold really matter? I noticed you set it to max distance in the code. Thanks! |
ok, i spent some time with both massey/denton and the code today. I think a big part of the confusion comes from their description of the c matrix. they draw an analogy to a contiguity matrix (i.e. where cells describe affinity between i an j), but they actually compute a distance matrix, where cells describe dissimilarity between i and j, so we need to use 1-c to get the proper weights.
in the most recent commit, i set it back to exp
threshold is required if binary is false
this wasn't quite right, there was an error in my code earlier i'm confident in these new numbers |
I was just playing around with distances and Spatial Proximities, and indeed it does look like with exponential decay the discount is just too extreme for NumPy to consider any c_{ij} different from 0. Distances are all given in meters, perhaps using kilometers instead allows for a more meaningful weighting? |
c_{ij} should be a proximity matrix, not a distance matrix as it's described in the paper. OasisR also operationalizes it (incorrectly) as a distance matrix:
versus our estimates using the same dataset (using the 2.0 branch) notice OasisR gives a negative estimate for relative clustering, and its estimates for interaction and dxinteraction are the same |
i should also note the performance difference between us and OasisR is extreme :) |
Thanks! Yes, I agree about the interpretation of C as a proximity matrix. However, my concern was mostly about the numerical limitations of exponential decay, which is inherently based on distance in 2.0 as well. The issue being that when coordinates are defined in meters, the values of distances and areas values are simply too large and are all translated into literal zeros by np.exp(-d). I did re-read carefully that spatial section in Massey and Denton and there is one mention of_miles_. I believe considering a kilometric scale measures would give more meaningful weights. I did try the C proximity matrix from your new 2.0 SpatialProximity, and the weights assigned to nearby units are very small, even for contiguous units. I hope this helps! |
agree, the exponential decay is fast (especially when you include all observations in the study area). Above we get a difference between Interaction and DxInteraction... at the third decimal. an alternative would be to use one of the generalized spatial indices in the new 2.0 framework. Two benefits would be that you can choose exactly how you want to represent space by passing a libpysal.W, and you can see the influence of space for your chosen index, e.g. by looking at the multiscalar profile |
I have been experimenting with the ACL on my own data, and it shows quite an inconsistent behavior. I think this has more to do with the ACL itself, the updated code is perfect. Using the same exponential decay in meters, as Massey and Denton suggest, the discounting is so steep that for a lot of instances the index is = 1. By using hundreds of meters or km, the exponential decay is a bit smoother, but that allows the index to be negative when the X group is highly segregated and a lot of units display x = 0. Have you ever noticed anything similar? I am not entirely sure how they can guarantee that the ACL will belong to [0, 1). They never discuss explicitly how to deal with units with zero group population. Perhaps the ACL is thought to be used considering only units with x>0? Thank you for any suggestion/comment. |
Cool! And with the new framework, the user can set any kind of "weights" that can comprised contiguity or exponential distance, etc. |
@federicoricca can you double check using the latest commit in the 2.0 branch? (sorry, i thought i pushed that a couple days ago) I included a row-standardization line that i think will make a difference for you. Here's a histogram of stats for 20 different MSAs. You can see it does skew toward 0, but im not getting any negative values |
On a related note, the haversine function requires radians as arguments and I think the current implementation here and likely elsewhere is passing degrees? |
Following up on @knaaptime investigation on the Also for the "own" distance whereas in Massey and Denton they have |
in 2.0, i've opted to remove the haversine option, because we often need area anyway. Going forward we operate exclusively on planar geoms and we warn the user otherwise. |
Sorry about that, my bad, wasn't using the new version and found a typo. Thanks! |
Great catch! Also, in the paragraph above that, White (1983) clearly states the use of distances in miles, while most likely the CRS will give distances in meters, with quite a different exponential decay, provided Massey and Denton follow White in that aspect too. |
I think we can mark this resolved as of #161 |
Definitely, and thank you for all the help! |
awesome. thanks for raising the issue that led to all this digging! resolved by #161 |
discovered by @federicoricca
The root of the issue is that while we allow haversine distance in the case that a user's data are stored in geographic coordinates, we still need the unit's area, and we're not accounting for that at the moment.
I think the best way to resolve this is to check for geographic coordinates when the index is initialized, raise an error that it only works on projected CRS, and provide a flag for forcing it through in case we've incorrectly identified geographic CRS. Actually it might be worth considering that approach elsewhere too
The text was updated successfully, but these errors were encountered: