Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple closest street network segments with get_network_id #124

Open
ghost opened this issue Nov 29, 2019 · 6 comments
Open

Multiple closest street network segments with get_network_id #124

ghost opened this issue Nov 29, 2019 · 6 comments
Labels
later Waiting for something else to happen before

Comments

@ghost
Copy link

ghost commented Nov 29, 2019

I was looking at the get_network_id function and found an issue.

During the nearest distance calculation

d = INFTY
for h in hits:
    new_d = p.distance(right.geometry.iloc[h])
    if d >= new_d:
        d = new_d
        nid = right[network_id].iloc[h]

information is lost if there are two geometries which have exactly the same distance from each other. I think this should at least be mentioned if that happens.

I stumbled over this problem by trying to improve the speed of the for loop by using pandas.DataFrame.apply and instead of pandas.DataFrame.iterrows. For the ´apply` I implemented two different functions, and that is where I found the issue.

My solution, which seems to be a bit faster for bigger dataframes looks like this:

def get_network_id(left, right, network_id, min_size=100, old_func=True):

    INFTY = 1000000000000
    MIN_SIZE = min_size
  
    left = left.copy()
    right = right.copy()

    buildings_c = left.copy()
    buildings_c["geometry"] = buildings_c.centroid  # make centroids
    
    # this creates the pbox from the previous code
    buildings_c['bounding_box'] = buildings_c.buffer(MIN_SIZE, cap_style=3).bounds.values.tolist()

    # Already implemented version, just as a function
    def f_old(x):
        print("Generating rtree...")
        idx = right.sindex
        d = INFTY
        nid = None
        hits = list(idx.intersection(x['bounding_box']))
        
        for h in hits:
            new_d = x['geometry'].distance(right.geometry.iloc[h])
            if d >= new_d:
                d = new_d
                nid = right[network_id].iloc[h]
        if nid is None:
            return np.nan
        else:
            return nid
    
    # New way, by using dictionaries
    def f_new(x):
        idx = right.sindex
        distances = {h:x['geometry'].distance(right.geometry.iloc[h]) for h in (idx.intersection(x['bounding_box']))}
        if distances:
            nid = right[network_id].iloc[min(distances, key=distances.get)]
            return nid
        else:
            return np.nan


    if old_func:
        series = buildings_c.apply(f_old, axis=1)
    else:
        series = buildings_c.apply(f_new, axis=1)


    return series

I was worried using the function with dictionaries was not working because the results were different than the old way, but when looking at the ´distances´ dict, I saw that it sometimes occurs that building centroids have exactly the same distance to the streets. Which makes sense, but that information should not be dropped.

@martinfleis
Copy link
Member

Hi @kvnkrmr,
thanks for the report. You are right about it. Which data did you use? It seems to be a very corner case which will not happen very often on the real world data. Anyway, it should at least raise a warning that this situation happened and which geometries were involved.

How faster is .apply in this case? I did a few benchmarks on different functions earlier and found out only a very minor difference. But I did not test this one. I did not focus on performance yet, a development was and still is focused on functionality primarily (this one especially, it is not a very polished code I must admit :)).

If you'd like to make a PR to tackle the warning and/or faster implementation, I'd be more than happy.

@ghost
Copy link
Author

ghost commented Dec 4, 2019

Hi @martinfleis,
I used the following data: From Geofabrik, I downloaded saarland-latest-free.shp.zip. From that zip file, I imported

gis_osm_roads_free_1.shp

as the roads and

gis_osm_buildings_a_free_1.shp

as the buildings. For this whole dataset, .apply is 1s faster. Not much, but I am trying to find out how the different functions scale with either more buildings or more roads. I will try to post my timing code and make a PR sometime in this week.

@martinfleis
Copy link
Member

Thanks! That is a huge dataset. I haven't tried as it shows 700 hours on my machine at the moment 🤣. I assume the difference of 1s is basically equal to nothing with geodataframes as large as these. What was the total time?

Testing all three versions using built-in bubenec dataset (which is tiny), the difference is minimal, sometimes even .apply being a bit slower than existing approach.

I am curious about the benchmarks on the larger data.

One idea regarding the cause of this issue - as it is highly unlikely that two different street segments will be at the exactly the same distance from building centroid, I assume that those affected segments are, in fact, duplicated. Which happens sometimes in OSM. Anyway, we should print a warning with ids explaining what happened.

Looking forward for timings and PR! Thanks!

@martinfleis
Copy link
Member

@kvnkrmr just checking the state of this. I want to release 0.1.1 with a few bugfixes, so I just want to understand if I should wait for this or leave for the next one. Thanks!

@ghost
Copy link
Author

ghost commented Dec 14, 2019

@martinfleis Hi, I couldn't find the time for this in the last week, so leave it for the next one. Sorry!

@martinfleis martinfleis added this to the 0.1.2 milestone Dec 15, 2019
@martinfleis martinfleis modified the milestones: 0.1.2, 0.2.0 Jan 8, 2020
@martinfleis
Copy link
Member

Just an update on this. GeoPandas will include sjoin_nearest, which will be then used within momepy, so this issue is now on hold to avoid duplication of work across packages.

@martinfleis martinfleis removed this from the 0.2.0 milestone Jan 5, 2021
@martinfleis martinfleis added the later Waiting for something else to happen before label Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
later Waiting for something else to happen before
Projects
None yet
Development

No branches or pull requests

1 participant