Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osm loader -- support for very large queries #41

Closed
knaaptime opened this issue May 20, 2015 · 8 comments
Closed

osm loader -- support for very large queries #41

knaaptime opened this issue May 20, 2015 · 8 comments

Comments

@knaaptime
Copy link

right now, very large OSM queries fail because python runs out of memory before the network object can be written to disk.

Would it be possible to add an option in the OSM loader to stream the overpass request to disk, allowing for really big networks?

@jiffyclub
Copy link
Member

If you have a network that large won't you be unable to load it into memory anyway, even if it does make it to disk?

@knaaptime
Copy link
Author

Well, the final network shouldn't be that big.

Before the OSM loader was added, @fscottfoti built me a network that covers the state of MD (an h5 file about 50mb). I want to create a similar network from scratch using the OSM loader, but when I try, eg:

from pandana.loaders import osm

osm.network_from_bbox(37.8856, -79.4872, 39.7905, -74.9852, network_type='walk', two_way=True)
lcn = network.low_connectivity_nodes(10000, 10, imp_name='distance')

network.save_hdf5('input/osmnetwork.h5', rm_nodes=lcn)

python eats up all of the machine's RAM, then quits with a memory error message. The same thing happens on an amazon server with 80gb of ram. There's no way the network is actually that big. Is it possible that something weird is happening during the overpass query?

maybe something like this? http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py

@jiffyclub
Copy link
Member

The overpass request response is JSON so there's really no way to stream it. You have to have the entire document in order to parse it.

When I get a chance to work on this the first step will be to figure out exactly what's using all the memory, whether it's the query data or some subsequent step.

@fscottfoti
Copy link
Contributor

BTW, I tried this out. First I tried using a network from the Bay Area - basically all the 9 counties out basically to Sacramento. It ran fine in only about 5GB of memory and about 24 minutes. So that was great.

So I tried @knaaptime's query and did get a memory error on a 32GB machine. The place it errored for me was very strange though - in the from_records Pandas call?

Traceback (most recent call last):
File "go.py", line 3, in <module>
  network = osm.network_from_bbox(37.8856, -79.4872, 39.7905, -74.9852)
File "/home/ubuntu/pandana/pandana/loaders/osm.py", line 312, in network_from_bbox
  lat_min, lng_min, lat_max, lng_max, network_type)
File "/home/ubuntu/pandana/pandana/loaders/osm.py", line 202, in ways_in_bbox
  lat_min, lng_min, lat_max, lng_max, network_type=network_type)))
File "/home/ubuntu/pandana/pandana/loaders/osm.py", line 178, in parse_network_osm_query
  pd.DataFrame.from_records(nodes, index='id'),
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 888, in from_records
  columns)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4808, in _arrays_to_mgr
  return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3555, in      
 create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3645, in 
  form_blocks
object_items, np.object_)
  File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3677, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 3741, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError

@Eh2406
Copy link

Eh2406 commented Mar 10, 2017

Not the newish https://github.com/UDST/osmnet may be relevant here.

@sablanchard
Copy link
Collaborator

Thanks @Eh2406 that is correct the new OSMnet package fixes the issue with large bounding box queries: https://github.com/UDST/osmnet. The PR to replace the functions inside Pandana is still pending but will be merged soon so in the meantime anyone can use OSMnet to extract the nodes and edges from OSM and then place it inside a pandana network object. Will keep this issue open until the PR is merged and ready.

@knaaptime
Copy link
Author

Thanks for the heads up. I noticed osmnet when i saw urbanaccess was released, and figured it would solve this issue.

I saw the PR get merged on friday so I ran the query again using osmnet. It finished in about an hour on my current-gen macbook pro

Downloaded OSM network data within bounding box from Overpass API in 40 request(s) and 907.23 seconds
657946 duplicate records removed. Took 160.33 seconds
Returning OSM data with 8,415,342 nodes and 774,600 ways...
Edge node pairs completed. Took 2,394.01 seconds
Returning processed graph with 986,809 nodes and 1,316,755 edges...
Completed OSM data download and Pandana node and edge table creation in 3,599.82 seconds

All looks good. Thanks for your great work

sablanchard added a commit that referenced this issue Mar 14, 2017
@sablanchard
Copy link
Collaborator

Fixed with #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants