Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Zonal Stats: Colab crashing due to RAM consumption #71

Closed
Tracked by #42
alronlam opened this issue Jul 4, 2022 · 6 comments
Closed
Tracked by #42

Vector Zonal Stats: Colab crashing due to RAM consumption #71

alronlam opened this issue Jul 4, 2022 · 6 comments

Comments

@alronlam
Copy link
Contributor

alronlam commented Jul 4, 2022

Colab notebook for testing:
https://colab.research.google.com/drive/147HWUgaBztsZuBPrI_HTckBrz_vl9l1l#scrollTo=wvLenjgDUgod

Scenario

  • Created AOI grid tiles for the Subang Regency in Indonesia (~36k 250mx250m grid tiles)
  • Tried to get the average population density per tile using HRSL vector data (CSV file is around 1.8GB)

Error
Colab crashes due to exceeding the RAM limit.
gw_vzs_hrsl

Just creating this issue to check if there are straightforward ways to optimize. Otherwise, are there workarounds for handling such vector datasets that are relatively large?

@tm-kah-alforja tm-kah-alforja mentioned this issue Jul 4, 2022
12 tasks
@butchtm
Copy link
Collaborator

butchtm commented Jul 4, 2022

Hi @alronlam
I'm trying to replicate the issue but your colab notebook is missing a dataset 'indonesia-osm-pois-2020.csv'. Could you provide a link to a copy of the dataset we could download?

@butchtm
Copy link
Collaborator

butchtm commented Jul 4, 2022

Hi @alronlam, the ookla dataset 'indonesia-ookla-2020-q1-fixed.csv' is missing as well.

@alronlam
Copy link
Contributor Author

alronlam commented Jul 4, 2022

Oh hi @butchtm , the link to the GDrive folder for these files are in the top-most part of the notebook.

@alronlam
Copy link
Contributor Author

alronlam commented Jul 4, 2022

Additional detail: I think I also ran into this RAM issue when aligning with the raw Ookla dataset: https://registry.opendata.aws/speedtest-global-performance/

I tried utilizing the latest fixed line data from Ookla:

My workaround was to utilize an older, filtered version of the data that was for Indonesia only (because this raw data was for the whole world).

So I guess one principle here is that we should always filter the feature datasets as much as we can before aligning to the AOIs to avoid such issues. But in the example of HRSL, this data is already for Indonesia alone. Not sure what else we can do to make it work for such big datasets (some kind of parallel processing?). Or in these cases, are we forced to use other tools like BQ?

@butchtm
Copy link
Collaborator

butchtm commented Jul 4, 2022

hi @alronlam, I'm trying to see if I can just convert the HRSL data (1.8GB csv file) to a geojson file and load it as such, but even that is already crashing Colab. Colab might not be ideal for working with production sized datasets but for learning/exploring the modules.
I'll try to replicate the problem on a beefier machine (my laptop :-)) to find a way to do this.

@tm-kah-alforja
Copy link
Collaborator

low prio; for further discussions

@thinkingmachines thinkingmachines locked and limited conversation to collaborators Aug 11, 2022
@tm-kah-alforja tm-kah-alforja converted this issue into discussion #124 Aug 11, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants