Vector Zonal Stats: Colab crashing due to RAM consumption #71

alronlam · 2022-07-04T01:58:59Z

Colab notebook for testing:
https://colab.research.google.com/drive/147HWUgaBztsZuBPrI_HTckBrz_vl9l1l#scrollTo=wvLenjgDUgod

Scenario

Created AOI grid tiles for the Subang Regency in Indonesia (~36k 250mx250m grid tiles)
Tried to get the average population density per tile using HRSL vector data (CSV file is around 1.8GB)

Error
Colab crashes due to exceeding the RAM limit.

Just creating this issue to check if there are straightforward ways to optimize. Otherwise, are there workarounds for handling such vector datasets that are relatively large?

butchtm · 2022-07-04T02:48:53Z

Hi @alronlam
I'm trying to replicate the issue but your colab notebook is missing a dataset 'indonesia-osm-pois-2020.csv'. Could you provide a link to a copy of the dataset we could download?

butchtm · 2022-07-04T02:51:28Z

Hi @alronlam, the ookla dataset 'indonesia-ookla-2020-q1-fixed.csv' is missing as well.

alronlam · 2022-07-04T03:00:35Z

Oh hi @butchtm , the link to the GDrive folder for these files are in the top-most part of the notebook.

alronlam · 2022-07-04T03:35:39Z

Additional detail: I think I also ran into this RAM issue when aligning with the raw Ookla dataset: https://registry.opendata.aws/speedtest-global-performance/

I tried utilizing the latest fixed line data from Ookla:

S3 url: s3://ookla-open-data/parquet/performance/type=fixed/year=2022/quarter=2/2022-04-01_performance_fixed_tiles.parquet
Docs: https://registry.opendata.aws/speedtest-global-performance/
Parquet file was around 500-600MB, corresponding to ~6m data tiles.

My workaround was to utilize an older, filtered version of the data that was for Indonesia only (because this raw data was for the whole world).

So I guess one principle here is that we should always filter the feature datasets as much as we can before aligning to the AOIs to avoid such issues. But in the example of HRSL, this data is already for Indonesia alone. Not sure what else we can do to make it work for such big datasets (some kind of parallel processing?). Or in these cases, are we forced to use other tools like BQ?

butchtm · 2022-07-04T03:54:50Z

hi @alronlam, I'm trying to see if I can just convert the HRSL data (1.8GB csv file) to a geojson file and load it as such, but even that is already crashing Colab. Colab might not be ideal for working with production sized datasets but for learning/exploring the modules.
I'll try to replicate the problem on a beefier machine (my laptop :-)) to find a way to do this.

tm-kah-alforja · 2022-08-11T03:37:48Z

low prio; for further discussions

tm-kah-alforja mentioned this issue Jul 4, 2022

Vector Zonal Stats #42

Closed

12 tasks

thinkingmachines locked and limited conversation to collaborators Aug 11, 2022

tm-kah-alforja converted this issue into discussion #124 Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Vector Zonal Stats: Colab crashing due to RAM consumption #71

Vector Zonal Stats: Colab crashing due to RAM consumption #71

alronlam commented Jul 4, 2022

butchtm commented Jul 4, 2022

butchtm commented Jul 4, 2022

alronlam commented Jul 4, 2022

alronlam commented Jul 4, 2022 •

edited

butchtm commented Jul 4, 2022

tm-kah-alforja commented Aug 11, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Vector Zonal Stats: Colab crashing due to RAM consumption #71

Vector Zonal Stats: Colab crashing due to RAM consumption #71

Comments

alronlam commented Jul 4, 2022

butchtm commented Jul 4, 2022

butchtm commented Jul 4, 2022

alronlam commented Jul 4, 2022

alronlam commented Jul 4, 2022 • edited

butchtm commented Jul 4, 2022

tm-kah-alforja commented Aug 11, 2022

This issue was moved to a discussion.

alronlam commented Jul 4, 2022 •

edited