Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Read GeoTIFF files from remote urls via object_store #5

Merged
merged 12 commits into from
Mar 4, 2024

Conversation

weiji14
Copy link
Owner

@weiji14 weiji14 commented Mar 1, 2024

Enable GeoTIFFs stored on remote urls (e.g. http, s3, azure, gcp, etc) to be read into numpy arrays.

Uses the object_store crate, which pulls data via an async API. Note that only the network read is async, the TIFF decoding (using the tiff crate) is still synchronous.

Usage (in Python):

import numpy as np
from cog3pio import read_geotiff

array: np.ndarray = read_geotiff(
    path="https://github.com/pka/georaster/raw/v0.1.0/data/tiff/float32.tif"
)
assert array.shape == (20, 20)
assert array.dtype == "float32"

TODO:

  • Add failing unit test reading a GeoTIFF via HTTP
  • Use object_store's parse_url function to handle remote URL reads
  • Ensure that local relative filepaths (e.g. data/float32.tif) work alongside remote URLs (e.g. https://somewhere.com/float32.tif)
  • Refactor to remove .unwrap() statements
  • Document in main README.md on new usage

Other things done in this PR out of necessity:

  • Move to compiling on manylinux_2_28, therefore dropping compilation on i686 (x86) targets on Linux

References:

A generic object store interface for uniformly interacting with AWS S3, Google Cloud Storage, Azure Blob Storage and local files!

Also added the url crate.
An event-driven, non-blocking I/O platform for writing asynchronous I/O backed applications!
Doing test driven development by trying to load a remote GeoTIFF file via http into a numpy array.
@weiji14 weiji14 added the feature New feature or request label Mar 1, 2024
@weiji14 weiji14 added this to the 0.1.0 milestone Mar 1, 2024
@weiji14 weiji14 self-assigned this Mar 1, 2024
Allow passing remote URLs to the read_geotiff_py function. URLs are handled via object_store's parse_url function. Data is read asynchronously into an in-memory buffer (within a tokio runtime), and passed into the synchronous io::geotiff::read_geotiff function. Also added a short example in the Python docstring on how this can be used.
src/lib.rs Outdated
Comment on lines 56 to 61
// Parse URL
let url = Url::parse(path).expect(&format!("Cannot parse path: {path}"));
let (store, location) = parse_url(&url).expect(&format!("Cannot parse url: {url}"));

// Return cursor to in-memory buffer
let result = store.get(&location).await.unwrap();
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to find a way here to let object_store read from local filepaths like /tmp/float32.tif or data/float32.tif. Currently at commit 33faafb, one would need to prepend the filepath with file:// for the read to work, otherwise there will be an error like pyo3_runtime.PanicException: Cannot parse path: /tmp/tmp6ip79h4_/float32.tif: RelativeUrlWithoutBase (see https://github.com/weiji14/cog3pio/actions/runs/8115987634/job/22185036928?pr=5#step:6:89).

Maybe add some logic to determine the URL scheme, and if it is local, prepend the url with file://?

Copy link
Owner Author

@weiji14 weiji14 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, RelativeUrlWithoutBase is raised on Url::parse, not ObjectStore 😅 Edit: Should be handled in f9607f3

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, need to figure out how to handle Windows paths - https://github.com/weiji14/cog3pio/actions/runs/8118215740/job/22192052940?pr=5#step:6:95:

E       pyo3_runtime.PanicException: Cannot parse url: c:\Users\RUNNER~1\AppData\Local\Temp\tmpv5aci1cm\float32.tif: Generic { store: "URL", source: Unrecognised { url: Url { scheme: "c", cannot_be_a_base: true, username: "", password: None, host: None, port: None, path: "\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpv5aci1cm\\float32.tif", query: None, fragment: None } } }

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, figured out that there's a Url::from_file_path function that handles Windows paths (and also prepends file:// to local filepaths). Implemented in commit 3c8a604

Cargo.toml Outdated
Comment on lines 14 to 19
numpy = "0.20.0"
object_store = { version = "0.9.0", features = ["http"] }
pyo3 = { version = "0.20.3", features = ["abi3-py310", "extension-module"] }
tiff = "0.9.1"
tokio = "1.36.0"
url = "2.5.0"
Copy link
Owner Author

@weiji14 weiji14 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to hide some of the Python-specific crates behind a feature flag (if the pure Rust parts will be published as a crate). Hoping that most of the Rust code can be upstreamed though. See PyO3/pyo3#2271 for reference.

Handle the RelativeUrlWithoutBase parsing error raised when local filepaths are given without `file://` in front. Also moved the parse url logic out of the async block.
Copy link

codspeed-hq bot commented Mar 1, 2024

CodSpeed Performance Report

Merging #5 will degrade performances by 33.24%

Comparing object_store (5c412da) with main (0c37e32)

Summary

❌ 1 (👁 1) regressions

🆕 1 new benchmarks

Benchmarks breakdown

Benchmark main object_store Change
👁 test_read_geotiff_local 423 µs 633.6 µs -33.24%
🆕 test_read_geotiff_remote N/A 48.9 ms N/A

Resolve `warning: ring@0.17.8:  #error "ARM assembler must define __ARM_ARCH"` on aarch64. Xref PyO3/maturin-action#222
Getting an error like `/usr/bin/python3: No module named pip` when using the quay.io/pypa/manylinux_2_28_x86_64:latest docker image, so trying with ghcr.io/rust-cross/manylinux_2_28-cross instead. Note that the aarch64/armv7/s390x/ppc64le targets use this rust-cross image by default.
Use built-in function at https://docs.rs/url/2.5.0/url/struct.Url.html#method.from_file_path to add the file:// prefix instead of the DIY method at f9607f3. Hoping that this would handle Windows paths too. Also changed .expect to .map_err to raise PyValueError instead of panicking.
Add instructions on how to install cog3pio in Rust or Python (need to pull development version from git for now), and sample code to read a GeoTIFF file from a http url. Included some tips on pinning the specific commit hash, and mentioned that the current crate/library only supports reading single-band float32 GeoTIFF files. Also checked the box that reading from remote storage is implemented.
Gracefully handle errors when retrieving the GeoTIFF data in the async block, using turbofish operator as suggested by https://rust-lang.github.io/async-book/07_workarounds/02_err_in_async_blocks.html. Fixed some errors missing the format! macro, and added unit tests to catch parsing and decoding errors.

In the read_geotiff function, changed the function to return a TiffError instead of ShapeError, so that ? can be used in the decoding portions.
Copy the Rust code from 15b0dbc to src/lib.rs as a doctest. Need to change `crate-type = ["rlib"]` in Cargo.toml so that `cargo test --doc` works, but will configure that in CI later.
Comment on lines -39 to +94
// Open TIFF file from path
let file = File::open(path).expect("Cannot find GeoTIFF file");
// Parse URL into ObjectStore and path
let file_or_url = match Url::from_file_path(path) {
// Parse local filepath
Ok(filepath) => filepath,
// Parse remote URL
Err(_) => Url::parse(path)
.map_err(|_| PyValueError::new_err(format!("Cannot parse path: {path}")))?,
};
let (store, location) = parse_url(&file_or_url)
.map_err(|_| PyValueError::new_err(format!("Cannot parse url: {file_or_url}")))?;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodSpeed Performance Report

Merging #5 will degrade performances by 32.21%

Comparing object_store (33f4a73) with main (0c37e32)

Summary

❌ 1 (👁 1) regressions

🆕 1 new benchmarks

Benchmarks breakdown

Benchmark main object_store Change
👁 test_read_geotiff_local 423 µs 624 µs -32.21%
🆕 test_read_geotiff_remote N/A 49 ms N/A

Acknowledging the slowdown in reading local GeoTIFF files as acceptable for now since it is only ~200µs slower, but could revisit this next time if we should have separate branches for remote files (read using object_store) and local files (read using std::fs::File).

@weiji14 weiji14 marked this pull request as ready for review March 4, 2024 03:32
@weiji14 weiji14 enabled auto-merge (squash) March 4, 2024 03:33
@weiji14 weiji14 merged commit fefa543 into main Mar 4, 2024
13 checks passed
@weiji14 weiji14 deleted the object_store branch March 4, 2024 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant