Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
554 changes: 538 additions & 16 deletions Cargo.lock

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ async-compat = "0.2.5"
async-fs = "2.2.0"
async-stream = "0.3.6"
async-trait = "0.1.89"
bincode = "2"
bindgen = "0.72.0"
bit-vec = "0.8.0"
bitvec = "1.0.1"
Expand Down Expand Up @@ -116,13 +117,18 @@ dyn-hash = "1.0.0"
enum-iterator = "2.0.0"
enum-map = "2.7.3"
erased-serde = "0.4"
fastbloom = "0.14.0"
fastlanes = "0.5"
flatbuffers = "25.2.10"
fsst-rs = "0.5.5"
futures = { version = "0.3.31", default-features = false }
fuzzy-matcher = "0.3"
geo = "0.31.0"
geo-types = "0.7.18"
geozero = "0.14.0"
glob = "0.3.2"
goldenfile = "1"
h3o = "0.9.3"
half = { version = "2.6", features = ["std", "num-traits"] }
hashbrown = "0.16.0"
humansize = "2.1.3"
Expand Down
2 changes: 2 additions & 0 deletions encodings/zstd/src/array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,8 @@ impl ZstdArray {
(Some(ByteBuffer::from(dict)), compressor)
};

// TODO(aduffy): dictionary training

let mut frame_metas = vec![];
let mut frames = vec![];
for i in 0..n_frames {
Expand Down
Binary file added presentation/adam.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/blog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/geozero.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/h3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
131 changes: 131 additions & 0 deletions presentation/hackweek.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: Putting Vortex on the Map
sub_title: adding geospatial features to Vortex (again)
author: Andrew Duffy
---

# Background

* Vortex puts pluggability front and center
* Layouts
* Encodings
* Expressions
* It would be fun to apply Vortex to new domains that we
don't have a lot of prebuilt demos for
* I like mapping data



<!-- end_slide -->

# Background: Spatial data

*all of this is about vector data, not raster

* Encode _geometry_ alongside some set of _properties_
* Query patterns: filtered scans (`ST_Contains`, `ST_Intersects`), aggregates
* Different encoding schems, GeoParquet uses WKB to be compact

![wkb](wkb.png)

* Lean on Rust geo ecosystem

![geozero](geozero.png)

<!-- end_slide -->


# Background: The Dataset

* Microsoft OpenBuildings

![blog](blog.png)

* Global coverage of the world


<!-- end_slide -->

# Plan: Indexing

* Implement a new `GeoLayout`, like ZonedLayout but for spatial indexing

* For each 8K row chunk, build a bloom filter of *H3 Cell IDs*

![h3](h3.png)

* Can treat cell IDs like `u64` and insert into bloom filter

<!-- end_slide -->

# Plan: Layouts


* Implement an `ST_Contains` expression, implement pruning for it in our new layout
* Implement new strategy to write compact files with the custom index structure

```rust
/// Make a strategy which has special handling for DType::Binary chunks named "geometry".
fn make_rtree_strategy() -> Arc<dyn LayoutStrategy> {
let validity = Arc::new(FlatLayoutStrategy::default());
let fallback = WriteStrategyBuilder::new()
.with_compressor(CompactCompressor::default())
.build();

// override the handling of the "geometry" column
let leaf_writers = HashMap::from_iter([(
FieldPath::from_name(FieldName::from("geometry")),
geometry_writer(),
)]);

Arc::new(PathStrategy::new(leaf_writers, validity, fallback))
}
```

* (Bonus: built new strategy that allows you to override the writer for leaf columns by fieldpath)

<!-- end_slide -->

# How'd we do

Source GeoParquet file:

> 1.2 GB

Vortex File with H3 Bloom Filter Index:

> 0.9 GB!

**~22% smaller!**

<!-- end_slide -->

## Follow up work:

* Improve write performance
* Add BBOX for coarse filtering
* Implement more geospatial functions with pushdown
* `to_geojson(col("geom"))`
* `ST_Area`, `ST_Intersect`, see [geodatafusion](https://github.com/datafusion-contrib/geodatafusion)
* Integrate into SedonaDB

![sedonadb](sedonadb.png)

<!-- end_slide -->

# What People are Saying

![pmarca](pmarca.png)

<!-- pause -->

![roon](roon.png)

<!-- pause -->

![adam](adam.png)


<!-- end_slide -->

# Demo
Binary file added presentation/pmarca.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/qgis_fail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/roon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/sedonadb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/trump.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added presentation/wkb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions vortex-array/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ enum-iterator = { workspace = true }
enum-map = { workspace = true }
flatbuffers = { workspace = true }
futures = { workspace = true, features = ["alloc", "async-await", "std"] }
geo = { workspace = true }
geo-types = { workspace = true }
geozero = { workspace = true, features = ["with-wkb"] }
getrandom_v03 = { workspace = true }
goldenfile = { workspace = true, optional = true }
humansize = { workspace = true }
Expand Down
4 changes: 4 additions & 0 deletions vortex-array/src/expr/exprs/geospatial/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

pub mod st_contains;
214 changes: 214 additions & 0 deletions vortex-array/src/expr/exprs/geospatial/st_contains.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

//! An implementation of an ST_Contains expression type.

// use crate::expr::functions::{ArgName, Arity, EmptyOptions, ExecutionArgs, FunctionId, VTable};
use std::fmt::Formatter;

use geo::Centroid;
use geo::Contains;
use geo_types::Geometry;
use geozero::GeozeroGeometry;
use geozero::geo_types::GeoWriter;
use geozero::wkb;
use vortex_buffer::BitBuffer;
use vortex_dtype::DType;
use vortex_dtype::Nullability;
use vortex_error::VortexResult;
use vortex_error::vortex_ensure;
use vortex_scalar::Scalar;

use crate::Array;
use crate::ArrayRef;
use crate::IntoArray;
use crate::ToCanonical;
use crate::accessor::ArrayAccessor;
use crate::arrays::BoolArray;
use crate::arrays::ConstantArray;
use crate::expr::ChildName;
use crate::expr::ExprId;
use crate::expr::ExpressionView;
use crate::expr::Literal;
use crate::expr::VTable;
use crate::expr::traversal::Node;
use crate::vtable::ValidityHelper;

pub struct STContains;

impl VTable for STContains {
type Instance = ();

fn id(&self) -> ExprId {
ExprId::from("vortex.geo.contains")
}

fn validate(&self, expr: &ExpressionView<Self>) -> VortexResult<()> {
vortex_ensure!(
expr.children_count() == 2,
"ST_Contains expression must have exactly 2 children"
);

let _lhs = &expr.children()[0];
let _rhs = &expr.children()[1];

// TODO(aduffy): do other checks on the lhs/rhs

Ok(())
}

fn child_name(&self, _instance: &Self::Instance, child_idx: usize) -> ChildName {
match child_idx {
0 => ChildName::new_ref("geom_a"),
1 => ChildName::new_ref("geom_b"),
_ => unreachable!("child_name called with invalid child_idx"),
}
}

fn fmt_sql(&self, expr: &ExpressionView<Self>, f: &mut Formatter<'_>) -> std::fmt::Result {
let lhs = &expr.children()[0];
let rhs = &expr.children()[1];

write!(f, "ST_CONTAINS(")?;
lhs.fmt_sql(f)?;
write!(f, ", ")?;
rhs.fmt_sql(f)?;
write!(f, ")")
}

fn return_dtype(&self, expr: &ExpressionView<Self>, scope: &DType) -> VortexResult<DType> {
let lhs = &expr.children()[0];
let rhs = &expr.children()[1];
let nullability =
lhs.return_dtype(scope)?.nullability() | rhs.return_dtype(scope)?.nullability();
Ok(DType::Bool(nullability))
}

fn evaluate(&self, expr: &ExpressionView<Self>, scope: &ArrayRef) -> VortexResult<ArrayRef> {
let lhs = &expr.children()[0];
let rhs = &expr.children()[1];

match (lhs.as_opt::<Literal>(), rhs.as_opt::<Literal>()) {
(Some(l), Some(r)) => {
// Both are literals
let len = scope.len();

let l_v = l.data().as_binary().value();
let r_v = r.data().as_binary().value();
let constant = match (l_v, r_v) {
(Some(wkb_l), Some(wkb_r)) => {
let geom_l = parse_wkb(&wkb_l);
let geom_r = parse_wkb(&wkb_r);
Scalar::bool(geom_l.contains(&geom_r), Nullability::NonNullable)
}
_ => Scalar::null(DType::Bool(Nullability::Nullable)),
};

Ok(ConstantArray::new(constant, len).into_array())
}
(Some(l), None) => {
// lhs is literal, rhs is an array that we need to iterate over.
let rhs = rhs.evaluate(scope)?;
let len = rhs.len();

let Some(wkb_l) = l.data().as_binary().value() else {
return Ok(ConstantArray::new(
Scalar::null(DType::Bool(Nullability::Nullable)),
len,
)
.into_array());
};

let geom_l = parse_wkb(&wkb_l);

let rhs = rhs.to_varbinview();
let validity = rhs.validity().clone();

rhs.with_iterator(|iter| {
let matches = iter
.map(|rhs_value| match rhs_value {
None => false,
Some(wkb_r) => {
let geom_r = parse_wkb(wkb_r);
// Get centroid of the geometry
let _centroid = geom_r.centroid();
geom_l.contains(&geom_r)
}
})
.collect::<BitBuffer>();

Ok(BoolArray::from_bit_buffer(matches, validity).into_array())
})
}
(None, Some(r)) => {
// rhs is literal, lhs is an array that we need to iterate over
let lhs = lhs.evaluate(scope)?;
let len = lhs.len();

let Some(wkb_r) = r.data().as_binary().value() else {
return Ok(ConstantArray::new(
Scalar::null(DType::Bool(Nullability::Nullable)),
len,
)
.into_array());
};

let geom_r = parse_wkb(&wkb_r);

let lhs = lhs.to_varbinview();
let validity = lhs.validity().clone();
lhs.with_iterator(|iter| {
let matches = iter
.map(|v| match v {
None => false,
Some(wkb_l) => {
let geom_l = parse_wkb(wkb_l);
geom_l.contains(&geom_r)
}
})
.collect::<BitBuffer>();

Ok(BoolArray::from_bit_buffer(matches, validity).into_array())
})
}
(None, None) => {
// lhs and rhs are both arrays, we need to zip/iterate them both.
let lhs = lhs.evaluate(scope)?.to_varbinview();
let rhs = rhs.evaluate(scope)?.to_varbinview();

// And the validities together.
let validity = lhs.validity().clone().and(rhs.validity().clone());

let len = rhs.len();

// TODO(aduffy): hoist validity checking
let matches = BitBuffer::collect_bool(len, |index| {
if lhs.is_invalid(index) || rhs.is_invalid(index) {
return false;
}

let l_v = lhs.bytes_at(index);
let r_v = rhs.bytes_at(index);

let geom_l = parse_wkb(&l_v);
let geom_r = parse_wkb(&r_v);

geom_l.contains(&geom_r)
});

Ok(BoolArray::from_bit_buffer(matches, validity).into_array())
}
}
}
}

fn parse_wkb(wkb: &[u8]) -> Geometry {
let mut writer = GeoWriter::new();
wkb::Wkb(wkb)
.process_geom(&mut writer)
.expect("wkb parsing left");
writer.take_geometry().expect("wkb should yield geometry")
}
Loading
Loading