# Indexing and selecting data

DataArrays and Datasets with an `xvec.GeometryIndex` support standard indexing, slicing and selection from Xarray on non-geometric dimensions plus specific spatial indexing options based on geometric dimensions. To make the example more interesting, create a Dataset of trips between individual taxi zones in New York City in [January 2022](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [1]:
import datetime

import geopandas as gpd
import pandas as pd
import xarray as xr
import xvec

from shapely import Point, box

You can index the data by the payment type, day of the month, the hour of the day, origin zone and destination zone. For example, you can check trip count, mean trip distance, fare amount and tip amount.[^sparse]

[^sparse]: {-} It may be better to create the data as sparse arrays, but those do not support all indexing methods, so it is better to use dense arrays in this example.

In [2]:
# Load the data
trips = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"
)  # 33MB
zones = gpd.read_file(
    "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip"
)  # 1MB
lookup = pd.read_csv("https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv")

# create variables for day and hour
trips["date"] = trips.tpep_pickup_datetime.dt.date
trips["hour"] = trips.tpep_pickup_datetime.dt.hour

# use groupby over five columns to create a mutli-indexed DataFrame with aggregations
# and create a Dataset backed by sparse arrays
taxi_trips = xr.Dataset.from_dataframe(
    trips[  # filter only trips with known locations
        trips.PULocationID.isin(zones.LocationID)
        & trips.DOLocationID.isin(zones.LocationID)
    ]
    .groupby(["payment_type", "date", "hour", "PULocationID", "DOLocationID"])
    .agg(
        {
            "trip_distance": "mean",
            "VendorID": "count",
            "tip_amount": "mean",
            "fare_amount": "mean",
        }
    ),
)

# Replace int codes with labels
taxi_trips["payment_type"] = [
    "Credit card",
    "Cash",
    "No charge",
    "Dispute",
    "Unknown",
    "Voided trip",
]

# create linkable geometry variable
taxi_zones = (
    lookup.merge(
        zones.dissolve("LocationID")[["zone", "geometry"]],
        left_on="Zone",
        right_on="zone",
        how="left",
    )
    .set_index("LocationID")
    .geometry
)
# replace location IDs with actual geometries
taxi_trips["PULocationID"] = taxi_zones.loc[taxi_trips.PULocationID].values
taxi_trips["DOLocationID"] = taxi_zones.loc[taxi_trips.DOLocationID].values

# rename
taxi_trips = taxi_trips.rename(
    {"PULocationID": "origin", "DOLocationID": "destination", "VendorID": "trips_count"}
)

# assing GeometryIndex
taxi_trips = taxi_trips.xvec.set_geom_indexes(["origin", "destination"], crs=zones.crs)
taxi_trips

The dataset is created with two dimensions with a `GeometryIndex`.

In [3]:
taxi_trips.xindexes

Indexes:
    payment_type  PandasIndex
    date          PandasIndex
    hour          PandasIndex
    origin        GeometryIndex (crs=EPSG:2263)
    destination   GeometryIndex (crs=EPSG:2263)

## Selection by geometry

### Geometry as a label

You can select data based on geometry as with any other index, treating it as a label.

In [4]:
taxi_trips.sel(destination=[zones.geometry[0], zones.geometry[3]])

### Nearest

Alternatively, you can select based on the nearest neighbor[^crs].

[^crs]: {-} Remember that all geometries, those in the index and those in the query, must use the same Coordinate Reference System.

In [5]:
taxi_trips.sel(
    date=datetime.datetime(2022, 1, 28),
    hour=12,
    origin=[Point(1064321, 211194), Point(988669, 207721)],
    destination=[Point(998142, 191215), Point(1010116, 42998)],
    method="nearest",
)

### Spatial query

Spatial-aware data selection using the “query” mode with a single geometry and a given predicate:

In [6]:
taxi_trips.sel(origin=box(998142, 191215, 1024321, 211194), method="intersects")

Spatial query using the `sel()` method with predicates other than `"nearest"` supports only scalar geometries as an input. If you want to query using an array of geometries, you can use the `.xvec.query()` method instead.

In [7]:
taxi_trips.xvec.query(
    "origin", [Point(1064321, 211194), Point(1064321, 211194).buffer(500)]
)

`.xvec.query()` is a wrapper around `shapely.STRtree` and returns the subset of the original object where the bounding box of each input geometry intersects the bounding box of geometry in a `GeometryIndex`. If a predicate is provided, the tree geometries are first queried based on the bounding box of the input geometry. Then they are further filtered to those that meet the predicate when comparing the input geometry to the tree geometry: `predicate(geometry, index_geometry)`.

In [8]:
taxi_trips.xvec.query(
    "origin",
    [Point(1064321, 211194), Point(1064321, 211194).buffer(500)],
    predicate="within",
)

Since multiple query geometries may return the same index geometry, the method, by default, returns duplicated observations. That can be filtered by passing `unique=True`.

In [9]:
taxi_trips.xvec.query(
    "origin",
    [Point(1064321, 211194), Point(1064321, 211194).buffer(500)],
    predicate="within",
    unique=True,
)

When using a predicate `"dwithin"` (search for geometries within a set distance) you can also pass the `distance` argument.

In [10]:
taxi_trips.xvec.query(
    "origin",
    [Point(1064321, 211194), Point(1064321, 211194).buffer(500)],
    predicate="dwithin",
    unique=True,
    distance=5000,
)