# Retrieve a Eurobis dataset
Imagine you are interested in occurrence data, for instance this zooplankton dataset on [Eurobis](https://www.eurobis.org/imis?module=dataset&dasid=4687). You can also find this dataset via these portals:
- [IPT](https://www.vliz.be/nl/imis?module=dataset&dasid=4685&printversion=1&dropIMIStitle=1)
- [EMODnet](https://emodnet.ec.europa.eu/geonetwork/srv/eng/catalog.search#/metadata/6d617269-6e65-696e-666f-000000004687)
- [LifeWatch](https://rshiny.lifewatch.be/flowcam-data/)

This tutorial will illustrate how to get to this data via the DTO.
## 0. Setup environment
#### Requirements

In [10]:
packages = ["contextily",
            "pandas",
            "geopandas",
            "matplotlib",
            "pyarrow",
            "pystac_client"]

#### Install packages

In [11]:
for package in packages:
    %pip install {package}


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A n

#### Load packages

In [12]:
import pyarrow.parquet as pq
import pyarrow.fs
import pyarrow.dataset as ds
import pyarrow.compute as pc
import pyarrow.fs
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx
from shapely.geometry import Point
import pystac_client

## 1. Find the occurrence dataset
On 2025-03-07 this is the current occurrence dataset: 
- s3.waw3-1.cloudferro.com/emodnet/biology/eurobis_occurrence_data/eurobis_occurrences_geoparquet_2024-10-01.parquet
<br><br>
The dataset is sometimes updated & URL changed, so it is best to search the STAC to make sure we have the latest version.
<br><br>
We use pystac_client to query the STAC, raed the docs [here](https://pystac-client.readthedocs.io/en/stable/index.html).
#### 1.1 Connect to STAC catalag

In [13]:
url = 'https://catalog.dive.edito.eu'
client = pystac_client.Client.open(url)
print(client)

<Client id=root>


Explore the different collections

In [14]:
collections = list(client.get_collections())
print(len(collections))

442


#### 1.2 Search collections with occurrence data

In [18]:
variable = "emodnet-occurrence_data"

items = []
for collection in collections:
    if variable in collection.id:
        print(collection.id)
        for i, item in enumerate(collection.get_items()):
            items.append(item)
            
print(items)

emodnet-occurrence_data
[<Item id=426b87c6-409d-5dfd-9522-1be4e30b4b01>]


This results in one dataset that contains occurrence data.
#### 1.3 Find the data URL
Each item has assets, we want to find the parquet asset and extract that one.

In [19]:
for item in items:
    for key, value in item.assets.items():
        print(f"{key}: {value}")
        print("-"*25)
        if key == "parquet":
            occurrence_data = value.href

xml: <Asset href=https://emodnet.ec.europa.eu/geonetwork/srv/api/records/6d617269-6e65-696e-666f-000000001510/formatters/xml>
-------------------------
csw: <Asset href=https://emodnet.ec.europa.eu/geonetwork/emodnet/eng/csw?request=GetRecordById&service=CSW&version=2.0.2&elementSetName=full&id=6d617269-6e65-696e-666f-000000001510>
-------------------------
wms: <Asset href=https://geo.vliz.be/geoserver/wms?SERVICE=WMS&REQUEST=GetMap&LAYERS=Dataportal:eurobis_rasters-obisenv&VERSION=1.1.1&CRS=CRS:84&BBOX=-180,-90,180,90&WIDTH=800&HEIGHT=600&FORMAT=image/png>
-------------------------
parquet: <Asset href=https://s3.waw3-1.cloudferro.com/emodnet/emodnet_biology/12639/marine_biodiversity_observations_2025-09-22.parquet>
-------------------------
parquet-datalab-data-explorer: <Asset href=https://datalab.dive.edito.eu/data-explorer?source=https://s3.waw3-1.cloudferro.com/emodnet/emodnet_biology/12639/marine_biodiversity_observations_2025-09-22.parquet>
-------------------------


In [20]:
print(occurrence_data)

https://s3.waw3-1.cloudferro.com/emodnet/emodnet_biology/12639/marine_biodiversity_observations_2025-09-22.parquet


## 2. Open the occurrences dataset
#### connect to the S3 object storage 
To connect to the dataset, you need following information:

- Host: base url 
- Bucket: first part of the path after host 
- Key: all other path specifications

This information is all in the URI. It can be extracted manualy or automated.
#### Manual resource specification

In [21]:
host = "s3.waw3-1.cloudferro.com"
bucket_name = "emodnet"
key = "biology/eurobis_occurrence_data/eurobis_occurrences_geoparquet_2024-10-01.parquet"

#### Automated resource specification

In [22]:
from urllib.parse import urlparse
parsed_url = urlparse(occurrence_data)
host = parsed_url.hostname
bucket_name = parsed_url.path.split('/')[1]
key = '/'.join(parsed_url.path.split('/')[2:])

In [23]:
print("host =", host)
print("bucket_name =", bucket_name)
print("key =", key)

host = s3.waw3-1.cloudferro.com
bucket_name = emodnet
key = emodnet_biology/12639/marine_biodiversity_observations_2025-09-22.parquet


In [24]:
s3 = pyarrow.fs.S3FileSystem(endpoint_override=host, anonymous=True)
s3_path = f"{bucket_name}/{key}"

#### Open file and inspect the schema
Open the dataset. Notice this is not equal to loading the data. The entire dataset contains way to many records to load at once, we will filter en load in a later step.

In [25]:
dataset = ds.dataset(s3_path, filesystem=s3, format="parquet")
print(dataset.schema)

id: int64
datelastmodified: timestamp[us, tz=UTC]
datasetid: int32
institutioncode: string
collectioncode: string
eventid: string
observationdate: timestamp[us, tz=UTC]
season: string
yearcollected: int16
startyearcollected: int16
endyearcollected: int16
monthcollected: int16
startmonthcollected: int16
endmonthcollected: int16
daycollected: int16
startdaycollected: int16
enddaycollected: int16
timeofday: double
starttimeofday: double
endtimeofday: double
timezone: string
continentocean: string
country: string
stateprovince: string
county: string
collectornumber: string
fieldnumber: string
longitude: double
startlongitude: double
endlongitude: double
latitude: double
startlatitude: double
endlatitude: double
coordinateprecision: double
verbatimpositiondetail: string
minimumdepth: double
maximumdepth: double
occurrenceid: string
scientificname: string
aphiaid: int32
taxonrank: int16
rankname: string
scientificname_accepted: string
scientificnameauthor: string
aphiaidaccepted: int32
kingd

## 2. Filter the data
We will filter for copepods (apiaID 1080 near the Flemish coast (latitude 51-51.5; longitude 2.5-3.3).

In [None]:
columns_needed = ["aphiaid", "latitude", "longitude"]
filtered_table = dataset.to_table(
    columns=columns_needed,
    filter=(
        (pc.field("aphiaid") == 1080) &
        (pc.field("latitude") >= 51) &
        (pc.field("latitude") <= 51.5) &
        (pc.field("longitude") >= 2.5) &
        (pc.field("longitude") <= 3.3)
    )
)

#### Convert to Pandas DataFrame

In [None]:
df = filtered_table.to_pandas()

See how many records in this dataframe

In [None]:
print(f"{len(df)=}")

Print some records to see the data.

In [None]:
for line in df.head(10).itertuples():
    print(line)

## 3.Plotting the data
Lets visualize the data for better understanding.

In [None]:
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df.longitude, df.latitude),
    crs="EPSG:4326"  # WGS84 coordinate system
)
fig, ax = plt.subplots(figsize=(10, 10))
gdf = gdf.to_crs(epsg=3857)  # Reproject to Web Mercator for compatibility with contextily
gdf.plot(ax=ax, color="blue", markersize=10, alpha=0.6, label="Occurrences")
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik, zoom=10)
ax.set_title("Map of Filtered AphiaID Occurrences with Background Map")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.legend()
plt.show()

## 4. Conclusion
This notebook demonstrated how to reach the Eurobis data via EDITO data lake. All occurence datasets are merged in a single parquet file in the data lake.