# Retrieve a CMEMS dataset

Imagine you used to work with this ocean colour product from CMEMS:
https://data.marine.copernicus.eu/product/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/description

Now you want to move to the DTO facilities, and you are wondering how to find that dataset there. 


### 0. setup environment

#### Requirements

In [1]:
packages = ['pystac_client',
            'copernicusmarine',
            'xarray',
            'requests',
            'aiohttp',
            'copernicusmarine']

#### Install packages

In [2]:
for package in packages:
    !pip install {package} > /dev/null 2>&1

#### Load packages

In [3]:
for package in packages:
    exec(f'import {package}')

  from .autonotebook import tqdm as notebook_tqdm


<br>

### 1. Open the STAC catalog
Using pystac client you can connect to the STAC.

In [4]:
url = 'https://catalog.dive.edito.eu'
client = pystac_client.Client.open(url)
print(client)

<Client id=catalogs>


<br>

### 2. Load collections
One property of the STAC is collections, its a good way to explore the available datasets.

In [5]:
collections = list(client.get_collections())

Lets see how many collections there are:

In [6]:
print(f"number of collections: {len(collections)}")

number of collections: 4845


<br>

### 3. Query collections
We will loop over the collections and filter a variable defined in CMEMS such as: Mass concentration of chlorophyll a in sea water. Notice that we need to add underscores as spaces are not accepted.
#### 3.1 Filter on variable

In [7]:
variable = "chlorophyll_a"

In [8]:
for collection in collections:
    if variable in collection.id:
        print(collection.id)

emodnet-deepest_values_of_water_body_chlorophyll_a
climate_forecast-mass_concentration_of_chlorophyll_a_in_sea_floor_sediment
climate_forecast-mass_concentration_of_chlorophyll_a_in_sea_ice
climate_forecast-mass_concentration_of_chlorophyll_a_in_sea_water
climate_forecast-mass_concentration_of_divinyl_chlorophyll_a_in_sea_water
climate_forecast-mass_concentration_of_monovinyl_chlorophyll_a_in_sea_water
climate_forecast-mass_fraction_of_chlorophyll_a_in_sea_water
emodnet-water_body_chlorophyll_a
emodnet-water_body_chlorophyll_a_deepest
emodnet-water_body_chlorophyll_a_masked_using_relative_error_threshold_0.5


#### 3.2 Retrieve products
Get all the products from these collections.

In [9]:
products = []
for collection in collections:
    if variable in collection.id:
        for i, item in enumerate(collection.get_items()):
            products.append(item)

Lets see how many products this is.

In [10]:
print(f"number of products: {len(products)}")

number of products: 419


#### 3.3 Filter the products 
On the CMEMS webpage, the product ID is defined as "OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209".

In [11]:
product_id = "OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209"

In [12]:
for collection in collections:
    if variable in collection.id:
        for i, item in enumerate(collection.get_items()):
            for asset_key, asset in item.assets.items():
                if product_id in asset.href:
                    print(i, asset.href)

140 https://s3.waw3-1.cloudferro.com/mdl-arco-geo-045/arco/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mosaic_P1D-m_202107/geoChunked.zarr
140 https://datalab.dive.edito.eu/data-explorer?source=https://s3.waw3-1.cloudferro.com/mdl-arco-geo-045/arco/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mosaic_P1D-m_202107/geoChunked.zarr
141 https://s3.waw3-1.cloudferro.com/mdl-arco-time-045/arco/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mosaic_P1D-m_202107/timeChunked.zarr
141 https://datalab.dive.edito.eu/data-explorer?source=https://s3.waw3-1.cloudferro.com/mdl-arco-time-045/arco/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mosaic_P1D-m_202107/timeChunked.zarr
142 https://wmts.marine.copernicus.eu/teroWmts/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mosaic_P1D-m_202107


<br>

### 4. Open ZARR
Select one of the zarr files and inspect how it looks. In this example we will continue with the geochunked zarr:

In [13]:
my_zarr = "https://s3.waw3-1.cloudferro.com/mdl-arco-geo-045/arco/OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209/cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mosaic_P1D-m_202107/geoChunked.zarr"

In [14]:
from copernicusmarine.core_functions import custom_open_zarr

In [15]:
ds = custom_open_zarr.open_zarr(my_zarr)
print(ds)

<xarray.Dataset> Size: 4TB
Dimensions:    (time: 1702, latitude: 15120, longitude: 14311)
Coordinates:
  * latitude   (latitude) float64 121kB 48.0 48.0 48.0 48.0 ... 62.0 62.0 62.0
  * longitude  (longitude) float64 114kB -12.0 -12.0 -12.0 ... 13.0 13.0 13.0
  * time       (time) datetime64[ns] 14kB 2020-01-04 2020-01-05 ... 2024-08-31
Data variables:
    CHL        (time, latitude, longitude) float32 1TB dask.array<chunksize=(230, 64, 64), meta=np.ndarray>
    SPM        (time, latitude, longitude) float32 1TB dask.array<chunksize=(230, 64, 64), meta=np.ndarray>
    TUR        (time, latitude, longitude) float32 1TB dask.array<chunksize=(230, 64, 64), meta=np.ndarray>
Attributes: (12/46)
    Conventions:                CF-1.7
    TileSize:                   945:1192
    cmd_data_type:              Grid
    cmems_dataset:              cmems_obs_oc_nws_bgc_tur-spm-chl_nrt_l4-hr-mo...
    cmems_product_id:           OCEANCOLOUR_NWS_BGC_HR_L4_NRT_009_209
    contact:                    h

Notice the size: 4TB. In CMEMS this product consists of monthly netcdf file, in ARCO format they have been merged into one file. Very handy for slicing the dataset. 

<br>

### 5. Slice data and plot
Slice the dataset based on extend in space and time.

In [16]:
lat_min, lat_max = (50, 53) 
lon_min, lon_max = (2, 5)
time_index = 0

Execute the slicing.

In [17]:
subset = ds.isel(time=time_index).sel(latitude=slice(lat_min, lat_max), longitude=slice(lon_min, lon_max))['CHL']
print(subset)

<xarray.DataArray 'CHL' (latitude: 3240, longitude: 1717)> Size: 22MB
dask.array<getitem, shape=(3240, 1717), dtype=float32, chunksize=(64, 64), chunktype=numpy.ndarray>
Coordinates:
  * latitude   (latitude) float64 26kB 50.0 50.0 50.0 50.0 ... 53.0 53.0 53.0
  * longitude  (longitude) float64 14kB 2.001 2.003 2.005 ... 4.996 4.997 4.999
    time       datetime64[ns] 8B 2020-01-04
Attributes:
    long_name:      Chlorophyll-a concentration derived from MSI L2R using HR...
    standard_name:  mass_concentration_of_chlorophyll_a_in_sea_water
    units:          mg m-3
    valid_min:      0.0


Plot the slice.

In [None]:
subset.plot(cmap='viridis',vmin=0, vmax=30 )