<a href="https://colab.research.google.com/github/quentinf00/my_ocb/blob/main/Demo_OCB_EDITO_Modellab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo of the oceanbench ecosystem components:

- **Datachallenges:** reproducible and configured pipelines for loading data and computing metrics
- **Pipelines:**  Sequences of processing steps
- **Modules:** Units of processing

Pipelines and modules are installable, configurable and documented

# Installation
-  Download repo
-  Install conda dependencies

In [1]:
!pip install --quiet condacolab
import condacolab
condacolab.install_mambaforge()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:13
🔁 Restarting kernel...


In [1]:
!git clone https://github.com/quentinf00/my_ocb.git

fatal: destination path 'my_ocb' already exists and is not an empty directory.


In [2]:
%cd my_ocb

/content/my_ocb


In [None]:
!git pull

In [4]:
!mamba env update -q -f env.yaml -n base


  Pinned packages:

  - python 3.10.*
  - python_abi 3.10.* *cp310*
  - cuda-version 12.*


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... 

done
Installing pip dependencies: ...working... done


# Demo Data challenge SSH Mapping OSE (2021)

In [5]:
%cd datachallenges/dc_ose_2021

/content/my_ocb/datachallenges/dc_ose_2021


# Install pipelines

In [6]:
# This data challenge has a single pipeline so far for computing lambda x
!cat pipelines.txt

qf_alongtrack_lambdax_from_map @ git+https://github.com/quentinf00/my_ocb.git#egg=qf_alongtrack_lambdax_from_map&subdirectory=pipelines/qf_alongtrack_lambdax_from_map


In [7]:
!pip install -q -r pipelines.txt

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... 

## Display the pipeline configuration for this data challenge

In [28]:
# The key "to_run" contains the sequences of steps
# The key "stages" contains the configuration of each pipeline step
!cat stage_configs.yaml

_target_: qf_alongtrack_lambdax_from_map.run_pipeline
to_run:
- dl_tracks
- filter_and_merge
- interp_on_track
- lambdax
stages:
  method: default
  dl_tracks:
    _target_: dz_download_ssh_tracks.run
    _partial_: true
    sat: c2
    download_dir: data/downloads/${.sat}
    min_time: '2017-01-01'
    max_time: '2017-12-31'
    filters:
    - '*2017*'
    _skip_val: false
  filter_and_merge:
    _target_: qf_filter_merge_daily_ssh_tracks.run
    _partial_: true
    input_dir: ${..dl_tracks.download_dir}
    output_path: data/prepared/${..dl_tracks.sat}.nc
    min_lon: -65.0
    max_lon: -55.0
    min_lat: 33.0
    max_lat: 43.0
    min_time: '2017-01-01'
    max_time: '2018-01-01'
    _skip_val: false
  interp_on_track:
    _target_: qf_interp_grid_on_track.run
    _partial_: true
    track_path: ${..filter_and_merge.output_path}
    grid_path: data/method_outputs/${..method}.nc
    grid_var: ???
    output_path: data/method_outputs/${..method}_on_track.nc
    _skip_val: false
  lamb

## Visualize processing steps

In [24]:
#In term of stages names
!dvc dag 'compute_lambdax@0'

+----------------------+                   +-----------------+   
| filter_and_merge_ref |                   | method_output@0 |   
+----------------------+******             +-----------------+   
            *                 ******                *            
            *                       *******         *            
            *                              ****     *            
            **                            +-------------------+  
              ****                        | interp_on_track@0 |  
                  ***                     +-------------------+  
                     ***                 ****                    
                        ****          ***                        
                            **      **                           
                      +-------------------+                      
                      | compute_lambdax@0 |                      
                      +-------------------+                      
[0m

In [25]:
#In term of data dependency
!dvc dag 'compute_lambdax@0' --out

                                    +---------------------+                                                               +---------------------------------+
                                    | data/prepared/c2.nc |*********                                                      | data/method_outputs/4dvarnet.nc |
                                    +---------------------+**       ****************                                      +---------------------------------+
                                ****                         *****                  *****************                                       *
                           *****                                  *****                              ****************                       *
                      *****                                            *****                                         *********              *
                   ***                                                      *****                   

## Download some data associated with the data challenge

In [26]:
# We fetch the reference alongtrack data of cryosat2
!dvc pull data/prepared/c2.nc

Collecting          |1.00 [00:00, 27.8entry/s]
Fetching
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
100% 1/1 [00:00<00:00,  2.30files/s{'info': ''}][A
                                                [A
Fetching from https:   0% 0/1 [00:00<?, ?file/s][A
Fetching from https:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A

  0% 0.00/1.33M [00:00<?, ?B/s][A[A

  0% 0.00/1.33M [00:00<?, ?B/s{'info': ''}][A[A

  5% 63.5k/1.33M [00:00<00:02, 454kB/s{'info': ''}][A[A

 14% 191k/1.33M [00:00<00:01, 722kB/s{'info': ''}] [A[A

 18% 239k/1.33M [00:00<00:02, 549kB/s{'info': ''}][A[A

 42% 575k/1.33M [00:00<00:00, 1.28MB/s{'info': ''}][A[A

 68% 927k/1.33M [00:00<00:00, 1.72MB/s{'info': ''}][A[A

                                                   [A[A
Fetching from https: 100% 1/1 [00:01<00:00,  1.49s/file{'info': ''}][A
Fetching
Building workspace index          |4.00 [00:00, 9

In [27]:
import xarray as xr
xr.open_dataset('data/prepared/c2.nc')

## See benchmarked methods

In [29]:
# a method has a name, a link to a netcdf and the variable of the field
!cat methods.yaml

methods:
- name: 4dvarnet
  var: rec_ssh
  url: https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/a5/2381e9409cb7c6cf9be980bda9aced
- name: miost
  var: ssh
  url: https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/4f/014481eed0088eb9a0cf329ebf045b
- name: bfn
  var: ssh
  url: https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/29/6781b126d905e98b82dac9bcecf57e
- name: duacs
  var: ssh
  url: https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/95/7d74696fbf5f2b6d0c528757951b8a
- name: dymost
  var: ssh
  url: https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/1f/6fc60bba2ef471ff845b4cdcc18f6a



## Add a new method

In [62]:
!dvc dag --out compute_lambdax@5

[31mERROR[39m: 'compute_lambdax@6' does not exist as an output or a stage name in 'dvc.yaml': Stage 'compute_lambdax@6' not found inside 'dvc.yaml' file
[0m

In [32]:
%%bash
echo "
- name: musti
  var: ssh
  url: https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/0d/43f1639d5d21324bea07f1fd4cdc9d
" >> methods.yaml

In [36]:
# New steps have been added to the grap
!dvc dag --out compute_lambdax@5

                                +---------------------+                                                           +------------------------------+
                                | data/prepared/c2.nc |*******                                                    | data/method_outputs/musti.nc |
                                +---------------------+       ****************                                    +------------------------------+
                            ****                       *****                  ***************                                     *
                       *****                                *****                            ****************                     *
                   ****                                          ****                                        ********             *
                ***                                                  *****                                    +---------------------------------------+
           

In [41]:
# Compute the metrics
!dvc freeze fetch_reference_data #we do not want to download again the raw tracks
!dvc repro -k --pull compute_lambdax@5 #Take some time to read the output below to understand what is happening

Modifying stage 'fetch_reference_data' in 'dvc.yaml'
Stage 'filter_and_merge_ref' didn't change, skipping
Running stage 'method_output@5':
> wget https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/0d/43f1639d5d21324bea07f1fd4cdc9d -nc -O 'data/method_outputs/musti.nc'
--2024-02-22 12:47:22--  https://s3.eu-west-2.wasabisys.com/oceanbench-data-registry/dvc/0d/43f1639d5d21324bea07f1fd4cdc9d
Resolving s3.eu-west-2.wasabisys.com (s3.eu-west-2.wasabisys.com)... 130.117.185.102, 130.117.185.100, 130.117.185.103, ...
Connecting to s3.eu-west-2.wasabisys.com (s3.eu-west-2.wasabisys.com)|130.117.185.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 292016649 (278M) [binary/octet-stream]
Saving to: ‘data/method_outputs/musti.nc’


2024-02-22 12:47:40 (17.3 MB/s) - ‘data/method_outputs/musti.nc’ saved [292016649/292016649]

Updating lock file 'dvc.lock'

Running stage 'interp_on_track@5':
> qf_alongtrack_lambdax_from_map -cd . -cn stage_configs 'to_run=[in

Summary:
- the stage filter_and_merge_ref which create the reference data didn't change so it wasn't rerun
- the stage 'method_output@5' downloaded the data
- the stage 'interp_on_track@5' interpolated the map on the reference track
- the stage 'compute_lambdax@5' computed the lambdax

Both stages 'compute_lambdax' and 'interp_on_track' used a command `qf_alongtrack_lambdax_from_map` which is the **pipeline** that we installed: see next section for more details

In [42]:
!cat data/metrics/lambdax_musti.json

{"lambdax": 63.80801481059697}

## Generate leaderboard

In [49]:
# Fetch the results of other methods
!dvc pull -q --allow-missing compute_lambdax
# Display the metrics
!dvc metrics show data/metrics/lambdax*.json

[0mPath                                lambdax
data/metrics/lambdax_4dvarnet.json  48.75652
data/metrics/lambdax_bfn.json       112.75704
data/metrics/lambdax_duacs.json     16.36009
data/metrics/lambdax_miost.json     90.97104
data/metrics/lambdax_musti.json     63.80801
[0m

*Nota Bene: The values are still incoherent but fixing it will make a nice demonstration on how to recompute the metrics for all the methods*

# Pipelines are:
- pip installable
- runable as a CLI


Example here: pipeline to compute lambda_x from a method output

In [8]:
# Help from the cli
!qf_alongtrack_lambdax_from_map --help


    Stages:
        dl_tracks:
    Download the SSH reprocessed tracks of a given satellite from copernicus marine store (requires cmems credentials)
    more info with: `dz_download_ssh_tracks --help`

        filter_and_merge:
    Filter the input files with the given ranges and merge them into a single file 
    more info with: `qf_filter_merge_daily_ssh_tracks --help`

        interp_on_track:
    Interpolates the input grid data on the input alongtrack data
    more info with: `qf_interp_grid_on_track --help`

        lambdax:
    Compute effective resolution lambda_x on the track geometry
    more info with: `alongtrack_lambdax --help`

== Configuration groups ==
Compose your configuration from those groups (group=option)



== Config ==
Override anything in the config (foo.bar=value)

_target_: qf_alongtrack_lambdax_from_map.run_pipeline
to_run:
- dl_tracks
- filter_and_merge
- interp_on_track
- lambdax
stages:
  method: default
  dl_tracks:
    _target_: dz_download_ssh_tracks

## Run the first stage from the pipeline: downloading raw data

In [55]:
# We lauch the dl_tracks stage and override the filter value to download a single file (of Jan 1st 2017)
# CMEMS authentication is asked
!qf_alongtrack_lambdax_from_map to_run=['dl_tracks'] stages.dl_tracks.filters='[*20170101*]'

[2024-02-22 13:03:05,005][dz_download_ssh_tracks][INFO] - Starting
username: qfebvre1
password: 
Fetching catalog: 100% 4/4 [00:32<00:00,  8.11s/it]
INFO - 2024-02-22T13:03:54Z - Dataset version was not specified, the latest one was selected: "202112"
[2024-02-22 13:03:54,166][copernicus_marine_root_logger][INFO] - Dataset version was not specified, the latest one was selected: "202112"
INFO - 2024-02-22T13:03:54Z - Dataset part was not specified, the first one was selected: "default"
[2024-02-22 13:03:54,167][copernicus_marine_root_logger][INFO] - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-02-22T13:03:54Z - Service was not specified, the default one was selected: "original-files"
[2024-02-22 13:03:54,167][copernicus_marine_root_logger][INFO] - Service was not specified, the default one was selected: "original-files"
INFO - 2024-02-22T13:03:54Z - Downloading using service original-files...
[2024-02-22 13:03:54,168][copernicus_marine_root_logger][I

In [56]:
import pathlib
import xarray as xr
p = next(pathlib.Path('data/downloads/c2').glob('**/*.nc'))
print(str(p))
xr.open_dataset(p)

data/downloads/c2/SEALEVEL_GLO_PHY_L3_MY_008_062/cmems_obs-sl_glo_phy-ssh_my_c2-l3-duacs_PT1S_202112/2017/01/dt_global_c2_phy_l3_20170101_20210603.nc


## Run last stage Manually recompute a lambdax

In [57]:
!rm data/metrics/lambdax_musti.json
!qf_alongtrack_lambdax_from_map to_run=['lambdax'] stages.method='musti'
!cat data/metrics/lambdax_musti.json

[0m{"lambdax": 63.80801481059697}

## Inside the pipeline: Modules

### Exploring the config with `--cfg job -p`

In [58]:
# print the config of the pipeline associated with the first stage: we see the field __target__ which is a MODULE
!qf_alongtrack_lambdax_from_map --cfg job -p stages.dl_tracks

# @package stages.dl_tracks
_target_: dz_download_ssh_tracks.run
_partial_: true
sat: c2
download_dir: data/downloads/${.sat}
min_time: '2017-01-01'
max_time: '2017-12-31'
filters:
- '*2017*'
_skip_val: false
[0m

In [59]:
# Get detail from the first stage's module
!dz_download_ssh_tracks --help


 Download the SSH reprocessed tracks of a given satellite from copernicus marine store (requires cmems credentials)
To specify the files to be downloaded, two options:
- specify min_time and max_time parameters and the months encompassing the period
will be downloaded
- specify a list of filters in the form "*YYYY*" or "*YYYYMM*" or "*YYYYMMDD*" to
have more fine grained control

Pipeline description: 
    Download the SSH reprocessed tracks of a given satellite from copernicus marine store (requires cmems credentials)

Input description: None
    

Output description:
    
    Daily netcdf ordered by folder with for a given satellite
    Requirements:
      - download_dir points to a directory
      - download_dir contains netcdf files
    

Returns:
    None

== Configuration groups ==
Compose your configuration from those groups (group=option)



== Config ==
Override anything in the config (foo.bar=value)

_target_: dz_download_ssh_tracks.run
sat: c2
download_dir: data/downloads/$

## Use a module directly to download some data

In [60]:
# Download the data for Jun 1st to Jun 9th of th Alg satellite
# Note: run `copernicusmarine login` in order to avoid typing your credentials each time
!dz_download_ssh_tracks sat=alg filters=['*2017060*']

[2024-02-22 13:12:53,495][dz_download_ssh_tracks][INFO] - Starting
username: qfebvre1
password: 
INFO - 2024-02-22T13:13:03Z - Dataset version was not specified, the latest one was selected: "202112"
[2024-02-22 13:13:03,573][copernicus_marine_root_logger][INFO] - Dataset version was not specified, the latest one was selected: "202112"
INFO - 2024-02-22T13:13:03Z - Dataset part was not specified, the first one was selected: "default"
[2024-02-22 13:13:03,574][copernicus_marine_root_logger][INFO] - Dataset part was not specified, the first one was selected: "default"
INFO - 2024-02-22T13:13:03Z - Service was not specified, the default one was selected: "original-files"
[2024-02-22 13:13:03,574][copernicus_marine_root_logger][INFO] - Service was not specified, the default one was selected: "original-files"
INFO - 2024-02-22T13:13:03Z - Downloading using service original-files...
[2024-02-22 13:13:03,574][copernicus_marine_root_logger][INFO] - Downloading using service original-files...
1

In [61]:
import pathlib
import xarray as xr
ps = [*pathlib.Path('data/downloads/alg').glob('**/*.nc')]
print(f'{len(ps)=}')
print(str(ps[0]))
xr.open_dataset(ps[0])

len(ps)=9
data/downloads/alg/SEALEVEL_GLO_PHY_L3_MY_008_062/cmems_obs-sl_glo_phy-ssh_my_alg-l3-duacs_PT1S_202112/2017/06/dt_global_alg_phy_l3_20170603_20210603.nc
