# Quick start

## Installation

The Python module can be directly installed from [PyPI](https://pypi.org/project/anta-database/) with:

    pip install anta_database

Note that this module was developed for Python versions >=3.11. It is also recommanded to install it in a dedicated fresh python environment to avoid dependency issues. 


This Python module is designed to query, filter, and visualize data stored in an AntADatabase folder. While the AntADatabase is not yet publicly available, you can contact me for access to test the tool.
You can already have a look at this guide to have an idea of the features of this tool. This Jupyter Notebook can be directly downloaded (top bar) and ran locally (assuming you had downloaded or compiled the AntADatabase folder).


## Initializing the database

Having the AntADatabase folder stored on your machine, you can initialize import and initialize the 'Database' class which provides SQL query and filter functions as well as quick visualization function.

In [None]:
# Initialize the database and create the SQL table (required only once or when adding new datasets)
from anta_database import Database

db = Database('/home/anthe/documents/data/isochrones/AntADatabase/', index_database=True)

Note: the `index_database=True` argument creates a SQLite table (AntADatabase.db) that indexes all datasets in the AntADatabase folder. This ensures that future queries reflect data that is actually present in AntADatabase. This allows you to download the current entire AntADatabase or parts of it. You only need to set `index_database=True` once or when you add new datasets to the folder.

In [None]:
# For subsequent uses (no need to re-index unless new datasets are added)
db = Database('/home/anthe/documents/data/isochrones/AntADatabase/')

## Querying the database
This section provides examples for querying the database using the `query()` and `filter_out()` functions.
These functions allow you to browse and filter data based on various criteria such as datasets, institutes, projects, regions, layer ages and more.

Use the `query()` without argument for an overview of the entire data:

In [None]:
db.query()

The `query()` can take one or multiple arguments. Each argument can take a string or a list of strings for multiple selections.  
You can query by dataset, institute, project, age, region, IMBIE basin, variable, or flight ID. 

### **Parameters**
| Parameter         | Description                                                                                     | Example Values                     |
|-------------------|-------------------------------------------------------------------------------------------------|-------------------------------------|
| `dataset`         | Name of the dataset(s) of interest.                                                             | `'Cavitte_2020'`, `['Franke_2025', 'Winter_2018']` |
| `institute`       | Institute(s) that produced the data.                                                           | `'BAS'`, `['AWI', 'NASA']`            |
| `project`         | Project(s) under which the data were collected.                                                | `'OIB'`                             |
| `acquisition_year`| Year(s) in which the radar data were acquired. Supports ranges and inequalities.               | `'2000-2010'`, `'<1990'`, `'2005'` |
| `age`             | Age(s) in years before present of the layer(s) of interest. Supports ranges and inequalities.   | `'10000'`, `'37500-38500'`, `'>90000'` |
| `region`          | Region(s) of interest (e.g., `'EAIS'`, `'WAIS'`).                                              | `'EAIS'`                            |
| `IMBIE_basin`     | IMBIE basin(s) of interest (e.g., `'G-H'`, `'Ap-B'`).                                           | `'G-H'`                             |
| `var`             | Variable(s) of interest. Possible values: `'IRH_DEPTH'`, `'IRH_NUM'`, `'ICE_THK'`, `BED_ELEV'`, `'SURF_ELEV'`.      | `'ICE_THK'`                        |
| `flight_id`       | ID of a particular flight line. Supports regex with `'%'`.                                     | `'DC_LDC_DIVIDE'`, `'%WSB%'`       |

Note that `age` and `acquisition_year` can take ranges:
 - '<' for ages or acquisition year younger than X (age='<50000' means all ages younger than 50000 yrs old)
 - '>' for ages or acquisition year older than X (age='>50000' means all ages older 50000 yrs old)
 - for ages or acquisition years between a certain range, use '-' (acquisition_year='2000-2010' means all data acquired between these dates)
 - '<=' and '>=' will also work as you expect

 In addition, all arguments support regex with `'%'`:
  - flight_id='OIA%' means all data with flight ID which starts by OIA
  - dataset='%2025' means all dataset which ends by 2025 (so published in 2025)
  - flight_id='%WSB%' means all data which has WSB anywhere in its flight id 

In [None]:
#Examples of queries:
db.query(dataset='Cavitte_2020') # all data from Cavitte et al. 2020
db.query(institute='BAS') # all data that was acquired by BAS
db.query(project='OIB') # all data that was acquired during OIB campaigns
db.query(age='38100') # all datasets with the 38.1ka isochrone
db.query(var='ICE_THK') # all datasets with ICE_THK variable
db.query(IMBIE_basin='G-H') # all flight lines that cross the G-H basin
db.query(flight_id='DC_LDC_DIVIDE') # all layers with the flight ID DC_LDC_DIVIDE
db.query(flight_id='%WSB%') # all flight lines with WSB in the flight ID
db.query(dataset=['Franke_2025', 'Winter_2018'], age='38100') # example of multiple criteria

The filter_out() function allows the pre-filter out some data so they would never be included in the next queries. 

In [None]:
%%capture
db.filter_out(acquisition_year='<1990')  # filter out all data acquired before 1990
db.query() # now all queries will exclude data acquired before 1990

db.filter_out()  # reset all filters to include all data again
db.query()

## Visualization

Use the results of the query in the plotting functions:

Current implemented plotting functions are:
- plot.dataset(): plots locations of the data, with different colors for the different datasets
- plot.institute(): plots locations of the data, with different colors for the different institutes
- plot.var(): color-coded scatter plot of the variable of interest.
- plot.flight_id(): color-coded trace IDs. Useful for identifying specific traces of interest.
- plot.transect_1D(): plots depths of the IRH and Bed along a single flight line 

In Jupyter Notebook, use '%matplotlib qt' or '%matplotlib widget' depending on your IDE, to switch to the matplotlib widget that allows you to zoom in etc.
Use '%matplotlib inline' (default) to plot the figure in the notebook

### Plot datasets

In [None]:
# %matplotlib widget
%matplotlib inline
results = db.query(IMBIE_basin='G-H')
db.plot.dataset(results,
                title='IRH data crossing the G-H IMBIE basin',
                xlim=(-2000, -500), # set the plot extent in km
                ylim=(-1000, 250),
                marker_size=1, # adjust the size of the markers
                )

Note: all flight lines in the database are associated with one or several IMBIE basin(s). This depends if the flight line crosses one or multiple basins. The plot above shows all the flight lines which have traced IRH data and that crosses at some point the G-H basin.

### Plot variables
Example of plotting the IRH depth of a specific layer found across multiple datasets. Here we still select a range of ages close from each other, which could be attributed to the same layer (but different dating method used maybe). Note the warning in the case, but we can ignore it:

In [None]:
results = db.query(age='37500-38500', var='IRH_DEPTH')
db.plot.var(results, title='AntArchitecture 38ka isochrone depth',
                downsampling_factor=10, # downscale the datasets n times, which makes little visual difference but lighter to plot
                xlim=(-500, 2400),
                ylim=(-2200, 2200),
                scale_factor=0.7, # adjust the size of the plot 
                marker_size=1.2,
                # save='AntA_38ka_depth.png'
                )

The above plot shows the absolute IRH depth relative to the ice surface as it is traced. It is often more informative to look at the IRH fraction depth: the depth of a layer relative to the ice thickness (IRH_DEPTH/ICE_THK*100). The fraction depth variable is not directly stored in the database to reduce disk usage. It is probably more efficient to compute it when needed. For this, use the option fraction_depth=True:

In [None]:
# %matplotlib widget
%matplotlib inline
results = db.query(age=['37600', '38000', '38100', '38200', '38500'], var='IRH_DEPTH')
db.plot.var(results, 
            fraction_depth=True,
            title='AntArchitecture 38ka isochrone fractional depth',
                downsampling_factor=10, # downscale the datasets n times, which makes little visual difference but lighter to plot
                xlim=(-500, 2400),
                ylim=(-2200, 2200),
                scale_factor=0.7, # adjust the size of the plot 
                marker_size=1.2,
                # save='AntA_38ka_depth.png'
                )

Note that if the variable ICE_THK is not present in a dataset, the fractional depth cannot be computed. In the case, NaN values will be generated instead without warning. So if you get a blank plot, please check if the dataset you queried contains the ICE_THK with db.query(dataset='Author_YYYY').

### Plot datasets

The IRH_NUM variable shows the number of traced isochrones (layers) per data point:

In [None]:
results = db.query(var='IRH_NUM')
db.plot.var(results, title='IRH Density over the Antarctic Ice Sheet',
                downsampling_factor=100,
                scale_factor=1,
                marker_size=1.2,
                )

### Plot flight IDs

This plot is useful when we want to identify a specific flight line. One can then identify a flight line of interest on the 2D map, then plot the traced IRHs along the transect (see below):

In [None]:
results = db.query(dataset='Winter%', flight_id=['EPICA%'])
db.plot.flight_id(results, title='Winter et al. 2018 - EPICA',
                xlim=(-500, 1000),
                ylim=(1000, 2200),
                marker_size=1.2,
                )

### Plot layer depths along transect

In [None]:
results = db.query(dataset='Cavitte_2020', flight_id='DC_LDC_DIVIDE')
db.plot.transect_1D(results)

Use the elevation=True option to plot the transect in absolute elevation above sea level:

In [None]:
db.plot.transect_1D(results, elevation=True)

Other possible arguments for the plot methods:
- cmap: provide your colormap of choice (as LinearSegmentedColormap). Tip: 'import colormaps as cmaps' for a large choice of colormaps (see [Colormaps docs](https://pratiman-91.github.io/colormaps/)
- vmin and vmax: sets the minimum and maximum values for the colorbar.

## BEDMAP

The AntADatabase now contains all the BEDMAP (1, 2 and 3) data. This is useful to see the whole extent of the existing radar data, or to reconnect the IRH datasets to BEDMAP in order to get the Bed Elevation or Ice Thickness when those are not included.

By default, BEDMAP is not shown in the query, since it is not an IRH dataset. To include it, initialize the Database with the option:

In [None]:
db = Database('/home/anthe/documents/data/isochrones/AntADatabase/', include_BEDMAP=True)

## Get files from the database

You may want to make your own plots or further process the data after querying the database. One option is to get the list of the files from your query and open them individually with either xarray or h5py. For this, use the get_files() function:

In [None]:
import xarray as xr
import matplotlib.pyplot as plt
results = db.query(dataset='Cavitte_2020', flight_id='DC_LDC_DIVIDE')
file_list = db.get_files(results)
f = file_list[0]
ds = xr.open_dataset(f, engine='h5netcdf')
print(ds)
plt.plot(ds.Distance, ds.IRH_DEPTH)
plt.show()

xarray provides a nice interface for interacting with the data. This works great when dealing with one file, and for example quickly plot all the layers (see above). But xarray adds some overload, which feels slow when reading many files. h5py on the other hand reads the underlying data right away, which is much more efficient:

In [None]:
import h5py
import matplotlib.pyplot as plt
results = db.query(dataset='Cavitte_2020')
file_list = db.get_files(results)
for f in file_list:
    with h5py.File(f, 'r') as ds:
        plt.scatter(ds['PSX'][:], ds['PSY'][:], c=ds['ICE_THK'][:], s=1)
plt.show()

## Generate data from the database

Note: This part could be developed further in the future if there is the need. But for now, I am designing a separate Python module for constraining my ice sheet model of use, which is tailored to this database and other parallel processing libraries. However, the [Model-comparison](https://antoinehermant.github.io/anta_database/pism_example.html) section already give some bits of code about it.

The data_generator() function reads the query and 'yield' the dataframes for later use. It uses h5py to read all the data efficiently, and creates pandas dataframes, including all the variables and ages from the query. Columns for IRH DEPTH are named after the age. Basically, it reads the data with h5py as shown above and restructure it as bit, which can be easier to manage layers than with h5py dimensions.
Here is a quick example of how this can be used for computing the mean layer depth:

In [None]:
results = db.query(age=['37600', '38000', '38100', '38200', '38500'], var='IRH_DEPTH')
lazy_dfs = db.data_generator(results)

import numpy as np
mean_depth_trs = []
min_depth = float('inf')
max_depth = float('-inf')
for df, md in lazy_dfs:
    depth_values = df[md['age']].values
    mean_depth_trs.append(np.nanmean(depth_values))
    min_depth = min(min_depth, np.nanmin(depth_values))
    max_depth = max(max_depth, np.nanmax(depth_values))


mean_depth = np.nanmean(mean_depth_trs)
std_dev = np.nanstd(mean_depth_trs, ddof=1)
print(f"The mean depth of the 38ka isochrone across East Antarctica is {round(mean_depth, 2)} m ranging from {round(min_depth, 2)} m to {round(max_depth, 2)} m.")

Note that the data_generator returns a simple pandas DataFrame for each flight line, containing all queried variables and ages (if exist). The IRH depth of each layer is stored by columns named by age (so the IRH_DEPTH of the layer 38000 is df['38000']). As shown above, one can use the metadata (md) of the current dataframe (df) to get its age (md['age']).
Furthermore, if one needs the fraction depth, one can compute it using the ICE_THK (has to be included in the query), or the data_generator has this option: db.data_generator(results, fraction_depth=True). With this option, the layer columns will now be IRH fraction depth instead of absolute IRH_DEPTH.

# Pro tips

The Database methods always keep the last query in memory. This means that one does not have to actually pass argument.
Example:

In [None]:
db.query(var='IRH_NUM')
# db.plot.var() # This is equivalent to the overview plot above. Note that without downscaling it is very heavy to plot.

import numpy as np
mean_density_trs = []
min_density = float('inf')
max_density = float('-inf')
for df, md in db.data_generator():    # without explicitly passing results, it uses the last query.
    density_values = df['IRH_NUM']
    mean_density_trs.append(np.mean(density_values))
    min_density = min(min_density, min(density_values))
    max_density = max(max_density, max(density_values))

mean_density = np.mean(mean_density_trs)
print(f"The average number of picked layers per data point in the AntADatabase is {int(mean_density)}, ranging from {int(min_density)} to {int(max_density)}. ")