# Hydrological and Environmental Measurements in Bay of Biscay and Adjacent Waters Exploration with `mlcroissant`
This notebook demonstrates how to load, explore, and process the FAIR² dataset using the `mlcroissant` library. 

### Dataset Source
The dataset uses a Croissant schema accessible at <https://sen.science/doi/10.71728/senscience.h3h6-tjan/fair2.json>.

In [None]:
# Install mlcroissant. If running in an environment where it's not installed, uncomment the line below:
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# The Croissant schema URL for FAIR²
croissant_url = "https://sen.science/doi/10.71728/senscience.h3h6-tjan/fair2.json"

# Load the dataset
dataset = mlc.Dataset(croissant_url)
metadata = dataset.metadata

print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their `@id`s.

We'll list all available record sets, each field's `@id`, and their labels.

In [None]:
# List all record sets and their fields, referencing by @id
print("Available record sets (@id):")
record_sets = []
for record_set in dataset.record_sets:
    print(f"\nRecord Set: {record_set['@id']}")
    record_sets.append(record_set['@id'])
    print("  Fields:")
    for field in record_set['field']:
        # field may be string (just id) or dict (expanded)
        if isinstance(field, dict):
            fid = field.get('@id', '')
            label = field.get('label', '')
        else:
            fid = field
            label = ''
        print(f"    - {fid} {label}")
    print()

## 3. Data Extraction
Load data from all discovered record sets into Pandas DataFrames for further processing.
We use `@id` for referencing each record set.

In [None]:
# Load all record sets data to DataFrames
dataframes = {}

# List for sample demonstration (replace with actual record_set ids as discovered)
for record_set_id in record_sets:
    try:
        records = list(dataset.records(record_set=record_set_id))
        if records:
            df = pd.DataFrame(records)
            dataframes[record_set_id] = df
            print(f"Loaded {len(df)} records for {record_set_id}")
            print(f"Columns: {df.columns.tolist()}")
            display(df.head())
        else:
            print(f"No records found for {record_set_id}")
    except Exception as e:
        print(f"Error loading {record_set_id}: {e}")

## 4. Exploratory Data Analysis (EDA)
Let's perform some EDA steps for one of the record sets with tabular numeric data.

We'll select one record set with available data and pick a numeric field (by its `@id`) to:
- Filter records by a threshold,
- Normalize the field,
- Group by a categorical field, if available.

In [None]:
# Select a record set with loaded data for EDA
import numpy as np
# You may pick a record_set_id from previous outputs. For demonstration:
eda_record_set_id = None
for k, v in dataframes.items():
    if not v.select_dtypes(include=np.number).empty:
        eda_record_set_id = k
        break
if eda_record_set_id is None:
    print("No numeric record set found. Adjust and rerun.")
else:
    df = dataframes[eda_record_set_id]
    numeric_fields = df.select_dtypes(include=np.number).columns.tolist()
    if not numeric_fields:
        print("No numeric columns found in selected record set.")
    else:
        # Use the first numeric field as demo
        numeric_field = numeric_fields[0]
        print(f"Analyzing field: {numeric_field}\n")

        # Filter records (example threshold)
        threshold = df[numeric_field].mean() if df[numeric_field].mean() > 0 else 10
        filtered_df = df[df[numeric_field] > threshold].copy()
        print(f"Filtered records with {numeric_field} > {threshold:.2f}:")
        display(filtered_df.head())

        # Normalize column
        std = df[numeric_field].std()
        # avoid division by 0
        if std == 0:
            filtered_df[f"{numeric_field}_normalized"] = 0
        else:
            filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - df[numeric_field].mean()) / std

        print(f"\nNormalized {numeric_field} for filtered records:")
        display(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

        # Attempt to group by a categorical field
        cat_fields = df.select_dtypes(exclude=np.number).columns
        group_field = None
        for f in cat_fields:
            if filtered_df[f].nunique() > 1 and filtered_df[f].nunique() < 20:
                group_field = f
                break

        if group_field:
            grouped_df = filtered_df.groupby(group_field)[numeric_field].mean().reset_index()
            print(f"\nGrouped mean {numeric_field} by {group_field}:")
            display(grouped_df.head())
        else:
            print("No suitable categorical group field found for grouping.")

## 5. Visualization
Let's visualize the distribution of the selected numeric field using a histogram and, if grouped, a bar chart of the group means.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

if eda_record_set_id is not None and numeric_fields:
    fig, ax = plt.subplots(1, 2, figsize=(14, 6))
    sns.histplot(df[numeric_field].dropna(), bins=30, kde=True, ax=ax[0], color='skyblue')
    ax[0].set_title(f"Distribution of {numeric_field}")
    ax[0].set_xlabel(numeric_field)

    if 'grouped_df' in locals() and group_field is not None:
        sns.barplot(data=grouped_df, x=group_field, y=numeric_field, ax=ax[1])
        ax[1].set_title(f"Mean {numeric_field} by {group_field}")
        ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=45)
    else:
        ax[1].axis('off')

    plt.tight_layout()
    plt.show()

## 6. Conclusion
In this notebook, we demonstrated how to use `mlcroissant` to load and analyze a multi-record set FAIR² dataset directly from a Croissant schema URL. 

- We listed the available record sets and their fields by `@id`.
- For a selected record set, we performed numeric filtering, normalization, group statistics, and plotted visualizations.

You can adapt these steps to other record sets and fields by referencing their `@id` as shown in the notebook.