# Hydrological and Environmental Measurements in Bay of Biscay and Adjacent Waters Exploration with `mlcroissant`
This notebook demonstrates how to load, explore, and analyze the FAIRÂ² dataset (hydrological and environmental measurements in the Bay of Biscay and adjacent waters) using the `mlcroissant` library.

### Dataset Source
The dataset is described using a Croissant schema and is available at the following URL:

```
https://sen.science/doi/10.71728/senscience.h3h6-tjan/fair2.json
```

We will walk step-by-step through the dataset structure, extraction, and analysis process.

In [None]:
# Ensure mlcroissant is installed in your current Jupyter environment
!pip install --quiet mlcroissant

## 1. Data Loading

We start by loading metadata from the Croissant schema URL. The `mlcroissant` library provides convenient tooling for accessing both metadata and records defined in the Croissant description.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the Croissant schema URL
croissant_url = 'https://sen.science/doi/10.71728/senscience.h3h6-tjan/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(croissant_url)

# Display basic dataset information
print(f"Dataset Name: {dataset.metadata.name}")
print(f"Description: {dataset.metadata.description}")
print(f"Published: {dataset.metadata.datePublished}")
print(f"License: {dataset.metadata.license}")
print(f"Keywords: {', '.join(dataset.metadata.keywords)}")

## 2. Data Overview

Now, we examine the main record sets available in the dataset, and inspect their fields. *All datastructures are referenced **exclusively by their `@id`** values, consistent with Croissant specifications.*

In [None]:
# List all record sets (tables) defined in the dataset
record_sets = list(dataset.metadata.record_sets)

if not record_sets:
    print('No record sets found in metadata.')
else:
    print(f"Found {len(record_sets)} record sets:\n")
    for rec in record_sets:
        print(f"@id: {rec['@id']}")
        print(f"  Name: {rec.get('name', '<no name>')}")
        print(f"  Description: {rec.get('description', '')}")
        # List out fields by @id
        fields = rec.get('fields', [])
        print(f"  Fields: {[fld['@id'] for fld in fields]}")
        print()

**Example record preview:**

We choose one record set to preview a few rows by referencing its `@id`. (Update `<record_set_id>` below with the `@id` you want to explore.)

In [None]:
# Choose a record set of interest by its @id (replace this with a valid one from previous output)
example_record_set_id = ''  # e.g. 'sen:water_chemistry_records' (Replace with a valid @id)

if example_record_set_id:
    print(f'Displaying a few example records for record set: {example_record_set_id}')
    for i, record in enumerate(dataset.records(record_set=example_record_set_id)):
        print(f'Row {i+1}: {record}')
        if i >= 2:
            break
else:
    print('Please update example_record_set_id with a valid @id from the previous cell output.')

## 3. Data Extraction

Let's extract data from all or selected record sets into pandas DataFrames for further processing. Ensure to use the correct record set `@id`s.

In [None]:
# List record set @id's to extract (add from output of section 2)
record_set_ids = []  # Example: ['sen:water_chemistry_records', 'sen:species_abundance_records']

dataframes = {}
for rec_id in record_set_ids:
    print(f"Loading records for record set: {rec_id} ...")
    records = list(dataset.records(record_set=rec_id))
    dataframes[rec_id] = pd.DataFrame(records)

# Display column names for the first loaded record set
if record_set_ids:
    print(f"\nColumns for '{record_set_ids[0]}':\n", dataframes[record_set_ids[0]].columns.tolist())
    dataframes[record_set_ids[0]].head()
else:
    print('Please specify one or more valid record_set @id values in record_set_ids to extract data.')

## 4. Exploratory Data Analysis (EDA)

Apply basic analysis steps such as filtering, normalizing, and grouping. This workflow helps prepare and understand the data before modeling or deeper statistical analysis.

In [None]:
# Choose record set and fields (by @id) to analyze (update with your column/field @id)
record_set_id = ''  # Example: 'sen:water_chemistry_records'
numeric_field_id = ''  # Example: 'sen:nitrate_concentration'
group_field_id = ''    # Example: 'sen:site_name'

# Perform analysis only if everything specified
if record_set_id and numeric_field_id and record_set_id in dataframes:
    df = dataframes[record_set_id]
    # Filter to numeric (e.g., remove NAs, convert type)
    df[numeric_field_id] = pd.to_numeric(df[numeric_field_id], errors='coerce')
    threshold = 10
    filtered_df = df[df[numeric_field_id] > threshold].copy()
    print(f"Filtered records with '{numeric_field_id}' > {threshold}:")
    print(filtered_df.head())

    filtered_df[f"{numeric_field_id}_normalized"] = (
        (filtered_df[numeric_field_id] - filtered_df[numeric_field_id].mean()) / filtered_df[numeric_field_id].std()
    )
    print(f"\nNormalized '{numeric_field_id}' for filtered records:")
    print(filtered_df[[numeric_field_id, f"{numeric_field_id}_normalized"]].head())

    # (Optional) Group by
    if group_field_id and group_field_id in filtered_df.columns:
        grouped_df = filtered_df.groupby(group_field_id)[numeric_field_id].mean().reset_index()
        print(f"\nGrouped data by '{group_field_id}':")
        print(grouped_df.head())
else:
    print("Please set valid values for record_set_id, numeric_field_id, and ensure data is loaded in dataframes.")

## 5. Visualization

Let's visualize numeric distributions or trends using matplotlib or seaborn. (Update variable names to match your field `@id`s.)

In [None]:
import matplotlib.pyplot as plt

# Sample plot: histogram for a selected numeric field
if record_set_id and numeric_field_id and record_set_id in dataframes:
    df = dataframes[record_set_id]
    df[numeric_field_id] = pd.to_numeric(df[numeric_field_id], errors='coerce')
    plt.figure(figsize=(7,4))
    df[numeric_field_id].hist(bins=30)
    plt.xlabel(numeric_field_id)
    plt.ylabel('Count')
    plt.title(f"Distribution of {numeric_field_id}")
    plt.show()
else:
    print('Please load data and set valid record_set_id and numeric_field_id for visualization.')

## 6. Conclusion

This notebook illustrated:
- Loading and understanding Croissant-powered dataset metadata and structure using `mlcroissant`
- Accessing and extracting data by `@id`
- Performing preliminary EDA with field filtering and normalization
- Visualizing numeric distributions for further insight

**Tip:** Refer to the Croissant schema to identify `@id`s for record sets, fields, and columns. Using `mlcroissant`, you can efficiently explore large environmental or scientific datasets with complex structures.