# 13.02 - QA - Physico-Chemical, Biological, and Environmental Measurements Exploration with `mlcroissant`
This notebook guides users in loading and exploring the FAIR^2 marine monitoring dataset using the `mlcroissant` library. It follows the Croissant specification for data interoperability and transparency, referencing all entities by their `@id`.

### Dataset Source
The dataset source is provided via a Croissant schema URL:

In [None]:
# Install mlcroissant for working with Croissant datasets
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the Croissant schema URL
croissant_url = 'https://sen.science/doi/10.71728/senscience.bg74-tkzq/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(croissant_url)

# Print dataset name and description
dataset_name = dataset.metadata.name
dataset_description = dataset.metadata.description
print(f"{dataset_name}: {dataset_description}")

## 2. Data Overview
Review available record sets, fields, and their IDs. All entities are referenced by their `@id`.

In [None]:
# List all available record sets by @id
record_sets = dataset.record_sets

print("Available record sets:")
for rs in record_sets:
    print(f"- RecordSet @id: {rs['@id']}, name: {rs.get('name', 'N/A')}")

# List fields and columns for each record set
for rs in record_sets:
    print(f"\nFields for RecordSet @id: {rs['@id']}:")
    fields = rs.get('fields', [])
    for field in fields:
        print(f"  - Field @id: {field['@id']}, name: {field.get('name', 'N/A')}, dataType: {field.get('dataType', 'N/A')}")
    columns = rs.get('columns', [])
    if columns:
        print(f"  Columns:")
        for column in columns:
            print(f"    - Column @id: {column['@id']}, name: {column.get('name', 'N/A')}, dataType: {column.get('dataType', 'N/A')}")

## 3. Data Extraction
Load data from specific record sets into DataFrames for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set using their @id
df_dict = {}

# For demonstration, load the first two record sets
example_record_set_ids = [rs['@id'] for rs in record_sets[:2]]

for rs_id in example_record_set_ids:
    print(f"\nLoading records for RecordSet @id: {rs_id}")
    records = list(dataset.records(record_set=rs_id))
    df = pd.DataFrame(records)
    df_dict[rs_id] = df
    print(f"Columns for {rs_id}: {df.columns.tolist()}")
    print(df.head())

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data, referencing all fields by their `@id`.

In [None]:
# Choose a record set and numeric field by @id for demonstration

# Select the first loaded record set
example_rs_id = example_record_set_ids[0]
df = df_dict[example_rs_id]

# Identify numeric fields (those with dataType == 'Float' or 'Integer')
numeric_fields = []
for rs in record_sets:
    if rs['@id'] == example_rs_id:
        for field in rs.get('fields', []):
            if field.get('dataType') in ['Float', 'Integer']:
                numeric_fields.append(field['@id'])

print(f"Numeric fields for RecordSet {example_rs_id}: {numeric_fields}")
if len(numeric_fields) == 0:
    print("No numeric fields found.")
else:
    numeric_field_id = numeric_fields[0]       # Use first numeric field
    print(f"Using numeric field @id: {numeric_field_id}")
    # Filter where numeric field > threshold (example threshold: 10)
    threshold = 10
    if numeric_field_id in df.columns:
        filtered_df = df[df[numeric_field_id] > threshold]
        print(f"Filtered records with {numeric_field_id} > {threshold}:")
        print(filtered_df.head())

        # Normalize numeric column
        filtered_df[f"{numeric_field_id}_normalized"] = (
            filtered_df[numeric_field_id] - filtered_df[numeric_field_id].mean()
        ) / filtered_df[numeric_field_id].std()
        print(f"Normalized {numeric_field_id} for filtered records:")
        print(filtered_df[[numeric_field_id, f"{numeric_field_id}_normalized"]].head())

        # Choose a grouping field (categorical), for demo use first non-numeric field
        group_fields = [field['@id'] for field in rs.get('fields', [])
                       if field.get('dataType') not in ['Float', 'Integer']]
        if group_fields:
            group_field_id = group_fields[0]
            if group_field_id in filtered_df.columns:
                grouped_df = filtered_df.groupby(group_field_id)[numeric_field_id].mean().reset_index()
                print(f"Grouped mean of {numeric_field_id} by {group_field_id}:")
                print(grouped_df.head())
        else:
            print("No suitable group field found.")

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Visualize the distribution of the numeric field for the filtered DataFrame
if len(numeric_fields) > 0 and numeric_field_id in filtered_df.columns:
    plt.figure(figsize=(8, 4))
    filtered_df[numeric_field_id].hist(bins=30)
    plt.xlabel(numeric_field_id)
    plt.ylabel('Count')
    plt.title(f'Distribution of {numeric_field_id} (> {threshold}) in RecordSet {example_rs_id}')
    plt.show()

    # If grouped_df exists, visualize the group means
    if 'grouped_df' in locals():
        plt.figure(figsize=(8, 4))
        plt.bar(grouped_df[group_field_id], grouped_df[numeric_field_id])
        plt.xlabel(group_field_id)
        plt.ylabel(f"Mean {numeric_field_id}")
        plt.title(f"Mean {numeric_field_id} by {group_field_id}")
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()
else:
    print("No numeric field available for visualization.")

## 6. Conclusion
This notebook demonstrates step-by-step loading, exploration, and processing of the FAIR^2 marine monitoring dataset using the mlcroissant library. All data entities were referenced by their `@id`, ensuring reproducibility and transparency. Further analysis can be performed on the loaded DataFrames for scientific and environmental management applications.