# Long-Term Marine Environmental Monitoring Data from Basque Country Estuaries and Coasts, 1995â€“2014 Exploration with `mlcroissant`
This notebook provides a step-by-step guide for loading and exploring the FAIR^2 marine dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL:

`https://sen.science/doi/10.71728/senscience.bx5w-rs5c/fair2.json`


In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant --quiet

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd
import json

# Define the dataset Croissant schema URL
croissant_url = 'https://sen.science/doi/10.71728/senscience.bx5w-rs5c/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(croissant_url)
metadata = dataset.metadata.to_json()
print(f"{metadata['name']} (ID: {metadata['@id']})\n")
print(metadata['description'])
print(f"\nPublished: {metadata['datePublished']}")
print(f"Spatial Coverage: {metadata['spatialCoverage']}")
print(f"Temporal Coverage: {metadata['temporalCoverage']}")

## 2. Data Overview
Review available record sets, fields, and their `@id`s.

In [None]:
# Get the record sets from the dataset metadata
record_sets = metadata.get('recordSet', [])
if not record_sets:
    print("No record sets found in the dataset metadata.")
else:
    print(f"Found {len(record_sets)} record sets. Listing their @id and fields:")
    for rs in record_sets:
        rs_id = rs.get('@id', '')
        print(f"\nRecord set @id: {rs_id}")
        fields = rs.get('field', [])
        if fields:
            field_ids = [f.get('@id', '') for f in fields]
            print(f"Fields @id: {field_ids}")
        else:
            print("No fields found in this record set.")

## 3. Data Extraction
Load data from each record set into a DataFrame for analysis. All references use `@id` values as in the FAIR^2 dataset.

In [None]:
# Build a list of record set @id's
record_set_ids = []
for rs in record_sets:
    if '@id' in rs:
        record_set_ids.append(rs['@id'])

dataframes = {}
for record_set_id in record_set_ids:
    try:
        records = list(dataset.records(record_set=record_set_id))
        df = pd.DataFrame(records)
        if not df.empty:
            dataframes[record_set_id] = df
            print(f"Loaded DataFrame for record set {record_set_id}: {df.shape[0]} rows, {df.shape[1]} columns.")
        else:
            print(f"Record set {record_set_id} yielded an empty DataFrame.")
    except Exception as e:
        print(f"Failed to load records for {record_set_id}: {e}")

# Show columns from the first available DataFrame
if dataframes:
    first_rs_id = list(dataframes.keys())[0]
    print(f"\nColumns in record set {first_rs_id}:")
    print(dataframes[first_rs_id].columns.tolist())
    dataframes[first_rs_id].head()
else:
    print("No data available from any record set.")

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and grouping data. 
All references to fields and columns use their `@id`. 

Let's perform example processing on the first DataFrame loaded.

In [None]:
# EDA for the first available record set
import numpy as np

if dataframes:
    df = dataframes[first_rs_id]
    # Find a numeric column by heuristics or list
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    if numeric_cols:
        numeric_field_id = numeric_cols[0]
        print(f"Using numeric field: {numeric_field_id}")
        # Set arbitrary threshold for demonstration
        threshold = df[numeric_field_id].mean()
        filtered_df = df[df[numeric_field_id] > threshold]
        print(f"Filtered records where {numeric_field_id} > mean ({threshold:.2f}): {filtered_df.shape[0]} rows")

        # Normalization (Z-score)
        filtered_df[f"{numeric_field_id}_normalized"] = (
            filtered_df[numeric_field_id] - filtered_df[numeric_field_id].mean()
        ) / filtered_df[numeric_field_id].std()
        print(filtered_df[[numeric_field_id, f"{numeric_field_id}_normalized"]].head())

        # Attempt grouping by a common field
        potential_group_fields = [c for c in df.columns if c != numeric_field_id]
        if potential_group_fields:
            group_field_id = potential_group_fields[0]
            print(f"Grouping by field: {group_field_id}")
            grouped_df = filtered_df.groupby(group_field_id)[numeric_field_id].mean().reset_index()
            print(grouped_df.head())
        else:
            print("No grouping fields found.")
    else:
        print("No numeric fields found for EDA.")
else:
    print("No data loaded; cannot perform EDA.")

## 5. Visualization
Visualize data distributions or relationships between fields using the extracted DataFrame.

In [None]:
# Import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

if dataframes and numeric_cols:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[numeric_field_id].dropna(), bins=30, kde=True)
    plt.title(f"Distribution of {numeric_field_id} in record set {first_rs_id}")
    plt.xlabel(numeric_field_id)
    plt.ylabel("Frequency")
    plt.show()

    # Scatter plot between first two numeric fields if available
    if len(numeric_cols) > 1:
        plt.figure(figsize=(8, 6))
        sns.scatterplot(x=df[numeric_cols[0]], y=df[numeric_cols[1]])
        plt.title(f"Scatter plot: {numeric_cols[0]} vs {numeric_cols[1]} in {first_rs_id}")
        plt.xlabel(numeric_cols[0])
        plt.ylabel(numeric_cols[1])
        plt.show()
else:
    print("Visualization skipped: no numeric data loaded.")

## 6. Conclusion
This notebook demonstrated how to load, explore, and analyze marine environmental monitoring data from the Basque Country using `mlcroissant`.
Key steps:
- Load metadata and records with schema references by `@id`
- Review record sets and fields using their unique `@id`s
- Extract tabular data for processing
- Filter, normalize, and group data using column `@id`s
- Visualize distributions and relationships

You can continue by selecting specific record sets and field `@id`s for advanced domain analyses.