# [UPSIDE - QA TEST] - Spatiotemporal Water Quality Exploration with `mlcroissant`
This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.9xzh-z4vm/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets and their IDs.

In [None]:
# Overview of Record Sets
for record_set in dataset.metadata.recordSet:
    print(f"Record Set: {record_set['name']}, ID: {record_set['@id']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis.

In [None]:
# Extract data from each record set
record_set_id = '<insert_record_set_id>'  # Replace with actual RecordSet ID
records = list(dataset.records(record_set=record_set_id))
dataframe = pd.DataFrame(records)

print(dataframe.columns.tolist())
dataframe.head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps.

In [None]:
# EDA Operations
# Example operations: filtering, normalizing
numeric_field = '<insert_numeric_field>'  # Replace with actual numeric field name
threshold = 10
filtered_df = dataframe[dataframe[numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

filtered_df[f"{numeric_field}_normalized"] = (
    (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) /
    filtered_df[numeric_field].std()
)
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

group_field = '<insert_group_field>'  # Example grouping field
if group_field in dataframe.columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Conclusion
Summarize key findings and observations from the dataset exploration.

Exploration of the Bay of Biscay water quality dataset using `mlcroissant` helps reveal temporal and spatial patterns in water quality metrics, including potential biases and data biases in readings. Its structured format supports both initial analysis and more in-depth environmental trend analysis.