# [UPSIDE - QA TEST] - Spatiotemporal Water Quality Exploration with `mlcroissant`
This notebook provides a guide for loading and exploring the [UPSIDE - QA TEST] dataset focusing on water quality, contaminant levels, and biodiversity metrics in the Bay of Biscay, using the `mlcroissant` library.

### Dataset Source
The dataset source is available via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.9xzh-z4vm/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# List available record sets
record_sets = metadata.recordSet
for rs in record_sets:
    print(f"Record Set ID: {rs['@id']} - {rs['name']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
selected_record_set_id = record_sets[0]['@id']  # Assuming we choose the first ID
records = list(dataset.records(record_set=selected_record_set_id))
df = pd.DataFrame(records)

print(df.columns.tolist())
df.head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section should include operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# Select a numeric field for analysis
numeric_field = df.columns[1]  # Replace with actual numeric field ID
threshold = 10
filtered_df = df[df[numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

# Normalize numeric field
filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

# Group by another field
group_field = df.columns[2]  # Replace with actual group field ID
if group_field in df.columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Plot distribution of a numeric field
plt.figure(figsize=(10, 6))
plt.hist(df[numeric_field], bins=50, alpha=0.7)
plt.title(f"Distribution of {numeric_field}")
plt.xlabel(numeric_field)
plt.ylabel("Frequency")
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

In this notebook, we utilized the `mlcroissant` library to load and explore the [UPSIDE - QA TEST] dataset, focusing on spatiotemporal data related to water quality in the Bay of Biscay. Through data extraction, filtering, normalization, and visualization, we have derived insights into the dataset's structure and content, paving the way for further analysis and application.