# Marine Biodiversity and Environmental Data Exploration with `mlcroissant`
This notebook provides a guide for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://cdn.dev.senscience.cloud/portals/10.82843/pm80-mh77/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# List all the record sets available in the dataset
record_sets = metadata['recordSet']
for record_set in record_sets:
    print(f"Record Set ID: {record_set['@id']}")
    print(f"Description: {record_set['description']}")
    for field in record_set['field']:
        print(f"  Field ID: {field['@id']}, Name: {field['name']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
record_set_id = 'https://senscience.ai/frontiers/borja/WATER.csv'  # Example Record Set ID
records = list(dataset.records(record_set=record_set_id))
df = pd.DataFrame(records)

print(df.columns.tolist())
df.head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section should include operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# Select a numeric field for analysis
numeric_field = 'https://api.dev.senscience.cloud/frontiers/8704cfcf-0667-4eaf-b0b1-679597df683f-16/1cf446bb-deb7-4b51-8958-0561d8c9668c'

# Filter records with a threshold
threshold = 10
filtered_df = df[df[numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

# Normalize the numeric field
filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

# Group by another field
group_field = 'siteid'
if group_field in df.columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Plotting the distribution of the normalized values
plt.figure(figsize=(10, 6))
filtered_df[f"{numeric_field}_normalized"].hist(bins=50)
plt.title('Distribution of Normalized Parameter Values')
plt.xlabel('Normalized Parameter Value')
plt.ylabel('Frequency')
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

In this notebook, we successfully loaded and explored a dataset using the `mlcroissant` library. We reviewed the available record sets and fields, extracted data into a DataFrame, and performed basic exploratory data analysis. Key findings include the distribution patterns of parameter values and insights into site-specific data groupings. This analysis provides a foundation for further in-depth research and data modeling.