# Marine Biodiversity and Environmental Data Exploration with `mlcroissant`
This notebook provides a template for loading and exploring the Marine Biodiversity and Environmental Data using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://cdn.dev.senscience.cloud/portals/10.82843/pm80-mh77/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# List all record sets
for record_set in dataset.metadata.recordSets:
    print(f"Record Set: {record_set['@id']} - {record_set['description']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
record_sets = [
    'https://senscience.ai/frontiers/borja/WATER.csv',
    'https://senscience.ai/frontiers/borja/SEDIMENTS.csv'
]
dataframes = {}

for record_set in record_sets:
    records = list(dataset.records(record_set=record_set))
    dataframes[record_set] = pd.DataFrame(records)

print(dataframes['https://senscience.ai/frontiers/borja/WATER.csv'].columns.tolist())
dataframes['https://senscience.ai/frontiers/borja/WATER.csv'].head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section should include operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# Select a numeric field for analysis
numeric_field = 'maximumdepthinmeters'

threshold = 10
filtered_df = dataframes['https://senscience.ai/frontiers/borja/WATER.csv'][
    dataframes['https://senscience.ai/frontiers/borja/WATER.csv'][numeric_field] > threshold
]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

filtered_df[f"{numeric_field}_normalized"] = (
    filtered_df[numeric_field] - filtered_df[numeric_field].mean()
) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

group_field = 'siteid'
if group_field in dataframes['https://senscience.ai/frontiers/borja/WATER.csv'].columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Visualize the distribution of maximum depth
plt.figure(figsize=(10, 6))
plt.hist(dataframes['https://senscience.ai/frontiers/borja/WATER.csv'][numeric_field], bins=50, color='c', edgecolor='k')
plt.title('Distribution of Maximum Sampling Depth in Meters')
plt.xlabel('Maximum Depth (m)')
plt.ylabel('Frequency')
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

In this notebook, we explored the Marine Biodiversity and Environmental Data using the `mlcroissant` library. We successfully loaded the dataset, reviewed its structure, and extracted specific record sets for detailed analysis.

We performed exploratory data analysis to filter and normalize data, which helped us understand the distribution of maximum sampling depths. The visualizations provided insights into how depth measurements are distributed across different sites.

This exploration serves as a foundation for more advanced analyses, such as assessing environmental trends or the impact of various factors on marine biodiversity.