# Marine Biodiversity and Environmental Data Exploration with `mlcroissant`
This notebook provides an example of loading and exploring the Marine Biodiversity and Environmental Data dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://dev.senscience.cloud/portal/10.82843/pm80-mh77/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# Display record sets and field IDs
record_sets = metadata.recordSet
for record_set in record_sets:
    print(f"Record Set ID: {record_set['@id']}")
    for field in record_set['field']:
        print(f"  Field ID: {field['@id']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
record_sets_ids = [
    'https://senscience.ai/frontiers/borja/README.csv',
    'https://senscience.ai/frontiers/borja/METADATA.csv',
    'https://senscience.ai/frontiers/borja/WATER.csv'
]
dataframes = {}

for record_set_id in record_sets_ids:
    records = list(dataset.records(record_set=record_set_id))
    dataframes[record_set_id] = pd.DataFrame(records)

# Display columns of a chosen record set
print(dataframes['https://senscience.ai/frontiers/borja/METADATA.csv'].columns.tolist())
dataframes['https://senscience.ai/frontiers/borja/METADATA.csv'].head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section should include operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# Select a numeric field for analysis
numeric_field = 'https://senscience.ai/frontiers/borja/METADATA.csv/decimallatitude'

threshold = 43.35
filtered_df = dataframes['https://senscience.ai/frontiers/borja/METADATA.csv'][
    dataframes['https://senscience.ai/frontiers/borja/METADATA.csv'][numeric_field] > threshold
]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

# Normalize the numeric field for the filtered records
filtered_df[f"{numeric_field}_normalized"] = (
    filtered_df[numeric_field] - filtered_df[numeric_field].mean()
) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

# Grouping the data by site ID
group_field = 'https://senscience.ai/frontiers/borja/METADATA.csv/siteid'
if group_field in dataframes['https://senscience.ai/frontiers/borja/METADATA.csv'].columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Visualize the distribution of a numeric field
plt.hist(
    filtered_df[numeric_field],
    bins=30,
    alpha=0.7,
    color='blue',
    label=f'{numeric_field} Distribution'
)
plt.xlabel('Latitude')
plt.ylabel('Frequency')
plt.title('Distribution of Latitude in Filtered Records')
plt.legend()
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

In this notebook, we explored the Marine Biodiversity and Environmental Data dataset using the `mlcroissant` library. We successfully loaded metadata, reviewed available record sets and fields, extracted data from specific record sets, and performed an exploratory data analysis. Filtering and normalizing latitude data provided insights into the geographic distribution of sampling sites. The visualization further illustrated the distribution of these sample sites' latitudes. This analysis can be extended to include more complex data processing and machine learning tasks.