# Bay of Biscay Marine Monitoring Dataset Exploration with `mlcroissant`
This notebook provides a guide for loading and exploring the Bay of Biscay Marine Monitoring dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'http://127.0.0.1:8000/api/portals/10.82843/mfcq-0459/fair2.json'

# Load the dataset
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# Display record sets in the dataset
for record_set in metadata['recordSet']:
    print(f"Record Set ID: {record_set['@id']}")
    for field in record_set['field']:
        print(f"  Field ID: {field['@id']} - Name: {field['name']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
record_sets = [
    'http://127.0.0.1:8000/sense/test-colab-1/41f02852-29a6-47b1-b685-eada6124f217',
    'http://127.0.0.1:8000/sense/test-colab-1/55882445-6fea-4ca9-93d3-648aaaf1f58c',
    'http://127.0.0.1:8000/sense/test-colab-1/7cc073b8-d338-4419-b96f-dfbe0d79e674'
]
dataframes = {}

for record_set in record_sets:
    records = list(dataset.records(record_set=record_set))
    dataframes[record_set] = pd.DataFrame(records)

# Display columns of the first record set
print(dataframes[record_sets[0]].columns.tolist())
dataframes[record_sets[0]].head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data.

In [None]:
# Select a numeric field for analysis from the first record set
numeric_field = 'http://127.0.0.1:8000/sense/test-colab-1/dfceb0a3-15c5-40b8-b4d5-af793acfd2cc'

threshold = 10
filtered_df = dataframes[record_sets[0]][dataframes[record_sets[0]][numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

group_field = 'http://127.0.0.1:8000/sense/test-colab-1/4c17d92b-7ddc-4b74-8e83-703e1bf784c1'
if group_field in dataframes[record_sets[0]].columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Histogram of the numeric field
plt.hist(dataframes[record_sets[0]][numeric_field], bins=50, alpha=0.7)
plt.title('Distribution of Parameter Values')
plt.xlabel('Parameter Value')
plt.ylabel('Frequency')
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

Through this exploration of the Bay of Biscay Marine Monitoring dataset, we have loaded and examined the structure and some sample entries of the data. We applied basic filtering and normalization techniques to prepare the data for further analysis. Visualizations provide insight into the distribution of parameters within the dataset, supporting ecological and environmental research.