# Explore the Marine Biodiversity and Environmental Data Package with `mlcroissant`

## Introduction
This dataset from **Borja et al.** provides 38 years of environmental monitoring data (1995–2023) tracks 130 environmental variables across water, sediments, and biota in the Basque Country’s coastal areas. Developed with the Basque Water Agency (URA), it enables analysis of human impacts and ecosystem management.

Learn more:
- Data Package doi: [10.71728/r1rj-f947](https://sen.science/doi/10.71728/r1rj-f947)
- Frontiers Data Article doi: [10.3389/focsu.2024.1528837](https:.//doi.org/10.3389/focsu.2024.1528837)

As a FAIR² Data Package, it ensures accessibility, interoperability, and AI-readiness, supporting research and policy aligned with European directives. FAIR² datasets follow the MLCommons **Croissant** 🥐 format for machine learning datasets. See the [MLCommons Croissant Format Specification](https://docs.mlcommons.org/croissant/docs/croissant-spec.html).

This notebook provides a step-by-step guide for loading the dataset using the `mlcroissant` Python library.

### Install and import required libraries

In [None]:
# Install mlcroissant from the source
# !sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config
# !pip install mlcroissant[dev]

In [None]:
# To install it directly from the github repository
# !pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [None]:
import mlcroissant as mlc
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tabulate import tabulate
from IPython.display import Markdown

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant` and the URL of the FAIR<sup>2</sup> Data Package.

In [None]:
# Provide the dataset URL
url = 'https://sen.science/doi/10.71728/r1rj-f947/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview

In the **Croissant** format, a RecordSet represents a structured collection of records, where each record is a granular dataset unit (e.g., an image, text file, or table row). It defines the structure of these records using a set of fields, such as the columns in a table or sheet, as seen in this example.

### 2.1 Review available RecordSets

In [None]:
# Format the list column as a Markdown-compatible string
def format_list_column(row):
    if isinstance(row, list):
        return "\n".join(f"- {item}" for item in row)  # Bullet point list
    return str(row)

In [None]:
# List all the record sets available in the dataset
df = pd.DataFrame(metadata["recordSet"])
columns_to_keep = {
    "@id": "Record Set ID",
    "description": "Description"
}
df = df[list(columns_to_keep.keys())]
df = df.rename(columns=columns_to_keep)

# Convert DataFrame to Markdown table
markdown_table = tabulate(df, headers="keys", tablefmt="pipe", showindex=False)

# Render the table as Markdown in Jupyter
display(Markdown(markdown_table))

## 3. Data Extraction

#### 3.1 Load data from a specific record set into a DataFrame for analysis. 

In this example we will focus on the data records related to the biodiversity of the sampled areas (phytoplankton, invertebrates, macroalgae, and fish). To that end we can use the RecordSet `@id`s from the overview

In [None]:
record_set_ids = [
    'https://sen.science/10.71728/r1rj-f947/record-sets/PHYTOPLANKTON',
    'https://sen.science/10.71728/r1rj-f947/record-sets/INVERTEBRATES',
    'https://sen.science/10.71728/r1rj-f947/record-sets/MACROALGAE',
    'https://sen.science/10.71728/r1rj-f947/record-sets/FISH'
]

dataframes = {
    record_set_id: pd.DataFrame(list(dataset.records(record_set=record_set_id)))
    for record_set_id in record_set_ids
}

In [None]:
prefix = "https://sen.science/doi/10.71728/r1rj-f947/"

for name, df in dataframes.items():
    df.rename(columns=lambda x: x.replace(prefix, "").split("/")[-1], inplace=True)

In [None]:
# Display the first rows of each dataframe
for name, df in dataframes.items():
    display(Markdown(f"#### {name}"))
    display(df.head())
    display(Markdown("---"))

From the preview, we can see that the tables contain population data for various species, categorized by taxonomic name (taxa). For phytoplankton, invertebrates, and fish, the dataset records species abundance, while macroalgae are measured in terms of coverage percentage.

## 4. Exploratory Data Analysis (EDA)

To grasp the dataset’s key characteristics, identify patterns, and detect anomalies, we begin with Exploratory Data Analysis (EDA).

### 4.1 Identify missing values
Check the size and completeness of the dataframes

In [None]:
for name, df in dataframes.items():
    missing_columns = df.columns[df.isnull().any()].tolist()
    if missing_columns:
        print(f"Dataframe '{name}' has missing values in columns: {missing_columns}")
    else:
        print(f"Dataframe '{name}' has no missing values.")

### 4.2 Summary statistics

As seen in Section 2, our initial exploration revealed that the dataset captures biodiversity trends in the sampled regions of the Bay of Biscay over the years. This includes species abundance—measured as the number of individual specimens or, in the case of algae, as surface coverage. Given this, it is essential to examine the statistical summaries of these measurements for each dataframe.

In [None]:
summary_stats_dict = {}
for name, df in dataframes.items():
    summary_stats = df.groupby('taxaname').agg({
        'parameter_value': ['mean', 'std', 'min', 'max', 'count']
    }).reset_index()
    summary_stats.columns = ['Taxa', 'Mean', 'Std', 'Min', 'Max', 'Count']
    
    # Order by count
    summary_stats = summary_stats.sort_values(by='Count', ascending=False)
    summary_stats_dict[name] = summary_stats
    
    display(Markdown(f"### Summary Statistics for {name.split('/')[-1].split('.csv')[0]} {df['parameter'][0]}"))
    display(summary_stats)
    display(Markdown("---"))

### Summary of Tables: Mean and Taxa Diversity

The dataset comprises multiple tables, each representing different categories of marine biodiversity data, including phytoplankton, invertebrates, macroalgae, and fish. Below is a summary focusing on the mean values and taxa diversity for each category:

1. **Phytoplankton**:
    - **Number of Unique Species**: 505
    - **Mean Abundance**: The dataset records the abundance of various phytoplankton species, with mean values calculated for each species. The mean abundance provides insights into the average population size of each species across different sampling sites and times.

2. **Invertebrates**:
    - **Number of Unique Species**: 1493
    - **Mean Abundance**: Invertebrate data includes a wide range of species with varying mean abundance values. The mean values help identify the most and least common invertebrate species in the sampled areas.

3. **Macroalgae**:
    - **Number of Unique Species**: 306
    - **Mean Coverage**: Unlike other categories, macroalgae are measured in terms of coverage percentage. The mean coverage values indicate the average surface area occupied by each macroalgae species, providing insights into their distribution and dominance in the ecosystem.

4. **Fish**:
    - **Number of Unique Species**: 132
    - **Mean Abundance**: Fish data includes species abundance measured in individual counts. The mean values highlight the average number of individuals per species, helping to understand the population dynamics of different fish species.

Overall, the dataset reveals significant diversity in marine species across different categories, with invertebrates showing the highest number of unique species. The mean values for each category provide valuable information on the average population sizes and coverage, aiding in the analysis of biodiversity trends and ecosystem health.

#### 4.2.1 Example: Exploring the Abundance of the FISH Category

We can take a look at the 10 most sampled unique fish taxa

In [None]:
# Select the first 10 Taxa names
top_10_taxa = summary_stats.head(10)

In [None]:
# Plot the count of the first 10 taxa vs the mean value of the parameter_value with inverted axes
plt.figure(figsize=(22, 6))
ax = sns.barplot(y='Count', x='Taxa', data=top_10_taxa, palette='viridis')
ax.set_ylabel('Samples')
ax.set_xlabel('Taxa')
ax2 = ax.twinx()
sns.lineplot(y='Mean', x='Taxa', data=top_10_taxa, ax=ax2, color='red', marker='o')
ax2.set_ylabel(f"({df['parameter_standardunit'][0]})")
plt.title(f"Number of Samples of the First 10 Taxa vs Mean {df['parameter'][0]}")
plt.xticks(rotation=45)
plt.show()

The figure above illustrates the relationship between the number of samples and the mean abundance of the top 10 most sampled fish taxa. The bar plot represents the number of samples for each taxa, while the line plot shows the mean abundance of these taxa. The x-axis lists the taxa names, and the y-axis on the left indicates the number of samples, while the y-axis on the right shows the mean abundance in individual counts (ind). The plot highlights that the average abundance of species is not always proportional to the number of samples taken. For instance, the Gobius niger is relatively less abundant than the Crangon crangon fish, even though they were sampled almost the same number of times.

## 5. Analysis of the Biodiversity Over the Years

In this subsection we focuse on the evolution of the biodiversity richness over the years. See Figure 3 of the FAIR² Data Article.

In [None]:
x_column = 'datecollected'
new_x_column = "year_collected"

def x_transformation(df):
    # keep only the year
    return pd.to_datetime(df[x_column], format='%Y-%m-%d').dt.year

for name, df in dataframes.items():    
    # execute transformations
    df[new_x_column] = x_transformation(df)

In [None]:
# Plot the change over the species diversity reachness (unique taxa) over the years
plt.figure(figsize=(14, 8))

for name, df in dataframes.items():
    unique_taxa_per_year = df.groupby('year_collected')['taxaname'].nunique().reset_index()
    plt.plot(unique_taxa_per_year['year_collected'], unique_taxa_per_year['taxaname'], marker='o', label=name.split('/')[-1])

plt.xlabel('Year Collected')
plt.ylabel('Number of Unique Taxa')
plt.title('Number of Unique Taxa per Year Collected')
plt.legend()
plt.grid(True)
plt.show()

The figure above illustrates the trend in the number of unique taxa collected each year from 1989 to 2023. Each line represents a different category of marine biodiversity data, including phytoplankton, invertebrates, macroalgae, and fish. The x-axis shows the years of data collection, while the y-axis indicates the number of unique taxa identified in each year. The plot highlights the changes in biodiversity richness over time, providing insights into the temporal dynamics of species diversity in the sampled regions. The overall trend shows fluctuations in the number of unique taxa, reflecting variations in environmental conditions and sampling efforts across different years.

## 5. Observations


1. **Dataset Overview**:
    - The dataset provides extensive long-term monitoring data from the Basque Country, covering various environmental variables across water, sediments, and biota.
    - The dataset includes multiple record sets, each representing different categories of marine biodiversity data, such as phytoplankton, invertebrates, macroalgae, and fish.

2. **Data Completeness**:
    - The dataframes for each record set were checked for missing values. Most dataframes had no missing values, indicating a high level of data completeness.

3. **Summary Statistics**:
    - Summary statistics were calculated for each record set, focusing on species abundance and coverage.
    - The dataset revealed significant diversity in marine species, with invertebrates showing the highest number of unique species (1493), followed by phytoplankton (505), macroalgae (306), and fish (132).

4. **Top 10 Most Sampled Fish Taxa**:
    - The top 10 most sampled fish taxa were identified, with Carcinus maenas being the most sampled species.
    - A plot illustrating the relationship between the number of samples and the mean abundance of these taxa showed that the average abundance of species is not always proportional to the number of samples taken.

5. **Biodiversity Trends Over the Years**:
    - The number of unique taxa collected each year from 1989 to 2023 was analyzed.
    - The plot showed fluctuations in the number of unique taxa over time, reflecting variations in environmental conditions and sampling efforts across different years.


Overall, the dataset reveals valuable information on marine biodiversity trends and ecosystem health in the Basque Country's coastal areas. The analysis highlights the richness and diversity of species, as well as temporal dynamics and potential correlations within the data.

## Conclusion
In this notebook, we successfully explored the Marine Biodiversity and Environmental Data Package using the `mlcroissant` library. We began by loading the dataset and reviewing its metadata, followed by extracting specific record sets into dataframes for detailed analysis. Through exploratory data analysis (EDA), we identified missing values, calculated summary statistics, and visualized the abundance of various taxa. Our analysis revealed significant biodiversity in the Basque Country's coastal areas, with invertebrates showing the highest number of unique species. We also examined the temporal trends in species diversity, highlighting fluctuations over the years. The insights gained from this dataset provide valuable information for understanding marine biodiversity trends and ecosystem health, supporting research and policy aligned with environmental management and conservation efforts.