# Marine Biodiversity Observations in Basque Country Estuaries and Coasts Exploration with `mlcroissant`
This notebook provides a guide for loading and exploring the Marine Biodiversity Observations dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'http://127.0.0.1:8000/api/portals/10.82843/ecsh-pr64/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
for record_set in metadata.recordSet:
    print(f"RecordSet: {record_set['@id']}")
    for field in record_set['field']:
        print(f"  Field: {field['@id']} - {field['name']}")


## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
record_sets = [
    'http://127.0.0.1:8000/sense/lnerwgerreegr/84aeff68-0da5-4fe8-9610-7a636702b25f',  # Macroalgae
    'http://127.0.0.1:8000/sense/lnerwgerreegr/e1521e79-aa2b-47b2-9016-01f946e88870',  # Phytoplankton
    'http://127.0.0.1:8000/sense/lnerwgerreegr/e91e4a97-5a2b-4e7a-923a-81ec8208510f'   # Invertebrates
]
dataframes = {}

for record_set in record_sets:
    records = list(dataset.records(record_set=record_set))
    dataframes[record_set] = pd.DataFrame(records)

# Display columns for a specific record set
selected_record_set = 'http://127.0.0.1:8000/sense/lnerwgerreegr/84aeff68-0da5-4fe8-9610-7a636702b25f'
print(dataframes[selected_record_set].columns.tolist())
dataframes[selected_record_set].head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section should include operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# Select a numeric field for analysis
numeric_field = 'parameter_value'

threshold = 10
filtered_df = dataframes[selected_record_set][dataframes[selected_record_set][numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

group_field = 'siteid'
if group_field in dataframes[selected_record_set].columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of a numeric field
plt.figure(figsize=(10, 6))
sns.histplot(filtered_df[numeric_field], bins=30, kde=True)
plt.title('Distribution of Parameter Value')
plt.xlabel('Parameter Value')
plt.ylabel('Frequency')
plt.show()

# Visualize relationship between numeric field and siteid
plt.figure(figsize=(12, 8))
sns.boxplot(data=filtered_df, x=group_field, y=numeric_field)
plt.xticks(rotation=90)
plt.title('Boxplot of Parameter Value by Site')
plt.xlabel('Site ID')
plt.ylabel('Parameter Value')
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

In this exploration, we loaded and examined the Marine Biodiversity dataset using `mlcroissant`. We reviewed record sets and fields, extracted data into DataFrames, and performed exploratory data analysis focusing on parameter values. The visualizations provided insights into the distribution and site-specific variations of the data.

This analysis demonstrates the potential of leveraging `mlcroissant` for streamlined data handling and exploration in ecological studies.