# Iris Species Measurement Dataset Exploration with `mlcroissant`
This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.19b2-8dj2/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# Assume the record set `@id` is known from metadata exploration or provided details
record_set_id = 'https://api.app.sen.science/frontiers/4b114e29-1038-40e3-9c41-b54fcb344883/ed5c80cc-97ae-46a3-a063-e1e13bfb620c'

# Print available fields in the record set
record_set_metadata = dataset.metadata.record_set[record_set_id]
fields = record_set_metadata.columns
print("Fields available in the record set:")
for field in fields:
    print(f"{field['@id']}: {field['description']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from the specific record set
records = list(dataset.records(record_set=record_set_id))
df = pd.DataFrame(records)

print("Columns in extracted DataFrame:")
print(df.columns.tolist())
df.head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data.

In [None]:
# Select a numeric field for analysis from the available columns
numeric_field = 'https://example.org/iris/height'

# The following line assumes the column exists and is numeric
threshold = 10
filtered_df = df[df[numeric_field] > threshold]
print(f"Filtered records where {numeric_field} > {threshold}:")
print(filtered_df.head())

# Normalizing the selected numeric field
filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

# Grouping by species or another categorical field
group_field = 'https://example.org/iris/species'
if group_field in df.columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Plot a histogram of the normalized field
plt.figure(figsize=(10, 6))
plt.hist(filtered_df[f"{numeric_field}_normalized"], bins=30, alpha=0.7, color='blue')
plt.title('Distribution of Normalized Field')
plt.xlabel('Normalized Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

- The dataset provides a comprehensive view of Iris species measurements.
- Initial analysis highlighted the variation across species.
- Further exploration could include more sophisticated statistical tests or machine learning analysis.