# Boreal Rodent Behavior Observational Data (2020) Exploration with `mlcroissant`
This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd
from pprint import pprint

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.5zsz-gnaq/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
dataset_metadata = dataset.metadata
pprint(vars(dataset_metadata))

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# Display information about each record set available in the dataset
record_sets = dataset_metadata.record_sets
for record_set in record_sets:
    print(f"Record set ID: {record_set['@id']}")
    print("Fields:")
    for field in record_set.get('fields', []):
        print(f"  - {field['@id']} : {field.get('description', 'No description')}")
    print("")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
record_set_id = 'https://api.app.sen.science/frontiers/165e6501-f85b-438d-8044-a63e68cd0f4e/c3f289a7-f59b-4b56-b9cd-fef41436eeda'
records = list(dataset.records(record_set=record_set_id))
df = pd.DataFrame(records)

# Display the first few columns
print(df.columns.tolist())
df.head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data.

In [None]:
# Select a numeric field for analysis
numeric_field = 'encounters_count'  # Hypothetical numeric field

threshold = 10
filtered_df = df[df[numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

group_field = 'species_type'  # Hypothetical group field
if group_field in df.columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Visualize normalized encounters count
plt.figure(figsize=(10, 6))
plt.hist(filtered_df[f"{numeric_field}_normalized"], bins=30, alpha=0.7, label=numeric_field)
plt.title('Distribution of Normalized Encounters Count')
plt.xlabel('Normalized Count')
plt.ylabel('Frequency')
plt.legend()
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

In this notebook, we explored the `Boreal Rodent Behavior Observational Data (2020)` dataset, described its structure, and performed basic data manipulations and visualizations. We have seen how to filter records based on numeric values, normalize data, and visualize data distributions, which are essential steps in gaining insights from a dataset. By documenting specific field identifiers using `@id`, we ensure data integrity and reproducibility.