# Annual Registrations and Characteristics of Dogs in Zurich, 2014-2024 Exploration with `mlcroissant`
This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.dtj2-hwyp/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# Fetch and print record sets
record_sets = metadata['recordSet']
for record_set in record_sets:
    print(f"Record Set ID: {record_set['@id']}")
    fields = record_set.get('field', [])
    for field in fields:
        print(f"  Field ID: {field['@id']}, Name: {field['name']}")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
from collections import defaultdict

# Example record set ID from previous step
example_record_set_id = record_sets[0]['@id']  # Substitute with actual ID

# Extract data from each record set
dataframes = defaultdict(pd.DataFrame)

records = list(dataset.records(record_set=example_record_set_id))
dataframes[example_record_set_id] = pd.DataFrame(records)

print(dataframes[example_record_set_id].columns.tolist())
dataframes[example_record_set_id].head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data.

In [None]:
# Here we assume 'dog_age' is a numeric field ID for demonstration purposes
numeric_field_id = 'dog_age'  # Substitute with actual field ID

if numeric_field_id in dataframes[example_record_set_id].columns:
    threshold = 10
    filtered_df = dataframes[example_record_set_id][
        dataframes[example_record_set_id][numeric_field_id] > threshold
    ]
    print(f"Filtered records with {numeric_field_id} > {threshold}:")
    print(filtered_df.head())
    
    # Normalize the numeric field
    filtered_df[f"{numeric_field_id}_normalized"] = (
        (filtered_df[numeric_field_id] - filtered_df[numeric_field_id].mean()) /
        filtered_df[numeric_field_id].std()
    )
    print(f"Normalized {numeric_field_id} for filtered records:")
    print(filtered_df[[numeric_field_id, f"{numeric_field_id}_normalized"]].head())
    
    # Group by 'breed' (assuming this is a valid field ID)
    group_field_id = 'breed'  # Substitute with actual field ID
    if group_field_id in filtered_df.columns:
        grouped_df = filtered_df.groupby(group_field_id).mean()
        print(f"Grouped data by {group_field_id}:")
        print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Example visualization: Plot histogram of the numeric field
if numeric_field_id in dataframes[example_record_set_id].columns:
    plt.figure(figsize=(10, 6))
    plt.hist(dataframes[example_record_set_id][numeric_field_id].dropna(), bins=30, alpha=0.7)
    plt.title(f'Histogram of {numeric_field_id}')
    plt.xlabel(numeric_field_id)
    plt.ylabel('Frequency')
    plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

Using `mlcroissant`, we explored the Annual Registrations and Characteristics of Dogs in Zurich dataset, extracting and analyzing data related to dog demographics. This included filtering dog age data, normalizing the numerical values for better comparison, and grouping by breed for demographic insights. Further visualizations can be implemented for deeper data exploration.