# Demographic and Breed Data of Registered Dogs in Zurich (2014-2024) Exploration with `mlcroissant`
This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset Croissant schema URL
croissant_url = 'https://sen.science/doi/10.71728/senscience.834f-x4hv/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(croissant_url)
print(f"Dataset loaded. Metadata @id: {dataset.metadata.id}")
print(f"Title: {dataset.metadata.name}")
print(f"Description: {dataset.metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# List all record sets and their fields using their @id, name, and description
if hasattr(dataset, 'record_sets'):
    print("Available Record Sets:")
    for rs in dataset.record_sets:
        print(f"- Record Set @id: {rs.id}")
        print(f"  Name: {getattr(rs, 'name', None)}")
        print(f"  Description: {getattr(rs, 'description', None)}")
        if hasattr(rs, 'fields') and rs.fields:
            print("  Fields:")
            for field in rs.fields:
                print(f"    - Field @id: {field.id} | Name: {getattr(field, 'name', None)} | Data type: {getattr(field, 'data_type', None)}")
        print("")
else:
    print("No record sets found in the dataset schema.")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Gather all record set @ids
record_set_ids = []
if hasattr(dataset, 'record_sets'):
    record_set_ids = [rs.id for rs in dataset.record_sets]
else:
    print("Dataset has no record sets.")

dataframes = {}

for record_set_id in record_set_ids:
    try:
        records = list(dataset.records(record_set=record_set_id))
        if records:
            df = pd.DataFrame(records)
            dataframes[record_set_id] = df
            print(f"Loaded DataFrame for Record Set @id: {record_set_id}")
            print(f"Columns: {df.columns.tolist()}")
            print(df.head(2))
        else:
            print(f"Record set @id {record_set_id} contains no records.")
    except Exception as e:
        print(f"Could not load records for record set @id {record_set_id}: {e}")

# For demonstration, choose the first non-empty record set
non_empty_record_set_id = None
for k, v in dataframes.items():
    if not v.empty:
        non_empty_record_set_id = k
        break

if non_empty_record_set_id:
    print(f"Example DataFrame columns for record set @id {non_empty_record_set_id}:")
    print(dataframes[non_empty_record_set_id].columns.tolist())
    display(dataframes[non_empty_record_set_id].head())
else:
    print("No non-empty DataFrames found in record sets.")

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section should include operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# --- EDA on the main record set ---
if not non_empty_record_set_id:
    print("No data available for EDA.")
else:
    df = dataframes[non_empty_record_set_id]
    print(f"Analyzing DataFrame for Record Set @id: {non_empty_record_set_id}")
    
    # Identify numeric fields by data type (e.g., 'birthyear', 'age', or similar columns)
    numeric_col_candidates = df.select_dtypes(include=['number']).columns.tolist()
    if not numeric_col_candidates:
        print("No numeric fields found for analysis.")
    else:
        numeric_field = numeric_col_candidates[0]
        print(f"Numeric field selected for filtering and normalization: {numeric_field}")
        
        # For example, select a threshold. If field is 'birthyear', we can use after 2018, else value>threshold
        threshold = df[numeric_field].mean() if df[numeric_field].dtype!=int else 2018
        filtered_df = df[df[numeric_field] > threshold]
        print(f"Filtered records with {numeric_field} > {threshold}:")
        print(filtered_df.head())
        
        # Normalized field
        filtered_df = filtered_df.copy()
        filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
        print(f"Normalized {numeric_field} for filtered records:")
        print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())
        
        # Select a group field (for dog data, could be 'breed', 'gender', etc.), fallback to any object dtype
        potential_group_fields = [c for c in df.columns if df[c].dtype == 'object' and c != numeric_field]
        if potential_group_fields:
            group_field = potential_group_fields[0]
            print(f"Grouping by field: {group_field}")
            grouped_df = filtered_df.groupby(group_field)[numeric_field].mean().reset_index().sort_values(numeric_field, ascending=False)
            print(f"Grouped data by {group_field} and calculated mean of {numeric_field}:")
            print(grouped_df.head())
        else:
            print("No suitable object/categorical column for grouping found.")

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

if not non_empty_record_set_id:
    print("No data available for visualization.")
else:
    if not numeric_col_candidates:
        print("No numeric field present for plotting.")
    else:
        # Histogram of numeric field
        plt.figure(figsize=(8,5))
        sns.histplot(df[numeric_field].dropna(), bins=30, kde=True)
        plt.title(f"Distribution of {numeric_field}")
        plt.xlabel(numeric_field)
        plt.ylabel("Frequency")
        plt.show()
        
        # If grouped_df exists from EDA, show barplot
        if 'grouped_df' in locals():
            top_n = grouped_df.head(10)
            plt.figure(figsize=(10,5))
            sns.barplot(x=top_n[group_field], y=top_n[numeric_field])
            plt.title(f"Top 10 Groups by Mean {numeric_field}")
            plt.xlabel(group_field)
            plt.ylabel(f"Mean {numeric_field}")
            plt.xticks(rotation=45, ha='right')
            plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

- In this notebook, we demonstrated how to load a Croissant-structured dataset using `mlcroissant`, inspect its metadata, explore record sets and their fields via their `@id`, extract data into pandas DataFrames, and perform initial exploratory analysis and visualization.
- You can extend this notebook by further exploring additional record sets, performing deeper statistical analyses, or building machine learning models using the cleaned dataset.
- Always refer to record set, field, and column `@id`s when manipulating and documenting entities for reproducibility.