# Demographic and Breed Data of Registered Dogs in Zurich (2014-2024) Exploration with `mlcroissant`
This notebook provides a structured walkthrough for loading and exploring the *Demographic and Breed Data of Registered Dogs in Zurich (2014-2024)* dataset using the `mlcroissant` library.

### Dataset Source
The dataset is described by a Croissant schema and accessible at:

`https://sen.science/doi/10.71728/senscience.834f-x4hv/fair2.json`


In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load dataset metadata and records using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd
import json

# Define the dataset URL
croissant_url = 'https://sen.science/doi/10.71728/senscience.834f-x4hv/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(croissant_url)
# The metadata object behaves like an attribute holder, not a dict
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their `@id`s.

Let's list all record sets and their fields using their `@id`.


In [None]:
# Get available record sets from the Croissant schema
record_sets = list(dataset.record_sets)
print(f"Found {len(record_sets)} record set(s):")
for rs in record_sets:
    print(f"- RecordSet @id: {rs['@id']}; name: {rs.get('name', rs['@id'])}")

# For each record set, list its fields (@id)
record_set_ids = []
for rs in record_sets:
    record_set_id = rs['@id']
    record_set_ids.append(record_set_id)
    fields = rs.get('field', [])
    # 'field' can be a dict or list
    if isinstance(fields, dict):
        fields = [fields]
    print(f"  Fields for {record_set_id}: ")
    for field in fields:
        if isinstance(field, dict):
            print(f"    - Field @id: {field['@id']}; name: {field.get('name', field['@id'])}")
        else:
            # Sometimes just @id string
            print(f"    - Field @id: {field}")

## 3. Data Extraction
Load records from each available record set into a Pandas DataFrame for analysis.

All extraction is done by specifying record set and field `@id`s as referenced above.

In [None]:
# Load records for each record set
dataframes = {}
for record_set_id in record_set_ids:
    print(f"Loading records from record set: {record_set_id}")
    rows = list(dataset.records(record_set=record_set_id))
    df = pd.DataFrame(rows)
    dataframes[record_set_id] = df
    print(f"Columns for {record_set_id}: {df.columns.tolist()}")
    print(df.head(), "\n")

# For convenience in EDA, pick the first record set as primary
main_record_set = record_set_ids[0] if record_set_ids else None
if main_record_set:
    print(f"Main record set selected: {main_record_set}")
    print(dataframes[main_record_set].head())

## 4. Exploratory Data Analysis (EDA)
Apply common data processing operations: filter, normalize, group by. Use `@id`s as column names where possible.

Typical columns might include numeric fields such as 'BirthYear' or categorical fields such as 'Breed'.

In [None]:
# For this example, we make the following assumptions based on likely field names in the main record set:
# You should replace these values with actual @id strings printed above if different.
# Suppose '@id' of the birth year field is 'https://api.app.sen.science/frontiers/7280980/birthYear'
# and the breed field is 'https://api.app.sen.science/frontiers/7280980/breed'

df = dataframes[main_record_set]
# Try to auto-detect likely numeric/birth year and breed fields

# Identify numeric/categorical fields by checking column names @id
birth_year_field = None
breed_field = None
for col in df.columns:
    lowercase_col = col.lower()
    if 'birth' in lowercase_col or 'year' in lowercase_col:
        birth_year_field = col
    if 'breed' in lowercase_col:
        breed_field = col

if birth_year_field is not None:
    print(f"Numeric field selected for EDA (birth year): {birth_year_field}")
else:
    # As fallback, just take the first numeric column
    birth_year_field = df.select_dtypes('number').columns[0]
    print(f"Fallback numeric field used: {birth_year_field}")
if breed_field is not None:
    print(f"Categorical/group field selected for EDA (breed): {breed_field}")
else:
    # Fallback to any object column
    breed_field = df.select_dtypes('object').columns[0]

# Filter for dogs born after 2015, as example
threshold = 2015
filtered_df = df[df[birth_year_field] > threshold]
print(f"Filtered records with {birth_year_field} > {threshold}:")
print(filtered_df.head())

# Normalize the birth year
filtered_df = filtered_df.copy()  # Avoid SettingWithCopyWarning
filtered_df[f"{birth_year_field}_normalized"] = (filtered_df[birth_year_field] - filtered_df[birth_year_field].mean()) / filtered_df[birth_year_field].std()
print(f"Normalized {birth_year_field} for filtered records:")
print(filtered_df[[birth_year_field, f"{birth_year_field}_normalized"]].head())

# Group by breed and compute average and count
if breed_field in df.columns:
    grouped = filtered_df.groupby(breed_field)[birth_year_field].agg(['mean', 'count']).sort_values('count', ascending=False)
    print(f"Grouped (breed) statistics:")
    print(grouped.head())

## 5. Visualization
Visualize data distributions and relationships between fields using matplotlib and seaborn.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of birth years (for filtered data)
plt.figure(figsize=(8, 5))
sns.histplot(filtered_df[birth_year_field], bins=10, kde=True)
plt.title('Distribution of Birth Years of Registered Dogs (After 2015)')
plt.xlabel('Birth Year')
plt.ylabel('Count')
plt.show()

# Top 10 breeds by count in the filtered dataset
if breed_field in filtered_df.columns:
    top_breeds = filtered_df[breed_field].value_counts().head(10)
    plt.figure(figsize=(10, 6))
    sns.barplot(y=top_breeds.index, x=top_breeds.values, orient='h')
    plt.title('Top 10 Dog Breeds (2016 Onwards)')
    plt.xlabel('Number of Registered Dogs')
    plt.ylabel('Breed')
    plt.tight_layout()
    plt.show()

## 6. Conclusion
This exploration demonstrated end-to-end loading, inspection, and exploration of the Zurich dog registry dataset using the `mlcroissant` Python library.

* By referencing all record sets and fields by their `@id`s, the Croissant standard ensures reproducibility and clarity.
* We filtered the dataset for recent dog registrations, normalized birth year data, and identified the most popular dog breeds in Zurich in the recent period.
* These data preparation steps provide the foundation for further statistical analysis, machine learning, or policy research.

Further directions may include: exploring demographic correlations, longitudinal analysis, or spatial mapping of breeds if geo-data columns are available.