# Demographic and Canine Attributes in Zurich's Dog Registration Data Exploration with `mlcroissant`

This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.cwm4-z9n3/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata.name}: {metadata.description}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
# Inspect the record sets available in the dataset
record_sets = metadata.recordSet
print("Record Sets:")
for record_set in record_sets:
    print("-", record_set['@id'], record_set['name'])

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from each record set
dataframes = {}

for record_set in record_sets:
    record_set_id = record_set['@id']
    records = list(dataset.records(record_set=record_set_id))
    dataframes[record_set_id] = pd.DataFrame(records)
    print(f"Columns for {record_set_id}: {dataframes[record_set_id].columns.tolist()}")

# Display first few rows of one of the dataframes as an example
example_record_set_id = record_sets[0]['@id']
dataframes[example_record_set_id].head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data.

In [None]:
# Select a numeric field for analysis
numeric_field = "birthYear"

example_record_set_id = record_sets[0]['@id']

# Check if the numeric field exists in the DataFrame
if numeric_field in dataframes[example_record_set_id].columns:
    
    threshold = 2010
    filtered_df = dataframes[example_record_set_id][dataframes[example_record_set_id][numeric_field] > threshold]
    print(f"Filtered records with {numeric_field} > {threshold}:")
    print(filtered_df.head())

    # Normalizing data
    filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
    print(f"Normalized {numeric_field} for filtered records:")
    print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

    # Group by data, here assuming ownerId is a possible field to group by
    group_field = "ownerId"
    if group_field in dataframes[example_record_set_id].columns:
        grouped_df = filtered_df.groupby(group_field).mean()
        print(f"Grouped data by {group_field}:")
        print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot distribution of the normalized birth years
plt.figure(figsize=(10, 6))
sns.histplot(filtered_df[f'{numeric_field}_normalized'], bins=30, kde=True)
plt.title('Distribution of Normalized Birth Year')
plt.xlabel('Normalized Birth Year')
plt.ylabel('Frequency')
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

- The dataset provides a comprehensive overview of dog registrations in Zurich including demographic information.
- Significant trends and patterns can be analyzed, such as the distribution of dog birth years.
- Further exploration can involve analyzing breed data or geographic distributions.