# Withings Real World Wearables Aggregated Dataset (2023-2024) Exploration with `mlcroissant`
This notebook provides a template for loading and exploring a dataset using the `mlcroissant` library.

### Dataset Source
The dataset source is provided via a Croissant schema URL.

In [None]:
# Ensure `mlcroissant` library is installed
!pip install mlcroissant

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant`.

In [None]:
import mlcroissant as mlc
import pandas as pd

# Define the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.gae5-90ty/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview
Review available record sets, fields, and their IDs.

In [None]:
record_sets = dataset.metadata['recordSet']

for record_set in record_sets:
    print(f"RecordSet: {record_set['@id']}")
    fields = record_set['field']
    for field in fields:
        print(f"  - Field: {field['@id']} ({field['dataType']})")

## 3. Data Extraction
Load data from a specific record set into a DataFrame for analysis. Use the record set and field `@id`s from the overview.

In [None]:
# Extract data from first record set
record_set_id = record_sets[0]['@id']
records = list(dataset.records(record_set=record_set_id))
df = pd.DataFrame(records)
print(f"Columns in {record_set_id}:", df.columns.tolist())
df.head()

## 4. Exploratory Data Analysis (EDA)
Apply common data processing steps, such as filtering records based on specific criteria, normalizing numeric fields, and categorizing data. This section includes operations like removing outliers, transforming data distributions, or grouping data by key attributes to prepare it for further analysis.

In [None]:
# Select a numeric field for analysis
# We assume 'sbp' (systolic blood pressure) is a relevant numeric field
numeric_field = 'sbp'

threshold = 120
filtered_df = df[df[numeric_field] > threshold]
print(f"Filtered records with {numeric_field} > {threshold}:")
print(filtered_df.head())

filtered_df[f"{numeric_field}_normalized"] = (filtered_df[numeric_field] - filtered_df[numeric_field].mean()) / filtered_df[numeric_field].std()
print(f"Normalized {numeric_field} for filtered records:")
print(filtered_df[[numeric_field, f"{numeric_field}_normalized"]].head())

# Assuming 'age' is a field to group data
group_field = 'age'
if group_field in df.columns:
    grouped_df = filtered_df.groupby(group_field).mean()
    print(f"Grouped data by {group_field}:")
    print(grouped_df.head())

## 5. Visualization
Visualize data distributions or relationships between fields in the dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(data=filtered_df, x=numeric_field, kde=True)
plt.title(f'Distribution of {numeric_field} (Filtered)')
plt.xlabel(numeric_field)
plt.ylabel('Frequency')
plt.show()

## 6. Conclusion
Summarize key findings and observations from the dataset exploration.

Through this exploration, we have gained insights into the dataset's structure and contents. We identified key numeric fields and performed basic transformations and visualizations. This preliminary analysis serves as a foundation for detailed analysis and model building in subsequent steps.