# Explore the MomCare Data Package with `mlcroissant`

## Introduction

This document explores the MomCare Data Package, a rich dataset containing longitudinal maternal health data from value-based digital care in Tanzania. Leveraging the power of `mlcroissant`, a Python library for working with datasets in the MLCommons Croissant format, we will load, analyze, and visualize key aspects of this FAIR² data resource.


Learn more:
- Data Package doi: [10.71728/senscience.tyq3-fnvr](https://sen.science/doi/10.71728/senscience.tyq3-fnvr)
- Frontiers Data Article doi: [xx.xxxxx/xxxxxx]()

As a FAIR² Data Package, it ensures accessibility, interoperability, and AI-readiness, supporting research and policy aligned with European directives. FAIR² datasets follow the MLCommons **Croissant** 🥐 format for machine learning datasets. See the [MLCommons Croissant Format Specification](https://docs.mlcommons.org/croissant/docs/croissant-spec.html).

Through a series of steps, we will:

- Load the dataset and explore its structure using `mlcroissant`.
- Examine the available record sets, which represent different facets of the maternal care journey, such as encounters, patients, and locations.
- Extract data from specific record sets into pandas DataFrames for detailed analysis.
- Visualize patient demographics, including age distribution across different cohorts.
- Construct and visualize a patient-clinic network to understand patient engagement and the flow of care across different clinics.


This exploration aims to provide insights into the MomCare dataset, demonstrating how FAIR² data principles and tools like `mlcroissant` can facilitate reproducible research and data-driven decision-making in maternal health.

### Install and import required libraries

In [None]:
# Install mlcroissant from the source
!sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config
!pip install "git+https://github.com/crisely09/croissant.git@read-fhir#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [None]:
import mlcroissant as mlc
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tabulate import tabulate
from IPython.display import Markdown

## 1. Data Loading
Load metadata and records from the dataset using `mlcroissant` and the URL of the FAIR<sup>2</sup> Data Package.

In [None]:
# Provide the dataset URL
url = 'https://sen.science/doi/10.71728/senscience.tyq3-fnvr/fair2.json'

# Load the dataset metadata
dataset = mlc.Dataset(url)
metadata = dataset.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")

## 2. Data Overview

In the **Croissant** format, a RecordSet represents a structured collection of records, where each record is a granular dataset unit (e.g., an image, text file, or table row). It defines the structure of these records using a set of fields, such as the columns in a table or sheet, as seen in this example.

### 2.1 Review available RecordSets

In [None]:
# Format the list column as a Markdown-compatible string
def format_list_column(row):
    if isinstance(row, list):
        return "\n".join(f"- {item}" for item in row)  # Bullet point list
    return str(row)

In [None]:
# List all the record sets available in the dataset
df = pd.DataFrame(metadata["recordSet"])
columns_to_keep = {
    "@id": "Record Set ID",
    "description": "Description"
}
df = df[list(columns_to_keep.keys())]
df = df.rename(columns=columns_to_keep)

# Convert DataFrame to Markdown table
markdown_table = tabulate(df, headers="keys", tablefmt="pipe", showindex=False)

# Render the table as Markdown in Jupyter
display(Markdown(markdown_table))

## 3. Data Extraction

#### 3.1 Load data from a specific record set into a DataFrame for analysis.

In [None]:
# Generate a dataframe for each record set in the metadata
record_set_ids = df["Record Set ID"].tolist()

dataframes = {
    record_set_id: pd.DataFrame(list(dataset.records(record_set=record_set_id)))
    for record_set_id in record_set_ids
}

In [None]:
prefix = "https://sen.science/core/"

for name, df in dataframes.items():
    df.rename(columns=lambda x: x.replace(prefix, "").split("/")[-1], inplace=True)

In [None]:
# Display the first rows of each dataframe
for name, df in dataframes.items():
    display(Markdown(f"#### {name}"))
    display(df.head())
    display(Markdown("---"))

## 🧪 Patient Demographics and Engagement

In [None]:
# Reload episode and patient data
episode_df = dataframes["https://sen.science/core/record-sets/episodeOfCare"]
episode_df["patient_id"] = episode_df["patient_reference"].apply(lambda x: x.decode() if isinstance(x, bytes) else x)
# Ensure period_start is decoded from bytes if necessary before extracting date
episode_df["period_start"] = episode_df["period_start"].apply(lambda x: x.decode() if isinstance(x, bytes) else x)
episode_df["start_date"] = pd.to_datetime(episode_df["period_start"].str.extract(r"(?:b')?(.*?)'?\Z", expand=False), errors="coerce")

patient_df = dataframes["https://sen.science/core/record-sets/patient"]
# Ensure patient_id is decoded from bytes if necessary
patient_df["patient_id"] = patient_df["identifier_value"].apply(lambda x: x.decode() if isinstance(x, bytes) else x)
# Decode bytes to string if necessary for birthDate and org_name columns
def decode_if_bytes(val):
    if isinstance(val, bytes):
        return val.decode(errors="ignore")
    return val

patient_df["birthDate"] = patient_df["birthDate"].apply(decode_if_bytes)
patient_df["org_name"] = patient_df["contact_organization_display"].apply(decode_if_bytes)

# Convert columns used for merging to string type to handle potential lists
episode_df["patient_id"] = episode_df["patient_id"].astype(str)
patient_df["patient_id"] = patient_df["patient_id"].astype(str)

# Define cohort windows
cohorts = [
    ("Cohort 1", "2020-11-01", "2021-02-28"),
    ("Cohort 2", "2021-03-01", "2021-06-30"),
    ("Cohort 3", "2021-07-01", "2021-10-31"),
    ("Cohort 4", "2021-11-01", "2022-02-28"),
    ("Cohort 5", "2022-03-01", "2022-06-30"),
    ("Cohort 6", "2022-07-01", "2022-10-31"),
    ("Cohort 7", "2022-11-01", "2023-02-28"),
    ("Cohort 8", "2023-03-01", "2023-06-30"),
    ("Cohort 9", "2023-07-01", "2023-10-31")
]

def assign_cohort(date):
    if pd.isnull(date): return "Unassigned"
    for label, start, end in cohorts:
        if pd.to_datetime(start) <= date <= pd.to_datetime(end):
            return label
    return "Unassigned"

episode_df["cohort"] = episode_df["start_date"].apply(assign_cohort)

# Merge to create cohort map
cohort_map = episode_df[["patient_id", "cohort", "start_date"]].merge(
    patient_df[["patient_id", "birthDate", "org_name"]], on="patient_id", how="left"
)

In [None]:
# Age distribution by cohort
cohort_map['age'] = pd.to_datetime('today').year - pd.to_datetime(cohort_map['birthDate'], errors='coerce').dt.year
sns.histplot(data=cohort_map, x='age', bins=20, hue='cohort', multiple='stack')
plt.title('Age Distribution by Cohort')
plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.tight_layout()
plt.show()

The histogram shows the age distribution of patients across different cohorts. Each colored bar represents a cohort, and the height of the stacked bars indicates the number of patients in each age group within that cohort.

From the figure, we can observe:

Overall Age Distribution: The majority of patients appear to be in their 20s and 30s, which aligns with a population receiving maternal care.
Cohort Variations: While the general age distribution is similar across cohorts, there might be slight shifts or variations in the peak age groups for different cohorts. This could suggest changes in the patient population over time or differences in how cohorts were defined or captured.
Cohort Size: The height of the stacked bars also gives an indication of the relative size of each cohort in terms of patient numbers. Some cohorts appear to have more patients than others.
This visualization helps us understand the demographic characteristics of the patient population within each cohort and how age is distributed over time.

## 🧭 Patient–Clinic Network Visualization

In [None]:
# prompt: show the network of patients and clinics of the top 10 clinics (with more patients assigned) use plotly to make it interactive

import plotly.graph_objects as go
import networkx as nx

# Count patients per clinic
clinic_patient_counts = cohort_map['org_name'].value_counts().reset_index()
clinic_patient_counts.columns = ['org_name', 'patient_count']

# Get the top 10 clinics
top_clinics = clinic_patient_counts.head(10)['org_name'].tolist()

# Filter cohort_map to include only patients from top clinics
cohort_map_top_clinics = cohort_map[cohort_map['org_name'].isin(top_clinics)].copy()

# Create a graph
G = nx.Graph()

# Add nodes for clinics (using original names from the top 10 list for node names)
G.add_nodes_from(top_clinics, node_type='clinic')

# Add nodes for patients
unique_patients = cohort_map_top_clinics['patient_id'].unique()
G.add_nodes_from(unique_patients, node_type='patient')

# Add edges between patients and their respective clinics
for index, row in cohort_map_top_clinics.iterrows():
    G.add_edge(row['patient_id'], row['org_name'])

# Define positions for nodes
# Using a layout like spring_layout can help visualize the network structure
pos = nx.spring_layout(G, seed=42) # for reproducible layout

# Create edge trace
edge_x = []
edge_y = []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None) # Add None to separate lines
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='#888'),
    hoverinfo='none',
    mode='lines')

# Create node trace
node_x = []
node_y = []
node_text = []
node_color = []
node_size = []

for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)

    if G.nodes[node]['node_type'] == 'clinic':
        # Find the patient count for this clinic
        clinic_name = node
        patient_count = clinic_patient_counts[clinic_patient_counts['org_name'] == clinic_name]['patient_count'].iloc[0]
        node_text.append(f'Clinic: {node}<br>Patients: {patient_count}')
        node_color.append('red') # Clinics in red
        # Scale size based on patient count, adjust scaling factor as needed
        node_size.append(patient_count * 0.5) # Adjust multiplier for visual clarity
    else:
        # For patients
        patient_id = node
        patient_info = cohort_map_top_clinics[cohort_map_top_clinics['patient_id'] == patient_id].iloc[0]
        node_text.append(f'Patient ID: {patient_id}<br>Clinic: {patient_info["org_name"]}<br>Age: {patient_info["age"]}')
        node_color.append('blue') # Patients in blue
        node_size.append(5) # Smaller size for patients


node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    text=node_text,
    marker=dict(
        showscale=False,
        colorscale='YlGnBu',
        reversescale=True,
        color=node_color,
        size=node_size,
        line_width=2))

# Create figure
fig = go.Figure(data=[edge_trace, node_trace],
                layout=go.Layout(
                    title='<br>Patient-Clinic Network (Top 10 Clinics)',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    annotations=[ dict(
                        text="Network of patients connected to the top 10 clinics by patient count",
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002 ) ],
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )

fig.show()


The network graph visualizes the connections between patients (represented by blue nodes) and the top 10 clinics by patient count (represented by red nodes). An edge between a patient and a clinic indicates that the patient had at least one encounter at that clinic.

From the figure, we can observe:

Network Structure: The graph shows a bipartite-like structure, with patients connected to clinics. The density of connections around a clinic node indicates the number of unique patients who have visited that clinic.
Clinic Centrality: The red clinic nodes with more connections (more edges emanating from them) are the clinics that have served a larger number of unique patients among the top 10.
Patient Connections: The blue patient nodes are connected to one or more clinic nodes, showing which clinics each patient has visited.
Connectivity: The graph illustrates the patient flow between patients and clinics.
This visualization provides insights into which clinics are most frequently visited and how patients are distributed across these top clinics.

## Discussion

Based on the analysis performed in this notebook, we can discuss the following key findings:

*   **Patient Demographics:** Summarize the age distribution by cohort and any notable trends or variations observed.
*   **Patient-Clinic Network:** Discuss the structure of the network, the most central clinics (those with the highest patient counts), and any insights into patient movement or engagement.

Consider the implications of these findings for maternal health care in Tanzania and how the FAIR² nature of the dataset and the use of tools like `mlcroissant` facilitated this analysis.

## Conclusion

In this notebook, we successfully loaded, analyzed, and visualized key aspects of the MomCare Data Package using `mlcroissant`. We explored patient demographics, visualized the patient-clinic network for the top clinics, and mapped clinic locations with patient counts.

This analysis demonstrated the value of using FAIR² data and tools like `mlcroissant` for exploring complex health datasets. The insights gained from this initial exploration can serve as a foundation for further research into maternal health outcomes, care patterns, and the impact of digital health interventions in low-resource settings.

Future work could involve deeper analysis of specific patient pathways, investigating the correlation between clinic characteristics and patient outcomes, or applying machine learning models to predict key health indicators using this dataset.