# Table of Content

- [Table of Content](#table-of-content)


# 0-General
[Back to Table of Content](#table-of-content)
# Early-Life Immune Activation and House Dust Allergy in the PRINCE Cohort

## Introduction
[Back to Table of Content](#table-of-content)

Clinical observations in pediatrics suggest that early infections or immune activation may be followed by the later development of allergic diseases, including house dust mite allergy. Whether this pattern reflects a genuine etiological link, shared vulnerability factors, or reporting bias requires systematic investigation using longitudinal data.

The present analysis is based on the **PRINCE Study (Prenatal Identification of Children’s Health)**, a prospective cohort conducted at the Universitäres Perinatalzentrum des Universitätsklinikums Hamburg-Eppendorf. Families are followed from the first trimester of pregnancy through birth and into early childhood, with repeated assessments at predefined time points. Data include detailed prenatal information, perinatal characteristics, parental allergy history, and repeated measures on child infections, symptoms, and clinically diagnosed allergic conditions.

In this context, early-life immune activation is operationalized using questionnaire and clinical data on infection burden and related symptoms (for example frequency of respiratory infections, fever episodes, gastrointestinal infections) across several developmental windows. House dust allergy is captured through items on house dust mite sensitization and, where available, physician-diagnosed allergy. This notebook documents how raw PRINCE data are transformed into analysis-ready variables to examine the association between early immune activation and later house dust allergy.


## Objective
[Back to Table of Content](#table-of-content)

The primary goal of this notebook is to construct and analyze an observational dataset that allows us to test whether early-life immune activation is associated with an increased risk of house dust allergy in childhood.

More specifically, we aim to:

- derive exposure metrics for early infections and immune-related events across prenatal and postnatal time windows  
- define outcome variables for house dust allergy based on questionnaire items and physician-diagnosed allergy  
- identify and integrate key potential confounders such as child sex, parental allergy history, maternal prepartum BMI, birth characteristics (for example birth weight, gestational age), and other relevant covariates available in the PRINCE data  
- estimate associations between early immune activation and house dust allergy using appropriate regression models and sensitivity analyses

The notebook focuses on transparent variable construction and analytic decisions to facilitate later manuscript preparation and reproducibility.

## Methodological Role
[Back to Table of Content](#table-of-content)

This notebook serves as the central documentation of the data preparation and analytic strategy for the PRINCE house dust allergy project. It links the longitudinal REDCap structure of the cohort to the specific research question on early immune activation and allergy risk.

Key methodological components include:

- **Handling of longitudinal structure**  
  PRINCE data are collected at multiple prenatal and postnatal time points, for example  
  - prenatal ultrasound visits: `us1_arm_1` (first trimester), `us3_arm_1` (second trimester), `us6_arm_1` (third trimester)  
  - birth and follow-up visits: `z1_arm_1` (birth), `z2_arm_1` (6 months), `z3_arm_1` (12 months), `z4_arm_1` (24 months), `z5_arm_1` (36 months), `z6_arm_1` (48 months), `z7_arm_1` (60 months)  

  The notebook reshapes these event-specific records into analysis-ready formats, typically one row per child with derived summary measures for predefined developmental windows.

- **Exposure definition (early immune activation)**  
  Early-life infection and immune activation are operationalized using variables such as infection counts and symptom frequencies (for example `timesgrippe`, `timeslunginflammation`, `timesangina`, `timesbronchitis`, `timespseudocroup`, `timesfever`, `timesdiarrhea` and related items) recorded at relevant postnatal events. Where useful, composite indices or categorized burden scores are derived.

- **Outcome definition (house dust allergy)**  
  House dust allergy is defined using variables such as `housedust` and related items on house dust mite exposure and physician diagnosis, as well as items on sneezing responses to house dust or mites (for example `sneezemites`). Priority is given to physician-diagnosed outcomes where available.

- **Confounder selection and construction**  
  Potential confounders are identified based on prior knowledge and data availability, including:  
  - child-level characteristics (for example `gender_child`, birth weight `bc_weight`, gestational age in weeks `gestational_age_weeks`)  
  - parental allergy variables (for example `allergic`, `allergic_to`, `allergicdad`, `allergic_todad`) as proxies for atopic background  
  - maternal characteristics such as prepartum BMI (`bmi_mother_prepartum`) and other relevant prenatal factors  

  These covariates are cleaned, recoded, and where necessary collapsed into analytically meaningful categories.

- **Association modelling**  
  The notebook specifies and fits regression models to estimate the association between early immune activation and house dust allergy, adjusting for the selected confounders. Depending on data structure and model fit, this may include logistic regression for binary outcomes and additional sensitivity analyses with alternative exposure definitions. The emphasis is on transparent reporting of model specifications and diagnostics.

Overall, the notebook provides a reproducible bridge from raw REDCap exports to interpretable effect estimates for the research question at hand.


## Analysis steps
[Back to Table of Content](#table-of-content)

1. **Data import and basic cleaning**  
   Load REDCap exports, harmonize variable names, and restrict to relevant events and participants.

2. **Restructuring longitudinal data**  
   Reshape event-specific records (`us*`, `z*`) into analysis-ready datasets with one row per child and derived variables for prenatal and postnatal windows.

3. **Exposure construction**  
   Derive early immune activation indicators from infection and symptom counts (for example number of respiratory or gastrointestinal infections, fever episodes) and define categorized or continuous exposure metrics.

4. **Outcome and confounder coding**  
   Define house dust allergy outcomes using `housedust`, `sneezemites`, and physician-diagnosed allergy items. Clean and code confounders such as child sex, parental allergy history, maternal prepartum BMI, birth weight, and gestational age.

5. **Descriptive and diagnostic analyses**  
   Summarize distributions of exposure, outcome, and confounders, inspect missingness patterns, and explore crude associations.

6. **Multivariable association models**  
   Fit adjusted regression models to estimate the association between early immune activation and house dust allergy, report effect estimates, confidence intervals, and key diagnostics.

7. **Sensitivity analyses**  
   Explore robustness to alternative exposure definitions, confounder sets, and missing-data handling strategies.


## Acknowledgements
[Back to Table of Content](#table-of-content)

Statistical analysis and data preparation were conducted by **Dr. Steven Ngandeu Schepanski**, who also oversaw the development of this notebook.


# Setup: paths, packages, and data import
[Back to Table of Content](#table-of-content)

In [4]:
import os
import io

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# If you want basic statistical modelling later:
# import statsmodels.api as sm
# import statsmodels.formula.api as smf

# For password-protected Excel files
# Make sure you have installed this first in your environment:
# pip install msoffcrypto-tool
import msoffcrypto

In [5]:
# Set working directory
project_root = "/Users/stevenschepanski/Documents/04_ANALYSIS/HouseDust_UKE/"
os.chdir(project_root)

print(f"Working directory set to: {os.getcwd()}")

Working directory set to: /Users/stevenschepanski/Documents/04_ANALYSIS/HouseDust_UKE


In [6]:
# Define Excel file and password
excel_path = os.path.join(
    project_root,
    "data",
    "20251125_StevenSchepanski_Allergy_Housedust_5J.xlsx"
)

excel_password = "5zrwx::a2l9V"

In [7]:
# Decrypt the Excel file into memory
decrypted = io.BytesIO()

with open(excel_path, "rb") as f:
    office_file = msoffcrypto.OfficeFile(f)
    office_file.load_key(password=excel_password)
    office_file.decrypt(decrypted)

# Always rewind the buffer before reading
decrypted.seek(0)

0

In [8]:
# Read the sheets into pandas DataFrames
sheet_data = "PRINCEStudie-20251125StevenSche"
sheet_dict = "PRINCEStudie_DataDictionary_202"

prince_data = pd.read_excel(decrypted, sheet_name=sheet_data)

# Rewind again before reading second sheet
decrypted.seek(0)
prince_dict = pd.read_excel(decrypted, sheet_name=sheet_dict)

In [9]:
# Quick sanity checks
print("Data shape (main sheet):", prince_data.shape)
print("Data dictionary shape:", prince_dict.shape)

print("\nFirst rows of the main data:")
display(prince_data.head())

print("\nColumns in main data:")
print(prince_data.columns.tolist())

Data shape (main sheet): (7401, 31)
Data dictionary shape: (23, 18)

First rows of the main data:


Unnamed: 0,princeid,redcap_event_name,timesgrippe,timeslunginflammation,timesangina,timesbronchitis,timespseudocroup,hayfever2,housedust___1,housedust___2,...,allergic_to___4,allergic_to___5,allergic_to____99,allergicdad,allergic_todad___1,allergic_todad___2,allergic_todad___3,allergic_todad___4,allergic_todad___5,allergic_todad____99
0,65001027,us1_arm_1,,,,,,,,,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,65001027,us3_arm_1,,,,,,,,,...,,,1.0,,,,,,,1.0
2,65001027,us6_arm_1,,,,,,,,,...,,,1.0,,,,,,,1.0
3,65001027,z1_arm_1,,,,,,,,,...,,,1.0,,,,,,,1.0
4,65001027,z2_arm_1,,,,,,,,,...,,,1.0,,,,,,,1.0



Columns in main data:
['princeid', 'redcap_event_name', 'timesgrippe', 'timeslunginflammation', 'timesangina', 'timesbronchitis', 'timespseudocroup', 'hayfever2', 'housedust___1', 'housedust___2', 'housedust___0', 'housedust____99', 'sneezemites', 'gestational_age_weeks', 'bc_weight', 'gender_child', 'bmi_mother_prepartum', 'allergic', 'allergic_to___1', 'allergic_to___2', 'allergic_to___3', 'allergic_to___4', 'allergic_to___5', 'allergic_to____99', 'allergicdad', 'allergic_todad___1', 'allergic_todad___2', 'allergic_todad___3', 'allergic_todad___4', 'allergic_todad___5', 'allergic_todad____99']


# 1-Data Inspection and Cleaning
[Back to Table of Content](#table-of-content)