# Household composition and place of birth in Our Future Health

## Purpose
This notebook extracts participant-reported household composition and place-of-birth information from the Our Future Health (OFH) baseline questionnaire and summarises their distributions within the cohort.

## Outputs
- `outputs/intermediate/questionnaire_diag_fields_metadata.csv`  
  Metadata describing questionnaire fields related to household size, household relatedness, and place of birth.
- `outputs/raw/questionnaire_diag_fields_raw_values_query.sql`  
  SQL query used to extract raw questionnaire responses.
- In-memory summary tables reporting:
  - Percentage distribution of household size categories
  - Counts and proportions of household relatedness categories (including a derived “live alone” category)
  - Counts and percentages of place-of-birth responses

## Relationship to manuscript
Results from this notebook are used to generate **Figure 2** in the main text, which describes the sociodemographic composition of the Our Future Health cohort.

## Data and access notes
Analyses use restricted Our Future Health data accessed within the OFH Trusted Research Environment under approved study permissions. Outputs are limited to aggregated, non-disclosive summary statistics, in accordance with OFH Safe Output requirements.

## Notes
Household size is binned into prespecified categories for reporting. Participants reporting a household size of one are classified as “live alone” in household relatedness summaries. Percentages for place of birth are calculated among participants providing non-missing responses.


In [None]:
# Import packages
import dxpy
import subprocess
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession

# Local imports
import phenofhy

# Turn off logging
import logging
# logging.disable(logging.CRITICAL)

### Initialize Spark

In [None]:
spark = SparkSession.builder \
    .appName("Phenotype Analysis") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.kryoserializer.buffer.max", "128") \
    .getOrCreate()

### Load and preprocess data

In [None]:
# Run using config-defined .CSV input file and output path
phenofhy.load.field_list(
    fields=[
        "participant.birth_month",
        "participant.birth_year",
        "participant.registration_month",
        "participant.registration_year",
        "participant.pid",
        "questionnaire.birth_place_1_1",
        "questionnaire.housing_people_1_1",
        "questionnaire.housing_people_relate_1_m",
        # "questionnaire.work_status_1_m",
        # "questionnaire.work_status_2_m"
    ],
    output_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv",
    output_file="outputs/raw/questionnaire_diag_fields_raw_values_query.sql", 
    cohort_key="FULL_SAMPLE_ID", 
    sql_only=True
)

raw_questionnaire_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/questionnaire_diag_fields_raw_values_query.sql"
)

pheno_df = phenofhy.process.participant_fields(raw_questionnaire_df)

In [None]:
questionnaire_df = phenofhy.process.questionnaire_fields(pheno_df, derive='auto')

### Analysis

#### Birth place

In [None]:
questionnaire_df['questionnaire.birth_place_1_1'].value_counts()/ questionnaire_df['questionnaire.birth_place_1_1'].notnull().sum() * 100

In [None]:
questionnaire_df['questionnaire.birth_place_1_1'].value_counts()

#### Housing relatednesss

In [None]:
# Add live alone as a category to housing_people_relate_1_m
live_alone_mask = questionnaire_df["questionnaire.housing_people_1_1"] == 1
trait = "questionnaire.housing_people_relate_1_m"
questionnaire_df.loc[live_alone_mask, trait] = "Live alone"

# compute counts
res = phenofhy.calculate.summary(questionnaire_df, 
                                  traits=['questionnaire.housing_people_relate_1_m'], 
                                  granularity="categorical")
res['categorical']

#### Household size

In [None]:
# Column for convenience
col = "questionnaire.housing_people_1_1"

# Define bins and labels
bins = [0, 1, 2, 3, 4, 10, np.inf]   # upper bounds
labels = ["1", "2", "3", "4", "5-9", "10+"]

# Create bucket column
questionnaire_df["household_bucket"] = pd.cut(
    questionnaire_df[col],
    bins=bins,
    labels=labels,
    right=True,         # 1 goes to '1', 2 to '2', etc.
    include_lowest=True
)

# Compute percentages
pct_df = (
    questionnaire_df["household_bucket"]
    .value_counts(normalize=True, dropna=True)
    .sort_index()
    .mul(100)
    .round(2)
    .rename("percentage")
    .reset_index()
)

print(pct_df)

#### Employment status

In [None]:
# Combine v1 and v2
map_work_status = {
    -3: "Prefer not to answer",
    -7: "None of the above",
    -5: "Unemployed",     # from version 2
     5: "Unemployed",     # from version 1
     1: "In paid employment or self-employed",
     2: "Retired",
     3: "Looking after home/family",
     4: "Unable to work",
     6: "Doing unpaid/voluntary work",
     7: "Student",
     8: "On paid leave",
     9: "Unpaid carer",
}

import numpy as np

def map_cell(x):
    # safe NA check
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return []

    # if multi-response array/list
    if isinstance(x, (list, tuple, set, np.ndarray)):
        vals = list(x)
    else:
        vals = [x]

    out = []
    for v in vals:
        # convert v to int if possible
        try:
            v_key = int(v)
        except Exception:
            v_key = v

        out.append(map_work_status.get(v_key, str(v)))

    # dedupe while preserving order
    return list(dict.fromkeys(out))


v1 = "questionnaire.work_status_1_m"
v2 = "questionnaire.work_status_2_m"

df = questionnaire_df.copy()

df["_v1_list"] = df[v1].apply(map_cell)
df["_v2_list"] = df[v2].apply(map_cell)

# Combined multi-response list (union of both)
df["work_status_combined"] = df.apply(
    lambda r: list(dict.fromkeys(r["_v1_list"] + r["_v2_list"])),
    axis=1
)

In [None]:
res = phenofhy.calculate.summary(
    df,
    traits=["work_status_combined"],
    granularity="category"
)

res["categorical"]