# Prevalence of self-reported health conditions in Our Future Health and comparator populations

## Purpose
This notebook extracts self-reported health condition data from the Our Future Health (OFH) baseline questionnaire and computes prevalence estimates for common conditions. These estimates are prepared for comparison with corresponding measures from the UK Biobank, Health Survey for England, and Global Burden of Disease (GBD) 2021.

## Outputs
- Intermediate metadata and SQL query files used to extract self-reported diagnosis fields from the OFH baseline questionnaire.
- Aggregated prevalence counts and proportions for common self-reported health conditions in OFH.
- In-memory summary tables structured for cross-cohort comparison with:
  - UK Biobank self-reported health conditions
  - Health Survey for England 2021 estimates
  - UK-level GBD 2021 prevalence estimates

## Relationship to manuscript
Outputs from this notebook are used to generate **Figure 3** in the main text and to populate:
- **Supplementary Table 5** (*Most common self-reported health conditions in Our Future Health, the UK Biobank, and Health Survey England 2021*)
- **Supplementary Table 6** (*Most common self-reported conditions in Our Future Health compared with UK Biobank*)
- **Supplementary Table 7** (*Most common self-reported conditions in Our Future Health compared with UK 2021 Global Burden of Disease estimates*)

## Data and access notes
Analyses use restricted Our Future Health data accessed within the OFH Trusted Research Environment under approved study permissions. Outputs are limited to aggregated, non-disclosive summary statistics and are intended for descriptive cross-cohort comparison rather than inference, in accordance with OFH Safe Output requirements.

## Notes
Condition prevalence estimates are based on lifetime self-reported diagnoses collected at baseline.

## Setup env

In [None]:
# Import packages
import dxpy
import shlex
import subprocess
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql import SparkSession

# Import phenofhy
import phenofhy

### Initialize Spark

In [None]:
spark = SparkSession.builder \
    .appName("Phenotype Analysis") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.kryoserializer.buffer.max", "128") \
    .getOrCreate()

### Load files

In [None]:
files = [
    "table_s6_questionnaire_diag_fields.csv",
    "table_s7_questionnaire_diag_fields.csv",
]

In [None]:
phenofhy.utils.download_files([
    (str(phenofhy.utils.find_latest_dx_file_id(f)), f"inputs/{f}")
    for f in files
])

In [None]:
pheno_dfs = {f.replace('.csv', ''): pd.read_csv(f'./inputs/{f}') for f in files}

### Table S5 

In [None]:
phenofhy.load.field_list(
    fields=[
        "questionnaire.diag_1_m",
        "questionnaire.diag_2_m",
        "participant.birth_month",
        "participant.birth_year",
        "participant.registration_month",
        "participant.registration_year",
        "participant.demog_sex_2_1",
        "participant.demog_sex_1_1",
    ],
    output_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv",
    output_file="outputs/raw/questionnaire_diag_fields_raw_values_query.sql",
    cohort_key="FULL_SAMPLE_ID",
    sql_only=True
)

raw_diag_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/questionnaire_diag_fields_raw_values_query.sql"
)

diag_df = phenofhy.process.participant_fields(raw_diag_df)
diag_df = phenofhy.process.questionnaire_fields(diag_df, derive=False)

prev = phenofhy.calculate.prevalence(
    df=diag_df,
    denominator=("nonmissing"),
)

### Table S6

In [None]:
phenofhy.load.field_list(
    fields='inputs/table_s6_questionnaire_diag_fields.csv',
    output_file="outputs/intermediate/table_s6_questionnaire_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/table_s6_questionnaire_diag_fields_metadata.csv",
    output_file="outputs/raw/table_s6_questionnaire_diag_fields_raw_values_query.sql",
    cohort_key="FULL_SAMPLE_ID",
    sql_only=True
)

raw_s6_diag_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/table_s6_questionnaire_diag_fields_raw_values_query.sql")

parc_s6_df = phenofhy.process.participant_fields(raw_s6_diag_df)
diag_s6_df = phenofhy.process.questionnaire_fields(parc_s6_df, derive='auto')

##### Full sample

In [None]:
prev_s6_all = phenofhy.calculate.prevalence(
    df=diag_s6_df,
    denominators=['all', 'nonmissing'],
)

###### Remove extra columns when exporting

- Inflammatory bowel disease 

In [None]:
prev_s6_all.sort_values(by='prevalence_all', ascending=False).head()

##### Aged 40-69

In [None]:
prev_s6_40_69 = phenofhy.calculate.trait_prevalence_using_grouped(
    df=diag_s6_df.loc[(
        diag_s6_df['derived.age_at_registration'] >= 40) &
        (diag_s6_df['derived.age_at_registration'] < 70)],
    denominators=['all', 'nonmissing'],
)

In [None]:
prev_s6_40_69.loc[prev_s6_40_69['trait'].isin(
    pheno_dfs['table_s6_questionnaire_diag_fields']['trait'])].sort_values(
    by='trait', ascending=True)[['trait','count','prevalence_all']]

### Table S7

In [None]:
phenofhy.load.field_list(
    fields='inputs/table_s7_questionnaire_diag_fields.csv',
    output_file="outputs/intermediate/table_s7_questionnaire_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/table_s7_questionnaire_diag_fields_metadata.csv",
    output_file="outputs/raw/table_s7_questionnaire_diag_fields_raw_values_query.sql",
    cohort_key="FULL_SAMPLE_ID",
    sql_only=True
)

raw_s7_diag_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/table_s7_questionnaire_diag_fields_raw_values_query.sql")

parc_s7_df = phenofhy.process.participant_fields(raw_s7_diag_df)
diag_s7_df = phenofhy.process.questionnaire_fields(parc_s7_df, derive='auto')

In [None]:
prev_s7_plus = phenofhy.calculate.prevalence(
    df=diag_s7_df.loc[(
        diag_s7_df['derived.age_at_registration'] >= 20)],
    denominators=['all', 'nonmissing'],
)

In [None]:
prev_s7_plus.loc[prev_s7_plus['trait'].isin(
    pheno_dfs['table_s7_questionnaire_diag_fields']['trait'])].sort_values(
    by='trait', ascending=True)[['trait','count','prevalence_all']]

### Upload results

In [None]:
# Upload an entire directory of folders
phenofhy.utils.upload_folders([
    ("phenofhy/", "applets/phenofhy"),
])