# Extraction and prevalence calculation of self-reported health conditions (OFH)

## Purpose
This notebook extracts selected self-reported health condition fields from the Our Future Health (OFH) baseline questionnaire, processes participant-level responses, and computes prevalence estimates for the most common conditions in the cohort.

## Outputs
- `outputs/intermediate/questionnaire_diag_fields_metadata.csv`  
  Metadata describing questionnaire diagnosis fields included in the analysis.
- `outputs/raw/questionnaire_diag_fields_raw_values_query.sql`  
  SQL query used to extract raw questionnaire responses.
- `outputs/intermediate/questionnaire_diag_sub_fields_metadata.csv`  
  Metadata for sub-fields associated with selected questionnaire diagnoses.
- `outputs/raw/questionnaire_diag_sub_fields_raw_values_query.sql`  
  SQL query used to extract raw sub-field responses.
- In-memory pandas DataFrames containing processed questionnaire responses and prevalence estimates used for downstream tabulation.

## Relationship to manuscript
Results from this notebook are used to populate **Supplementary Table 5** (*Most common self-reported health conditions in Our Future Health, the UK Biobank, and Health Survey England 2021*).

## Data and access notes
Analyses use restricted Our Future Health data accessed within the OFH Trusted Research Environment under approved study permissions. Outputs are limited to aggregated, non-disclosive summary statistics in accordance with OFH Safe Output policies.

## Notes
Prevalence estimates are calculated using participant-level questionnaire data without derivation of composite phenotypes. Only fields explicitly listed in the input metadata files are included.

## Setup env

In [None]:
# Import packages
import dxpy
import shlex
import subprocess
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql import SparkSession

# Import phenofhy
import phenofhy

### Initialize Spark

In [None]:
spark = SparkSession.builder \
    .appName("Phenotype Analysis") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.kryoserializer.buffer.max", "128") \
    .getOrCreate()

### Load data

In [None]:
files = [
    "table_s5_questionnaire_diag_fields.csv",
    "table_s5_other_codes.csv",
]

phenofhy.utils.download_files([
    (str(phenofhy.utils.find_latest_dx_file_id(f)), f"inputs/{f}")
    for f in files
])

pheno_df = pd.read_csv(f'./inputs/{files[0]}')
codes_df = pd.read_csv(f'./inputs/{files[1]}')

traits = (pheno_df.iloc[:9,:]["entity"] + "." + pheno_df.iloc[:9,:]["coding_name"]).tolist()
sub_traits = (pheno_df.iloc[9:,:]["entity"] + "." + pheno_df.iloc[9:,:]["coding_name"]).tolist()

metadata_dfs = phenofhy.load.metadata()

phenofhy.load.field_list(
    fields=traits,
    output_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv",
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv",
    output_file="outputs/raw/questionnaire_diag_fields_raw_values_query.sql", 
    cohort_key="FULL_SAMPLE_ID", 
    sql_only=True
)

raw_questionnaire_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/questionnaire_diag_fields_raw_values_query.sql"
)

questionnaire_df = phenofhy.process.participant_fields(raw_questionnaire_df)
questionnaire_df = phenofhy.process.questionnaire_fields(questionnaire_df, derive=False)

phenofhy.load.field_list(
    fields=sub_traits,
    output_file="outputs/intermediate/questionnaire_diag_sub_fields_metadata.csv",
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/questionnaire_diag_sub_fields_metadata.csv",
    output_file="outputs/raw/questionnaire_diag_sub_fields_raw_values_query.sql", 
    cohort_key="FULL_SAMPLE_ID", 
    sql_only=True
)

raw_sub_questionnaire_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/questionnaire_diag_sub_fields_raw_values_query.sql"
)

sub_questionnaire_df = phenofhy.process.participant_fields(raw_sub_questionnaire_df)
sub_questionnaire_df = phenofhy.process.questionnaire_fields(sub_questionnaire_df, derive=False)

#### Compute prevalence for top 10

In [None]:
questionnaire_prev = phenofhy.calculate.prevalence(
    df=questionnaire_df,
    codings=metadata_dfs["codings"],
    traits=traits
)

questionnaire_prev.loc[
    questionnaire_prev['coding_name'].isin(traits[-2:])][['meaning', 'count', 'prevalence']]

#### Compute prevalnce of Other/None of the above counts per diagnosis category for top 10

In [None]:
sub_questionnaire_prev = phenofhy.calculate.prevalence(
    df=sub_questionnaire_df,
    codings=metadata_dfs["codings"],
    traits=sub_traits
)

In [None]:
import re
import pandas as pd

# 1) select rows whose meaning is "None of the above" OR contains "other" (case-insensitive)
pattern = r'\bnone of the above\b|\bother\b|\banother\b'
mask = sub_questionnaire_prev['meaning'].astype(str).str.contains(pattern, case=False, regex=True)

# 2) group by trait (and coding_name) and sum counts; take denominator (assumed constant per group)
summary = (
    sub_questionnaire_prev.loc[mask]
    .groupby(['trait', 'coding_name'], as_index=False)
    .agg(count=('count', 'sum'),
         denominator=('denominator', 'first'))
)

# 3) recompute prevalence from summed counts
summary['prevalence'] = summary['count'] / summary['denominator']

# result
summary[['trait','count', 'prevalence']]