# Correlation structure of medication use categories in Our Future Health

## Purpose
This notebook derives summary indicators of medication use in the Our Future Health (OFH) cohort and computes pairwise correlations among medication-domain flags and aggregated medication usage–pattern variables.

## Outputs
- Derived medication-domain indicators representing use within specific pharmacological or physiological categories.
- Aggregated medication usage–pattern variables summarising medication burden and diversity of use.
- Pairwise correlation matrices computed for:
  - All OFH participants
  - Participants stratified by age group
- In-memory correlation tables structured for visualisation and tabulation.

## Relationship to manuscript
Outputs from this notebook are used to populate **Supplementary Table S12** (*Pairwise correlations (φ, Pearson, and point–biserial) among medication use categories and summary measures for all Our Future Health participants*).

## Data and access notes
Analyses use restricted Our Future Health data accessed within the OFH Trusted Research Environment under approved study permissions. All outputs are aggregated, non-disclosive summary statistics and comply with OFH Safe Output requirements.

## Notes
Medication-domain variables are binary indicators of regular use within defined categories, while summary measures capture overall medication burden and cross-system use. Correlation coefficients are computed using Pearson’s correlation, with interpretation depending on variable type (binary–binary, binary–continuous, or continuous–continuous).

## Setup environment

In [None]:
# Import packages
from pyspark.sql import SparkSession
import pandas as pd
import subprocess
import dxpy
import numpy as np

# Local imports
import phenofhy.calculate as calc
import phenofhy

### Initialize Spark cluster

In [None]:
spark = SparkSession.builder \
    .appName("Phenotype Analysis") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.kryoserializer.buffer.max", "128") \
    .getOrCreate()

## Table S12

### Extract data

In [None]:
phenofhy.load.field_list(
    fields=[
        "participant.birth_month",
        "participant.birth_year",
        "participant.registration_month",
        "participant.registration_year",
        "participant.demog_sex_2_1",
        "participant.demog_sex_1_1",
        "participant.pid",
        "questionnaire.demog_weight_1_1",
        "questionnaire.medicat_1_m",
        "questionnaire.diag_2_m"
    ],
    output_file="outputs/intermediate/meds_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/meds_diag_fields_metadata.csv",
    output_file="outputs/raw/meds_fields_raw_values_query.sql",
    cohort_key="FULL_SAMPLE_ID",
    sql_only=True
)

raw_diag_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/meds_fields_raw_values_query.sql"
)

p_df = phenofhy.process.participant_fields(raw_diag_df)
meds_df = phenofhy.process.questionnaire_fields(p_df, derive='auto')

### Analysis

In [None]:
# Step 2. Derive usage-pattern summaries
res, summary = phenofhy.calculate.medication_summary(meds_df, return_summary=True)

In [None]:
# Short example — use the existing phi_corr function and plot the publication panel
# all raw domain flags  
base_meds = [c for c in res.columns if isinstance(c, str) and c.startswith("derived.medicates_")]

# grouped system-level flags  
grouped = [
    "derived.cardiometabolic_use",
    "derived.mental_pain_use",
    "derived.immune_inflam_use",
    "derived.supplemental_use",
]

# core summary variables  
core_summaries = [
    "derived.num_meds_domains",       # continuous burden
    "derived.multi_med_system_use",   # >=2 systems used (binary)
    "derived.polypharmacy_flag",
    "derived.any_meds_flag",
    "derived.prop_systems_used"
]
# build final lists but only keep columns that actually exist in `res`
vars_all = [v for v in (base_meds + grouped + core_summaries) if v in res.columns]

# variables to use in the main Jaccard / correlation panels (publication set)
# (prefer grouped system flags + the core summaries; exclude raw base_meds)
vars_pub = [v for v in (grouped + core_summaries) if v in res.columns]

# quick sanity prints
print("Keeping (all available):", vars_all)
print("Publication matrix variables:", vars_pub)

In [None]:
# compute full matrix once (phi_corr returns a DataFrame)
phi_full = calc.phi_corr(res, vars_for_heatmap=vars_all)

In [None]:
# slice publication panel
phi_pub = phi_full.loc[vars_pub, vars_pub]

#### Display

In [None]:
phi = phi_full.copy()

# build mask for the upper triangle (including diagonal)
mask = np.triu(np.ones(phi.shape, dtype=bool), k=0)

# replace upper-triangle cells with "-"
phi_masked = phi.mask(mask, other="-")

# now phi_masked is a DataFrame showing only the bottom-right triangle
phi_masked

In [None]:
phi_masked.to_csv('phi_masked.csv')
phi_full.to_csv('phi_full.csv')

### By age-group

In [None]:
res_age = res.loc[res['derived.age_group']=='60+'] # change

In [None]:
# all raw domain flags  
base_meds = [c for c in res_age.columns if isinstance(c, str) and c.startswith("derived.medicates_")]

# grouped system-level flags  
grouped = [
    "derived.cardiometabolic_use",
    "derived.mental_pain_use",
    "derived.immune_inflam_use",
    "derived.supplemental_use",
]

# core summary variables  
core_summaries = [
    "derived.num_meds_domains",       # continuous burden
    "derived.multi_med_system_use",   # >=2 systems used (binary)
    "derived.polypharmacy_flag",
    "derived.any_meds_flag",
    "derived.prop_systems_used"
]
# build final lists but only keep columns that actually exist in `res`
vars_all = [v for v in (base_meds + grouped + core_summaries) if v in res_age.columns]

# variables to use in the main Jaccard / correlation panels (publication set)
# (prefer grouped system flags + the core summaries; exclude raw base_meds)
vars_pub = [v for v in (grouped + core_summaries) if v in res_age.columns]

In [None]:
# compute full matrix once (phi_corr returns a DataFrame)
phi_full = calc.phi_corr(res_age, vars_for_heatmap=vars_all)

In [None]:
phi = phi_full.copy()

# build mask for the upper triangle (including diagonal)
mask = np.triu(np.ones(phi.shape, dtype=bool), k=0)

# replace upper-triangle cells with "-"
phi_masked = phi.mask(mask, other="-")

# now phi_masked is a DataFrame showing only the bottom-right triangle
phi_masked

In [None]:
phi_masked.to_csv('phi_masked.csv')

 ### Uploads

In [None]:
# Upload an entire directory of folders
phenofhy.utils.upload_folders([
    ("phenofhy/", "applets/phenofhy"),
])