# Prevalence of medication use by category, age group, and sex in Our Future Health

## Purpose
This notebook extracts self-reported medication-use data from the Our Future Health (OFH) baseline questionnaire and summarises the prevalence of medication use across medication categories, stratified by age group and sex.

## Outputs
- Intermediate metadata and SQL query files used to extract self-reported medication-use fields from the OFH baseline questionnaire.
- Sample size counts by age group and sex.
- Aggregated counts and proportions of medication use by:
  - Medication category
  - Age group (18–29, 30–59, 60+)
  - Sex (female, male, and overall)
- In-memory summary tables structured for tabulation and reporting.

## Relationship to manuscript
Outputs from this notebook are used to populate **Supplementary Table S8** (*Prevalence of medication use by medication category and age group*).

## Data and access notes
Analyses use restricted Our Future Health data accessed within the OFH Trusted Research Environment under approved study permissions. Outputs are limited to aggregated, non-disclosive summary statistics and comply with OFH Safe Output requirements.

## Notes
Medication-use variables are derived from self-reported regular medication-use questions. Summaries are reported separately for all participants and stratified by age group and sex. Non-informative response categories (e.g. “do not know,” “prefer not to answer,” “none of the above”) are excluded from reported prevalence estimates.


## Setup environment

In [None]:
# Import packages
from pyspark.sql import SparkSession
import pandas as pd
import subprocess
import dxpy
import numpy as n

# Local imports
import phenofhy

### Initialize Spark cluster

In [None]:
spark = SparkSession.builder \
    .appName("Phenotype Analysis") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.kryoserializer.buffer.max", "128") \
    .getOrCreate()

## Table S8

In [None]:
med_traits = [
    "questionnaire.medicat_1_m",
    "questionnaire.medicat_repro_contracept_1_m",
    "questionnaire.medicat_psych_antidepr_1_m",
    "questionnaire.medicat_psych_antipsych_1_m",
]

phenofhy.load.field_list(
    fields=[
        "participant.birth_month",
        "participant.birth_year",
        "participant.registration_month",
        "participant.registration_year",
        "participant.demog_sex_2_1",
        "participant.demog_sex_1_1",
        "participant.pid",
        "questionnaire.demog_weight_1_1",
    ]+med_traits,
    output_file="outputs/intermediate/meds_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/meds_diag_fields_metadata.csv",
    output_file="outputs/raw/meds_fields_raw_values_query.sql",
    cohort_key="FULL_SAMPLE_ID",
    sql_only=True
)

raw_diag_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/meds_fields_raw_values_query.sql"
)

diag_df = phenofhy.process.participant_fields(raw_diag_df)
diag_df = phenofhy.process.questionnaire_fields(diag_df, derive='auto')

#### Sample sizes

In [None]:
_df = diag_df
pid = "participant.pid" if "participant.pid" in _df.columns else None
age_bins = ["18-29","30-59","60+"]

def n_in(df, mask):
    if pid: return int(df.loc[mask, pid].dropna().nunique())
    return int(df.loc[mask].shape[0])

samples = {
  "all": _df,
  "male": _df[_df.get("derived.sex")==1],
  "female": _df[_df.get("derived.sex")==2],
}

res = []
for name, d in samples.items():
    res.append({"sample": name, "all": n_in(d, d.index==d.index)})
    for a in age_bins:
        res[-1][a] = n_in(d, d.get("derived.age_group")==a)
wide = pd.DataFrame(res)[["sample","all"]+age_bins]
wide

#### All age groups

In [None]:
# --- category-level summaries ---
cat_df = pd.concat([
    phenofhy.calculate.summary(diag_df, traits=med_traits[:1], stratify=None, granularity="category")["categorical"].assign(sample="whole"),
    phenofhy.calculate.summary(diag_df, traits=med_traits[:1], stratify="derived.age_group", granularity="category")["categorical"]
])

# --- variable-level (aggregate) summaries ---
var_df = pd.concat([
    phenofhy.calculate.summary(diag_df, traits=med_traits[1:], stratify=None, granularity="variable")["categorical"].assign(sample="whole"),
    phenofhy.calculate.summary(diag_df, traits=med_traits[1:], stratify="derived.age_group", granularity="variable")["categorical"]
])

# --- combine ---
combined = pd.concat([var_df, cat_df], ignore_index=True)

# --- pivot wide (whole first) ---
order = ["whole", "18-29", "30-59", "60+"]
combined["sample"] = pd.Categorical(combined["sample"], categories=order, ordered=True)

wide_all = (
    combined.pivot_table(
        index=["trait", "coding_name"],
        columns="sample",
        values=["count", "proportion"],
        aggfunc="first"
    )
    .sort_index(
        axis=1,
        key=lambda mi: [order.index(c) if c in order else len(order)
                        for c in mi.get_level_values(-1)]
    )
    .reset_index()
)

wide_all.loc[~wide_all['trait'].isin(
    ['Do not know', 'None of the above', 'Prefer not to answer'])].sort_values(by='trait')

#### Male/female participants

In [None]:
# for males use 1, for females use 2
for sex in [1,2,]:
    sex_df = diag_df.loc[diag_df['derived.sex']==sex] # or == 2 for females

    # --- category-level summaries ---
    cat_df = pd.concat([
        phenofhy.calculate.summary(sex_df, traits=med_traits[:1], stratify=None, granularity="category")["categorical"].assign(sample="whole"),
        phenofhy.calculate.summary(sex_df, traits=med_traits[:1], stratify="derived.age_group", granularity="category")["categorical"]
    ])

    # --- variable-level (aggregate) summaries ---
    var_df = pd.concat([
        phenofhy.calculate.summary(sex_df, traits=med_traits[1:], stratify=None, granularity="variable")["categorical"].assign(sample="whole"),
        phenofhy.calculate.summary(sex_df, traits=med_traits[1:], stratify="derived.age_group", granularity="variable")["categorical"]
    ])

    # --- combine and pivot ---
    combined = pd.concat([var_df, cat_df], ignore_index=True)
    order = ["whole", "18-29", "30-59", "60+"]
    combined["sample"] = pd.Categorical(combined["sample"], categories=order, ordered=True)

    wide_sex = (
        combined.pivot_table(
            index=["trait", "coding_name"],
            columns="sample",
            values=["count", "proportion"],
            aggfunc="first"
        )
        .sort_index(
            axis=1,
            key=lambda mi: [order.index(c) if c in order else len(order)
                            for c in mi.get_level_values(-1)]
        )
        .reset_index()
    )

    wide_sex.loc[~wide_sex['trait'].isin(
        ['Do not know', 'None of the above', 'Prefer not to answer']
    )].sort_values(by='trait')
    
    display(wide_sex)

### Uploads

In [None]:
# Upload an entire directory of folders
phenofhy.utils.upload_folders([
    ("phenofhy/", "applets/phenofhy"),
])