# Age- and sex-specific cancer prevalence and crude rates in Our Future Health

## Purpose
This notebook extracts self-reported cancer diagnoses from the Our Future Health (OFH) baseline questionnaire and computes age- and sex-specific cancer prevalence counts and crude rates (per 100,000 participants) among OFH participants.

## Outputs
- `outputs/intermediate/questionnaire_diag_fields_metadata.csv`  
  Metadata describing questionnaire diagnosis fields used for cancer case definitions.
- `outputs/raw/questionnaire_diag_fields_raw_values_query.sql`  
  SQL query used to extract raw questionnaire diagnosis responses.
- `results.csv`  
  Aggregated age- and sex-specific cancer counts and crude rates (per 100,000) for selected cancer groupings, structured for direct use in Supplementary Table S11.

## Relationship to manuscript
Outputs from this notebook are used to populate **Supplementary Table S11** (*Comparison of age- and sex-specific cancer prevalence and crude rate (per 100,000) among Our Future Health participants and the population of England (2022)*).

## Data and access notes
Analyses use restricted Our Future Health data accessed within the OFH Trusted Research Environment under approved study permissions. The analysis is restricted to participants aged 20 years or older and to registrations prior to June 2024, corresponding to participants registered in England. All outputs are aggregated, non-disclosive summary statistics, consistent with OFH Safe Output requirements.

## Notes
Cancer groupings are derived from self-reported questionnaire diagnosis fields.


## Setup env

In [None]:
# Import packages
import dxpy
import shlex
import subprocess
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql import SparkSession

# Import phenofhy
import phenofhy

### Initialize Spark

In [None]:
spark = SparkSession.builder \
    .appName("Phenotype Analysis") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.kryoserializer.buffer.max", "128") \
    .getOrCreate()

### Table S11

#### Extraction

In [None]:
phenofhy.load.field_list(
    fields=[
        "participant.birth_month",
        "participant.birth_year",
        "participant.registration_month",
        "participant.registration_year",
        "participant.demog_sex_2_1",
        "participant.demog_sex_1_1",
        "participant.pid",
        "questionnaire.diag_1_m",
        "questionnaire.diag_2_m",
        "questionnaire.diag_cancer_1_m",
        "participant.demog_sex_2_1",
        "participant.demog_sex_1_1",
    ],
    output_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv"
)

phenofhy.extract.fields(
    input_file="outputs/intermediate/questionnaire_diag_fields_metadata.csv",
    output_file="outputs/raw/questionnaire_diag_fields_raw_values_query.sql",
    cohort_key="FULL_SAMPLE_ID",
    sql_only=True
)

raw_diag_df = phenofhy.extract.sql_to_pandas(
    "outputs/raw/questionnaire_diag_fields_raw_values_query.sql"
)

p_df = phenofhy.process.participant_fields(
    raw_diag_df, 
    derive='auto',
    min_age=20,   
    age_group_bins = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, float("inf")],
    age_group_labels = [
        "20–24","25–29","30–34","35–39","40–44",
        "45–49","50–54","55–59","60–64","65–69",
        "70–74","75–79","80–84","85-90", "90+"
    ],
    extra_ranges={"derived.registration_date": (pd.Timestamp.min, pd.Timestamp("2024-06-01"))} # i.e., excl. Scotland clinics)
)

can_df = phenofhy.process.questionnaire_fields(p_df)

#### Analysis

In [None]:
# compact version: builds item/sex/age rows with counts + rate per 100k (subgroup denominator)
pairs = [("All cancers combined", None), ("Breast", "breast"), ("Colon/rectal", "colon|rectal"),
         ("Prostate", "prostate"), ("Lung or bronchial", "lung|bronchial")]

sex_specs = [([2], "Female"), ([1], "Male"), (list(can_df["derived.sex"].dropna().unique()), "All")]

# age groups ordering (preserve categorical ordering if present)
if pd.api.types.is_categorical_dtype(can_df["derived.age_group"].dtype):
    age_groups = list(can_df["derived.age_group"].cat.categories)
else:
    age_groups = sorted(can_df["derived.age_group"].dropna().unique(), key=lambda x: str(x))

rows = []
for vals, sex_label in sex_specs:
    for ag in age_groups:
        sub = can_df.loc[can_df["derived.sex"].isin(vals) & (can_df["derived.age_group"] == ag)]
        pop = len(sub)
        prev = phenofhy.calculate.prevalence(
            df=sub,
            traits=['questionnaire.diag_1_m', 'questionnaire.diag_2_m', 'questionnaire.diag_cancer_1_m'],
            denominator=("nonmissing"),
        )
        # compute All cancers combined (diag_1_m code 5 OR diag_2_m code 3)
        total_all = (
            prev.loc[(prev['trait'] == 'diag_1_m') & (prev['code'].astype(str) == '5'), 'count'].sum()
            + prev.loc[(prev['trait'] == 'diag_2_m') & (prev['code'].astype(str) == '3'), 'count'].sum()
        )
        cancer_rows = prev.loc[prev['trait'] == 'diag_cancer_1_m'].assign(m=prev['meaning'].astype(str).str.lower())

        # build item rows (All cancers uses total_all; others use pattern match)
        for item_name, pat in pairs:
            cnt = int(total_all) if pat is None else int(cancer_rows[cancer_rows['m'].str.contains(pat, regex=True)]['count'].sum())
            rate = round((cnt / pop * 100_000) if pop > 0 else 0.0, 1)
            rows.append({"item": item_name, "sex": sex_label, "age group": ag, "count": cnt, "rate": rate})

result_df = pd.DataFrame(rows)[["item", "sex", "age group", "count", "rate"]]

# enforce requested ordering
item_order = ["All cancers combined", "Breast", "Colon/rectal", "Lung or bronchial", "Prostate"]
sex_order = ["Female", "Male", "All"]
age_groups = list(dict.fromkeys(age_groups))  # dedupe preserve order

result_df["item"] = pd.Categorical(result_df["item"], categories=item_order, ordered=True)
result_df["sex"] = pd.Categorical(result_df["sex"], categories=sex_order, ordered=True)
result_df["age group"] = pd.Categorical(result_df["age group"], categories=age_groups, ordered=True)

result_df = result_df.sort_values(["item", "sex", "age group"]).reset_index(drop=True)

# Inspect temp results
result_df.to_csv('results.csv')

### Upload results

In [None]:
# Upload an entire directory of folders
phenofhy.utils.upload_folders([
    ("phenofhy/", "applets/phenofhy"),
])