# Prostate Cancer Analysis: A Step-by-Step Study Tool

## Introduction
This notebook reproduces the analysis of Prostate Cancer incidence and staging trends in England (2013-2022). It is designed as a study tool for students to understand how to process real-world health data using R.

## Prerequisites
1.  **R**: Ensure R is installed on your system.
2.  **Jupyter**: Ensure Jupyter Notebook is installed.
3.  **IRkernel**: You need the R kernel for Jupyter. Run `install.packages('IRkernel'); IRkernel::installspec()` in your R console if not already installed.
4.  **Libraries**: We use the `tidyverse` suite (includes `dplyr` and `ggplot2`).

## Data Setup
1.  Download the data from the **NHS Get Data Out** portal: [https://nhsd-ndrs.shinyapps.io/get_data_out/](https://nhsd-ndrs.shinyapps.io/get_data_out/)
2.  Download the **"GDO_data_wide.csv"** file.
3.  Place the file in a folder named `data` in the parent directory of this notebook (i.e., `../data/GDO_data_wide.csv`).

In [None]:
# Step 1: Load Libraries
library(tidyverse)

# Check if file exists
data_path <- "../data/GDO_data_wide.csv"
if (!file.exists(data_path)) {
  stop("Data file not found! Please download GDO_data_wide.csv and place it in the '../data/' directory.")
}

# Load Data
df <- read.csv(data_path, stringsAsFactors = FALSE)
cat("Data loaded successfully. Rows:", nrow(df), "\n")

## Step 2: Data Cleaning & Filtering
The GDO dataset is in a "long format" with multiple strata. To avoid double-counting, we must filter for a specific cohort.

**Target Cohort:**
*   **Cancer Site:** Prostate
*   **Region:** All England
*   **Gender:** Male
*   **Age:** All Ages (aggregated)
*   **Stage:** All Stages (for incidence trend)

In [None]:
# Filter for National Incidence Trend
incidence_data <- df %>%
  filter(Cancer.Site == "Prostate",
         Region == "All England",
         Gender == "Male",
         Age == "All",
         Stage == "All",
         nchar(Year) == 4) %>% # Filter out rolling averages (e.g., "2013-2015")
  mutate(Year = as.numeric(Year),
         Rate = as.numeric(Age.Gender.Standardised.Incidence.Rate))

head(incidence_data)

## Step 3: Visualizing Incidence Trends
We plot the **Age-Standardized Rate (ASR)** to account for population aging. We also annotate key events: the 2018 "Fry & Turnbull Effect" and the 2020 COVID-19 impact.

In [None]:
ggplot(incidence_data, aes(x = Year, y = Rate)) +
  geom_line(color = "#2c3e50", linewidth = 1) +
  geom_point(color = "#2c3e50", size = 3) +
  scale_x_continuous(breaks = unique(incidence_data$Year)) +
  scale_y_continuous(limits = c(0, 250)) +
  labs(title = "Age-Standardized Incidence Rate of Prostate Cancer (England)",
       subtitle = "Impact of 'Fry & Turnbull Effect' (2018) and COVID-19 (2020)",
       x = "Diagnosis Year",
       y = "Rate (per 100,000)",
       caption = paste0("Source: NHS Get Data Out | Extracted: ", Sys.Date())) +
  
  # Annotations
  annotate("text", x = 2018, y = 235, label = "Fry & Turnbull Effect\n(Public Awareness)", 
           vjust = 0, fontface = "bold", size = 3.5, color = "#333333") +
  annotate("segment", x = 2018, xend = 2018, y = 230, yend = 215, 
           arrow = arrow(length = unit(0.2, "cm")), color = "#333333") +
  
  annotate("text", x = 2020, y = 125, label = "COVID-19 Impact", 
           vjust = 1, fontface = "bold", size = 3.5, color = "#333333") +
  annotate("segment", x = 2020, xend = 2020, y = 130, yend = 145, 
           arrow = arrow(length = unit(0.2, "cm")), color = "#333333") +
  
  theme_minimal(base_size = 14)

## Step 4: Stage Shift Analysis
To investigate if missed diagnoses led to more advanced cancers, we analyze the proportion of each stage over time. We include "Stage unknown" to check for data completeness.

In [None]:
# Filter for Stage Analysis
target_stages <- c("Stage localised", "Stage locally advanced", "Stage metastatic", "Stage unknown")

stage_data <- df %>%
  filter(Cancer.Site == "Prostate",
         Region == "All England",
         Gender == "Male",
         Age == "All",
         Stage %in% target_stages,
         nchar(Year) == 4) %>%
  mutate(Year = as.numeric(Year),
         Incidence = as.numeric(Incidence))

# Set Factor Levels for Plot Order
stage_data$Stage <- factor(stage_data$Stage, 
                           levels = c("Stage unknown", "Stage metastatic", "Stage locally advanced", "Stage localised"))

# Calculate Proportions
stage_props <- stage_data %>%
  group_by(Year, Stage) %>%
  summarise(Count = sum(Incidence), .groups = "drop") %>%
  group_by(Year) %>%
  mutate(Proportion = Count / sum(Count))

# Plot 100% Stacked Bar Chart
ggplot(stage_props, aes(x = Year, y = Proportion, fill = Stage)) +
  geom_bar(stat = "identity", position = "fill", width = 0.7) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(breaks = unique(stage_props$Year)) +
  scale_fill_manual(values = c("Stage localised" = "#BDD6EE",
                               "Stage locally advanced" = "#5B9BD5",
                               "Stage metastatic" = "#1F4E79",
                               "Stage unknown" = "#D9D9D9")) +
  labs(title = "Stage Distribution of Prostate Cancer",
       subtitle = "Proportion of New Diagnoses by Stage",
       x = "Diagnosis Year",
       y = "Proportion",
       fill = "Stage") +
  theme_minimal(base_size = 14)