vignettes/khsmisc.Rmd

---
title: "Sample analysis workflow using `khsmisc` and TCGA `BLCA`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sample analysis workflow using `khsmisc` and TCGA `BLCA`}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Installation

The `khsmisc` package can be installed from [GitHub](https://github.com/stopsack/khsmisc):

```r
# if "remotes" library is missing, install it first:
#   install.packages("remotes")
remotes::install_github("stopsack/khsmisc", build_vignettes = TRUE)
```

To access the documentation after installing the package, run
```r
help("khsmisc")
```

or

```r
vignette("khsmisc")
```


## Setting up a new RMarkdown

After successfully installing `khsmisc`:

(@) **Create an RMarkdown file**

    RStudio: *File* > *New File* > *R Markdown...* > *From Template* > *khsmisc RMarkdown template*

(@) **Customize the YAML header**

   + Change the `title:` and `author:` fields. 
   + The date will automatically be updated when compiling the document. 
   + Note that this type of code, "YAML," is sensitive to exact identation. Every indentation is done by exactly two spaces.

(@) **Load the package**
    
    The template provides code that will load a number of packages via the [khsverse meta-package](https://stopsack.github.io/khsverse):

```{r, eval = FALSE}
library(khsverse)
```

   + To just load the khsmisc package by itself, replace `library(khsverse)` by `library(khsmisc)`. 
   + The template also includes code to set graphics output to SVG format. If PNG output is preferred or the svglite package is not installed, that part can be deleted.


(@) **Knit the RMarkdown a first time**

    The RMarkdown is ready to be compiled using the *Knit* button. An HTML document should open that has nothing but a title and the startup messages of `khsverse` about all the packages it loads. If an an error message is shown instead, perhaps one of the packages was not installed properly? 
    
    Learn more about the syntax of RMarkdown in the [RStudio cheatsheets](http://www.rstudio.com/resources/cheatsheets/).


```{r quietload, include = FALSE, message = FALSE}
library(khsmisc)
library(tidyverse)  # Data handling
library(readxl)     # Reading Excel files
library(labelled)   # Variable labels
library(cowplot)    # Plotting theme
```

## Data handling

### Load the TCGA `BLCA` dataset

Participant and tumor data from the Cancer Genome Atlas Bladder Cancer "cohort" (`BLCA`) will be used in their published 2017 version. Insert a new R chunk into the markdown (toolbar: *Insert* > *R*), and add code for reading a tab-separated file ("TSV"). The code retrieves the file `data_clinical_patient.txt`, which can be downloaded as part of the [cBioPortal TCGA-BLCA tarball (276 MB)](http://download.cbioportal.org/blca_tcga_pub_2017.tar.gz). A copy of just the "clinical" dataset is available in the `extdata` directory of the khsmisc package. The first four lines of the TSV file are skipped because they contain meta-data.
     
```{r}
clinical <- read_tsv(file = system.file("extdata", "data_clinical_patient.txt", 
                                        package = "khsmisc", mustWork = TRUE),
                     skip = 4)
```

An inventory of the dataset:

```{r}
varlist(clinical) %>%
  print(n = 15)  # show the first 15 lines of output
```

Select and rename variables of interest:

```{r}
clinical <- clinical %>%
  select(id         = PATIENT_ID, 
         sex        = SEX,
         race       = RACE,
         ethnicity  = ETHNICITY,
         height     = HEIGHT,
         weight     = WEIGHT,
         smoke      = TOBACCO_SMOKING_HISTORY_INDICATOR,
         agedx      = AGE, 
         dxyear     = INITIAL_PATHOLOGIC_DX_YEAR,
         c_tstage   = CLIN_T_STAGE,
         p_stage    = AJCC_PATHOLOGIC_TUMOR_STAGE,
         p_tstage   = AJCC_TUMOR_PATHOLOGIC_PT, 
         p_nstage   = AJCC_NODES_PATHOLOGIC_PN,
         mstage     = AJCC_METASTASIS_PATHOLOGIC_PM,
         grade      = GRADE,
         histology  = HISTOLOGICAL_SUBTYPE,
         os_status  = OS_STATUS, 
         os_mos     = OS_MONTHS, 
         dfs_status = DFS_STATUS,
         dfs_mos    = DFS_MONTHS)
```

### Load Taylor *et al.* aneuploidy calls

Additional exposure data on tumors of the same study participants comes in the form of derived tumor aneuploidy scores as per Taylor, ..., Meyerson ([Cancer Cell 2018;33:676–689](https://doi.org/10.1016/j.ccell.2018.03.007)).

    
Next, load the Excel file provided as their [Supplementary Table 2 via a HTTPS URL](https://www.cell.com/cms/10.1016/j.ccell.2018.03.007/attachment/2d887978-0a9c-4e90-af00-66eb5948673b/mmc2.xlsx), show dataset inventory, and select/rename variables.

The following code can no longer be excuted because the journal blocks simple downloads. Instead, manually open the link above in the browser.
    
```{r, eval = FALSE}
### THIS CHUNK IS NOT BEING EXECUTED
# Generate temporary path
temporary_path <- tempfile(fileext = ".xlsx")
# Download Taylor suppl. table 2 to that temporary path
download.file(url = "https://www.cell.com/cms/10.1016/j.ccell.2018.03.007/attachment/2d887978-0a9c-4e90-af00-66eb5948673b/mmc2.xlsx", 
              destfile = temporary_path)
# Print temporary file location
print(paste("Taylor et al. suppl. table 2 is temporarily available locally at",
      temporary_path))
# Read Excel file from temporary path:
taylor <- read_xlsx(path = temporary_path, 
                    skip = 1)  # skip the first line
# --end workaround--
```

This example will continue to use **SIMULATED** data as follows:

```{r sim_taylor}
# Simulated aneuploidy data for demonstration purposes:
set.seed(123)
taylor <- clinical %>%
  select(id) %>%
  mutate(
    id = paste0(id, "-01"),  # make simulated IDs match Taylor et al.'s
    purity = rnorm(n = n(), 
                   mean = 0.6,
                   sd = 0.2),
    ascore = rnorm(n = n(),
                   mean = 12,
                   sd = 9),
    ascore = case_when(ascore < 0 ~ 0,
                       ascore > 40 ~ 40,
                       TRUE ~ ascore))

varlist(taylor)
```


### Merging datasets

Left-join the Taylor *et al.* aneuploidy scores to the clinical data, *i.e.*, keep all observations from `clinical` and those that match from `taylor`. Because IDs in `taylor` are tumor IDs, they contain an extra suffix that we need to remove before merging. We also need to check that this procedure did not introduce duplicates.
    
```{r}
taylor <- taylor %>%
  mutate(id = str_sub(string = id, start = 1, end = -4))

combined <- clinical %>%
  left_join(taylor, by = "id")

sum(duplicated(combined$id))  # Check for duplicates. Should return 0.
```
   

### Recoding variables

Consistently code categorical variables as a `factor` and continuous variables as `numeric`:

```{r}
combined <- combined %>%
  mutate(
    across(.cols = c(sex, race, ethnicity, smoke,
                     c_tstage, p_stage, p_tstage, p_nstage, mstage, 
                     grade, histology, os_status, dfs_status), 
           .fns = factor),
    across(.cols = c(height, weight, agedx, dxyear, os_mos, dfs_mos),
           .fns = as.numeric))
```

The warning messages are expected, as some numeric variables contain non-numeric values.


Next, recode various labels of missing data in categorical variables to a consistent missing type, using `fct_collapse`:

```{r}
# Tabulate some examples
combined %>% 
  select(race, dfs_status, histology) %>%
  map(.f = table, 
      useNA = "always")  # always show NA ("missing") levels

# Collapse all actually missings to R's missing
combined <- combined %>%
  mutate(
    across(.cols = where(is.factor), 
           .fns = ~fct_collapse(., 
                                NULL = c("[Not Available]", 
                                         "[Discrepancy]", 
                                         "Indeterminate"))))

# Revisit the examples after recoding
combined %>% 
  select(race, dfs_status, histology) %>%
  map(.f = table, 
      useNA = "always")  # always show NA ("missing") levels
```


Recode event indicators to the numeric values that `survival` functions require:

```{r}
combined <- combined %>%
  mutate(
    dfs_event = case_when(
      dfs_status == "Recurred/Progressed" ~ 1,
      dfs_status == "DiseaseFree"         ~ 0),
    os_event  = case_when(
      os_status  == "DECEASED"            ~ 1,
      os_status  == "LIVING"              ~ 0))
```


Recode race categories to make them more readable:

```{r}
# Before:
combined %>%
  count(race)

combined <- combined %>% 
         # Change from all uppercase:
  mutate(
    race      = factor(str_to_title(race)),
    ethnicity = factor(str_to_title(ethnicity)),
    # Make "Black" as short as other categories
    race = fct_recode(race, Black = "Black Or African American"),
    # Make "White" the reference category because of sample size
    race = fct_relevel(race, "White"))

# After:
combined %>%
  count(race)
```

Assign meaningful labels to the `smoke` variable, based on inspection of other smoking-related variables in the full dataset:

```{r}
combined <- combined %>%
  mutate(
    smoke = factor(
      smoke, 
      levels = 1:5, 
      labels = c("Never", "Current", 
                 "Former, quit >15 y", "Former, quit <15 y", 
                 "Former")),
    # Combine all "formers":
    smoke3 = fct_collapse(
      smoke, 
      Former = c("Former", 
                 "Former, quit >15 y", 
                 "Former, quit <15 y")),
    # Change order of levels, starting with "never" as the reference:
    smoke3 = fct_relevel(
      smoke3, 
      "Never", "Former", "Current"))
```


### Creating derived variables

```{r}
combined <- combined %>%
  mutate(bmidx = weight / ((height/100)^2)) %>%
  select(-height, -weight)
```

### Categorizing variables

Category boundaries are informed by results from code to be run below.

```{r}
combined <- combined %>%
  mutate(
    ascore_cat = cut(
      ascore, 
      breaks = c(0, 5, 10, 20, max(ascore, na.rm = TRUE)), 
      include.lowest = TRUE))
```

### Labeling variables

```{r}
combined <- combined %>%
  set_variable_labels(
    sex        = "Sex",
    race       = "Self-reported race",
    ethnicity  = "Ethnicity",
    smoke      = "Smoking status at diangosis",
    smoke3     = "Smoking status at diagnosis",
    agedx      = "Age at diagnosis",
    dxyear     = "Calendar year of initial diagnosis",
    bmidx      = "Body mass index at diagnosis",
    p_stage    = "AJCC pathologic stage",
    mstage     = "Metastases at diagnosis",
    grade      = "Histologic grade",
    histology  = "Histologic subtype",
    ascore     = "Aneuploidy score",
    ascore_cat = "Aneuploidy score",
    purity     = "DNA tumor purity by ABSOLUTE")
```


## Data description

### Contingency tables

```{r}
combined %>%
  tabulate_rowcol(smoke3, race) %>%  # make contingency table
  mygt()  # format
```

### Table of distributional statistics

Exposure and confounders

```{r}
combined %>%
  # make descriptive table:
  tsummary(race, agedx, bmidx, dxyear, purity, by = race) %>%  
  select(-mean, -sd, -sum) %>%  # remove undesired statistics
  mutate(across(
    .cols = where(is.numeric), 
    .fns = ~round(x = .x, digits = 2))) %>%  # round
  mygt()  # format
```

Outcome

```{r}
combined %>%
  tsummary(starts_with("ascore")) %>%  # make descriptive table
  select(-mean, -sd, -sum) %>%  # remove undesired statistics
  mygt()  # format
```


### Plots of the outcome

```{r}
combined %>%
  ggplot(mapping = aes(x = ascore)) +
  geom_histogram(binwidth = 2) +  # each bar = 2 units of "ascore"
  theme_minimal_grid()
```

Is the outcome reasonably normally distributed? Show a quantile--quantile plot before and after log transformation.

```{r}
combined %>% 
  ggplot(aes(sample = ascore)) + 
  stat_qq() + 
  stat_qq_line() +
  labs(y = "Sample, untransformed") +
  theme_minimal_grid()

combined %>% 
  ggplot(aes(sample = log(ascore + 0.1))) + 
  stat_qq() + 
  stat_qq_line() +
  labs(y = "Sample, log(x + 0.1)-transformed") +
  theme_minimal_grid()
```


## Main results

### Study population: Applying and documenting exclusions

Because the main exposure is self-reported race, we will have to exclude participants with missing race from the analytical population.

```{r}
# All participants:
nrow(combined)

# Exclude with missing race:
analytical <- combined %>%
  filter(!is.na(race)) %>%
  copy_labels_from(from = combined)

# Analytical population:
nrow(analytical)
```

Note that the dataset contains participants with tumors that were metastatic at diagnosis (M1):
```{r}
analytical %>%
  count(mstage)
```

In the analytical dataset, does everyone have an aneuploidy score?

```{r}
analytical %>%
  mutate(missing_ascore = is.na(ascore)) %>%
  tabulate_rowcol(missing_ascore, race) %>%
  mygt()
```

### Figure 1: Box-whiskers/dot plots

```{r}
analytical %>%
  stripplot(x = race, 
            y = ascore)
```

**Restricted to high grade, non-missing aneuploidy scores and histology, and then color code by histology**
```{r}
analytical %>%
  filter(grade == "High Grade") %>%
  filter(!is.na(ascore) & !is.na(histology)) %>%
  stripplot(x = race, 
            y = ascore, 
            color = histology) +
  # change color palette:
  scale_color_viridis_d(option = "cividis", 
                        end = 0.8, 
                        direction = -1) +
  # add label:
  labs(color = "Histology")
```