# Clean PDTC Portal Metadata to Consolidate Disease Type - Set to NA For Normals
## Date: 2022-02-28
## Author: Jeffrey Grover
**Purpose:** With the rest of the cleaning complete at this point, one of the remaining tasks is to consolidate the `disease_type` column to be more consistent. However, without knowledge from the centers and a controlled vocabulary, this is going to be difficult to catch all edge cases. So, I will retain the previous versions as well. Also set disease information to NA for normals.

### Load libraries

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### Load metadata

In [2]:
pdtc_metadata <- read_csv('../2022-02-25_pdtc_portal_metadata_remove_duplicates/2022-02-25_pdtc_portal_metadata_deduplicated.csv') %>%
    rename('index' = '...1')

head(pdtc_metadata)

New names:
* `` -> ...1

[1mRows: [22m[34m2134[39m [1mColumns: [22m[34m26[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (24): contributor, file_name, patient_id, model_id, tumor_id, case_id, s...
[32mdbl[39m  (2): ...1, paired_end


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,34_S29_L005_R1_001.fastq.gz,24483,BCM-4913,4913,,BCM-4913,PDX,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Invasive breast carcinoma,
2,BCM,34_S29_L005_R2_001.fastq.gz,24483,BCM-4913,4913,,BCM-4913,PDX,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Invasive breast carcinoma,
3,BCM,35_S30_L005_R1_001.fastq.gz,24490,BCM-4400,4400,,BCM-4400,PDX,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,"Infiltrating duct carcinoma, NOS",
4,BCM,35_S30_L005_R2_001.fastq.gz,24490,BCM-4400,4400,,BCM-4400,PDX,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,"Infiltrating duct carcinoma, NOS",
5,BCM,36_S31_L005_R1_001.fastq.gz,24455,BCM-2277,2277,,BCM-2277,PDX,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Invasive breast carcinoma,
6,BCM,36_S31_L005_R2_001.fastq.gz,24455,BCM-2277,2277,,BCM-2277,PDX,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Invasive breast carcinoma,


### Investigate the `disease_type` column

In [3]:
# What are the unique entries in that column, and how many are there?

disease_types <- pdtc_metadata$disease_type %>% table() %>% as_tibble() %>% rename('disease_type' = '.', 'freq' = 'n')

disease_types

disease_type,freq
<chr>,<int>
Adenocarcinoma - colon,12
bladder urothelial carcinoma,6
Blood,84
"Bone Neoplasm, Chondrosarcoma (10008737)",8
Breast Ascites,8
"Breast cancer, NOS",6
Breast Invasive Ductal (Skin Met),6
Breast Invasive Ductal (Spine Met),4
Breast Invasive Ductal Carcinoma,48
Breast Invasive Lobular (Brain Met),6


Many of these are highly specific, but I can clean some that are obviously the same. I think the best thing to do would be to display the `body_location` field on the portal rather than this `disease_type`.

### Do some simple replacements
The plan here should be to harmonize the information as much as possible. Then create a simplified disease column. This will be different from the `body_location` because that column could be from a metastatic site and not based on disease info.

In [4]:
# Replace some of the disease_type replacements that we can do quickly here
# In general, reformat all to have tissue/location - disease - other info

pdtc_metadata_cleaned <- pdtc_metadata %>%
    # colon
    mutate(disease_type = str_replace(disease_type, '(Adenocarcinoma - colon)|(Colon adenocarcinoma)', 'Colon - Adenocarcinoma')) %>%
    mutate(disease_type = str_replace(disease_type, '(Gastrointestinal Neoplasm, Colon - Colon Adenocarcinoma \\(10009951\\))|(Gastrointestinal Neoplasm, Colon - Adenocarcinoma \\(10009951\\))', 'Colon - Gastrointestinal Neoplasm - Adenocarcinoma (10009951)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Gastrointestinal Neoplasm, Colorectal cancer, NOS \\(10010029\\)', 'Colon - Gastrointestinal Neoplasm - Colorectal cancer - NOS (10010029)')) %>%
    # bladder
    mutate(disease_type = str_replace(disease_type, '(Urothelial Bladder Cancer)|(urothelial\\/bladder ca)|(Urothelial/bladder cancer, NOS)', 'Bladder - Urothelial Cancer - NOS')) %>%
    mutate(disease_type = str_replace(disease_type, '(bladder urothelial carcinoma)|(Urothelial carcinoma)', 'Bladder - Urothelial Carcinoma')) %>%
    mutate(disease_type = str_replace(disease_type, '(Urothelial Tract Neoplasm, Urothelial/bladder cancer, NOS \\(10018192\\))|(Urothelial Tract Neoplasm, Bladder - Urothelial Cancer - NOS \\(10018192\\))', 'Bladder - Urothelial Tract Neoplasm - Urothelial Cancer - NOS (10018192)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Myoinvasive bladder cancer', 'Bladder - Myoinvasive Cancer')) %>%
    mutate(disease_type = str_replace(disease_type, 'Transitional cell carcinoma', 'Bladder - Transitional Cell Carcinoma')) %>%
    mutate(disease_type = str_replace(disease_type, 'Urothelial Tract Neoplasm, Transitional cell car. - uroth. \\(10044409\\)', 'Bladder - Urothelial Tract Neoplasm - Transitional Cell Carcinoma (10044409)')) %>%
    # bone
    mutate(disease_type = str_replace(disease_type, 'Bone Neoplasm, Chondrosarcoma \\(10008737\\)', 'Bone - Neoplasm - Chondrosarcoma (10008737)')) %>%
    # breast
    mutate(disease_type = str_replace(disease_type, 'Breast Ascites', 'Breast - Ascites')) %>%
    mutate(disease_type = str_replace(disease_type, '^Invasive breast carcinoma$', 'Breast - Invasive Carcinoma')) %>%
    mutate(disease_type = str_replace(disease_type, '^Invasive ductal carcinoma$', 'Breast - Invasive Ductal Carcinoma')) %>%
    mutate(disease_type = str_replace(disease_type, '^Breast cancer, NOS$', 'Breast - Breast Cancer - NOS')) %>%
    mutate(disease_type = str_replace(disease_type, '^Breast Invasive', 'Breast - Invasive')) %>%
    mutate(disease_type = str_replace(disease_type, 'Breast Neoplasm, Invasive breast carcinoma \\(10006190\\)', 'Breast - Neoplasm - Invasive breast carcinoma (10006190)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Breast Pleural Effusion', 'Breast - Pleural Effusion')) %>%
    mutate(disease_type = str_replace(disease_type, 'Breast Poorly Differentiated', 'Breast - Poorly Differentiated')) %>%
    mutate(disease_type = str_replace(disease_type, 'Infiltrating duct carcinoma, NOS', 'Breast - Infiltrating duct carcinoma - NOS')) %>%
    # pancreas
    mutate(disease_type = str_replace(disease_type, 'Gastrointestinal Neoplasm, Adenocarcinoma - pancreas \\(10052747\\)', 'Pancreas - Gastrointestinal Neoplasm - Adenocarcinoma (10052747)')) %>%
    # rectum
    mutate(disease_type = str_replace(disease_type, 'Gastrointestinal Neoplasm, Adenocarcinoma - rectum \\(10038045\\)', 'Rectum - Gastrointestinal Neoplasm - Adenocarcinoma (10038045)')) %>%
    # gall bladder
    mutate(disease_type = str_replace(disease_type, 'Gastrointestinal Neoplasm, Gall bladder carcinoma \\(adeno\\) \\(10017618\\)', 'Gall Bladder - Gastrointestinal Neoplasm - Adenocarcinoma (10017618)')) %>%
    # gastrointestinal
    mutate(disease_type = str_replace(disease_type, 'Gastrointestinal Neoplasm, Gastrointestinal cancer, NOS \\(10017986\\)', 'Gastrointestinal - Neoplasm - Gastrointestinal cancer - NOS (10017986)')) %>%
    # head and neck
    mutate(disease_type = str_replace(disease_type, 'Head and Neck Neoplasm, H & N squamous cell car., NOS \\(10060121\\)', 'Head and Neck - Neoplasm - Suamous Cell Carcinoma - NOS (10060121)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Head and Neck Neoplasm, Pharyngeal squam. cell carcinoma \\(10034819\\)', 'Head and Neck - Neoplasm - Pharyngeal Squamous Cell Carcinoma (10034819)')) %>%
    # kidney
    mutate(disease_type = str_replace(disease_type, 'Kidney Neoplasm, RCC, clear cell adenocarcinoma \\(10009251\\)', 'Kidney - Neoplasm RCC - Clear Cell Adenocarcinoma (10009251)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Kidney Neoplasm, Renal cell carcinoma, NOS \\(10038415\\)', 'Kidney - Neoplasm - Renal Cell Carcinoma - NOS (10038415)')) %>%
    # lung
    mutate(disease_type = str_replace(disease_type, '(^Lung )|(^Lung, )', 'Lung - ')) %>%
    mutate(disease_type = str_replace(disease_type, ', Lung', ' -')) %>%
    mutate(disease_type = str_replace(disease_type, 'adenocarcinoma \\(10025032\\)', 'Adenocarcinoma (10025032)')) %>%
    mutate(disease_type = str_replace(disease_type, 'cancer, NOS \\(10025065\\)', 'NOS (10025065')) %>%
    mutate(disease_type = str_replace(disease_type, 'cell lung cancer', 'Cell Cancer')) %>%
    mutate(disease_type = str_replace(disease_type, 'Squamous cell lung carcinoma', 'Squamous Cell Carcinoma')) %>%
    mutate(disease_type = str_replace(disease_type, '^Neuroendocrinal$', 'Lung - Neuroendocrinal')) %>%  # Consistent with MDACC models of the same type
    # skin
    mutate(disease_type = str_replace(disease_type, 'Melanoma', 'Skin - Melanoma')) %>%
    mutate(disease_type = str_replace(disease_type, 'Skin Neoplasm, Skin - Melanoma \\(10053571\\)', 'Skin - Neoplasm - Melanoma (10053571)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Skin Neoplasm, Merkel cell tumor \\(10029266\\)', 'Skin - Neoplasm - Merkel Cell Tumor (10029266)')) %>%
    # ovary
    mutate(disease_type = str_replace(disease_type, 'Reproductive System Neoplasm, Female, Ovarian cancer, NOS \\(10033272\\)', 'Ovary - Reproductive System Neoplasm - Female - Ovarian Cancer - NOS (10033272)'))
    
    
    

pdtc_metadata_cleaned$disease_type %>% table() %>% as_tibble() %>% rename('disease_type' = '.', 'freq' = 'n')

disease_type,freq
<chr>,<int>
Bladder - Myoinvasive Cancer,6
Bladder - Transitional Cell Carcinoma,6
Bladder - Urothelial Cancer - NOS,16
Bladder - Urothelial Carcinoma,36
Bladder - Urothelial Tract Neoplasm - Transitional Cell Carcinoma (10044409),18
Bladder - Urothelial Tract Neoplasm - Urothelial Cancer - NOS (10018192),24
Blood,84
Bone - Neoplasm - Chondrosarcoma (10008737),8
Breast - Ascites,8
Breast - Breast Cancer - NOS,6


In [5]:
# Fix a few more that aren't as standardized

pdtc_metadata_cleaned <- pdtc_metadata_cleaned %>% mutate(disease_type = str_replace(disease_type, 'Soft Tissue Neoplasm, Malig. periph. nerve sheath tum. \\(10026667\\)', 'Soft Tissue - Neoplasm - Malignant Peripheral Nerve Sheath Tumor (10026667)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Soft Tissue Neoplasm, Non-Rhabdo. soft tissue sarcoma \\(10039494\\)', 'Soft Tissue - Neoplasm - Non-Rhabdomyosarcoma Soft Tissue Sarcoma (10039494)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Soft Tissue Neoplasm, Rhabdomyosarcoma, NOS \\(10039024\\)', 'Soft Tissue - Neoplasm - Rhabdomyosarcoma - NOS (10039024)')) %>%
    mutate(disease_type = str_replace(disease_type, 'Soft Tissue Neoplasm, Synovial sarcoma \\(10042866\\)', 'Soft Tissue - Neoplasm - Synovial Sarcoma (10042866)'))


pdtc_metadata_cleaned$disease_type %>% table() %>% as_tibble() %>% rename('disease_type' = '.', 'freq' = 'n')

disease_type,freq
<chr>,<int>
Bladder - Myoinvasive Cancer,6
Bladder - Transitional Cell Carcinoma,6
Bladder - Urothelial Cancer - NOS,16
Bladder - Urothelial Carcinoma,36
Bladder - Urothelial Tract Neoplasm - Transitional Cell Carcinoma (10044409),18
Bladder - Urothelial Tract Neoplasm - Urothelial Cancer - NOS (10018192),24
Blood,84
Bone - Neoplasm - Chondrosarcoma (10008737),8
Breast - Ascites,8
Breast - Breast Cancer - NOS,6


In [6]:
# Fix body location for those two MDACC samples

pdtc_metadata_cleaned <- pdtc_metadata_cleaned %>% mutate(body_location = ifelse(disease_type == 'Lung - Neuroendocrinal', 'Lung', body_location))

In [7]:
# Normals and blood normals should have disease type == NA

pdtc_metadata_cleaned <- pdtc_metadata_cleaned %>% mutate(disease_type = ifelse(sample_type == 'Normal', NA, disease_type))

### Create a new simplified disease column
Call it `disease_type` and change the current column to `disease_detail`.

In [8]:
pdtc_metadata_cleaned <- pdtc_metadata_cleaned %>% rename('disease_detail' = 'disease_type') %>%
    separate(disease_detail, into = c('disease_type'), sep = ' - ', remove = FALSE, extra = 'drop') %>%
    mutate(disease_type = str_c(disease_type, ' Cancer'))

head(pdtc_metadata_cleaned)

index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_detail,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,34_S29_L005_R1_001.fastq.gz,24483,BCM-4913,4913,,BCM-4913,PDX,WES,⋯,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Breast - Invasive Carcinoma,Breast Cancer,
2,BCM,34_S29_L005_R2_001.fastq.gz,24483,BCM-4913,4913,,BCM-4913,PDX,WES,⋯,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Breast - Invasive Carcinoma,Breast Cancer,
3,BCM,35_S30_L005_R1_001.fastq.gz,24490,BCM-4400,4400,,BCM-4400,PDX,WES,⋯,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Breast - Infiltrating duct carcinoma - NOS,Breast Cancer,
4,BCM,35_S30_L005_R2_001.fastq.gz,24490,BCM-4400,4400,,BCM-4400,PDX,WES,⋯,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Breast - Infiltrating duct carcinoma - NOS,Breast Cancer,
5,BCM,36_S31_L005_R1_001.fastq.gz,24455,BCM-2277,2277,,BCM-2277,PDX,WES,⋯,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Breast - Invasive Carcinoma,Breast Cancer,
6,BCM,36_S31_L005_R2_001.fastq.gz,24455,BCM-2277,2277,,BCM-2277,PDX,WES,⋯,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,Unknown,Unknown,Breast,Breast - Invasive Carcinoma,Breast Cancer,


In [9]:
# Some disease names are wrong, because English

pdtc_metadata_cleaned$disease_type %>% table()

.
         Bladder Cancer             Bone Cancer           Breast Cancer 
                     96                       6                     498 
           Colon Cancer     Gall Bladder Cancer Gastrointestinal Cancer 
                    154                       6                      16 
   Head and Neck Cancer           Kidney Cancer             Lung Cancer 
                     16                      44                     292 
           Ovary Cancer         Pancreas Cancer           Rectum Cancer 
                      6                     130                      18 
            Skin Cancer      Soft Tissue Cancer          Unknown Cancer 
                    394                      42                      24 

In [10]:
# Clean these for grammatical and readability reasons

pdtc_metadata_cleaned <- pdtc_metadata_cleaned %>% mutate(disease_type = str_replace(disease_type, 'Pancreas Cancer', 'Pancreatic Cancer')) %>%
    mutate(disease_type = str_replace(disease_type, 'Rectum Cancer', 'Rectal Cancer')) %>%
    mutate(disease_type = str_replace(disease_type, 'Unknown Cancer', 'Unknown')) %>%
    mutate(disease_type = str_replace(disease_type, 'Ovary Cancer', 'Ovarian Cancer'))

pdtc_metadata_cleaned$disease_type %>% table()

.
         Bladder Cancer             Bone Cancer           Breast Cancer 
                     96                       6                     498 
           Colon Cancer     Gall Bladder Cancer Gastrointestinal Cancer 
                    154                       6                      16 
   Head and Neck Cancer           Kidney Cancer             Lung Cancer 
                     16                      44                     292 
         Ovarian Cancer       Pancreatic Cancer           Rectal Cancer 
                      6                     130                      18 
            Skin Cancer      Soft Tissue Cancer                 Unknown 
                    394                      42                      24 

### Fix a few remaining samples
There are a handful of samples from BCM where we know they are Breast cancer, and that's in the body_location column, and they're `sample_type == 'PDX' | sample_type == 'Tumor'.

In [11]:
# Set these few

pdtc_metadata_cleaned <- pdtc_metadata_cleaned %>% mutate(
    disease_detail = ifelse((sample_type == 'PDX' | sample_type == 'Tumor') & contributor == 'BCM' & body_location == 'Breast', 'Breast - Breast Cancer - NOS', disease_detail),
    disease_type = ifelse((sample_type == 'PDX' | sample_type == 'Tumor') & contributor == 'BCM' & body_location == 'Breast', 'Breast Cancer', disease_type)
    )

In [12]:
# Let's just check really quick that there are no duplicate files

pdtc_metadata_cleaned$file_name %>% table() %>% as_tibble() %>% filter(n > 1)

.,n
<chr>,<int>


In [13]:
pdtc_metadata_cleaned$file_name %>% length()

In [14]:
pdtc_metadata_cleaned$file_name %>% unique() %>% length()

In [15]:
# Fix index name

colnames(pdtc_metadata_cleaned)[1] <- ''

In [16]:
# Export

write_csv(pdtc_metadata_cleaned, '2022-03-02_pdtc_files_metadata_cleaned.csv')