# Summarize PDTC Metadata
## Date: 2022-02-15
## Author: Jeffrey Grover
**Purpose:** Summarize the improvement in the PDTC Metadata.

### Load libraries

In [11]:
library(tidyverse)

### Load the metadata
Since it's been fixed. What's the missing data situation look like?

In [12]:
pdtc_metadata <- read_csv('../2022-02-09_pdxnet_pdtc_pdmr_portal_ids/2022-02-15_pdtc_metadata_cleaned.csv')

nrow(pdtc_metadata)
head(pdtc_metadata)

New names:
* `` -> ...1

[1mRows: [22m[34m2180[39m [1mColumns: [22m[34m26[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (24): contributor, file_name, patient_id, model_id, tumor_id, case_id, s...
[32mdbl[39m  (2): ...1, paired_end


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,


In [13]:
# Define function to summarize missing fields

summarize_missing_fields <- function(df) {
    colSums((is.na(df) | (df == 'Unknown') | (df == '?'))) %>% as.data.frame()
}

In [14]:
# How many NAs are there for the most important columns?


summarize_missing_fields(pdtc_metadata)

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,26
model_id,870
tumor_id,1422
case_id,602
sample_id,0
sample_type,0
experimental_strategy,0


Note: Some of these *should* have NAs in them. For example, "passage" is not meaningful for primary tumor samples.

In [5]:
# Break down by PDTC

pdtc_metadata <- pdtc_metadata %>% group_split(contributor)
names(pdtc_metadata) <- c('BCM', 'HCI', 'MDACC', 'UC Davis', 'WISTAR', 'WUSTL')

pdtc_metadata %>% map(head, 1)

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina HiSeq,Unknown,,Breast,Breast Poorly Differentiated,

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
455,MDACC,BFang-LungPDX-MS11DNA-TC211W.R1.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina Hiseq,Unknown,,Blood,Blood,

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
873,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq,Raw Reads,Illumina,Unknown,P0,Bladder,bladder urothelial carcinoma,

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
921,WISTAR,wm3907n_R1.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,Open Access,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,,Skin,Melanoma,

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1303,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,Open,Raw Sequencing Data,fastq.gz,Raw Reads,NovaSeq S4,Unknown,P3,Digestive/Gastrointestinal,Adenocarcinoma - colon,


In [6]:
# Break down the missing metadata fields by PDTC

pdtc_metadata %>% map(summarize_missing_fields)

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,2
model_id,0
tumor_id,16
case_id,208
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,0
model_id,246
tumor_id,246
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,12
model_id,264
tumor_id,418
case_id,12
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,0
model_id,0
tumor_id,0
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,12
model_id,0
tumor_id,382
case_id,382
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,0
model_id,360
tumor_id,360
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0


### How many missing metadata fields were there originally?

In [7]:
# Load the original metadata

original_pdtc_metadata <- read_csv('../2022-02-09_pdxnet_pdtc_pdmr_portal_ids/2022-02-09_pdxnet_portal_pdtc_files.csv')

nrow(original_pdtc_metadata)
head(original_pdtc_metadata)

New names:
* `` -> ...1

[1mRows: [22m[34m2822[39m [1mColumns: [22m[34m29[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (23): contributor, file_name, patient_id, model_id, tumor_id, case_id, s...
[32mdbl[39m  (4): ...1, paired_end, created_datetime, file_size
[33mlgl[39m  (2): is_ffpe, investigation


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [8]:
# NAs and missing data originally?

summarize_missing_fields(original_pdtc_metadata)

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,26
model_id,1515
tumor_id,2067
case_id,602
sample_id,0
sample_type,0
experimental_strategy,0


In [9]:
# Break down by PDTC

original_pdtc_metadata <- original_pdtc_metadata %>% group_split(contributor)
names(original_pdtc_metadata) <- c('BCM', 'HCI', 'MDACC', 'UC Davis', 'WISTAR', 'WUSTL')

original_pdtc_metadata %>% map(head, 1)

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1100,MDACC,BFang-LungPDX-MS11DNA-TC211W.R1.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1518,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1566,WISTAR,wm3907n_R1.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5439234,Tumor,Unknown,,,Unknown

...1,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1948,WUSTL,172845-121-T_SL.exome.cram,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1681781885,P3,Unknown,,,Unknown


In [10]:
# Get the number of missing metadata fields from the original metadata

original_pdtc_metadata %>% map(summarize_missing_fields)

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,2
model_id,0
tumor_id,16
case_id,208
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,0
model_id,891
tumor_id,891
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,12
model_id,264
tumor_id,418
case_id,12
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,0
model_id,0
tumor_id,0
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,12
model_id,0
tumor_id,382
case_id,382
sample_id,0
sample_type,0
experimental_strategy,0

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
...1,0
contributor,0
file_name,0
patient_id,0
model_id,360
tumor_id,360
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0


At this point we only have 80 missing disease information (vs over 500 originally) and 26 missing capture_kit info. Passage is problematic. Some that are missing gender/sex information are probably ones where the patient did not provide it. Also, all unmerged lanes and non-fastqs have been removed and replaced with their merged or fastq-converted counterparts. Unnecessary metadata fields like `created_datetime` and `file_size` were removed. `disease_type` was renamed to `body_location` and `disease_detail` changed to `disease_type` to match the fields for the PDMR.

As before, library prep kits for RNAseq are also included in the `capture_kit` column.

Other changes include using "Unknown" for missing data in most fields, and NA where they do not apply. Such as primary tumors getting NA in passage or model ID, but samples without known `disease_type` getting "Unknown." All files were set to "Controlled" access unless already public by the PDTC according to their metadata.

Many other missing fields here and there pulled from model info, metadata included in the center's dbGaP submission, or in some cases from the metadata on the CGC.

This metadata can now also be applied to all these files on the CGC. That will also indicate files on the CGC that are not included in this list, if any.

At this time, I don't think it will be possible to get the remaining missing info without going back to the individual PDTCs for that information.