# Check PDXNet Portal Metadata
## Date: 2022-02-09
## Author: Jeffrey Grover
**Purpose:** Check for the amount of missing metadata from the PDXNet Portal entries.

### Load libraries

In [1]:
library(tidyverse)
library(readxl)

── [1mAttaching packages[22m ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### Read in the metadata sheets we have
The PDTC models, PDTC sequencing, and PDMR sequencing.

In [2]:
pdtc_models <- read_csv('2022-02-09_pdxnet_portal_pdtc_models.csv')
colnames(pdtc_models)[1] <- 'Index'
nrow(pdtc_models)
colnames(pdtc_models)

head(pdtc_models)

New names:
* `` -> ...1

[1mRows: [22m[34m334[39m [1mColumns: [22m[34m21[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (18): PDXSource, Contributor, ContributorPDX.ID, PDMR.Patient.ID, Gende...
[32mdbl[39m   (2): ...1, CTEP.SDCCode
[34mdttm[39m  (1): Submission


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



Index,PDXSource,Contributor,ContributorPDX.ID,PDMR.Patient.ID,Gender,CTEP.SDCCode,CTEP.SDCDescription,DiagnosisSubtype,Disease.BodyLocation,⋯,Date.ofDiagnosis,Has.KnownMetastaticDisease,Grade.StageInformation,PatientNotes,Molecular.andIHC.Data,Has.Smoked100.Cigarettes,Race,Ethnicity,AdditionalMedicalHistory,Submission
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dttm>
1,PDXNet Consortium Members,MDACC,B8174,K42829,Female,10009951,Adenocarcinoma - colon,adenocarcinoma of sigmoid colon,Digestive/Gastrointestinal,⋯,42790,Yes,Stage,"Tumor Grade/Stage: Stage IV Location of known metastases: Pelvis, omentum, Chest wall (left), Lymph node (left axillary)",APC c.4037C>G p.S1346*; FBXW7 c.832C>T p.R278*; KRAS c.38G>A p.G13D; TP53 c.427G>A p.V143M,No,White,Not Hispanic or Latino,-,2021-09-02
2,PDXNet Consortium Members,MDACC,B8175,K30337,Female,10009951,Adenocarcinoma - colon,adenocarcinoma,Digestive/Gastrointestinal,⋯,42500,Yes,Stage,"Tumor Grade/Stage: Stage IV, poorly differentiated Location of known metastases: Peritoneum, Liver","KRAS/NRAS WT,BRAF mutated, MSI Stable",No,White,Hispanic or Latino,"Family h/o: 2nd degree relative, Uterine cancer",2018-12-19
3,PDXNet Consortium Members,MDACC,B8176,K45526,Female,10009951,Adenocarcinoma - colon,Lynch syndrome; mucinous and signet ring cell adenocarcinoma,Digestive/Gastrointestinal,⋯,42132,Not Reported,TNM (Pathological),"Tumor Grade/Stage: pT1bpN0pM0, poorly differentiated Location of known metastases: Liver, Abdomen","Germline heterozygous MSH2 c388_389del variant (deleterious) Lynch syndrome, MSI-high; KRAS-G12D; Loss of MSH2 and MSH6;",Yes,White,Hispanic or Latino,"Family History: 1st degree relative, GI cancer involving esophagus, liver, bile duct, stomach, pancreas, colon, rectum",2018-12-19
4,PDXNet Consortium Members,MDACC,B8182,K75566,Female,10009951,Adenocarcinoma - colon,poorly differentiated mucinous and signet ring cell adenocarcinoma,Digestive/Gastrointestinal,⋯,42644,Yes,"Stage, TNM",Tumor Grade/Stage: Stage III T4N2M0 Location of known metastases: Liver,-,No,White,Hispanic or Latino,-,2021-09-02
5,PDXNet Consortium Members,MDACC,B8183,K83548,Male,10009951,Adenocarcinoma - colon,adenocarcinoma,Digestive/Gastrointestinal,⋯,42184,Yes,Stage,"Tumor Grade/Stage: Stage IV, moderately differentiated Location of known metastases: Liver, Duodenum, Pelvis",MSI-Stable; BRAF mutant; KRAS wild type,Yes,Not Provided,Not Provided,-,2018-12-19
6,PDXNet Consortium Members,MDACC,B8207,K49395,Male,10009951,Adenocarcinoma - colon,Sigmoid,Digestive/Gastrointestinal,⋯,41537,Yes,"Stage, TNM","Tumor Grade/Stage: Stage IVB, TxNxM1b, Moderately differentiated Location of known metastases: Liver, Lung","Mutations present in: MDM4, NOTCH1, RB1, TP53, APC",Yes,White,Not Hispanic or Latino,No family history of cancer,2021-09-02


In [3]:
pdtc_files <- read_csv('2022-02-09_pdxnet_portal_pdtc_files.csv')
colnames(pdtc_files)[1] <- 'Index'
nrow(pdtc_files)
colnames(pdtc_files)

head(pdtc_files)

New names:
* `` -> ...1

[1mRows: [22m[34m2822[39m [1mColumns: [22m[34m29[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (23): contributor, file_name, patient_id, model_id, tumor_id, case_id, s...
[32mdbl[39m  (4): ...1, paired_end, created_datetime, file_size
[33mlgl[39m  (2): is_ffpe, investigation


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [4]:
pdmr_files <- read_csv('2022-02-09_pdxnet_portal_pdmr_files.csv')
colnames(pdmr_files)[1] <- 'Index'
nrow(pdmr_files)
colnames(pdmr_files)

head(pdmr_files)

New names:
* `` -> ...1

[1mRows: [22m[34m14506[39m [1mColumns: [22m[34m22[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (19): contributor, name, project, Patient.ID, gender, disease_type, Capt...
[32mdbl[39m  (3): ...1, paired_end, age_at_diagnosis


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



Index,contributor,name,project,Patient.ID,gender,disease_type,Capture.assembly,sample_id,PDMR.Version,⋯,Passage,paired_end,sample_type,platform,age_at_diagnosis,Capture.kit,case_id,Specimen.ID,experimental_strategy,id
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
1,NCI,253994~v2.0.1.10.0~germlineWES.R1.FASTQ.gz,jeffrey.grover/pdxnet-pdmr-datapool,253994,Male,Colorectal cancer - NOS,hg19,253994,2_0_1_10_0,⋯,,1,normal_germline,Illumina HiSeq,50,SureSelect Human All Exon V5,253994,,WES,5f43cd0be4b0bf4ad141bd5e
2,NCI,174316~266-R~AD7I14~v1.2~WES.R1.FASTQ.gz,jeffrey.grover/pdxnet-pdmr-datapool,174316,Male,Squamous cell carcinoma - skin,hg19,174316--266-R--AD7I14,1_2,⋯,P1,1,PDX,Illumina HiSeq,83,SureSelect Human All Exon V5,174316,266-R,WES,5f43cd0be4b0bf4ad141bd27
3,NCI,345371~062-R~P27PQ6~v1.2~RNASEQ.R2.FASTQ.gz,jeffrey.grover/pdxnet-pdmr-datapool,345371,Female,Salivary gland cancer,hg19,345371--062-R--P27PQ6,1_2,⋯,P1,2,PDX,Illumina HiSeq,65,TruSeq RNA Access,345371,062-R,RNASEQ,5f43cd0be4b0bf4ad141bcde
4,NCI,417821~307-R~TO3~v1.2~WES.R2.FASTQ.gz,jeffrey.grover/pdxnet-pdmr-datapool,417821,Male,Squamous cell lung carcinoma,hg19,417821--307-R--TO3,1_2,⋯,P0,2,PDX,Illumina HiSeq,67,SureSelect Human All Exon V5,417821,307-R,WES,5f43cd0be4b0bf4ad141bce0
5,NCI,594176~295-R~AL8~v1.2~WES.R2.FASTQ.gz,jeffrey.grover/pdxnet-pdmr-datapool,594176,Male,Osteosarcoma,hg19,594176--295-R--AL8,1_2,⋯,P0,2,PDX,Illumina HiSeq,72,SureSelect Human All Exon V5,594176,295-R,WES,5f43cd0be4b0bf4ad141bcdf
6,NCI,283228~195-R~A04VW5UV9~v1.2~RNASEQ.R2.FASTQ.gz,jeffrey.grover/pdxnet-pdmr-datapool,283228,Male,Melanoma,hg19,283228--195-R--A04VW5UV9,1_2,⋯,P2,2,PDX,Illumina HiSeq,48,TruSeq RNA Access,283228,195-R,RNASEQ,5f43cd0be4b0bf4ad141bcd8


### Get information on the missing metadata

In [5]:
pdtc_files_missing_gender <- pdtc_files %>% filter(is.na(gender))
pdtc_files_missing_gender %>% nrow()

In [6]:
pdtc_files_missing_paired_end <- pdtc_files %>% filter(is.na(paired_end))
pdtc_files_missing_paired_end %>% nrow()

In [7]:
pdtc_files_missing_capture_kit <- pdtc_files %>% filter((is.na(capture_kit) | (capture_kit == 'Unknown')))
pdtc_files_missing_capture_kit %>% nrow()

In [8]:
pdtc_files_missing_capture_assembly <- pdtc_files %>% filter(is.na(capture_assembly))
pdtc_files_missing_capture_assembly %>% nrow()

In [9]:
pdtc_files_missing_access_level <- pdtc_files %>% filter(is.na(access_level))
pdtc_files_missing_access_level %>% nrow()

In [10]:
pdtc_files_missing_created_datetime <- pdtc_files %>% filter(is.na(created_datetime))
pdtc_files_missing_created_datetime %>% nrow()

In [11]:
pdtc_files_missing_data_category <- pdtc_files %>% filter(is.na(data_category))
pdtc_files_missing_data_category %>% nrow()

In [12]:
pdtc_files_missing_data_format <- pdtc_files %>% filter(is.na(data_format))
pdtc_files_missing_data_format %>% nrow()

In [13]:
pdtc_files_missing_platform <- pdtc_files %>% filter(is.na(platform))
pdtc_files_missing_platform %>% nrow()

In [14]:
pdtc_files_missing_ffpe <- pdtc_files %>% filter(is.na(is_ffpe))
pdtc_files_missing_ffpe %>% nrow()

In [15]:
pdtc_files_missing_file_size <- pdtc_files %>% filter(is.na(file_size))
pdtc_files_missing_file_size %>% nrow()

In [16]:
pdtc_files_missing_passage <- pdtc_files %>% filter((is.na(passage) | (passage == '?')))
pdtc_files_missing_passage %>% nrow()

In [17]:
pdtc_files_missing_disease_type <- pdtc_files %>% filter((is.na(disease_type) | (disease_type == 'Unknown')))
pdtc_files_missing_disease_type %>% nrow()

In [18]:
pdtc_files_missing_disease_detail <- pdtc_files %>% filter((is.na(disease_detail) | (disease_detail == 'Unknown')))
pdtc_files_missing_disease_detail %>% nrow()

### Starting with cram files that shouldn't be in there

In [19]:
# There are some CRAM files in here for no good reason
# I thought these had been removed?

pdtc_cram_files <- pdtc_files %>% filter(data_format == 'cram')

pdtc_cram_files

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1948,WUSTL,172845-121-T_SL.exome.cram,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1681781885,P3,Unknown,,,Unknown
1953,WUSTL,625472-104-R_SL.exome.cram,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1391091034,P2,Unknown,,,Unknown
1958,WUSTL,BL0293-F563_SL.exome.cram,PDMR-BL0293,BL0293-F563_SL,F563,PDMR-BL0293,BL0293-F563_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,2398860885,P5,Unknown,,,Unknown


These should not be in there, and a quick search through the CGC shows that we have these as fastq files. So, we should replace them with the fastq versions.

In [20]:
# Create a small table to replace these with fastqs by first loading in a small manifest

wustl_cram_converted_fastqs <- read_csv('manifest_20220209_155256_wustl_converted_cram_fastq.csv') %>% mutate(name = basename(name))

wustl_cram_converted_fastqs

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m64[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): id, name, project, sample_id, sample_type, case_id, Generalized_Sa...
[32mdbl[39m  (2): size, paired_end
[33mlgl[39m (51): trimmed_read_count, p1_1, Is FFPE, passage, prop_ribosomal_bases, ...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



id,name,size,project,trimmed_read_count,p1_1,Is FFPE,passage,sample_id,prop_ribosomal_bases,⋯,dbgap_upload,model_id,paired_end,__inherit__,update_manifest,species,tumor_id,investigation,prop_20x_cov,experimental_strategy
<chr>,<chr>,<dbl>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>,<lgl>,⋯,<lgl>,<lgl>,<dbl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
5d891ad1e4b06d0cb4926ad8,172845-121-T.pe_1.fastq.paired.fq,16296437319,pdxnet/pdxnet-datapool,,,,,172845-121-T_SL,,⋯,,,1,,,,,,,WES
5d891ad1e4b06d0cb4926ad6,172845-121-T.pe_2.fastq.paired.fq,16296437319,pdxnet/pdxnet-datapool,,,,,172845-121-T_SL,,⋯,,,2,,,,,,,WES
5d8911f4e4b065d300c6a152,625472-104-R.pe_1.fastq.paired.fq,12612746052,pdxnet/pdxnet-datapool,,,,,625472-104-R_SL,,⋯,,,1,,,,,,,WES
5d8911f4e4b065d300c6a154,625472-104-R.pe_2.fastq.paired.fq,12612746052,pdxnet/pdxnet-datapool,,,,,625472-104-R_SL,,⋯,,,2,,,,,,,WES
5d89259ee4b065d300c6e80b,BL0293-F563.pe_1.fastq.paired.fq,21889170246,pdxnet/pdxnet-datapool,,,,,BL0293-F563_SL,,⋯,,,1,,,,,,,WES
5d89259ee4b065d300c6e80d,BL0293-F563.pe_2.fastq.paired.fq,21889170246,pdxnet/pdxnet-datapool,,,,,BL0293-F563_SL,,⋯,,,2,,,,,,,WES


In [21]:
pdtc_cram_files_converted_fastq <- pdtc_cram_files[c(1, 1, 2, 2, 3, 3), ]

pdtc_cram_files_converted_fastq

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1948,WUSTL,172845-121-T_SL.exome.cram,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1681781885,P3,Unknown,,,Unknown
1948,WUSTL,172845-121-T_SL.exome.cram,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1681781885,P3,Unknown,,,Unknown
1953,WUSTL,625472-104-R_SL.exome.cram,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1391091034,P2,Unknown,,,Unknown
1953,WUSTL,625472-104-R_SL.exome.cram,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,1391091034,P2,Unknown,,,Unknown
1958,WUSTL,BL0293-F563_SL.exome.cram,PDMR-BL0293,BL0293-F563_SL,F563,PDMR-BL0293,BL0293-F563_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,2398860885,P5,Unknown,,,Unknown
1958,WUSTL,BL0293-F563_SL.exome.cram,PDMR-BL0293,BL0293-F563_SL,F563,PDMR-BL0293,BL0293-F563_SL,PDX,WES,⋯,cram,Aligned Reads,NovaSeq S4,,2398860885,P5,Unknown,,,Unknown


In [22]:
# Replace with the fastq files

pdtc_cram_files_converted_fastq$file_name <- wustl_cram_converted_fastqs$name
pdtc_cram_files_converted_fastq$file_size <- wustl_cram_converted_fastqs$size

pdtc_cram_files_converted_fastq$data_category <- 'Raw Sequencing Data'
pdtc_cram_files_converted_fastq$data_format <- 'fastq'
pdtc_cram_files_converted_fastq$data_type <- 'Raw Reads'

pdtc_cram_files_converted_fastq$paired_end <- c(1, 2, 1, 2, 1, 2)

pdtc_cram_files_converted_fastq

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1948,WUSTL,172845-121-T.pe_1.fastq.paired.fq,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,WES,⋯,fastq,Raw Reads,NovaSeq S4,,16296437319,P3,Unknown,,,Unknown
1948,WUSTL,172845-121-T.pe_2.fastq.paired.fq,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,WES,⋯,fastq,Raw Reads,NovaSeq S4,,16296437319,P3,Unknown,,,Unknown
1953,WUSTL,625472-104-R.pe_1.fastq.paired.fq,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,WES,⋯,fastq,Raw Reads,NovaSeq S4,,12612746052,P2,Unknown,,,Unknown
1953,WUSTL,625472-104-R.pe_2.fastq.paired.fq,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,WES,⋯,fastq,Raw Reads,NovaSeq S4,,12612746052,P2,Unknown,,,Unknown
1958,WUSTL,BL0293-F563.pe_1.fastq.paired.fq,PDMR-BL0293,BL0293-F563_SL,F563,PDMR-BL0293,BL0293-F563_SL,PDX,WES,⋯,fastq,Raw Reads,NovaSeq S4,,21889170246,P5,Unknown,,,Unknown
1958,WUSTL,BL0293-F563.pe_2.fastq.paired.fq,PDMR-BL0293,BL0293-F563_SL,F563,PDMR-BL0293,BL0293-F563_SL,PDX,WES,⋯,fastq,Raw Reads,NovaSeq S4,,21889170246,P5,Unknown,,,Unknown


In [23]:
# Export

pdtc_cram_files_converted_fastq %>% write_csv('2022-02-09_wustl_cram_fastq_converted_for_portal.csv')

In [24]:
# Put these converted cram files back in

edited_pdtc_files <- pdtc_files %>% filter((data_format != 'cram') | is.na(data_format))

edited_pdtc_files <- bind_rows(edited_pdtc_files, pdtc_cram_files_converted_fastq) %>% arrange('index')

edited_pdtc_files %>% nrow()
head(edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


### Remove non-concatenated reads
Replace with concatenated versions.

In [25]:
# Read in the non-concatenated data

non_concat_lanes <- read_excel('Individual_Lane_Files_for_Removal.xlsx')

nrow(non_concat_lanes)
head(non_concat_lanes)

contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,gender,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<lgl>,<lgl>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<lgl>,<chr>
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_001.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_002.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_003.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_004.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_005.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_006.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion


These are all from HCI. And I definitely helped Matt put the right ones in the dbGaP submission.

In [26]:
# How many unique sample_ids are there?

non_concat_lanes$sample_id %>% unique()

In [27]:
# Load in the manifest of the concatenated versions

hci_concat_fastqs <- read_csv('manifest_20220209_165026_hci_concatenated_fastqs.csv') %>% mutate(name = basename(name))

nrow(hci_concat_fastqs)
head(hci_concat_fastqs)

[1mRows: [22m[34m48[39m [1mColumns: [22m[34m64[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (17): id, name, project, sample_id, sample_type, RestrictiveIRB, case_id...
[32mdbl[39m  (2): size, paired_end
[33mlgl[39m (45): trimmed_read_count, p1_1, Is FFPE, passage, prop_ribosomal_bases, ...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



id,name,size,project,trimmed_read_count,p1_1,Is FFPE,passage,sample_id,prop_ribosomal_bases,⋯,dbgap_upload,model_id,paired_end,__inherit__,update_manifest,species,tumor_id,investigation,prop_20x_cov,experimental_strategy
<chr>,<chr>,<dbl>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>,<lgl>,⋯,<chr>,<lgl>,<dbl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
5d5ffcfce4b011b5291791a8,0908263-X_HCI-007_PDXtumor_R1.fastq.gz,3992597315,pdxnet/pdxnet-datapool,,,,,0908263-X,,⋯,NO,,1,,,,,,,WES
5d5ffcfce4b011b5291791ea,0908263-X_HCI-007_PDXtumor_R2.fastq.gz,3809574763,pdxnet/pdxnet-datapool,,,,,0908263-X,,⋯,NO,,2,,,,,,,WES
5d5ffcfce4b011b52917902b,1007649-T_HCI-008_Patient_tumor_R1.fastq.gz,4687766228,pdxnet/pdxnet-datapool,,,,,1007649-T,,⋯,NO,,1,,,,,,,WES
5d5ffcfce4b011b529179187,1007649-T_HCI-008_Patient_tumor_R2.fastq.gz,4442360214,pdxnet/pdxnet-datapool,,,,,1007649-T,,⋯,NO,,2,,,,,,,WES
5d5ffcfce4b011b5291791dd,1007649-X_HCI-008_PDXtumor_R1.fastq.gz,3846953257,pdxnet/pdxnet-datapool,,,,,1007649-X,,⋯,NO,,1,,,,,,,WES
5d5ffcfce4b011b5291791e4,1007649-X_HCI-008_PDXtumor_R2.fastq.gz,3638143772,pdxnet/pdxnet-datapool,,,,,1007649-X,,⋯,NO,,2,,,,,,,WES


In [28]:
# Create a table of the rest of the metadata that exists for the non-concat fastqs

edited_hci_concat_lanes <- non_concat_lanes[c(1, 1, 25, 25, 53, 53, 75, 75, 117, 117, 157, 157, 183, 183, 209, 209, 233, 233, 253, 253, 273, 273, 295, 295, 327, 327, 351, 351, 381, 381,
                                              413, 413, 439, 439, 469, 469, 503, 503, 533, 533, 563, 563, 580, 580, 616, 616, 654, 654), ]

nrow(edited_hci_concat_lanes)
head(edited_hci_concat_lanes)

contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,gender,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<lgl>,<lgl>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<lgl>,<chr>
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_001.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_001.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index04-TGACCA-1007649-T_TGACCA_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-T,Tumor,WES,Female,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Pleural Effusion
HCI,lane1-index04-TGACCA-1007649-T_TGACCA_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-T,Tumor,WES,Female,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Pleural Effusion
HCI,lane1-index05-ACAGTG-1007649-X_ACAGTG_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
HCI,lane1-index05-ACAGTG-1007649-X_ACAGTG_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-X,PDX,WES,Female,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion


In [29]:
# Also need the index column

edited_hci_concat_lanes <- edited_hci_concat_lanes %>% left_join(select(pdtc_files, Index, file_name)) %>% select(Index, contributor:disease_detail)

nrow(edited_hci_concat_lanes)
head(edited_hci_concat_lanes)

Joining, by = "file_name"



Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,⋯,<lgl>,<lgl>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<lgl>,<chr>
519,HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_001.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
519,HCI,lane1-index03-TTAGGC-0908263-X_TTAGGC_L001_R1_001.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
933,HCI,lane1-index04-TGACCA-1007649-T_TGACCA_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-T,Tumor,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Pleural Effusion
933,HCI,lane1-index04-TGACCA-1007649-T_TGACCA_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-T,Tumor,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Pleural Effusion
547,HCI,lane1-index05-ACAGTG-1007649-X_ACAGTG_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-X,PDX,WES,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion
547,HCI,lane1-index05-ACAGTG-1007649-X_ACAGTG_L001_R1_001.fastq.gz,HCI008,,,HCI008,1007649-X,PDX,WES,⋯,,,Illumina HiSeq,,,?,Breast,,,Breast Pleural Effusion


In [30]:
# Double check that they're all in the same order

all(hci_concat_fastqs$sample_id == edited_hci_concat_lanes$sample_id)

In [31]:
# Merge

edited_hci_concat_lanes$file_name <- hci_concat_fastqs$name
edited_hci_concat_lanes$file_size <- hci_concat_fastqs$size

edited_hci_concat_lanes$data_category <- 'Raw Sequencing Data'
edited_hci_concat_lanes$data_format <- 'fastq.gz'
edited_hci_concat_lanes$data_type <- 'Raw Reads'

nrow(edited_hci_concat_lanes)
head(edited_hci_concat_lanes)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<lgl>,<lgl>,<chr>
519,HCI,0908263-X_HCI-007_PDXtumor_R1.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,3992597315,?,Breast,,,Breast Pleural Effusion
519,HCI,0908263-X_HCI-007_PDXtumor_R2.fastq.gz,HCI007,,,HCI007,0908263-X,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,3809574763,?,Breast,,,Breast Pleural Effusion
933,HCI,1007649-T_HCI-008_Patient_tumor_R1.fastq.gz,HCI008,,,HCI008,1007649-T,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,4687766228,,Breast,,,Breast Pleural Effusion
933,HCI,1007649-T_HCI-008_Patient_tumor_R2.fastq.gz,HCI008,,,HCI008,1007649-T,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,4442360214,,Breast,,,Breast Pleural Effusion
547,HCI,1007649-X_HCI-008_PDXtumor_R1.fastq.gz,HCI008,,,HCI008,1007649-X,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,3846953257,?,Breast,,,Breast Pleural Effusion
547,HCI,1007649-X_HCI-008_PDXtumor_R2.fastq.gz,HCI008,,,HCI008,1007649-X,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,3638143772,?,Breast,,,Breast Pleural Effusion


In [32]:
# Export

edited_hci_concat_lanes %>% write_csv('2022-02-09_hci_concat_lanes_for_portal.csv')

In [33]:
# Add them back into the full dataset

# First remove the existing ones

edited_pdtc_files <- edited_pdtc_files %>% filter(!(file_name %in% non_concat_lanes$file_name))
edited_pdtc_files %>% nrow()

In [34]:
# This number of rows, plus the number of ones which needed removal should add up to 2825
# Which was the orginal amount of rows

2142 + 683

In [35]:
# Put these in and rearrange the rows based on the index

edited_pdtc_files <- bind_rows(edited_pdtc_files, edited_hci_concat_lanes) %>% arrange('index')

edited_pdtc_files %>% nrow()
head(edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [36]:
# That number of rows should be 2142 + 48

2142 + 48

### Export so far

In [37]:
edited_pdtc_files %>% write_csv('2022-02-09_pdxnet_portal_pdtc_files_removed_cram_nonconcat_lanes.csv')

### Separate the PDTCs into different sheets
They all have their own quirks for how they've recorded things. Also, easier to fix each individually and recombine.

In [38]:
edited_pdtc_files_by_contributor <- edited_pdtc_files %>% group_split(contributor)

edited_pdtc_files_by_contributor %>% map(head, 1)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1100,MDACC,BFang-LungPDX-MS11DNA-TC211W.R1.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1518,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1566,WISTAR,wm3907n_R1.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5439234,Tumor,Unknown,,,Unknown

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1949,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4348811841,P3,Unknown,,,Unknown


In [39]:
# Name the dataframes

names(edited_pdtc_files_by_contributor) <- c('BCM', 'HCI', 'MDACC', 'UC Davis', 'WISTAR', 'WUSTL')

edited_pdtc_files_by_contributor %>% map(head, 5)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
210,HCI,14990X12_HCI001_patientnormal_2.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
211,HCI,0900570-B_HCI-002_Patient_normal_R1.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
212,HCI,0900570-B_HCI-002_Patient_normal_R2.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
239,HCI,14990X4_HCI004_patientnormalGU_1.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1100,MDACC,BFang-LungPDX-MS11DNA-TC211W.R1.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1101,MDACC,BFang-LungPDX-MS11DNA-TC211W.R2.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1102,MDACC,BFang-LungPDX-MS11DNA-TC219W.R1.fastq.gz,TC219,,,TC219,TC219W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1103,MDACC,BFang-LungPDX-MS11DNA-TC219W.R2.fastq.gz,TC219,,,TC219,TC219W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1104,MDACC,BFang-LungPDX-MS11DNA-TC223W.R1.fastq.gz,TC223,,,TC223,TC223W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1518,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1519,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R2_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1520,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L005_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1521,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L005_R2_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1522,UC Davis,BL0269F404P0_ATCACG_AC1FCKACXX_L003_001.R1.fastq.gz,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,WES,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1566,WISTAR,wm3907n_R1.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5439234,Tumor,Unknown,,,Unknown
1567,WISTAR,wm3907n_R2.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,6415082,Tumor,Unknown,,,Unknown
1568,WISTAR,wm3922n_R1.fastq.gz,WM3922,wm3922n,,,wm3922n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3024602,Tumor,Unknown,,,Unknown
1569,WISTAR,wm3922n_R2.fastq.gz,WM3922,wm3922n,,,wm3922n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3671122,Tumor,Unknown,,,Unknown
1570,WISTAR,wm3926n_R1.fastq.gz,WM3926,wm3926n,,,wm3926n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3679890,Tumor,Unknown,,,Unknown

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1949,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4348811841,P3,Unknown,,,Unknown
1950,WUSTL,172845-121-T.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4209751331,P3,Unknown,,,Unknown
1951,WUSTL,172845-121-T.singletons.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,40247394,P3,Unknown,,,Unknown
1952,WUSTL,172845-121-T.singletons.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,93138489,P3,Unknown,,,Unknown
1954,WUSTL,625472-104-R.r1.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,3184453176,P2,Unknown,,,Unknown


In [40]:
# Export these as-is so I can take a look at them more easily

for (i in seq_along(edited_pdtc_files_by_contributor)) {
    write_csv(edited_pdtc_files_by_contributor[[i]], str_c('2022-02-10_', names(edited_pdtc_files_by_contributor)[[i]], '_files_pre_edit.csv'))
    }

### Do some reformatting to the models list

In [41]:
# The model IDs have extra crap in the IDs that don't match for BCM some model IDs have a space and some stuff in parenthesis
# This can be removed to match based on model ID

pdtc_models_cleaned <- pdtc_models %>% mutate(ContributorPDX.ID_edit = str_replace(ContributorPDX.ID, '\\s.*$', ''))

nrow(pdtc_models_cleaned)
head(pdtc_models_cleaned)
colnames(pdtc_models_cleaned)

Index,PDXSource,Contributor,ContributorPDX.ID,PDMR.Patient.ID,Gender,CTEP.SDCCode,CTEP.SDCDescription,DiagnosisSubtype,Disease.BodyLocation,⋯,Has.KnownMetastaticDisease,Grade.StageInformation,PatientNotes,Molecular.andIHC.Data,Has.Smoked100.Cigarettes,Race,Ethnicity,AdditionalMedicalHistory,Submission,ContributorPDX.ID_edit
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dttm>,<chr>
1,PDXNet Consortium Members,MDACC,B8174,K42829,Female,10009951,Adenocarcinoma - colon,adenocarcinoma of sigmoid colon,Digestive/Gastrointestinal,⋯,Yes,Stage,"Tumor Grade/Stage: Stage IV Location of known metastases: Pelvis, omentum, Chest wall (left), Lymph node (left axillary)",APC c.4037C>G p.S1346*; FBXW7 c.832C>T p.R278*; KRAS c.38G>A p.G13D; TP53 c.427G>A p.V143M,No,White,Not Hispanic or Latino,-,2021-09-02,B8174
2,PDXNet Consortium Members,MDACC,B8175,K30337,Female,10009951,Adenocarcinoma - colon,adenocarcinoma,Digestive/Gastrointestinal,⋯,Yes,Stage,"Tumor Grade/Stage: Stage IV, poorly differentiated Location of known metastases: Peritoneum, Liver","KRAS/NRAS WT,BRAF mutated, MSI Stable",No,White,Hispanic or Latino,"Family h/o: 2nd degree relative, Uterine cancer",2018-12-19,B8175
3,PDXNet Consortium Members,MDACC,B8176,K45526,Female,10009951,Adenocarcinoma - colon,Lynch syndrome; mucinous and signet ring cell adenocarcinoma,Digestive/Gastrointestinal,⋯,Not Reported,TNM (Pathological),"Tumor Grade/Stage: pT1bpN0pM0, poorly differentiated Location of known metastases: Liver, Abdomen","Germline heterozygous MSH2 c388_389del variant (deleterious) Lynch syndrome, MSI-high; KRAS-G12D; Loss of MSH2 and MSH6;",Yes,White,Hispanic or Latino,"Family History: 1st degree relative, GI cancer involving esophagus, liver, bile duct, stomach, pancreas, colon, rectum",2018-12-19,B8176
4,PDXNet Consortium Members,MDACC,B8182,K75566,Female,10009951,Adenocarcinoma - colon,poorly differentiated mucinous and signet ring cell adenocarcinoma,Digestive/Gastrointestinal,⋯,Yes,"Stage, TNM",Tumor Grade/Stage: Stage III T4N2M0 Location of known metastases: Liver,-,No,White,Hispanic or Latino,-,2021-09-02,B8182
5,PDXNet Consortium Members,MDACC,B8183,K83548,Male,10009951,Adenocarcinoma - colon,adenocarcinoma,Digestive/Gastrointestinal,⋯,Yes,Stage,"Tumor Grade/Stage: Stage IV, moderately differentiated Location of known metastases: Liver, Duodenum, Pelvis",MSI-Stable; BRAF mutant; KRAS wild type,Yes,Not Provided,Not Provided,-,2018-12-19,B8183
6,PDXNet Consortium Members,MDACC,B8207,K49395,Male,10009951,Adenocarcinoma - colon,Sigmoid,Digestive/Gastrointestinal,⋯,Yes,"Stage, TNM","Tumor Grade/Stage: Stage IVB, TxNxM1b, Moderately differentiated Location of known metastases: Liver, Lung","Mutations present in: MDM4, NOTCH1, RB1, TP53, APC",Yes,White,Not Hispanic or Latino,No family history of cancer,2021-09-02,B8207


### Work on BCM's first

In [42]:
edited_pdtc_files_by_contributor[['BCM']] %>% nrow()
edited_pdtc_files_by_contributor[['BCM']] %>% head()
colnames(edited_pdtc_files_by_contributor[['BCM']])

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [43]:
# How many NAs are there for the most important columns?

colSums((is.na(edited_pdtc_files_by_contributor[['BCM']]) | (edited_pdtc_files_by_contributor[['BCM']] == 'Unknown'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,2
model_id,0
tumor_id,16
case_id,208
sample_id,0
sample_type,0
experimental_strategy,0


Notes on the columns:
- patient_id: Can't be filled from the models
- tumor_id: Can't be filled from the models
- case_id: Unused by BCM
- gender: Can be filled from Model based on model ID
- capture_kit: Unknown from the this metadata - All BCM samples on the CGC used AgilentSureSelect_V6 or WES and the RNAseq kits are unknown (not capture kits, but some centers put their RNAseq kits in here)
- capture_assembly: unused by BCM
- platform: no other place to fill it out, but all should be Illumina
- is_ffpe: unfillable with other sheets
- file_size: easily filled with the CGC metadata, but can do this at the end for all centers
- passage: unfillable with other metadata
- disease_type: fillable with models
- disease_detail: fillable with models

In [44]:
# Merge BCM with the models

edited_bcm_files <- edited_pdtc_files_by_contributor[['BCM']] %>%
    left_join(select(pdtc_models_cleaned, ContributorPDX.ID_edit, Gender, CTEP.SDCDescription, Disease.BodyLocation),
              by = c('model_id' = 'ContributorPDX.ID_edit')
              )

edited_bcm_files %>% head()

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail,Gender,CTEP.SDCDescription,Disease.BodyLocation
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,False,20656964441,,Breast,,,Invasive breast carcinoma,Female,Invasive breast carcinoma,Breast
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,False,21093643386,,Breast,,,Invasive breast carcinoma,Female,Invasive breast carcinoma,Breast
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,False,22027304876,,Breast,,,Invasive breast carcinoma,Female,Invasive breast carcinoma,Breast
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,False,22027304876,,Breast,,,Invasive breast carcinoma,Female,Invasive breast carcinoma,Breast
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,False,17927987899,,Breast,,,Invasive breast carcinoma,Female,Invasive breast carcinoma,Breast
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,False,18002109303,,Breast,,,Invasive breast carcinoma,Female,Invasive breast carcinoma,Breast


In [45]:
# Replace the disease detail if the models info exists
# Fill in gender
# Set the capture kit

edited_bcm_files <- edited_bcm_files %>% mutate(
    disease_detail = ifelse(!is.na(CTEP.SDCDescription), CTEP.SDCDescription, disease_detail),
    disease_type = ifelse(!is.na(Disease.BodyLocation), Disease.BodyLocation, disease_type),
    gender = ifelse(!is.na(Gender), Gender, gender),
    capture_kit = ifelse(experimental_strategy == 'WES', 'SureSelect Human All Exon V6', capture_kit),
    platform = 'Illumina',
    ) %>%
    select(-Gender, -CTEP.SDCDescription, -Disease.BodyLocation)

head(edited_bcm_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,FASTQ.GZ,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [46]:
# data format to lowercase for consistency

edited_bcm_files <- edited_bcm_files %>%
    mutate(data_format = tolower(data_format))

In [47]:
# How'd we do fixing these?

colSums((is.na(edited_bcm_files) | (edited_bcm_files == 'Unknown'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,2
model_id,0
tumor_id,16
case_id,208
sample_id,0
sample_type,0
experimental_strategy,0


In [48]:
# Export BCM only

write_csv(edited_bcm_files, '2022-02-11_BCM_files_edited.csv')

### Work on HCI's

In [49]:
edited_pdtc_files_by_contributor[['HCI']] %>% nrow()
edited_pdtc_files_by_contributor[['HCI']] %>% head()
colnames(edited_pdtc_files_by_contributor[['HCI']])

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
210,HCI,14990X12_HCI001_patientnormal_2.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
211,HCI,0900570-B_HCI-002_Patient_normal_R1.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
212,HCI,0900570-B_HCI-002_Patient_normal_R2.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
239,HCI,14990X4_HCI004_patientnormalGU_1.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma
240,HCI,14990X4_HCI004_patientnormalGU_2.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma


In [50]:
# How many NAs are there for the most important columns?

colSums((is.na(edited_pdtc_files_by_contributor[['HCI']]) | (edited_pdtc_files_by_contributor[['HCI']] == 'Unknown') | (edited_pdtc_files_by_contributor[['HCI']] == '?'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,0
model_id,256
tumor_id,256
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0


Notes on the columns:
- model_id: HCI uses patient ID and case ID as the model identifier
- tumor_id: unused by HCI
- paired_end: All of the NAs have either "file1" or "file2" which is going to represent their paired end status
- capture_kit: Can't fill from any other metadata
- capture_assembly: Can't fill from any other metadata
- is_ffpe: not collected by HCI
- file_size: easily filled with the CGC metadata, but can do this at the end for all centers
- passage: Should be NA for normals, Unknown for PDX without metadata, unfillable with other sheets
- disease_type: Missing ones are HCI017 and TOW18S, neither of which are present in any of our model information or any of the dbGaP submission forms
- disease_detail: Same as above

In [51]:
# Start cleaning HCI

edited_hci_files <- edited_pdtc_files_by_contributor[['HCI']] %>% mutate(
    paired_end = ifelse(str_detect(file_name, 'file1'), 1, ifelse(str_detect(file_name, 'file2'), 2, paired_end)),
    passage = ifelse(sample_type == 'Normal', NA, passage)
    )

In [52]:
# Continue fixing passage

edited_hci_files <- edited_hci_files %>% mutate(
    passage = ifelse(passage == '?', 'Unknown', passage)
)

head(edited_hci_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
210,HCI,14990X12_HCI001_patientnormal_2.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
211,HCI,0900570-B_HCI-002_Patient_normal_R1.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
212,HCI,0900570-B_HCI-002_Patient_normal_R2.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
239,HCI,14990X4_HCI004_patientnormalGU_1.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma
240,HCI,14990X4_HCI004_patientnormalGU_2.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,,,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma


I think that's about all that can be done, but HCI is one of the better ones with the least missing disease information.

In [53]:
# Add the data_category, data_format, and data_type
# Remove files that are not fastq.gz or txt.gz, because there are a few in the middle here

edited_hci_files <- edited_hci_files %>% mutate(
    data_category = 'Raw Sequencing Data',
    data_format = 'fastq.gz',
    data_type = 'Raw Reads'
    ) %>% filter((endsWith(edited_hci_files$file_name, 'gz')))

nrow(edited_hci_files)
head(edited_hci_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
209,HCI,14990X12_HCI001_patientnormal_1.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
210,HCI,14990X12_HCI001_patientnormal_2.txt.gz,HCI001,,,HCI001,14990X12,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
211,HCI,0900570-B_HCI-002_Patient_normal_R1.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
212,HCI,0900570-B_HCI-002_Patient_normal_R2.fastq.gz,HCI002,,,HCI002,0900570-B,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,,,Breast,,,Breast Poorly Differentiated
239,HCI,14990X4_HCI004_patientnormalGU_1.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma
240,HCI,14990X4_HCI004_patientnormalGU_2.txt.gz,HCI004,,,HCI004,14990X4,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina HiSeq,,,,Breast,,,Breast Invasive Ductal Carcinoma


In [54]:
# Export HCI only

write_csv(edited_bcm_files, '2022-02-14_HCI_files_edited.csv')

### Clean MDACC
These are pretty comprehensive.

In [55]:
edited_pdtc_files_by_contributor[['MDACC']] %>% nrow()
edited_pdtc_files_by_contributor[['MDACC']] %>% head()
colnames(edited_pdtc_files_by_contributor[['MDACC']])

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1100,MDACC,BFang-LungPDX-MS11DNA-TC211W.R1.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1101,MDACC,BFang-LungPDX-MS11DNA-TC211W.R2.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1102,MDACC,BFang-LungPDX-MS11DNA-TC219W.R1.fastq.gz,TC219,,,TC219,TC219W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1103,MDACC,BFang-LungPDX-MS11DNA-TC219W.R2.fastq.gz,TC219,,,TC219,TC219W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1104,MDACC,BFang-LungPDX-MS11DNA-TC223W.R1.fastq.gz,TC223,,,TC223,TC223W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood
1105,MDACC,BFang-LungPDX-MS11DNA-TC223W.R2.fastq.gz,TC223,,,TC223,TC223W,Normal,WES,⋯,,,Illumina Hiseq,,,,Blood,,,Blood


In [56]:
# How many NAs are there for the most important columns?

colSums((is.na(edited_pdtc_files_by_contributor[['MDACC']]) | (edited_pdtc_files_by_contributor[['MDACC']] == 'Unknown') | (edited_pdtc_files_by_contributor[['MDACC']] == '?'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,12
model_id,264
tumor_id,418
case_id,12
sample_id,0
sample_type,0
experimental_strategy,0


In [57]:
# How many of patient_id	model_id	tumor_id	case_id are in the models information?

any(c(edited_pdtc_files_by_contributor[['MDACC']]$patient_id, edited_pdtc_files_by_contributor[['MDACC']]$model_id, edited_pdtc_files_by_contributor[['MDACC']]$tumor_id, edited_pdtc_files_by_contributor[['MDACC']]$case_id) %in% pdtc_models_cleaned$ContributorPDX.ID)

None of these are present in the models information.

All I can do here is fill in some general things like the data_category, data_format, and data_type.

In [58]:
# Add some of the missing metadata in

edited_mdacc_files <- edited_pdtc_files_by_contributor[['MDACC']] %>%
    mutate(data_category = 'Raw Sequencing Data',
           data_format = 'fastq.gz',
           data_type = 'Raw Reads'
           )

nrow(edited_mdacc_files)
head(edited_mdacc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1100,MDACC,BFang-LungPDX-MS11DNA-TC211W.R1.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina Hiseq,,,,Blood,,,Blood
1101,MDACC,BFang-LungPDX-MS11DNA-TC211W.R2.fastq.gz,TC211,,,TC211,TC211W,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina Hiseq,,,,Blood,,,Blood
1102,MDACC,BFang-LungPDX-MS11DNA-TC219W.R1.fastq.gz,TC219,,,TC219,TC219W,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina Hiseq,,,,Blood,,,Blood
1103,MDACC,BFang-LungPDX-MS11DNA-TC219W.R2.fastq.gz,TC219,,,TC219,TC219W,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina Hiseq,,,,Blood,,,Blood
1104,MDACC,BFang-LungPDX-MS11DNA-TC223W.R1.fastq.gz,TC223,,,TC223,TC223W,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina Hiseq,,,,Blood,,,Blood
1105,MDACC,BFang-LungPDX-MS11DNA-TC223W.R2.fastq.gz,TC223,,,TC223,TC223W,Normal,WES,⋯,fastq.gz,Raw Reads,Illumina Hiseq,,,,Blood,,,Blood


In [59]:
# Export MDACC only

write_csv(edited_mdacc_files, '2022-02-14_MDACC_files_edited.csv')

### Work on UC Davis

In [60]:
edited_pdtc_files_by_contributor[['UC Davis']] %>% nrow()
edited_pdtc_files_by_contributor[['UC Davis']] %>% head()
colnames(edited_pdtc_files_by_contributor[['UC Davis']])

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1518,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1519,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R2_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1520,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L005_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1521,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L005_R2_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1522,UC Davis,BL0269F404P0_ATCACG_AC1FCKACXX_L003_001.R1.fastq.gz,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,WES,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1523,UC Davis,BL0269F404P0_ATCACG_AC1FCKACXX_L003_001.R2.fastq.gz,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,WES,⋯,fastq,raw data,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma


From visual inspection I know that these don't need much cleaning, but there are some things we can make more consistent.

In [61]:
# Add some of the missing metadata in

edited_ucd_files <- edited_pdtc_files_by_contributor[['UC Davis']] %>%
    mutate(data_category = 'Raw Sequencing Data',
           data_format = ifelse(endsWith(file_name, 'gz'), 'fastq.gz', 'fastq'),
           data_type = 'Raw Reads'
           )

nrow(edited_ucd_files)
head(edited_ucd_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1518,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,Raw Reads,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1519,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L004_R2_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,Raw Reads,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1520,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L005_R1_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,Raw Reads,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1521,UC Davis,BL0269F404_P0_GES10_166_ATCACG_L005_R2_ALL.fastq,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,RNA-Seq,⋯,fastq,Raw Reads,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1522,UC Davis,BL0269F404P0_ATCACG_AC1FCKACXX_L003_001.R1.fastq.gz,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma
1523,UC Davis,BL0269F404P0_ATCACG_AC1FCKACXX_L003_001.R2.fastq.gz,BL0269F,BL0269F,BL0269F,BL0269F,BL0269F,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,,,P0,Bladder,,,bladder urothelial carcinoma


In [62]:
# Export UCD only

write_csv(edited_ucd_files, '2022-02-14_UCDavis_files_edited.csv')

### Work on WISTAR

In [63]:
edited_pdtc_files_by_contributor[['WISTAR']] %>% nrow()
edited_pdtc_files_by_contributor[['WISTAR']] %>% head()
colnames(edited_pdtc_files_by_contributor[['WISTAR']])

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1566,WISTAR,wm3907n_R1.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5439234,Tumor,Unknown,,,Unknown
1567,WISTAR,wm3907n_R2.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,6415082,Tumor,Unknown,,,Unknown
1568,WISTAR,wm3922n_R1.fastq.gz,WM3922,wm3922n,,,wm3922n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3024602,Tumor,Unknown,,,Unknown
1569,WISTAR,wm3922n_R2.fastq.gz,WM3922,wm3922n,,,wm3922n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3671122,Tumor,Unknown,,,Unknown
1570,WISTAR,wm3926n_R1.fastq.gz,WM3926,wm3926n,,,wm3926n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3679890,Tumor,Unknown,,,Unknown
1571,WISTAR,wm3926n_R2.fastq.gz,WM3926,wm3926n,,,wm3926n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,4094170,Tumor,Unknown,,,Unknown


In [64]:
# How many NAs are there for the most important columns?

colSums((is.na(edited_pdtc_files_by_contributor[['WISTAR']]) | (edited_pdtc_files_by_contributor[['WISTAR']] == 'Unknown') | (edited_pdtc_files_by_contributor[['WISTAR']] == '?'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,12
model_id,0
tumor_id,382
case_id,382
sample_id,0
sample_type,0
experimental_strategy,0


One thing we can fix very easily, is that all of these are Melanoma models according to the metadata sheet that I've gotten for dbGaP/CDS. The missing patient IDs I looked for in the model metadata, but none of those 12 are in there.

In [65]:
# Load i nthe subject phenotype info from the dbGaP submission to fill over to this

wistar_subject_phenotypes <- read_excel('wistar_subjectphenotype_DS_v2.xlsx')

head(wistar_subject_phenotypes)

SUBJECT_ID,CONSENT,SEX,AGE,DISEASE_DESCRIPTION,DISEASE_CODE
<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>
WM3907,1,2,62,Melanoma,10053571
WM3922,1,1,57,Melanoma,10053571
WM3926,1,1,47,Melanoma,10053571
WM3934,1,2,75,Melanoma,10053571
WM3936,1,2,70,Melanoma,10053571
WM3939,1,1,54,Melanoma,10053571


In [66]:
# Are all these subject Ids found in the portal metadata?

(edited_pdtc_files_by_contributor[['WISTAR']]$patient_id %>% unique()) %in% wistar_subject_phenotypes$SUBJECT_ID %>% table()

.
FALSE  TRUE 
    1    46 

All but one of them are in there. So, what's that one?

In [67]:
edited_pdtc_files_by_contributor[['WISTAR']] %>% filter(!(patient_id %in% wistar_subject_phenotypes$SUBJECT_ID))

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1642,WISTAR,1-AM-0245_S1_R1.fastq.gz,,AM0245,,,AM0245,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina,False,6576631217,,Unknown,,,Unknown
1643,WISTAR,1-AM-0245_S1_R2.fastq.gz,,AM0245,,,AM0245,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina,False,6604368891,,Unknown,,,Unknown
1644,WISTAR,2-CK-2021_S2_R1.fastq.gz,,CK2021,,,CK2021,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5896023384,,Unknown,,,Unknown
1645,WISTAR,2-CK-2021_S2_R2.fastq.gz,,CK2021,,,CK2021,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5968313484,,Unknown,,,Unknown
1646,WISTAR,3-CK-0643_S3_R1.fastq.gz,,CK0643,,,CK0643,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina,False,12696484380,,Unknown,,,Unknown
1647,WISTAR,3-CK-0643_S3_R2.fastq.gz,,CK0643,,,CK0643,PDX,WES,⋯,fastq.gz,Raw Reads,Illumina,False,12831443167,,Unknown,,,Unknown
1648,WISTAR,1-AM0245_S1_1.fastq,,AM0245,,,AM0245,PDX,RNA-Seq,⋯,fastq,Raw Reads,Illumina,False,10954394853,,Unknown,,,Unknown
1649,WISTAR,1-AM0245_S1_2.fastq,,AM0245,,,AM0245,PDX,RNA-Seq,⋯,fastq,Raw Reads,Illumina,False,10954834357,,Unknown,,,Unknown
1650,WISTAR,2-CK2021_S2_1.fastq,,CK2021,,,CK2021,PDX,RNA-Seq,⋯,fastq,Raw Reads,Illumina,False,11295504826,,Unknown,,,Unknown
1651,WISTAR,2-CK2021_S2_2.fastq,,CK2021,,,CK2021,PDX,RNA-Seq,⋯,fastq,Raw Reads,Illumina,False,11295091028,,Unknown,,,Unknown


Looks like these are all the ones which are from the files that have no patient ID. No hope of filling those.

In [68]:
# Start with the easy stuff

edited_wistar_files <- edited_pdtc_files_by_contributor[['WISTAR']] %>%
    mutate(data_category = 'Raw Sequencing Data',
           data_format = ifelse(endsWith(file_name, 'gz'), 'fastq.gz', 'fastq'),
           data_type = 'Raw Reads'
           )

In [69]:
# Fill in the disease info for all the others, and make the rest of the fields more consistent

edited_wistar_files <- edited_wistar_files %>% left_join(select(wistar_subject_phenotypes, SUBJECT_ID, SEX, DISEASE_DESCRIPTION), by = c('patient_id' = 'SUBJECT_ID')) %>%
    mutate(gender = str_replace(str_replace(str_replace(SEX, 'UNK', 'Unknown'), '1', 'Male'), '2', 'Female'),
           disease_detail = ifelse(!is.na(DISEASE_DESCRIPTION), DISEASE_DESCRIPTION, 'Unknown'),
           passage = ifelse(((passage == 'Tumor') | (passage == 'tumor')), NA, passage)) %>%
    mutate(disease_type = ifelse(disease_detail == 'Unknown', 'Unknown', 'Skin')) %>%
    select(-SEX, -DISEASE_DESCRIPTION) 

nrow(edited_wistar_files)
head(edited_wistar_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1566,WISTAR,wm3907n_R1.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,5439234,,Skin,,,Melanoma
1567,WISTAR,wm3907n_R2.fastq.gz,WM3907,wm3907n,,,wm3907n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,6415082,,Skin,,,Melanoma
1568,WISTAR,wm3922n_R1.fastq.gz,WM3922,wm3922n,,,wm3922n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3024602,,Skin,,,Melanoma
1569,WISTAR,wm3922n_R2.fastq.gz,WM3922,wm3922n,,,wm3922n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3671122,,Skin,,,Melanoma
1570,WISTAR,wm3926n_R1.fastq.gz,WM3926,wm3926n,,,wm3926n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,3679890,,Skin,,,Melanoma
1571,WISTAR,wm3926n_R2.fastq.gz,WM3926,wm3926n,,,wm3926n,Tumor,WES,⋯,fastq.gz,Raw Reads,Illumina,False,4094170,,Skin,,,Melanoma


In [70]:
# One more small detail, turn NAs in gender into Unknown

edited_wistar_files <- edited_wistar_files %>% mutate(gender = ifelse(is.na(gender), 'Unknown', gender))

In [71]:
# Export wistar only

write_csv(edited_wistar_files, '2022-02-14_Wistar_files_edited.csv')

### Last is WUSTL

In [72]:
edited_pdtc_files_by_contributor[['WUSTL']] %>% nrow()
edited_pdtc_files_by_contributor[['WUSTL']] %>% head()
colnames(edited_pdtc_files_by_contributor[['WUSTL']])

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1949,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4348811841,P3,Unknown,,,Unknown
1950,WUSTL,172845-121-T.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4209751331,P3,Unknown,,,Unknown
1951,WUSTL,172845-121-T.singletons.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,40247394,P3,Unknown,,,Unknown
1952,WUSTL,172845-121-T.singletons.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,93138489,P3,Unknown,,,Unknown
1954,WUSTL,625472-104-R.r1.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,3184453176,P2,Unknown,,,Unknown
1955,WUSTL,625472-104-R.r2.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,3105288235,P2,Unknown,,,Unknown


In [73]:
# How many NAs are there for the most important columns?

colSums((is.na(edited_pdtc_files_by_contributor[['WUSTL']]) | (edited_pdtc_files_by_contributor[['WUSTL']] == 'Unknown') | (edited_pdtc_files_by_contributor[['WUSTL']] == '?'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,0
model_id,360
tumor_id,360
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0


Some of these that are missing disease information are included in the PDMR patient info.

In [74]:
# Read in the PDMR patient info

pdmr_patient_info <- read_csv('pdmr_patientinformation.csv')

nrow(pdmr_patient_info)
head(pdmr_patient_info)

[1mRows: [22m[34m840[39m [1mColumns: [22m[34m7[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): Patient ID, Gender, CTEP SDCDescription, DiagnosisSubtype, Disease ...
[32mdbl[39m (1): View


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



View,Patient ID,Gender,CTEP SDCDescription,DiagnosisSubtype,Disease BodyLocation,STR ProfileAvail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
2915,112475,Female,Ovarian epithelial cancer,Papillary Serous Cystadenocarcinoma,Gynecologic,Yes
890,112718,Female,Lung adenocarcinoma,-,Respiratory/Thoracic,Yes
3359,112975,Male,"Urothelial/bladder cancer, NOS","Invasive, high grade Papillary carcinoma of the bladder",Genitourinary,Yes
5273,113646,Female,Invasive breast carcinoma,Ductal,Breast,Yes
2215,114348,Male,Adenocarcinoma - rectum,-,Digestive/Gastrointestinal,Yes
1280,114434,Male,Non-Rhabdo. soft tissue sarcoma,"Spindle cell, solitary fibrous tumor",Musculoskeletal,Yes


In [75]:
# Reformat the wustl data so it can be merged with this patient info to fill those ones in

edited_wustl_files <- edited_pdtc_files_by_contributor[['WUSTL']] %>% mutate(pdmr_patient = ifelse(startsWith(patient_id, 'PDMR'), str_replace(patient_id, 'PDMR-', ''), NA))

nrow(edited_wustl_files)
head(edited_wustl_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail,pdmr_patient
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>
1949,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,Raw Reads,NovaSeq S4,,4348811841,P3,Unknown,,,Unknown,172845
1950,WUSTL,172845-121-T.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,Raw Reads,NovaSeq S4,,4209751331,P3,Unknown,,,Unknown,172845
1951,WUSTL,172845-121-T.singletons.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,Raw Reads,NovaSeq S4,,40247394,P3,Unknown,,,Unknown,172845
1952,WUSTL,172845-121-T.singletons.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,Raw Reads,NovaSeq S4,,93138489,P3,Unknown,,,Unknown,172845
1954,WUSTL,625472-104-R.r1.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,Raw Reads,NovaSeq S4,,3184453176,P2,Unknown,,,Unknown,625472
1955,WUSTL,625472-104-R.r2.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,Raw Reads,NovaSeq S4,,3105288235,P2,Unknown,,,Unknown,625472


In [76]:
# Add the pdmr info to the data frame

edited_wustl_files <- edited_wustl_files %>% left_join(select(pdmr_patient_info, -View, -'STR ProfileAvail', -Gender), by = c('pdmr_patient' = 'Patient ID'))

nrow(edited_wustl_files)
head(edited_wustl_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,file_size,passage,disease_type,comment,investigation,disease_detail,pdmr_patient,CTEP SDCDescription,DiagnosisSubtype,Disease BodyLocation
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>,<chr>
1949,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,4348811841,P3,Unknown,,,Unknown,172845,Adenocarcinoma - colon,-,Digestive/Gastrointestinal
1950,WUSTL,172845-121-T.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,4209751331,P3,Unknown,,,Unknown,172845,Adenocarcinoma - colon,-,Digestive/Gastrointestinal
1951,WUSTL,172845-121-T.singletons.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,40247394,P3,Unknown,,,Unknown,172845,Adenocarcinoma - colon,-,Digestive/Gastrointestinal
1952,WUSTL,172845-121-T.singletons.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,93138489,P3,Unknown,,,Unknown,172845,Adenocarcinoma - colon,-,Digestive/Gastrointestinal
1954,WUSTL,625472-104-R.r1.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,3184453176,P2,Unknown,,,Unknown,625472,Adenocarcinoma - colon,-,Digestive/Gastrointestinal
1955,WUSTL,625472-104-R.r2.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,3105288235,P2,Unknown,,,Unknown,625472,Adenocarcinoma - colon,-,Digestive/Gastrointestinal


In [77]:
# Fix the disease_type and disease_detail, as well as paired end status

edited_wustl_files <- edited_wustl_files %>% mutate(
    disease_type = ifelse(!is.na(`Disease BodyLocation`), `Disease BodyLocation`, disease_type),
    disease_detail = ifelse(!is.na(`CTEP SDCDescription`), `CTEP SDCDescription`, disease_detail),
    paired_end = ifelse(((paired_end == 'Unknown') | is.na(paired_end)), ifelse(endsWith(file_name, 'r1.fastq.gz'), 1, 2), paired_end)
    ) %>%
    select(Index:disease_detail)

nrow(edited_wustl_files)
head(edited_wustl_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1949,WUSTL,172845-121-T.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4348811841,P3,Digestive/Gastrointestinal,,,Adenocarcinoma - colon
1950,WUSTL,172845-121-T.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,4209751331,P3,Digestive/Gastrointestinal,,,Adenocarcinoma - colon
1951,WUSTL,172845-121-T.singletons.r1.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,40247394,P3,Digestive/Gastrointestinal,,,Adenocarcinoma - colon
1952,WUSTL,172845-121-T.singletons.r2.fastq.gz,PDMR-172845,172845-121-T_SL,121-T,PDMR-172845,172845-121-T_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,93138489,P3,Digestive/Gastrointestinal,,,Adenocarcinoma - colon
1954,WUSTL,625472-104-R.r1.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,3184453176,P2,Digestive/Gastrointestinal,,,Adenocarcinoma - colon
1955,WUSTL,625472-104-R.r2.fastq.gz,PDMR-625472,625472-104-R_SL,104-R,PDMR-625472,625472-104-R_SL,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,NovaSeq S4,,3105288235,P2,Digestive/Gastrointestinal,,,Adenocarcinoma - colon


In [78]:
# How'd we do fixing these?

colSums((is.na(edited_wustl_files) | (edited_wustl_files == 'Unknown') | (edited_wustl_files == '?'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,0
model_id,360
tumor_id,360
case_id,0
sample_id,0
sample_type,0
experimental_strategy,0


Definitely filled in many.

In [79]:
# Export wustl only

write_csv(edited_wustl_files, '2022-02-15_wustl_files_edited.csv')

### Merge all into a single list and final cleaning
Will also need to redo the index number on them.

In [80]:
# Bind rows into a single dataframe

merged_edited_pdtc_files <- bind_rows(edited_bcm_files, edited_hci_files, edited_mdacc_files, edited_ucd_files, edited_wistar_files, edited_wustl_files)

nrow(merged_edited_pdtc_files)
head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [81]:
# How'd we do fixing these?

colSums((is.na(merged_edited_pdtc_files) | (merged_edited_pdtc_files == 'Unknown') | (merged_edited_pdtc_files == '?'))) %>% as.data.frame()

Unnamed: 0_level_0,.
Unnamed: 0_level_1,<dbl>
Index,0
contributor,0
file_name,0
patient_id,26
model_id,870
tumor_id,1422
case_id,602
sample_id,0
sample_type,0
experimental_strategy,0


In [82]:
# Give them all a new index

merged_edited_pdtc_files$Index <- seq(1, 2180)

head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,20656964441,,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,21093643386,,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,22027304876,,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,17927987899,,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,18002109303,,Breast,,,Invasive breast carcinoma


In [83]:
# In the passage column it should be "unknown" if sample type is PDX, rather than NA
# Tumors and normals should be NA because that column does not apply

merged_edited_pdtc_files <- merged_edited_pdtc_files %>% mutate(passage = ifelse(((sample_type == 'PDX') & is.na(passage)), 'Unknown', passage))

nrow(merged_edited_pdtc_files)
head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,data_format,data_type,platform,is_ffpe,file_size,passage,disease_type,comment,investigation,disease_detail
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,20656964441,Unknown,Breast,,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,21093643386,Unknown,Breast,,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,22027304876,Unknown,Breast,,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,22027304876,Unknown,Breast,,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,17927987899,Unknown,Breast,,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,fastq.gz,Raw Reads,Illumina,False,18002109303,Unknown,Breast,,,Invasive breast carcinoma


In [84]:
# Remove columns that are not useful

merged_edited_pdtc_files <- merged_edited_pdtc_files %>% select(-file_size, -created_datetime, -investigation)

nrow(merged_edited_pdtc_files)
head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,disease_type,comment,disease_detail
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma


In [85]:
# Rename the disease_type to and change disease_detail to disease_type for consistency

merged_edited_pdtc_files <- merged_edited_pdtc_files %>% rename('body_location' = 'disease_type') %>% rename('disease_type' = 'disease_detail')

head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,comment,disease_type
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma


In [86]:
# Do some final cleanup of the columns to turn things into unknown vs just NA

merged_edited_pdtc_files <- merged_edited_pdtc_files %>% mutate(
    is_ffpe = ifelse(is.na(is_ffpe), 'Unknown', is_ffpe),
    access_level = ifelse(is.na(access_level), 'Controlled', access_level),
    public = ifelse(is.na(public), 'No', public),
    gender = ifelse(is.na(gender), 'Unknown', gender)
    )

head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,comment,disease_type
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma


There are some HCI files missing capture_kit, but we have this information on the CGC. I can fill these, and this should be the last of it

In [87]:
# Load the manifest

hci_manifest_pdx_tumor <- read_csv('hci_manifest_20220215_160645.csv') %>%
    mutate(name = basename(name)) %>%
    select(name, capture_kit) %>%
    rename('manifest_capture_kit' = 'capture_kit')

head(hci_manifest_pdx_tumor)

[1mRows: [22m[34m201[39m [1mColumns: [22m[34m64[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (19): id, name, project, sample_id, sample_type, RestrictiveIRB, case_id...
[32mdbl[39m  (3): size, platform_unit_id, paired_end
[33mlgl[39m (42): trimmed_read_count, p1_1, Is FFPE, passage, prop_ribosomal_bases, ...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



name,manifest_capture_kit
<chr>,<chr>
0704112-T_HCI-001_Patient_tumor_R1.fastq.gz,AgilentHumanAllExon50Mb
0704112-T_HCI-001_Patient_tumor_R2.fastq.gz,AgilentHumanAllExon50Mb
0704112-X_HCI-001_PDXtumor_R1.fastq.gz,AgilentHumanAllExon50Mb
0704112-X_HCI-001_PDXtumor_R2.fastq.gz,AgilentHumanAllExon50Mb
0900570-B_HCI-002_Patient_normal_R1.fastq.gz,AgilentHumanAllExon50Mb
0900570-B_HCI-002_Patient_normal_R2.fastq.gz,AgilentHumanAllExon50Mb


In [90]:
# Merge with all the other files and fix capture_kit

merged_edited_pdtc_files <- merged_edited_pdtc_files %>% left_join(hci_manifest_pdx_tumor, by = c('file_name' = 'name')) %>%
    mutate(capture_kit = ifelse(((is.na(capture_kit) | (capture_kit == 'Unknown')) & !is.na(manifest_capture_kit)), manifest_capture_kit, capture_kit)) %>%
    mutate(capture_kit = ifelse(is.na(capture_kit), 'Unknown', capture_kit)) %>%
    select(-manifest_capture_kit)

head(merged_edited_pdtc_files)

Index,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,comment,disease_type
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,,Invasive breast carcinoma


In [92]:
# Reorder a few of the columns at the end because it makes more sense to put the body location and disease type columns together

merged_edited_pdtc_files <- merged_edited_pdtc_files %>% select(Index:body_location, disease_type, comment)

In [94]:
# Remove the title of the first column for consistency with the existing data

colnames(merged_edited_pdtc_files)[1] <- ''

merged_edited_pdtc_files %>% head()

Unnamed: 0_level_0,contributor,file_name,patient_id,model_id,tumor_id,case_id,sample_id,sample_type,experimental_strategy,⋯,access_level,data_category,data_format,data_type,platform,is_ffpe,passage,body_location,disease_type,comment
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,BCM,LZML_37246_NoIndex_L00X_R1_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
2,BCM,LZML_37246_NoIndex_L00X_R2_001.fastq.gz,24452,BCM-0002,2011012-2,,BCM-0002,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
3,BCM,ML6065_GATCAGCG_L00X_R1_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
4,BCM,ML6065_GATCAGCG_L00X_R2_001.fastq.gz,2219,BCM-0046,2011012-46,,BCM-0046,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
5,BCM,ML6061_CAGATCTG_L00X_R1_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,
6,BCM,ML6061_CAGATCTG_L00X_R2_001.fastq.gz,24561,BCM-0104,2011012-104,,BCM-0104,PDX,RNA-Seq,⋯,Controlled,Raw Sequencing Data,fastq.gz,Raw Reads,Illumina,False,Unknown,Breast,Invasive breast carcinoma,


In [95]:
# Export this as a file

merged_edited_pdtc_files %>% write_csv('2022-02-15_pdtc_metadata_cleaned.csv')