# Add dbGaP and CDS Submission Status to WUSTL's File Metadata
## Date: 2022-02-22
## Author: Jeffrey Grover
**Purpose:** Previously, we had made copies of files and added this as tags. That is bad, and the multiple copies will make maintaining the metadata in the datapool even more of a disaster than it is. With multiple copies we don't know which one is the "main" copy and should get metadata applied. So, I'm going to add the dbGaP and CDS submission status of everything as more metadata fields to the "main" files.

### Load libraries

In [9]:
library(tidyverse)
library(sevenbridges)

### Load the metadata
They had some files which hadn't already been analyzed, and those were in a separate folder on the CGC. I added the `dbgap_accession` and `cds_status` as metadata fields to those separate, and put them in the `Data_Submission` folder under `WUSTL`.

There are also a bunch of files which were previously existing in the WUSTL folders. There may be a second copy in the dbGaP folder. Therefore, I don't know which ones should be the "main" copy and get the dbGaP and CDS submission metadata. To fix this, I can compare the lists of files.

In [4]:
existing_wustl_files <- read_csv('2022-02-22_wustl_datapool_existing_rnaseq_wes_manifest.csv') %>%
    mutate(name = basename(name))

head(existing_wustl_files)

[1mRows: [22m[34m434[39m [1mColumns: [22m[34m66[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (19): id, name, project, passage, sample_id, sample_type, library_prep_k...
[32mdbl[39m  (2): size, paired_end
[33mlgl[39m (45): trimmed_read_count, p1_1, Is FFPE, prop_ribosomal_bases, Restricti...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



id,name,size,project,trimmed_read_count,p1_1,Is FFPE,passage,sample_id,prop_ribosomal_bases,⋯,dbgap_upload,model_id,paired_end,__inherit__,update_manifest,species,tumor_id,investigation,prop_20x_cov,experimental_strategy
<chr>,<chr>,<dbl>,<chr>,<lgl>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,⋯,<lgl>,<lgl>,<dbl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
5d8914ace4b06d0cb492458b,WUPDTC-002-B01.pe_2.fastq.paired.fq,14208130600,pdxnet/pdxnet-datapool,,,,,WUPDTC-002-B01,,⋯,,,2,,,,,,,WES
5d891c1fe4b065d300c6e31c,WUPDTC-002-T01-P12.pe_1.fastq.paired.fq,18071815881,pdxnet/pdxnet-datapool,,,,P12,WUPDTC-002-T01-P12,,⋯,,,1,,,,,,,WES
5d891c1fe4b065d300c6e31b,WUPDTC-002-T01-P12.pe_2.fastq.paired.fq,18071815881,pdxnet/pdxnet-datapool,,,,P12,WUPDTC-002-T01-P12,,⋯,,,2,,,,,,,WES
5d8914bee4b06d0cb4924592,WUPDTC-002-T01.pe_1.fastq.paired.fq,13816487982,pdxnet/pdxnet-datapool,,,,,WUPDTC-002-T01,,⋯,,,1,,,,,,,WES
5d8914bee4b06d0cb4924590,WUPDTC-002-T01.pe_2.fastq.paired.fq,13816487982,pdxnet/pdxnet-datapool,,,,,WUPDTC-002-T01,,⋯,,,2,,,,,,,WES
5d891395e4b065d300c6a1fc,WUPDTC-003-B01.pe_1.fastq.paired.fq,13787659314,pdxnet/pdxnet-datapool,,,,,WUPDTC-003-B01,,⋯,,,1,,,,,,,WES


In [6]:
cds_wustl_files <- read_csv('2022-02-22_wustl_datapool_cds_rnaseq_wes_manifest.csv') %>%
    mutate(name = basename(name))

head(cds_wustl_files)

[1mRows: [22m[34m850[39m [1mColumns: [22m[34m66[39m

[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (10): id, name, project, sample_type, case_id, patient_id, contributor, ...
[32mdbl[39m  (1): size
[33mlgl[39m (55): trimmed_read_count, p1_1, Is FFPE, passage, sample_id, prop_riboso...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



id,name,size,project,trimmed_read_count,p1_1,Is FFPE,passage,sample_id,prop_ribosomal_bases,⋯,dbgap_upload,model_id,paired_end,__inherit__,update_manifest,species,tumor_id,investigation,prop_20x_cov,experimental_strategy
<chr>,<chr>,<dbl>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,⋯,<lgl>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>,<lgl>,<lgl>,<chr>
6053a7661f537262530d81d0,WU-L034-T1.WES.r1.fastq.gz,2643723290,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d81c2,WU-L034-T1.WES.r2.fastq.gz,2746938346,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d81c3,WU-L035-N1.WES.r1.fastq.gz,1876415309,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d81f9,WU-L035-N1.WES.r2.fastq.gz,1886646726,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d827e,WU-L035-T1-X1P0.WES.r1.fastq.gz,837482288,pdxnet/pdxnet-datapool,,,,,,,⋯,,WU-L035-T1,,,,,WU-L035-T1,,,WES
6053a7661f537262530d824a,WU-L035-T1-X1P0.WES.r2.fastq.gz,859970829,pdxnet/pdxnet-datapool,,,,,,,⋯,,WU-L035-T1,,,,,WU-L035-T1,,,WES


### Find files missing from the existing files list

In [8]:
# Files not already in the datapool have not been analyzed
# Create a separate list

new_wustl_files <- cds_wustl_files %>% filter(!(name %in% existing_wustl_files$name))

nrow(new_wustl_files)
head(new_wustl_files)

id,name,size,project,trimmed_read_count,p1_1,Is FFPE,passage,sample_id,prop_ribosomal_bases,⋯,dbgap_upload,model_id,paired_end,__inherit__,update_manifest,species,tumor_id,investigation,prop_20x_cov,experimental_strategy
<chr>,<chr>,<dbl>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,⋯,<lgl>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>,<lgl>,<lgl>,<chr>
6053a7661f537262530d81d0,WU-L034-T1.WES.r1.fastq.gz,2643723290,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d81c2,WU-L034-T1.WES.r2.fastq.gz,2746938346,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d81c3,WU-L035-N1.WES.r1.fastq.gz,1876415309,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d81f9,WU-L035-N1.WES.r2.fastq.gz,1886646726,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
6053a7661f537262530d827e,WU-L035-T1-X1P0.WES.r1.fastq.gz,837482288,pdxnet/pdxnet-datapool,,,,,,,⋯,,WU-L035-T1,,,,,WU-L035-T1,,,WES
6053a7661f537262530d824a,WU-L035-T1-X1P0.WES.r2.fastq.gz,859970829,pdxnet/pdxnet-datapool,,,,,,,⋯,,WU-L035-T1,,,,,WU-L035-T1,,,WES


### Move Files Which Haven't Been Analyzed Yet
To avoid chaos when it comes to file organization.

In [10]:
# Connect to the platform

sbg_auth <- Auth(from = 'file', profile_name = 'cgc')
sbg_proj <- sbg_auth$project(id = 'pdxnet/pdxnet-datapool')

Authenticating with user configuration file: ~/.sevenbridges/credentials

Authenticating with user profile: cgc



In [27]:
# Get the folder object for the analysis_incomplete folder

analysis_incomplete_folder <- sbg_proj$get_root_folder()$list_folder_contents(complete = TRUE)[[5]]$list_folder_contents(complete = TRUE)[[6]]$list_folder_contents(complete = TRUE)[[1]]

analysis_incomplete_folder
analysis_incomplete_folder$id

== Files ==
id : 62153e7819807d361ecf5e70
name : analysis_incomplete
project : pdxnet/pdxnet-datapool
parent : 620d1ca68b04ac66e6c00d5b
type : folder

In [30]:
# Test folder operations with the first file

# Test setting metadata
sbg_proj$file(id = new_wustl_files$id[1])$set_meta(cds_status = 'uploaded', dbgap_accession = 'phs002432.v1.p1')

# Test moving file
sbg_proj$file(id = new_wustl_files$id[1])$move_to_folder(analysis_incomplete_folder$id)

# Get file info
sbg_proj$file(id = new_wustl_files$id[1])
sbg_proj$file(id = new_wustl_files$id[1])$get_parent_folder()

== Files ==
id : 6053a7661f537262530d81d0
name : WU-L034-T1.WES.r1.fastq.gz
size : 2643723290
project : pdxnet/pdxnet-datapool
created_on : 2021-03-18T19:17:58Z
modified_on : 2022-02-22T21:22:43Z
storage:
  type : PLATFORM
  hosted_on_locations : 1 items
     - aws:us-east-1
tags:
   - phs002305.v1.p1
   - WES
   - 2020_10
   - WUSTL
   - CDS_DONE
metadata:
  contributor : WUSTL
  patient_id : WU-L034
  case_id : WU-L034
  cds_status : uploaded
  experimental_strategy : WES
  sample_type : Human_tumor
  dbgap_accession : phs002432.v1.p1
parent : 62153e7819807d361ecf5e70
type : file

== Files ==
id : 6053a7661f537262530d81d0
name : WU-L034-T1.WES.r1.fastq.gz
size : 2643723290
project : pdxnet/pdxnet-datapool
created_on : 2021-03-18T19:17:58Z
modified_on : 2022-02-22T21:22:43Z
storage:
  type : PLATFORM
  hosted_on_locations : 1 items
     - aws:us-east-1
tags:
   - phs002305.v1.p1
   - WES
   - 2020_10
   - WUSTL
   - CDS_DONE
metadata:
  contributor : WUSTL
  patient_id : WU-L034
  case_id : WU-L034
  cds_status : uploaded
  experimental_strategy : WES
  sample_type : Human_tumor
  dbgap_accession : phs002432.v1.p1
parent : 62153e7819807d361ecf5e70
type : file

== Files ==
id : 62153e7819807d361ecf5e70
name : analysis_incomplete
project : pdxnet/pdxnet-datapool
created_on : 2022-02-22T19:50:16Z
modified_on : 2022-02-22T19:50:16Z
parent : 620d1ca68b04ac66e6c00d5b
type : folder

In [32]:
# Loop to move the files which haven't been analyzed yet and set metadata fields

for (i in seq_along(new_wustl_files$id)) {
    
    # Pause for 5 mins every 200 iterations to avoid API call limits
    if (i %% 200 == 0) {
        Sys.sleep(300)
    }
    
    # Set metadata and move files
    sbg_proj$file(id = new_wustl_files$id[i])$set_meta(cds_status = 'uploaded', dbgap_accession = 'phs002432.v1.p1')
    sbg_proj$file(id = new_wustl_files$id[i])$move_to_folder(analysis_incomplete_folder$id)
}

All files which were uploaded to the datapool prior to cds and dbgap submissions are now in a separate folder to indicate that they haven't been analyzed.

### Add metadata to all cds-submitted files which already were on the CGC

In [36]:
# Filter the manifest for files which were submitted to CDS

existing_files_cds_uploaded <- existing_wustl_files %>% filter(name %in% cds_wustl_files$name)

nrow(existing_files_cds_uploaded)
head(existing_files_cds_uploaded)

id,name,size,project,trimmed_read_count,p1_1,Is FFPE,passage,sample_id,prop_ribosomal_bases,⋯,dbgap_upload,model_id,paired_end,__inherit__,update_manifest,species,tumor_id,investigation,prop_20x_cov,experimental_strategy
<chr>,<chr>,<dbl>,<chr>,<lgl>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,⋯,<lgl>,<lgl>,<dbl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
5d60005de4b09cae7680875a,WU-0056-T1-X1P2.WES.r1.fastq.gz,2126007246,pdxnet/pdxnet-datapool,,,,P2,WU-0056-T1-X1P2,,⋯,,,1.0,,,,,,,WES
5d60005de4b09cae76808740,WU-0056-T1-X1P2.WES.r2.fastq.gz,2109115670,pdxnet/pdxnet-datapool,,,,P2,WU-0056-T1-X1P2,,⋯,,,2.0,,,,,,,WES
5d60005de4b09cae76808753,WU-0059-T1-X1P1.WES.r1.fastq.gz,2534091484,pdxnet/pdxnet-datapool,,,,P1,WU-0059-T1-X1P1,,⋯,,,1.0,,,,,,,WES
5d60005de4b09cae76808755,WU-0059-T1-X1P1.WES.r2.fastq.gz,2531199502,pdxnet/pdxnet-datapool,,,,P1,WU-0059-T1-X1P1,,⋯,,,2.0,,,,,,,WES
5f4efaf6e4b0bf4a931cc751,WU-0059-T1-X1P2.WES.r1.fastq.gz,3176500221,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES
5f4efaf6e4b0bf4a931cc741,WU-0059-T1-X1P2.WES.r2.fastq.gz,3259841516,pdxnet/pdxnet-datapool,,,,,,,⋯,,,,,,,,,,WES


In [37]:
# Add the metadata to the files in-place

for (i in seq_along(existing_files_cds_uploaded$id)) {
    
    # Pause for 5 mins every 200 iterations to avoid API call limits
    if (i %% 200 == 0) {
        Sys.sleep(300)
    }
    
    # Set metadata and move files
    sbg_proj$file(id = existing_files_cds_uploaded$id[i])$set_meta(cds_status = 'uploaded', dbgap_accession = 'phs002432.v1.p1')
}

The copy of those files still in the cds upload folder on the CGC can now be removed through the GUI. There are 186 RNAseq and 186 WES fastq files (total = 372), which all match the files that were elsewhere on the CGC, as shown above.