# Acquiring TCGA V 1.0 data from terra for use with Bioc

We'll use the R 3.6.1, Bioc 3.9 environment.  We will use BiocAnvilTK::build_tcga_mat for
various tumor and assay types.  Basic parameters are

- site -- a string defining a tumor site like 'brca' -- see vector 'allsites' below
- brief_assay -- a string selecting an assay type -- see vector 'assay_cands' below; use 'names' of this vector
- assay -- the long name used in the google buckets, very unwieldy, generally this will not be set manually but takes its value using `brief_assay`
- suffix -- a string to distinguish the folder where text files are localized
- choice -- a number to select among multiple versions of assay texts, when such exist

A call to `build_mat` returns a matrix with rownames given by the feature names in the
files of quantifications (assumed common to all files provided) and column names given
by the long TCGA sample identifiers used in the headers of the files in google buckets.

## Set up package collection

We need to install current rapiclient and AnVIL from
Bioconductor's github repo.


In [19]:
dim(installed.packages())

In [20]:
sm = suppressMessages
ss = suppressPackageStartupMessages
avail = rownames(installed.packages())
inst_if_needed = function(pkname, rid) if (!(pkname %in% avail)) BiocManager::install(rid, ask=FALSE)
sm(ss({
    library(BiocManager)
    BiocManager::install("Bioconductor/AnVIL_rapiclient") # could check installed version to condition
    inst_if_needed("AnVIL", "Bioconductor/AnVIL")
    inst_if_needed("BiocAnvilTK", "vjcitn/BiocAnvilTK")
    }))

## Background: Use AnVIL package to get tabular information on TCGA

In the following code, we use `avtable` to obtain comprehensive information about
the V1-0 open access TCGA data on glioblastoma multiforme (GBM) in AnVIL.

In [21]:
ss(library(AnVIL))

In [23]:
bigt = avtable("sample", namespace="broad-firecloud-tcga",
   name="TCGA_GBM_OpenAccess_V1-0_DATA")
#datatable(as.data.frame(bigt))
dim(bigt)

This shows that there are 33 columns with information about 1146 samples.  The column names of this
table are unwieldy, but we will establish some abbreviations.

In [24]:
glimpse(bigt)

Observations: 1,146
Variables: 33
$ name                                                                                            [3m[90m<chr>[39m[23m …
$ sample_type                                                                                     [3m[90m<chr>[39m[23m …
$ clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin                                  [3m[90m<chr>[39m[23m …
$ snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg18__seg     [3m[90m<chr>[39m[23m …
$ snp__humanhap550__hudsonalpha_org__Level_3__segmented_cnv__seg                                  [3m[90m<chr>[39m[23m …
$ snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg18__seg                        [3m[90m<chr>[39m[23m …
$ clin__bio__intgen_org__Level_1__clinical__clin                                                  [3m[90m<chr>[39m[23m …
$ clin__bio__nationwidechildrens_org__Level_1__clinical__clin                              

We will cut down the information and rename a column with information on normalized gene
expression measures from RSEM-processed RNA-seq.

In [25]:
okt = bigt %>% 
  mutate(RNASeq2GeneNorm=rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data) %>% 
  select(name, tcga_sample_id, sample_type, RNASeq2GeneNorm)  %>% filter(!is.na(RNASeq2GeneNorm))
head(okt)

name,tcga_sample_id,sample_type,RNASeq2GeneNorm
<chr>,<chr>,<chr>,<chr>
GBM-02-0047-TP,TCGA-02-0047-01,TP,gs://firecloud-tcga-open-access/tcga/dcc/gbm/rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data/unc.edu_GBM.IlluminaHiSeq_RNASeqV2.Level_3.1.2.0/TCGA-02-0047-01A-01R-1849-01.data.txt
GBM-02-0055-TP,TCGA-02-0055-01,TP,gs://firecloud-tcga-open-access/tcga/dcc/gbm/rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data/unc.edu_GBM.IlluminaHiSeq_RNASeqV2.Level_3.1.2.0/TCGA-02-0055-01A-01R-1849-01.data.txt
GBM-02-2483-TP,TCGA-02-2483-01,TP,gs://firecloud-tcga-open-access/tcga/dcc/gbm/rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data/unc.edu_GBM.IlluminaHiSeq_RNASeqV2.Level_3.1.2.0/TCGA-02-2483-01A-01R-1849-01.data.txt
GBM-02-2485-TP,TCGA-02-2485-01,TP,gs://firecloud-tcga-open-access/tcga/dcc/gbm/rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data/unc.edu_GBM.IlluminaHiSeq_RNASeqV2.Level_3.1.2.0/TCGA-02-2485-01A-01R-1849-01.data.txt
GBM-02-2486-TP,TCGA-02-2486-01,TP,gs://firecloud-tcga-open-access/tcga/dcc/gbm/rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data/unc.edu_GBM.IlluminaHiSeq_RNASeqV2.Level_3.1.2.0/TCGA-02-2486-01A-01R-1849-01.data.txt
GBM-06-0125-TP,TCGA-06-0125-01,TP,gs://firecloud-tcga-open-access/tcga/dcc/gbm/rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data/unc.edu_GBM.IlluminaHiSeq_RNASeqV2.Level_3.1.2.0/TCGA-06-0125-01A-01R-1849-01.data.txt


The RNASeq2GeneNorm column in the filtered table above gives the location in Google Cloud Storage of each
participants RNA-seq expression measures.  We will make use of that information, implicitly, in the next section.
We gain a lot from the convention that the path to the data in Google storage is regimented, with
common patterns for all tumor sites.

## Build the RSEM gene-level matrix for GBM

We need to know some conventions for building matrices from AnVIL TCGA data.  

In [26]:
ss(library(BiocAnvilTK)) # attach package

We have a small number of functions defined in the BiocAnvilTK package.

In [27]:
ls("package:BiocAnvilTK")

Tumor site codes are fundamental.

In [28]:
length(tcga_sites())
head(tcga_sites())

Assay names are unwieldy.  We have abbreviations that are 'names' of the assays.  We'll use the names to
the left to refer to assays, whose provenance is spelled out in the long name, on the right.

In [29]:
gdc_assay_map()

A matrix of gene-level quantifications is readily produced using the abbreviations in the `build_tcga_mat` function.

In [33]:
if (dir.exists("gbm_rsem_genes_norm__terra_txt")) unlink("gbm_rsem_genes_norm__terra_txt", recursive=TRUE)

In [34]:
gex = build_tcga_mat("gbm", "rsem_genes_norm")
dim(gex)
gex[1:4,1:4]

Unnamed: 0,TCGA-02-0047-01A-01R-1849-01,TCGA-02-0055-01A-01R-1849-01,TCGA-02-2483-01A-01R-1849-01,TCGA-02-2485-01A-01R-1849-01
?|100130426,0.0,0.0,0.0,0.0
?|100133144,3.0732,0.0,2.9467,15.9952
?|100134869,6.7611,15.6973,13.9398,14.9571
?|10357,54.7036,31.3945,60.3441,91.8238


In [35]:
sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocAnvilTK_0.0.0  AnVIL_0.0.28       dplyr_0.8.3        BiocManager_1.30.9

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2           formatR_1.7          pillar_1.4.2        
 [4] compiler_3.6.1       futile.logger_1.4.3  futile.options_1.0.1
 [7] pre

In [None]:
# TO DO: the localization can occur in a temporary folder once we have sufficient experience