# Downloading bulk RNAseq data from the TCGA

## Available TCGA datasets: 
* TCGA-ACC --> Adrenocortical Carcinoma
* TCGA-BLCA --> Bladder Urothelial Carcinoma
* TCGA-BRCA --> Breast Invasive Carcinoma ************************************************************ done
* TCGA-CESC --> Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma
* TCGA-CHOL --> Cholangiocarcinoma ******************************************************************* done
* TCGA-COAD --> Colon Adenocarcinoma ***************************************************************** done 
* TCGA-DLBC --> Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
* TCGA-ESCA --> Esophageal Carcinoma ***************************************************************** done
* TCGA-GBM --> Glioblastoma Multiforme *************************************************************** done 
* TCGA-HNSC --> Head and Neck Squamous Cell Carcinoma ************************************************ done
* TCGA-KICH --> Kidney Chromophobe - Adenomas and Adenocarcinomas
* TCGA-KIRC --> Kidney Renal Clear Cell Carcinoma - Adenomas and Adenocarcinomas
* TCGA-KIRP --> Kidney Renal Papillary Cell Carcinoma - Adenomas and Adenocarcinomas
* TCGA-LAML --> Acute Myeloid Leukemia
* TCGA-LGG --> Brain Lower Grade Glioma
* TCGA-LIHC --> Liver Hepatocellular Carcinoma ******************************************************* done
* TCGA-LUAD --> Lung Adenocarcinoma ****************************************************************** done
* TCGA-LUSC --> Lung Squamous Cell Carcinoma 
* TCGA-MESO --> Mesothelioma
* TCGA-OV --> Ovarian Serous Cystadenocarcinoma ****************************************************** done
* TCGA-PAAD --> Pancreatic Adenocarcinoma ************************************************************ done
* TCGA-PCPG --> Pheochromocytoma and Paraganglioma
* TCGA-PRAD --> Prostate Adenocarcinoma
* TCGA-READ --> Rectum Adenocarcinoma
* TCGA-SARC --> Sarcoma
* TCGA-SKCM --> Skin Cutaneous Melanoma ************************************************************** done
* TCGA-STAD --> Stomach Adenocarcinoma *************************************************************** done
* TCGA-TGCT --> Testicular Germ Cell Tumors
* TCGA-THCA --> Thyroid Carcinoma
* TCGA-THYM --> Thymoma
* TCGA-UCEC --> Uterine Corpus Endometrial Carcinoma
* TCGA-UCS --> Uterine Carcinosarcoma
* TCGA-UVM --> Uveal Melanoma

In [2]:
setwd("/scratch/user/s4436039/TCGA_Data")

In [3]:
library(TCGAbiolinks)
library(SummarizedExperiment)

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, ro

In [22]:
clinical <- GDCquery_clinic("TCGA-LIHC")
unique(clinical$primary_diagnosis)

In [5]:
#get a list of projects available
gdcprojects <- getGDCprojects()

In [23]:
getProjectSummary('TCGA-LIHC')

Unnamed: 0_level_0,file_count,case_count,data_category
Unnamed: 0_level_1,<int>,<int>,<chr>
1,6070,377,Simple Nucleotide Variation
2,3226,377,Sequencing Reads
3,1634,377,Biospecimen
4,803,377,Clinical
5,4197,377,Copy Number Variation
6,1698,376,Transcriptome Profiling
7,1290,377,DNA Methylation
8,184,184,Proteome Profiling
9,22,11,Somatic Structural Variation
10,1696,371,Structural Variation


In [24]:
#build a query (GDCquery)
query_TCGA <- GDCquery(project = 'TCGA-LIHC', 
                       data.category = 'Transcriptome Profiling',
                       experimental.strategy = 'RNA-Seq',
                       workflow.type = 'STAR - Counts',
                       access = 'open')

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-LIHC

--------------------

oo Filtering results

--------------------

ooo By access

ooo By experimental.strategy

ooo By workflow.type

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases

ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------



In [26]:
#download data (GDCdownload)
GDCdownload(query_TCGA)

Downloading data for project TCGA-LIHC

GDCdownload will download 424 files. A total of 1.787394432 GB

The total size of files is big. We will download files in chunks

Downloading chunk 1 of 2 (237 files, size = 999.214194 MB) as Tue_Jan_21_18_01_02_2025_0.tar.gz



Downloading: 230 MB     

Downloading chunk 2 of 2 (187 files, size = 788.180238 MB) as Tue_Jan_21_18_01_02_2025_1.tar.gz



Downloading: 180 MB     

In [27]:
#prepare data (GDCprepare)
tcga_lihc_data <- GDCprepare(query_TCGA, summarizedExperiment = TRUE)



Starting to add information to samples

 => Add clinical information to samples

 => Adding TCGA molecular information from marker papers

 => Information will have prefix 'paper_' 

Available assays in SummarizedExperiment : 
  => unstranded
  => stranded_first
  => stranded_second
  => tpm_unstrand
  => fpkm_unstrand
  => fpkm_uq_unstrand



In [28]:
#export randomised summarized experiment
saveRDS(tcga_lihc_data, file = "TCGA_LIHC.rds")

# moving code
``` bash
rsync -azvhP [source] [target]
rsync -azvhP /scratch/user/s4436039/TCGA_datasets/TCGA_SKCM.rds /QRISdata/Q5935/nikita/TCGA-SKCM
rsync -azvhP /QRISdata/Q5935/nikita/TCGA-BRCA/TCGA_BRCA.rds /scratch/user/s4436039/TCGA_Data
rsync -azvhP /QRISdata/Q5935/nikita/TCGA_datasets/TCGA_STAD.rds /scratch/user/s4436039/TCGA_Data
rsync -azvhP /QRISdata/Q5935/nikita/TCGA-OV/TCGA-OV.RDS /scratch/user/s4436039/TCGA_Data
```


