# Ingest Pancan + GTEX

Download Pancan/TCGA and GTEX clinical labels, [Kallisto](https://pachterlab.github.io/kallisto/) rna-seq transcript expression computed by the [Toil recompute](https://xenabrowser.net/datapages/?hub=https://toil.xenahubs.net:443) from [Xena](https://xenabrowser.net), prune to only high variance transcripts and store in a single h5 file.

In [1]:
import os
import numpy as np
import pandas as pd

# Switch to a scratch data directory so all paths are local
!mkdir -p ~/data/pancan-gtex
os.chdir(os.path.expanduser("~/data/pancan-gtex"))

## Download Examples

In [2]:
!wget -q -N https://toil.xenahubs.net/download/tcga_Kallisto_tpm.gz
!wget -q -N https://toil.xenahubs.net/download/gtex_Kallisto_tpm.gz

In [3]:
%%time
# Convert to float32, Transpose to ML style rows = samples and hdf for significantly faster reading
if not os.path.exists("tcga_Kallisto_tpm.T.fp32.h5"):
    pd.read_table("tcga_Kallisto_tpm.gz", index_col=0, engine='c') \
        .astype(np.float32).T \
        .to_hdf("tcga_Kallisto_tpm.T.fp32.h5", "expression", mode="w", format="fixed")
if not os.path.exists("gtex_Kallisto_tpm.T.fp32.h5"):
    pd.read_table("gtex_Kallisto_tpm.gz", index_col=0, engine='c') \
        .astype(np.float32).T \
        .to_hdf("gtex_Kallisto_tpm.T.fp32.h5", "expression", mode="w", format="fixed")

CPU times: user 27 µs, sys: 32 µs, total: 59 µs
Wall time: 68.2 µs


In [4]:
%%time
#Read back the h5 to ensure we start with the same dataframes
tcga_samples = pd.read_hdf("tcga_Kallisto_tpm.T.fp32.h5")
gtex_samples = pd.read_hdf("gtex_Kallisto_tpm.T.fp32.h5")

# Make sure they have the exact same set of transcript names
assert tcga_samples.columns.equals(gtex_samples.columns)

CPU times: user 1.52 s, sys: 10.6 s, total: 12.1 s
Wall time: 12.2 s


In [5]:
# Combine into a single dataset
all_samples = pd.concat([tcga_samples, gtex_samples], axis="index")
print("Ingested {} samples with {} features".format(all_samples.shape[0], all_samples.shape[1]))
all_samples.head()

Ingested 18525 samples with 197044 features


sample,ENST00000548312.5,ENST00000527779.1,ENST00000454820.5,ENST00000535093.1,ENST00000346219.7,ENST00000570899.1,ENST00000557761.1,ENST00000625998.2,ENST00000583693.5,ENST00000383738.6,...,ENST00000380620.8,ENST00000548698.5,ENST00000542429.2,ENST00000602837.1,ENST00000422233.5,ENST00000377138.1,ENST00000463473.2,ENST00000380293.3,ENST00000288710.6,ENST00000250055.2
TCGA-E9-A1N3-01,-2.1325,-9.9658,-9.9658,0.821,-9.6932,-3.4428,-9.9658,0.4897,-1.8656,-6.056,...,-4.4744,-9.9658,-9.9658,-9.9658,-1.3686,-9.9658,-5.2782,-9.9658,-1.5226,-1.3758
TCGA-EL-A3ZP-01,-1.2893,-0.1728,-9.9658,-2.066,-1.0375,-2.0306,-9.9658,1.5523,-0.188,-6.8294,...,-2.7537,-9.9658,-9.9658,-3.1521,-9.9658,-9.9658,-3.0967,-3.3687,-4.9041,-1.4238
TCGA-E2-A152-01,-3.2871,-1.7043,-2.9534,-2.4626,-5.7857,-2.7592,-9.9658,-9.9658,-2.4451,-4.5288,...,2.8864,-9.9658,-9.9658,-1.1593,-9.9658,-9.9658,-9.9658,-9.9658,-4.5288,-9.9658
TCGA-66-2734-01,-2.2107,0.8207,-9.9658,-2.5896,-4.7458,-3.2763,-9.9658,-2.5001,-1.0966,-5.769,...,-3.2774,-9.9658,-0.8833,-0.8438,-9.9658,-9.9658,-3.7255,-9.9658,-4.163,4.4371
TCGA-BQ-5885-01,-3.7658,-9.9658,-9.9658,-2.9499,-9.9658,-3.9142,-9.9658,1.5864,-1.29,-9.9658,...,-1.0421,-9.9658,-9.9658,-2.2017,-9.9658,-9.9658,-9.9658,-0.5238,-2.86,-1.6823


In [6]:
# Filter features with low variance
var = all_samples.var()
var.describe()

pruned_features = var[var > (var.describe()["mean"] + 2*var.describe()["std"])]
pruned_samples = all_samples[pruned_features.index]
print("Filtered features down from {} down to {}".format(all_samples.shape[1], pruned_samples.shape[1]))

Filtered features down from 197044 down to 6974


## Download Labels

In [7]:
!wget -q -N https://pancanatlas.xenahubs.net/download/Survival_SupplementalTable_S1_20171025_xena_sp.gz
!wget -q -N https://toil.xenahubs.net/download/TcgaTargetGTEX_phenotype.txt.gz

In [8]:
survival_labels = pd.read_table(
    "Survival_SupplementalTable_S1_20171025_xena_sp.gz", compression="gzip", 
    header=0, sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")

tcga_gtex_labels = pd.read_table(
    "TcgaTargetGTEX_phenotype.txt.gz", compression="gzip", 
    header=0, sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")

In [9]:
all_labels = pd.merge(tcga_gtex_labels, survival_labels, left_index=True, right_index=True, how="outer").astype('str')
print("Ingested {} labels for {} samples".format(all_labels.shape[1], all_labels.shape[0]))
all_labels.iloc[::all_labels.shape[0]//5]

Ingested 39 labels for 21226 samples


Unnamed: 0_level_0,detailed_category,primary disease or tissue,_primary_site,_sample_type,_gender,_study,_PATIENT,cancer type abbreviation,age_at_initial_pathologic_diagnosis,gender,...,residual_tumor,OS,OS.time,DSS,DSS.time,DFI,DFI.time,PFI,PFI.time,Redaction
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GTEX-1117F-0226-SM-5GZZ7,Adipose - Subcutaneous,Adipose - Subcutaneous,Adipose Tissue,Normal Tissue,Female,GTEX,,,,,...,,,,,,,,,,
GTEX-QDVN-2126-SM-33HBS,Adipose - Subcutaneous,Adipose - Subcutaneous,Adipose Tissue,Normal Tissue,Male,GTEX,,,,,...,,,,,,,,,,
TARGET-50-PAJNAA-01,Wilms Tumor,Wilms Tumor,Kidney,Primary Solid Tumor,,TARGET,,,,,...,,,,,,,,,,
TCGA-AW-A1PO-01,,,,,,,TCGA-AW-A1PO,UCEC,66.0,FEMALE,...,,0.0,17.0,0.0,17.0,,,0.0,17.0,
TCGA-EI-6508-01,Rectum Adenocarcinoma,Rectum Adenocarcinoma,Rectum,Primary Tumor,Female,TCGA,TCGA-EI-6508,READ,48.0,FEMALE,...,,0.0,636.0,0.0,636.0,,,0.0,636.0,
TCGA-ZX-AA5X-01,Cervical & Endocervical Cancer,Cervical & Endocervical Cancer,Cervix,Primary Tumor,Female,TCGA,TCGA-ZX-AA5X,CESC,64.0,FEMALE,...,,0.0,119.0,0.0,119.0,,,0.0,119.0,


## Wrangle and Prune

Drop samples with missing values for fields we want to train on, transform field values for training etc...

In [10]:
# Include only labels for samples that we have
pruned_labels = all_labels.loc[all_labels.index.intersection(pruned_samples.index)]
print("Starting with {} labeled sample pairs".format(pruned_labels.shape[0]))

# Drop that are missing labels we plan to classify
pruned_labels = pruned_labels.dropna(subset=["_primary_site"])
print(pruned_labels.shape[0], "with _primary_site")
pruned_labels = pruned_labels.dropna(subset=["_gender"])
print(pruned_labels.shape[0], "with _gender")

# Some of the cell line are normal and in any case not clear they are reliable signal
pruned_labels = pruned_labels[pruned_labels._sample_type != "Cell Line"]
print(pruned_labels.shape[0], "not Cell Line")

# Generate a Tumor/Normal label
pruned_labels = pruned_labels.dropna(subset=["_sample_type"])
print(pruned_labels.shape[0], "with _sample_type")
pruned_labels["tumor_normal"] = pruned_labels.apply(
    lambda row: "Normal" if row["_sample_type"] in ["Normal Tissue", "Solid Tissue Normal"]
    else "Tumor", axis=1)

print("{} labels after pruning".format(pruned_labels.shape[0]))
pruned_labels.iloc[::pruned_labels.shape[0]//5]

Starting with 18397 labeled sample pairs
18397 with _primary_site
18397 with _gender
17964 not Cell Line
17964 with _sample_type
17964 labels after pruning


Unnamed: 0,detailed_category,primary disease or tissue,_primary_site,_sample_type,_gender,_study,_PATIENT,cancer type abbreviation,age_at_initial_pathologic_diagnosis,gender,...,OS,OS.time,DSS,DSS.time,DFI,DFI.time,PFI,PFI.time,Redaction,tumor_normal
GTEX-1117F-0226-SM-5GZZ7,Adipose - Subcutaneous,Adipose - Subcutaneous,Adipose Tissue,Normal Tissue,Female,GTEX,,,,,...,,,,,,,,,,Normal
GTEX-OHPK-0326-SM-2HMJO,Heart - Left Ventricle,Heart - Left Ventricle,Heart,Normal Tissue,Female,GTEX,,,,,...,,,,,,,,,,Normal
GTEX-ZVT4-1026-SM-57WC4,Breast - Mammary Tissue,Breast - Mammary Tissue,Breast,Normal Tissue,Female,GTEX,,,,,...,,,,,,,,,,Normal
TCGA-BB-7871-01,Head & Neck Squamous Cell Carcinoma,Head & Neck Squamous Cell Carcinoma,Head and Neck region,Primary Tumor,Female,TCGA,TCGA-BB-7871,HNSC,64.0,FEMALE,...,0.0,750.0,0.0,750.0,,,1.0,428.0,,Tumor
TCGA-ET-A3DV-01,Thyroid Carcinoma,Thyroid Carcinoma,Thyroid Gland,Primary Tumor,Female,TCGA,TCGA-ET-A3DV,THCA,68.0,FEMALE,...,0.0,5068.0,0.0,5068.0,0.0,5068.0,0.0,5068.0,,Tumor
TCGA-ZT-A8OM-01,Thymoma,Thymoma,Thymus,Primary Tumor,Female,TCGA,TCGA-ZT-A8OM,THYM,73.0,FEMALE,...,0.0,1398.0,0.0,1398.0,,,0.0,1398.0,,Tumor


## Export

Export the full dataset as an h5 file.

In [34]:
%%time
# Include only ids that we have labels for after pruning
sample_ids = pruned_samples.index.intersection(pruned_labels.index)
print("Exporting {} samples".format(len(sample_ids)))

# NOTE: Setting complevel to 9 reduces the size of the resulting h5 file from 3G down to 2.1G
# but increases the read time from 2.79s to 20.8s and the write time from 19.9s to 25m
pruned_samples.loc[sample_ids].sort_index(axis="index").sort_index(axis="columns").to_hdf(
    "pancan-gtex.h5", key="samples", mode="w", format="fixed", complevel=0)
pruned_labels.loc[sample_ids].sort_index(axis="index").sort_index(axis="columns").to_hdf(
    "pancan-gtex.h5", key="labels", mode="a", format="fixed", complevel=0)

Exporting 17964 samples
CPU times: user 1.1 s, sys: 1.5 s, total: 2.61 s
Wall time: 3.19 s


In [28]:
# Empty the path
# !aws --profile {os.getenv("AWS_PROFILE")} --endpoint {os.getenv("AWS_S3_ENDPOINT")} \
#     s3 rm --recursive s3://stuartlab/hello.txt

In [35]:
# Use the aws cli's rsync like sync command to push changed files up to PRP S3/CEPH
# !aws --profile {os.getenv("AWS_PROFILE")} --endpoint {os.getenv("AWS_S3_ENDPOINT")} \
#     s3 sync . s3://stuartlab/pancan-gtex --acl public-read
!aws --profile {os.getenv("AWS_PROFILE")} --endpoint {os.getenv("AWS_S3_ENDPOINT")} \
    s3 cp pancan-gtex.h5 s3://stuartlab/pancan-gtex/ --acl public-read

upload: ./pancan-gtex.h5 to s3://stuartlab/pancan-gtex/pancan-gtex.h5


In [36]:
!aws --profile {os.getenv("AWS_PROFILE")} --endpoint {os.getenv("AWS_S3_ENDPOINT")} \
    s3 ls s3://stuartlab/pancan-gtex/

2018-11-23 22:51:12  507550544 pancan-gtex.h5
