# Ingest Treehouse

Download [UCSC Treehouse](https://treehousegenomics.soe.ucsc.edu/public-data/) clinical labels and RNA-Seq expression computed by the [Toil recompute](https://xenabrowser.net/datapages/?hub=https://toil.xenahubs.net:443) from [Xena](https://xenabrowser.net), wrangle, prune and store in a single h5 file for machine learning.

In [1]:
import os
import numpy as np
import pandas as pd

# Switch to a scratch data directory so all paths are local
os.makedirs(os.path.expanduser("~/data/treehouse"), exist_ok=True)
os.chdir(os.path.expanduser("~/data/treehouse"))

## Ingest Samples

In [2]:
# Download raw files from xena
!wget -q -N https://xena.treehouse.gi.ucsc.edu/download/TreehousePEDv8_unique_hugo_log2_tpm_plus_1.2018-07-25.tsv

In [3]:
%%time
# Convert to float32, Transpose to ML style rows = samples and hdf for significantly faster reading
if not os.path.exists(os.path.expanduser("treehouse.T.fp32.h5")):
    pd.read_csv(os.path.expanduser("TreehousePEDv8_unique_hugo_log2_tpm_plus_1.2018-07-25.tsv"), 
                sep="\t", index_col=0, engine='c') \
        .astype(np.float32).T \
        .to_hdf(os.path.expanduser("treehouse.T.fp32.h5"), "expression", mode="w", format="fixed")

CPU times: user 12min 11s, sys: 40.1 s, total: 12min 51s
Wall time: 12min 51s


In [8]:
# Convert back to TPM
all_samples = pd.read_hdf("treehouse.T.fp32.h5").apply(np.exp2).subtract(1.0).clip(lower=0.0)
print("Ingested {} samples with {} features".format(all_samples.shape[0], all_samples.shape[1]))
all_samples.head()

Ingested 11427 samples with 58581 features


Gene,5S_rRNA,5_8S_rRNA,7SK,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,...,snoU2-30,snoU2_19,snoU83B,snoZ196,snoZ278,snoZ40,snoZ6,snosnR66,uc_338,yR211F11.2
THR15_0330_S01,0.0,0.0,0.0,5.14,1.22,0.0,147.210022,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0
THR29_0776_S01,0.0,0.0,0.06,31.359997,4.03,0.0,39.079998,0.93,0.04,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.47,0.0
THR14_1221_S01,0.0,0.0,0.78,2.68,1.73,0.01,28.330002,0.25,0.4,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.51,0.15
THR11_0247_S01,0.0,0.0,1.94,8.280001,8.580001,0.14,465.710022,1.05,0.23,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.4,0.0
THR08_0162_S01,0.0,0.0,0.0,17.389999,21.900002,0.0,0.73,0.12,0.14,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49.880005,0.0


In [9]:
# Check that TPM (transcripts per million) per sample sum to 1M
all_samples.iloc[::all_samples.shape[0]//5].sum(axis=1)

THR15_0330_S01     1.000000e+06
TCGA-VF-A8AE-01    1.000003e+06
TCGA-BR-8080-01    1.000001e+06
TCGA-IB-AAUP-01    1.000001e+06
TCGA-E2-A1IE-01    1.000004e+06
TCGA-BR-8588-01    1.000006e+06
dtype: float64

## Prune Features
Reduce the number of features/genes from ~20k down to a smaller number so the features vs. samples ratio is less out of whack but also so that a colab notebook can download the training in reasonable time when computering SHAP values against a tissue cohort (see infer.ipynb)

In [13]:
# Prune X to only KEGG pathway genes (http://software.broadinstitute.org/gsea/msigdb/)
with open("c2.cp.kegg.v6.2.symbols.gmt") as f:
    kegg_genes = list(set().union(*[line.strip().split("\t")[2:] for line in f.readlines()]))

# Prune X to only Cosmic Cancer Genes (https://cancer.sanger.ac.uk/census)
cosmic_genes = pd.read_csv("cosmic-26-11-2018.tsv", sep="\t")["Gene Symbol"].values

subset_of_genes = list(set(kegg_genes).union(set(cosmic_genes)))

pruned_samples = all_samples.drop(labels=(set(all_samples.columns) - set(subset_of_genes)), axis=1)

print("Pruning from {} down to {} features/genes".format(all_samples.shape[1], pruned_samples.shape[1]))

Pruning from 58581 down to 5545 features/genes


## Ingest Labels

In [14]:
!wget -q -N https://xena.treehouse.gi.ucsc.edu/download/TreehousePEDv8_clinical_metadata.2018-07-25.tsv

In [15]:
all_labels = pd.read_csv("TreehousePEDv8_clinical_metadata.2018-07-25.tsv",
    header=0, sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")

all_labels.iloc[::all_labels.shape[0]//5]

Unnamed: 0_level_0,disease,age_at_dx,pedaya,gender
th_sampleid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TARGET-10-PAKSWW-03,acute lymphoblastic leukemia,15.11,"Yes, age < 30 years",male
TCGA-66-2757-01,lung squamous cell carcinoma,65.0,No,female
TCGA-C5-A3HD-01,cervical & endocervical cancer,51.0,No,female
TCGA-EK-A2RC-01,cervical & endocervical cancer,33.0,No,female
TCGA-P8-A5KC-01,pheochromocytoma & paraganglioma,48.0,No,male
THR33_1146_S01,medulloblastoma,8.0,"Yes, age < 30 years",male


## Wrangle and Prune

Drop samples with missing values for fields we want to train on, or that have too few of the class and transform field values for training etc...

In [37]:
# Include only labels for samples that we have
pruned_labels = all_labels.loc[all_labels.index.intersection(pruned_samples.index)]
print("Starting with {} labeled sample pairs".format(pruned_labels.shape[0]))

# Drop that are missing labels we plan to classify
pruned_labels = pruned_labels.dropna(subset=["disease"])
print(pruned_labels.shape[0], "with disease")

# Drop disease with less then 100 examples
counts = pruned_labels.disease.value_counts()
pruned_labels = pruned_labels[pruned_labels.disease.isin(counts[counts > 100].index)]
print("Dropping {} samples where the disease has < 50 samples".format(counts[counts < 50].index.shape[0]))
print(pruned_labels.shape[0], "with > 50 samples of the disease")

print("{} labels after pruning".format(pruned_labels.shape[0]))
pruned_labels.iloc[::pruned_labels.shape[0]//5]

Starting with 11427 labeled sample pairs
11426 with disease
Dropping 55 samples where the disease has < 50 samples
10208 with > 50 samples of the disease
10208 labels after pruning


Unnamed: 0,disease,age_at_dx,pedaya,gender
TARGET-10-PAKSWW-03,acute lymphoblastic leukemia,15.11,"Yes, age < 30 years",male
TCGA-56-A49D-01,lung squamous cell carcinoma,67.0,No,male
TCGA-BJ-A18Z-01,thyroid carcinoma,58.0,No,male
TCGA-E2-A14P-01,breast invasive carcinoma,79.0,No,female
TCGA-KK-A8IL-01,prostate adenocarcinoma,65.0,No,male
THR33_1145_S01,medulloblastoma,3.0,"Yes, age < 30 years",male


## Export

Export the full dataset as an h5 file.

In [25]:
%%time
# Include only ids that we have both sample and label for
sample_ids = pruned_samples.index.intersection(pruned_labels.index)
print("Exporting {} samples".format(len(sample_ids)))

# NOTE: Setting complevel to 9 reduces the size of the resulting h5 file from 3G down to 2.1G
# but increases the read time from 2.79s to 20.8s and the write time from 19.9s to 25m
pruned_samples.loc[sample_ids].sort_index(axis="index").sort_index(axis="columns").to_hdf(
    os.path.expanduser("treehouse-pruned.h5"), key="samples", mode="w", format="fixed", complevel=0)
pruned_labels.loc[sample_ids].sort_index(axis="index").sort_index(axis="columns").to_hdf(
    os.path.expanduser("treehouse-pruned.h5"), key="labels", mode="a", format="fixed", complevel=0)

Exporting 10208 samples
CPU times: user 4.11 s, sys: 820 ms, total: 4.93 s
Wall time: 4.92 s


your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['age_at_dx', 'disease', 'gender', 'pedaya']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)
