### Treehouse Outlier

See which cancer related genes in a specific prospective Treehouse sample are expressed above the 75th percentile against the entire v2 cohort.

Xena exposes a flexible query interface allowing you to select what you want as well as the shape of what you want.

A higher level python library with examples can be found at:

https://github.com/ucscXena/ucsc-xena-server/tree/master/python

Each hub also exposes a console that you can use to develop and test queries directly:

http://toil.xenahubs.net/console.html

Paste:

(query {:select [:name] :from [:dataset] :where [:like :name "%target%"]})

to get a list of all the datasets with 'target' in their title.

In [1]:
import numpy as np
import pandas as pd
import xena_query as xena

In [2]:
# Get a list of cohorts for this hub
hub = "https://xena.scellucsc.net"
dataset = "treehouse_v2_expression"
print "Cohorts:", xena.all_cohorts(hub)
print "Treehouse Datasets:", xena.datasets_list_in_cohort(hub, "treehouse_v2")

Cohorts: [u'kriegsteinRadialGliaStudy1', u'treehouse_v2', u'quakeBrainGeo1']
Treehouse Datasets: [u'treehouse_v2_clinical', u'treehouse_v2_expression']


In [3]:
# Get a list of all the features in the expression dataset
all_features = pd.DataFrame(xena.dataset_field(hub, dataset), columns=["Gene Symbol"])
print "Total number of features:", len(all_features)
all_features.head()

Total number of features: 27166


Unnamed: 0,Gene Symbol
0,A1BG
1,A1BG-AS1
2,A1CF
3,A2M
4,A2M-AS1


In [4]:
# Get a list of cancer genes as features
cancer_genes = pd.read_table("cancer_genes.tsv")
print "Number of Cancer Genes:", len(cancer_genes)
cancer_genes.head()

Number of Cancer Genes: 602


Unnamed: 0,Gene Symbol,Name,Entrez GeneId,Genome Location,Chr Band,Somatic,Germline,Tumour Types(Somatic),Tumour Types(Germline),Cancer Syndrome,Tissue Type,Molecular Genetics,Role in Cancer,Mutation Types,Translocation Partner,Other Germline Mut,Other Syndrome,Synonyms
0,ABI1,abl-interactor 1,10006,10:26748570-26860863,10p11.2,yes,,AML,,,L,Dom,TSG,T,KMT2A,,,"ABI1,E3B1,ABI-1,SSH3BP1,10006"
1,ABL1,v-abl Abelson murine leukemia viral oncogene h...,25,9:130835447-130885683,9q34.1,yes,,"CML, ALL, T-ALL",,,L,Dom,oncogene,"T, Mis","BCR, ETV6, NUP214",,,"ABL1,p150,ABL,c-ABL,JTK7,bcr/abl,v-abl,P00519,..."
2,ABL2,"c-abl oncogene 2, non-receptor tyrosine kinase",27,1:179107718-179143044,1q24-q25,yes,,AML,,,L,Dom,oncogene,T,ETV6,,,"ABL2,ARG,RP11-177A2_3,ABLL,P42684,ENSG00000143..."
3,ACKR3,atypical chemokine receptor 3,57007,2:-,2q37.3,yes,,lipoma,,,M,Dom,oncogene,T,HMGA2,,,
4,ACSL3,acyl-CoA synthetase long-chain family member 3,2181,2:222908773-222941654,2q36,yes,,prostate,,,E,Dom,,T,ETV1,,,"2181,PRO2194,ACS3,FACL3,O95573,ENSG00000123983..."


In [5]:
# Subset the features to just those from the cancer list
filtered_features = cancer_genes.merge(all_features, how="inner", on="Gene Symbol")
print "Cancer gene features in dataset:", len(filtered_features)
filtered_features.head()

Cancer gene features in dataset: 417


Unnamed: 0,Gene Symbol,Name,Entrez GeneId,Genome Location,Chr Band,Somatic,Germline,Tumour Types(Somatic),Tumour Types(Germline),Cancer Syndrome,Tissue Type,Molecular Genetics,Role in Cancer,Mutation Types,Translocation Partner,Other Germline Mut,Other Syndrome,Synonyms
0,ABL2,"c-abl oncogene 2, non-receptor tyrosine kinase",27,1:179107718-179143044,1q24-q25,yes,,AML,,,L,Dom,oncogene,T,ETV6,,,"ABL2,ARG,RP11-177A2_3,ABLL,P42684,ENSG00000143..."
1,ACKR3,atypical chemokine receptor 3,57007,2:-,2q37.3,yes,,lipoma,,,M,Dom,oncogene,T,HMGA2,,,
2,ACSL3,acyl-CoA synthetase long-chain family member 3,2181,2:222908773-222941654,2q36,yes,,prostate,,,E,Dom,,T,ETV1,,,"2181,PRO2194,ACS3,FACL3,O95573,ENSG00000123983..."
3,ACSL6,acyl-CoA synthetase long-chain family member 6,23305,5:131954234-132011553,5q31.1,yes,,"AML, AEL",,,L,Dom,,T,ETV6,,,"23305,FACL6,LACS2,ACS2,LACS5,KIAA0837,ENSG0000..."
4,ACVR1,"activin A receptor, type I",90,2:157737531-157799493,2q23-q24,yes,,DIPG,,,O,Dom,oncogene,Mis,,yes,Fibrodysplasia ossificans progressiva,"90,SKR1,ACTRI,ALK2,FOP,ACVRLK2,Q04771,ENSG0000..."


In [6]:
# Get a list of all the samples identifiers
samples = pd.DataFrame(xena.dataset_samples(hub, dataset), columns=["Sample ID"])
print "Samples found:", len(samples)
samples.head()

Samples found: 10818


Unnamed: 0,Sample ID
0,icgc/_EGAR00001415737_RNA_PAIRED_ICGC_GBM15_tu...
1,icgc/_EGAR00001415749_RNA_PAIRED_ICGC_GBM24_tu...
2,icgc/_EGAR00001415750_RNA_PAIRED_ICGC_GBM25_tu...
3,icgc/_EGAR00001415751_RNA_PAIRED_ICGC_GBM26_tu...
4,icgc/_EGAR00001415752_RNA_PAIRED_ICGC_GBM27_tu...


In [7]:
# Get expression levels for the filtered gene list for all the samples
expression = pd.DataFrame(xena.dataset_probe_values(hub, dataset, 
                                       list(samples["Sample ID"].values), 
                                       list(filtered_features["Gene Symbol"].values)),
                          index=list(filtered_features["Gene Symbol"].values),
                          columns=list(samples["Sample ID"].values), dtype=np.float32)
expression.head()

Unnamed: 0,icgc/_EGAR00001415737_RNA_PAIRED_ICGC_GBM15_tumor_SN935_0182_B_C2UKHACXXs_131105_1,icgc/_EGAR00001415749_RNA_PAIRED_ICGC_GBM24_tumor_SN935_0169_B_D2335ACXXs_130809_5,icgc/_EGAR00001415750_RNA_PAIRED_ICGC_GBM25_tumor_SN935_0169_B_D2335ACXXs_130809_6,icgc/_EGAR00001415751_RNA_PAIRED_ICGC_GBM26_tumor_SN935_0169_B_D2335ACXXs_130809_7,icgc/_EGAR00001415752_RNA_PAIRED_ICGC_GBM27_tumor_SN935_0182_B_C2UKHACXXs_131105_2,icgc/_EGAR00001415753_RNA_PAIRED_ICGC_GBM28_tumor_SN935_0182_B_C2UKHACXXs_131105_3,icgc/_EGAR00001415754_RNA_PAIRED_ICGC_GBM32_tumor_SN935_0182_B_C2UKHACXXs_131105_5,icgc/_EGAR00001415755_RNA_PAIRED_ICGC_GBM33_tumor_SN935_0182_B_C2UKHACXXs_131105_6,icgc/_EGAR00001415756_RNA_PAIRED_ICGC_GBM34_tumor_SN935_0182_B_C2UKHACXXs_131105_7,icgc/_EGAR00001415757_RNA_PAIRED_ICGC_GBM36_tumor_SN935_0182_B_C2UKHACXXs_131105_8,...,icgc/_EGAZ00001000218_81MAGABXX_3_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000219_81MAGABXX_4_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000221_81MAGABXX_6_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000223_81MAGABXX_8_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000226_81MAGABXX_1_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000227_81MAGABXX_2_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000232_81MK3ABXX_3_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000233_81MK3ABXX_4_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000236_81MK3ABXX_7_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000237_81MK3ABXX_8_withJunctionsOnGenome_dupsFlagged
ABL2,12.5281,13.1017,12.7909,12.7493,12.4276,13.7405,12.7222,12.7274,11.8817,12.8849,...,12.2101,12.3156,11.9421,13.5259,11.6535,11.5598,11.3783,12.9246,11.147,11.92
ACKR3,11.376,10.214,10.9363,10.1263,10.1251,12.8547,10.9015,10.8702,9.32999,10.5003,...,6.98112,6.85113,9.49462,12.7392,6.45319,6.72364,6.09173,6.27627,7.69791,6.54832
ACSL3,12.9593,13.871,13.4312,14.0885,12.152,13.4753,12.4944,14.0798,16.2819,14.2923,...,12.136,12.1734,12.4163,11.6631,11.4909,12.7017,11.5323,11.8799,11.4678,11.2093
ACSL6,9.06482,6.33572,9.63731,6.73563,8.2234,9.74424,7.5038,7.09675,8.02462,4.80642,...,8.94622,6.07243,6.26945,4.03185,4.07634,8.57807,6.22102,3.98123,4.75369,5.67579
ACVR1,10.1955,10.67,10.6494,10.9388,10.7445,11.1561,10.5862,11.0295,10.3881,10.1676,...,10.9649,11.4338,10.7574,11.643,11.8029,11.3756,12.2705,12.3135,9.68646,10.9286


In [8]:
# See if any are not expressed at all in all samples
expression[(expression.T == 0).all()]

Unnamed: 0,icgc/_EGAR00001415737_RNA_PAIRED_ICGC_GBM15_tumor_SN935_0182_B_C2UKHACXXs_131105_1,icgc/_EGAR00001415749_RNA_PAIRED_ICGC_GBM24_tumor_SN935_0169_B_D2335ACXXs_130809_5,icgc/_EGAR00001415750_RNA_PAIRED_ICGC_GBM25_tumor_SN935_0169_B_D2335ACXXs_130809_6,icgc/_EGAR00001415751_RNA_PAIRED_ICGC_GBM26_tumor_SN935_0169_B_D2335ACXXs_130809_7,icgc/_EGAR00001415752_RNA_PAIRED_ICGC_GBM27_tumor_SN935_0182_B_C2UKHACXXs_131105_2,icgc/_EGAR00001415753_RNA_PAIRED_ICGC_GBM28_tumor_SN935_0182_B_C2UKHACXXs_131105_3,icgc/_EGAR00001415754_RNA_PAIRED_ICGC_GBM32_tumor_SN935_0182_B_C2UKHACXXs_131105_5,icgc/_EGAR00001415755_RNA_PAIRED_ICGC_GBM33_tumor_SN935_0182_B_C2UKHACXXs_131105_6,icgc/_EGAR00001415756_RNA_PAIRED_ICGC_GBM34_tumor_SN935_0182_B_C2UKHACXXs_131105_7,icgc/_EGAR00001415757_RNA_PAIRED_ICGC_GBM36_tumor_SN935_0182_B_C2UKHACXXs_131105_8,...,icgc/_EGAZ00001000218_81MAGABXX_3_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000219_81MAGABXX_4_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000221_81MAGABXX_6_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000223_81MAGABXX_8_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000226_81MAGABXX_1_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000227_81MAGABXX_2_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000232_81MK3ABXX_3_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000233_81MK3ABXX_4_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000236_81MK3ABXX_7_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000237_81MK3ABXX_8_withJunctionsOnGenome_dupsFlagged


In [10]:
# Calculate some stats
expression.describe()

Unnamed: 0,icgc/_EGAR00001415737_RNA_PAIRED_ICGC_GBM15_tumor_SN935_0182_B_C2UKHACXXs_131105_1,icgc/_EGAR00001415749_RNA_PAIRED_ICGC_GBM24_tumor_SN935_0169_B_D2335ACXXs_130809_5,icgc/_EGAR00001415750_RNA_PAIRED_ICGC_GBM25_tumor_SN935_0169_B_D2335ACXXs_130809_6,icgc/_EGAR00001415751_RNA_PAIRED_ICGC_GBM26_tumor_SN935_0169_B_D2335ACXXs_130809_7,icgc/_EGAR00001415752_RNA_PAIRED_ICGC_GBM27_tumor_SN935_0182_B_C2UKHACXXs_131105_2,icgc/_EGAR00001415753_RNA_PAIRED_ICGC_GBM28_tumor_SN935_0182_B_C2UKHACXXs_131105_3,icgc/_EGAR00001415754_RNA_PAIRED_ICGC_GBM32_tumor_SN935_0182_B_C2UKHACXXs_131105_5,icgc/_EGAR00001415755_RNA_PAIRED_ICGC_GBM33_tumor_SN935_0182_B_C2UKHACXXs_131105_6,icgc/_EGAR00001415756_RNA_PAIRED_ICGC_GBM34_tumor_SN935_0182_B_C2UKHACXXs_131105_7,icgc/_EGAR00001415757_RNA_PAIRED_ICGC_GBM36_tumor_SN935_0182_B_C2UKHACXXs_131105_8,...,icgc/_EGAZ00001000218_81MAGABXX_3_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000219_81MAGABXX_4_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000221_81MAGABXX_6_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000223_81MAGABXX_8_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000226_81MAGABXX_1_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000227_81MAGABXX_2_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000232_81MK3ABXX_3_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000233_81MK3ABXX_4_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000236_81MK3ABXX_7_withJunctionsOnGenome_dupsFlagged,icgc/_EGAZ00001000237_81MK3ABXX_8_withJunctionsOnGenome_dupsFlagged
count,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0,...,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0
mean,10.182821,10.027308,9.862326,9.939827,9.829427,11.430959,10.370827,10.001285,9.799645,9.787988,...,8.856701,8.808331,8.675457,9.121507,9.101644,9.047162,8.465183,8.703826,9.001608,8.950782
std,3.394402,3.840125,3.971373,3.930997,3.773202,3.146525,3.309657,3.91685,3.982865,3.993498,...,3.846623,3.789008,3.745486,3.600406,3.774234,3.664901,3.808564,3.896501,3.385404,3.809129
min,0.0,0.0,0.0,0.0,0.0,2.86924,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8.34913,8.52605,8.18834,8.30256,8.14129,9.07462,8.76842,8.43369,7.99503,7.84391,...,6.44966,6.45898,6.43239,6.96906,6.64378,6.84121,5.81161,6.27627,6.97377,6.54832
50%,10.9177,11.1197,10.8977,10.917,10.8551,12.0638,11.148,11.0927,10.7335,10.7525,...,9.76894,9.75288,9.49113,9.85069,10.3481,9.70021,9.21378,9.48678,9.79024,9.85823
75%,12.6284,12.8539,12.7305,12.8122,12.4623,13.9071,12.7356,12.7274,12.5589,12.7711,...,11.9117,11.8331,11.6463,11.7868,11.9015,11.8978,11.4982,11.7693,11.4678,11.7988
max,16.773199,16.0912,16.079901,17.346701,15.5333,17.8955,16.1117,16.346001,16.9088,17.239599,...,15.2757,16.6185,15.6422,16.379601,16.300501,15.791,15.6423,17.018499,16.1054,16.2717


In [13]:
# Get some stats per gene
expression.T.describe().T.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ABL2,10818.0,10.559373,0.894061,0.0,9.9499,10.52025,11.109875,14.5062
ACKR3,10818.0,9.568782,1.875077,0.0,8.4608,9.5802,10.76445,15.5799
ACSL3,10818.0,11.750946,0.98394,0.0,11.126225,11.6835,12.320125,16.2819
ACSL6,10818.0,0.865434,1.488314,0.0,0.0,0.0,1.5876,11.0797
ACVR1,10818.0,10.384686,0.964976,0.0,9.895025,10.4458,10.95345,16.419399


In [18]:
# Find all the genes in a sample (N of 1) expressed over the 75% of the cohort
sample = expression["ckcc/TH03_0016_S01"]
cutoff = expression.T.describe().T["75%"]
up = sample[(sample > cutoff)]
print "Outliers:", len(up)
up.head()

Outliers: 136


ABL2     11.47810
ACSL3    12.49850
ACSL6     7.11436
ACVR1    11.31170
AFF3     11.28830
Name: ckcc/TH03_0016_S01, dtype: float32