# Ingest Gene Expression and Clinical Data from TCGA+TARGET+GTEX and Treehouse

Download gene expression and clinical data from the [UCSC Xena Toil re-compute dataset](https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net) and the [Treehouse Childhood Cancer Initiative](https://xenabrowser.net/datapages/?host=https://treehouse.xenahubs.net), wrangle, and store in an hdf5 file for quick loading machine learning. This dataset comprises gene expression data for twenty thousand tumor and normal samples processed using the exact same genomics pipeline and therefore can be compared to each other. Treehouse contains many of the same samples from TCGA and TARGET as Toil which we can use to verify our conversion. It also includes unique samples (all prefixed with TH or TR) which we can use as a hold-out set.

Each of the source data set consists of a float vector, log2(TPM+0.001) in the case of TCGA+TARGET+GTEX or log2(TPM+1.0) in the case of Treehouse normalized, of gene expression for each of ~60k genes. Toil expression is labeled using Ensembl gene ids vs. Treehouse which uses Hugo. Associated with these data is clinical information on each sample such as type (tumor vs. normal), disease, primary site (where the sample came from in the human body) etc... We use this information to label the samples normal/0 vs. tumor/1 as well as to provide additional information for visualization and interpretation of models.

In [44]:
import os
import requests
import numpy as np
import pandas as pd
import h5py

if not os.path.exists("data"):
    os.makedirs("data")

## Download TGCA+TARGET+GTEX Expression
Download expression data files from Xena and save in an hdf5 file. This can take around 30 minutes each between the download and the conversion from tsv into float32 dataframes. We download manually vs. passing read_csv a url directly as the latter times out with this size file.

In [52]:
%%time
if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.gz"):
    print("Downloading TCGA, TARGET and GTEX expression data from UCSC Xena")
    r = requests.get("https://toil.xenahubs.net/download/TcgaTargetGtex_rsem_gene_tpm.gz", stream=True)
    response.raise_for_status()
    with open("data/TcgaTargetGtex_rsem_gene_tpm.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=32768):
            f.write(chunk)

if not os.path.exists("data/TcgaTargetGtex_rsem_gene_tpm.hd5"):
    print("Converting expression to dataframe and storing in hdf5 file")
    pd.read_csv("data/TcgaTargetGtex_rsem_gene_tpm.gz", sep="\t", index_col=0) \
        .astype(np.float32).to_hdf("data/TcgaTargetGtex_rsem_gene_tpm.hd5", "expression", mode="w", format="fixed")

tcga_target_gtex_expression = pd.read_hdf(
    "data/TcgaTargetGtex_rsem_gene_tpm.hd5", "expression").T.dropna(axis="index").sort_index(axis="columns")
print("tcga_target_gtex_expression: samples={} genes={}".format(*tcga_target_gtex_expression.shape))

tcga_target_gtex_expression: samples=19260 genes=60498
CPU times: user 6.74 s, sys: 7.26 s, total: 14 s
Wall time: 14 s


In [53]:
tcga_target_gtex_expression.head()

sample,ENSG00000000003.14,ENSG00000000005.5,ENSG00000000419.12,ENSG00000000457.13,ENSG00000000460.16,ENSG00000000938.12,ENSG00000000971.15,ENSG00000001036.13,ENSG00000001084.10,ENSG00000001167.14,...,ENSGR0000263980.5,ENSGR0000264510.5,ENSGR0000264819.5,ENSGR0000265658.5,ENSGR0000270726.5,ENSGR0000275287.4,ENSGR0000276543.4,ENSGR0000277120.4,ENSGR0000280767.2,ENSGR0000281849.2
GTEX-UTHO-1226-SM-3GAEE,3.0056,-9.9658,4.5098,1.7403,1.334,4.1883,8.9373,4.0994,2.9581,3.2526,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658
GTEX-146FH-1726-SM-5QGQ2,5.9729,-9.9658,5.0922,3.1572,2.3308,2.0569,4.5632,3.5742,4.4344,3.3633,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658
GTEX-QDT8-0126-SM-48TZ1,3.6939,2.4675,4.981,2.7972,1.4808,3.9792,6.8849,4.2419,3.6417,3.4411,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658
GTEX-QCQG-1326-SM-48U24,4.9594,-3.458,5.2223,2.7357,1.8564,3.7061,2.0218,5.1559,4.3449,3.5535,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658
GTEX-WZTO-2926-SM-3NM9I,2.876,-2.4659,4.7022,1.9034,0.044,1.6558,3.1733,3.1062,3.6657,2.8482,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658


## Covert Ensembl to Hugo

Toil's expression values are per Ensembl gene id, which have a one or more to one relationship to Hugo gene names so we need to convert back into TPM, average (or add?), and then convert back to log2(tpm+0.001). We're using an assembled table from John Vivian @ UCSC here. Another
option would be ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt

In [54]:
if not os.path.exists("data/ensembl_to_hugo.tsv"):
    with open("data/ensembl_to_hugo.tsv", "wb") as f:
        f.write(requests.get("https://github.com/jvivian/docker_tools/blob/master/gencode_hugo_mapping/attrs.tsv?raw=true").content)
ensemble_to_hugo = pd.read_table("data/ensembl_to_hugo.tsv",index_col=0).sort_index(axis="index")

# Remove duplicates
ensemble_to_hugo = ensemble_to_hugo[~ensemble_to_hugo.index.duplicated(keep='first')]
ensemble_to_hugo.head()

Unnamed: 0_level_0,geneName,geneType,geneStatus,transcriptId,transcriptName,transcriptType,transcriptStatus,havanaGeneId,havanaTranscriptId,ccdsId,level,transcriptClass
geneId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ENSG00000000003.14,TSPAN6,protein_coding,KNOWN,ENST00000612152.4,TSPAN6-201,protein_coding,KNOWN,OTTHUMG00000022002.1,,CCDS76001.1,3,coding
ENSG00000000005.5,TNMD,protein_coding,KNOWN,ENST00000373031.4,TNMD-001,protein_coding,KNOWN,OTTHUMG00000022001.1,OTTHUMT00000057481.1,CCDS14469.1,2,coding
ENSG00000000419.12,DPM1,protein_coding,KNOWN,ENST00000371582.8,DPM1-005,protein_coding,KNOWN,OTTHUMG00000032742.2,OTTHUMT00000079720.2,,2,coding
ENSG00000000457.13,SCYL3,protein_coding,KNOWN,ENST00000470238.1,SCYL3-004,processed_transcript,KNOWN,OTTHUMG00000035941.4,OTTHUMT00000087552.1,,2,nonCoding
ENSG00000000460.16,C1orf112,protein_coding,KNOWN,ENST00000466580.6,C1orf112-008,processed_transcript,KNOWN,OTTHUMG00000035821.7,OTTHUMT00000087524.1,,2,nonCoding


In [5]:
# Create a new data frame replacing the ensembl based index with hugo dropping any where there is no conversion
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression.copy()
tcga_target_gtex_expression_hugo.index = ensemble_to_hugo.reindex(tcga_target_gtex_expression.index).geneName.values
tcga_target_gtex_expression_hugo = tcga_target_gtex_expression_hugo[tcga_target_gtex_expression_hugo.index.notnull()]
tcga_target_gtex_expression_hugo.head()

Unnamed: 0,GTEX-UTHO-1226-SM-3GAEE,GTEX-146FH-1726-SM-5QGQ2,GTEX-QDT8-0126-SM-48TZ1,GTEX-QCQG-1326-SM-48U24,GTEX-WZTO-2926-SM-3NM9I,GTEX-12WSB-0126-SM-59HJN,GTEX-11VI4-0626-SM-5EQLO,GTEX-T5JC-0526-SM-32PM7,GTEX-RU1J-0426-SM-46MUK,GTEX-1212Z-0226-SM-59HLF,...,TCGA-AB-2965-03,TCGA-AB-2936-03,TCGA-AB-2839-03,TCGA-AB-2879-03,TCGA-AB-2886-03,TCGA-AB-2901-03,TCGA-AB-2862-03,TCGA-AB-2956-03,TCGA-AB-2987-03,TCGA-AB-2868-03
TSPAN6,3.0056,5.9729,3.6939,4.9594,2.876,3.836,3.1028,4.2571,2.3816,4.5838,...,-1.2828,-3.3076,-3.1714,-0.8084,-9.9658,-0.1828,-4.2934,-4.6082,-1.2481,-0.9132
TNMD,-9.9658,-9.9658,2.4675,-3.458,-2.4659,0.6608,-2.8262,3.5073,-2.4659,-2.9324,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-3.458,-9.9658
DPM1,4.5098,5.0922,4.981,5.2223,4.7022,5.225,4.7814,4.6206,5.1322,4.0471,...,5.251,4.6566,4.7603,4.4156,4.8719,4.8625,4.5361,5.3345,5.1223,4.1285
SCYL3,1.7403,3.1572,2.7972,2.7357,1.9034,3.5851,2.2144,1.7912,2.4753,1.7912,...,2.667,3.9147,3.3435,3.5535,3.5249,3.506,2.1509,2.9929,2.5011,2.9507
C1orf112,1.334,2.3308,1.4808,1.8564,0.044,2.1247,0.6608,1.0363,1.3109,0.1776,...,3.1939,4.3896,3.5174,3.7105,4.8954,4.5681,2.6918,4.0304,2.4985,3.1491


In [6]:
# While we're at it let's verify that the sum of all expression levels for a sample in TPM space sums to 1 million
tcga_target_gtex_expression_hugo[["GTEX-146FH-1726-SM-5QGQ2", "GTEX-WZTO-2926-SM-3NM9I", "TCGA-AB-2965-03"]].apply(np.exp2).apply(lambda x: x - 0.001).sum()

GTEX-146FH-1726-SM-5QGQ2    1.000001e+06
GTEX-WZTO-2926-SM-3NM9I     9.999974e+05
TCGA-AB-2965-03             9.999970e+05
dtype: float64

In [12]:
%%time
# Multiple Ensemble genes map to the same Hugo name. Each of these values has been normalized via log2(TPM+0.001)
# so we convert back into TPM, compute the mean, and re-normalize.
tcga_target_gtex_expression_hugo_tpm = tcga_target_gtex_expression_hugo \
    .apply(np.exp2).subtract(0.001).groupby(level=0).aggregate(np.mean).add(0.001).apply(np.log2)

CPU times: user 1min 35s, sys: 22.3 s, total: 1min 57s
Wall time: 1min 48s


In [15]:
tcga_target_gtex_expression_hugo_tpm.head()

Unnamed: 0,GTEX-UTHO-1226-SM-3GAEE,GTEX-146FH-1726-SM-5QGQ2,GTEX-QDT8-0126-SM-48TZ1,GTEX-QCQG-1326-SM-48U24,GTEX-WZTO-2926-SM-3NM9I,GTEX-12WSB-0126-SM-59HJN,GTEX-11VI4-0626-SM-5EQLO,GTEX-T5JC-0526-SM-32PM7,GTEX-RU1J-0426-SM-46MUK,GTEX-1212Z-0226-SM-59HLF,...,TCGA-AB-2965-03,TCGA-AB-2936-03,TCGA-AB-2839-03,TCGA-AB-2879-03,TCGA-AB-2886-03,TCGA-AB-2901-03,TCGA-AB-2862-03,TCGA-AB-2956-03,TCGA-AB-2987-03,TCGA-AB-2868-03
5S_rRNA,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,...,-9.9658,-9.9658,-9.9658,0.531436,0.506253,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658
5_8S_rRNA,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658
7SK,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-4.158432,-9.9658,-9.9658,-9.9658,-9.9658,...,-4.158432,-3.128688,-1.994226,-2.759535,-9.9658,-2.914171,-9.9658,-9.9658,-3.749798,-2.950835
A1BG,3.47,3.2435,4.2921,3.5424,4.1852,3.5621,2.6395,2.3649,2.7029,11.1325,...,0.9115,2.8819,2.6895,1.4547,3.5608,2.6647,4.5367,1.7141,0.9716,2.409
A1BG-AS1,0.044,1.2992,2.0535,1.5165,3.3336,1.5902,1.0711,-0.2845,0.8961,1.4808,...,2.7826,3.4451,3.3003,2.6738,4.4384,3.6737,4.1392,2.3981,1.5465,3.4156


## Download Treehouse Expression

The Treehouse public compendium is in Hugo log2(tpm+1). We need to download and convert into lot2(tpm+0.001) to match our TCGA+TARGET+GTEXt dataset above.

In [16]:
%%time
if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz"):
    print("Downloading Treehouse Public Compendium")
    r = requests.get("https://treehouse.xenahubs.net/download/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz",
                     stream=True)
    r.raise_for_status()
    with open("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", "wb") as f:
        for chunk in r.iter_content(chunk_size=32768):
            f.write(chunk)

if not os.path.exists("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.hd5"):
    print("Converting expression to dataframe and storing in hdf5 file")
    expression = pd.read_csv("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.tsv.gz", 
                             sep="\t", index_col=0).dropna(axis="index").astype(np.float32).sort_index(axis="index")
    expression.to_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.hd5", "expression", mode="w", format="fixed")

treehouse_expression = pd.read_hdf("data/treehouse_public_samples_unique_hugo_log2_tpm_plus_1.2017-09-11.hd5", "expression")
print("treehouse_expression: samples={} genes={}".format(*treehouse_expression.shape))

treehouse_expression: samples=58581 genes=11078
CPU times: user 101 ms, sys: 1.14 s, total: 1.24 s
Wall time: 1.24 s


In [20]:
# Check that we don't have any null/nan at this point
assert not tcga_target_gtex_expression_hugo_tpm.isnull().values.any()
assert not treehouse_expression.isnull().values.any()

# Make sure they have identical hugo gene indexes
assert np.array_equal(tcga_target_gtex_expression_hugo_tpm.index, treehouse_expression.index)

In [23]:
# Convert into log2(tpm+0.001)
treehouse_expression_hugo_tpm = treehouse_expression.apply(np.exp2).subtract(1.0).add(0.001).apply(np.log2)

## NOTE and REMINDER

The current public Treehouse compendium was created by combining expression values that map to the same Hugo gene identify by calculating the mean of their log2(tpm+1) values. As a result those values will not match perfectly with the same samples in the TCGA+TARGET+GTEX dataset. The next public compendium from Treehouse will calculate mean in TPM space. Continue on here but later we need to come back and update this - or calculate things the right way for the TH and TR samples from the raw data.

In [25]:
# Check to verify the TCGA+TARGET samples in the Treehouse compendium match TPM wise with our conversions above
sample_id = "TCGA-ZQ-A9CR-01"

np.allclose(tcga_target_gtex_expression_hugo_tpm[sample_id], treehouse_expression_hugo_tpm[sample_id], 1, 1)

argmax = (tcga_target_gtex_expression_hugo_tpm[sample_id] - treehouse_expression_hugo_tpm[sample_id]).values.argmax()
gene = tcga_target_gtex_expression_hugo_tpm.index[argmax]
print("Gene with maximum delta:", gene,
      tcga_target_gtex_expression_hugo_tpm[sample_id][gene] - treehouse_expression_hugo_tpm[sample_id][gene])

(tcga_target_gtex_expression_hugo_tpm[sample_id] - treehouse_expression_hugo_tpm[sample_id]).describe()

Gene with maximum delta: Metazoa_SRP 6.039602


count    58581.000000
mean         0.001218
std          0.060920
min         -0.013058
25%         -0.000107
50%         -0.000016
75%         -0.000016
max          6.039602
Name: TCGA-ZQ-A9CR-01, dtype: float64

## Download and Normalize Labels

In [35]:
# Read in the sample labels from Xena ie clinical/phenotype information on each sample
if not os.path.exists("data/TcgaTargetGTEX_phenotype.txt.gz"):
    with open("data/TcgaTargetGTEX_phenotype.txt.gz", "wb") as f:
        f.write(requests.get("https://toil.xenahubs.net/download/TcgaTargetGTEX_phenotype.txt.gz").content)

tcga_target_gtex_labels = pd.read_table(
    "data/TcgaTargetGTEX_phenotype.txt.gz", compression="gzip", 
    header=0, names=["id", "category", "disease", "primary_site", "sample_type", "gender", "study"],
    sep="\t", encoding="ISO-8859-1", index_col=0, dtype="str").sort_index(axis="index")


# Compute and add a tumor/normal column - TCGA and TARGET have some normal samples, GTEX is all normal.
tcga_target_gtex_labels["tumor_normal"] = tcga_target_gtex_labels.apply(
    lambda row: "Normal" if row["sample_type"] in ["Cell Line", "Normal Tissue", "Solid Tissue Normal"]
    else "Tumor", axis=1)

In [43]:
# assert np.array_equal(tcga_target_gtex_expression_hugo_tpm.index, tcga_target_gtex_labels.index)
print(tcga_target_gtex_expression_hugo_tpm.shape)
print(tcga_target_gtex_labels.shape)


(58581, 19260)
(19131, 7)


In [39]:
tcga_target_gtex_labels[0:20000:4000].head()

Unnamed: 0_level_0,category,disease,primary_site,sample_type,gender,study,tumor_normal
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GTEX-1117F-0226-SM-5GZZ7,Adipose - Subcutaneous,Adipose - Subcutaneous,Adipose Tissue,Normal Tissue,Female,GTEX,Normal
GTEX-POYW-1226-SM-5LZWQ,Lung,Lung,Lung,Normal Tissue,Male,GTEX,Normal
TARGET-10-PARCHB-03,Acute Lymphoblastic Leukemia,Acute Lymphoblastic Leukemia,White blood cell,Primary Blood Derived Cancer - Peripheral Blood,Male,TARGET,Tumor
TCGA-BF-A5ER-01,Skin Cutaneous Melanoma,Skin Cutaneous Melanoma,Skin,Primary Tumor,Male,TCGA,Tumor
TCGA-FY-A4B4-01,Thyroid Carcinoma,Thyroid Carcinoma,Thyroid Gland,Primary Tumor,Female,TCGA,Tumor


In [37]:
tcga_target_gtex_labels.describe()

Unnamed: 0,category,disease,primary_site,sample_type,gender,study,tumor_normal
count,19130,19130,19126,19131,18972,19131,19131
unique,93,93,46,17,2,3,2
top,Breast Invasive Carcinoma,Breast Invasive Carcinoma,Brain,Primary Tumor,Male,TCGA,Tumor
freq,1212,1212,1846,9185,10456,10535,10531


In [None]:
# Use the tissue location as the class label for the purposes of stratification
class_attribute = "primary_site"

# Tumor vs. Normal is the binary attribute we'll use to train on
label_attribute = "tumor_normal"

In [32]:
# Sort, transpose to machine learning standard of rows as samples and write out to hdf5 for fast loading into training notebooks
tcga_target_gtex_expression_hugo_tpm_sorted = tcga_target_gtex_expression_hugo_tpm.sort_index(axis="index").sort_index(axis="columns")
tcga_target_gtex_expression_hugo_tpm_sorted.to_hdf("data/tcga_target_gtex_expression_hugo_tpm.hd5", "expression", mode="w", format="fixed")

In [33]:
# Sort, transpose to machine learning standard of rows as samples and write out to hdf5 for fast loading into training notebooks
treehouse_expression_hugo_tpm_sorted = treehouse_expression_hugo_tpm.sort_index(axis="index").sort_index(axis="columns")
treehouse_expression_hugo_tpm.to_hdf("data/treehouse_expression_hugo_tpm.hd5", "expression", mode="w", format="fixed")

In [None]:
# Remove rows where the class is null or the sample is missing
Y_not_null = Y[pd.notnull(Y[class_attribute])]
intersection = X.index.intersection(Y_not_null.index)
X_clean = X[X.index.isin(intersection)]
Y_clean = Y[Y.index.isin(intersection)]

# Make sure the label and example samples are in the same order
assert(X_clean.index.equals(Y_clean.index))

print(intersection.shape[0], "samples with non-null labels")

In [None]:
# Convert tumor/normal labels to binary 1/0
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_binary = encoder.fit_transform(Y_clean["tumor_normal"])

In [None]:
# Convert classes into numbers
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(Y_clean[class_attribute].values)
classes = encoder.transform(Y_clean[class_attribute])
print("Total classes for stratification:", len(set(classes)))
class_labels = encoder.classes_

In [None]:
%%time
# Split into stratified training and test sets based on classes (i.e. tissue type) so that we have equal
# proportions of each tissue type in the train and test sets
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(X_clean.values, Y_clean[class_attribute]):
    X_train, X_test = X.values[train_index], X_clean.values[test_index]
    y_train, y_test = y_binary[train_index], y_binary[test_index]
    classes_train, classes_test = classes[train_index], classes[test_index]
    sample_labels_train, sample_labels_test = X.index[train_index], X.index[test_index]

In [None]:
"""
Write to an h5 file for training (see above for details on each dataset)
"""
with h5py.File("data/tumor_normal.h5", "w") as f:
    f.create_dataset('X_train', X_train.shape, dtype='f')[:] = X_train
    f.create_dataset('X_test', X_test.shape, dtype='f')[:] = X_test
    f.create_dataset('y_train', y_train.shape, dtype='i')[:] = y_train
    f.create_dataset('y_test', y_test.shape, dtype='i')[:] = y_test
    f.create_dataset('classes_train', y_train.shape, dtype='i')[:] = classes_train
    f.create_dataset('classes_test', y_test.shape, dtype='i')[:] = classes_test
    f.create_dataset('features', X_clean.columns.shape, 'S10', 
                     [l.encode("ascii", "ignore") for l in X_clean.columns.values])
    f.create_dataset('labels', (2, 1), 'S10', 
                     [l.encode("ascii", "ignore") for l in ["Normal", "Tumor"]])
    f.create_dataset('class_labels', (len(class_labels), 1), 'S10', 
                     [l.encode("ascii", "ignore") for l in class_labels])

In [None]:
import matplotlib.pyplot as pyplot
pyplot.hist(classes_train, alpha=0.5, label='Train')
pyplot.hist(classes_test, alpha=0.5, label='Test')
pyplot.legend(loc='upper right')
pyplot.title("Class (Primary Site) distribution between train and test")
pyplot.show()