# Tabula Muris Ingest

Ingest, wrangle and store [Tabula Muris](http://tabula-muris.ds.czbiohub.org/) into a [TFRecordDataset](https://docs.google.com/presentation/d/16kHNtQslt-yuJ3w8GIx-eEH6t_AvFeQOchqGRFpAD7U/edit) optimized for multi-GPU distributed machine learning. See [github](https://github.com/czbiohub/tabula-muris) for details on the data. 

Note: Initially only ingesting FACS dataset as droplet has significant QC dropout and several tissues missing.

In [1]:
import os
import re
import glob
import json
import numpy as np
import pandas as pd
import tensorflow as tf

np.random.seed(42)  # reproducibility

## Download

Download the dataset and unpack into a local directory. This could be /tmp if we only plan to use the S3 copy. I usually have ~/data symlinked to a local scratch disk location for speed and to keep home directories on the shared big memory servers down in size. For a much larger dataset with the files separately downloadable it would be better to download directly into memory, convert to tfrecord and push to S3.

In [2]:
# Switch to data directory for storing download and tfrecords
!mkdir -p ~/data/tabula-muris
os.chdir(os.path.expanduser("~/data/tabula-muris"))

In [3]:
# Delete all local files so we start clean
# !rm -rf *

In [4]:
# Download the FACS data set and unzip into per tissue expression csv files 
!wget --no-clobber --output-document data.zip https://ndownloader.figshare.com/articles/5715040/versions/1
!if [ ! -f FACS.zip ]; then unzip data.zip; fi
!if [ ! -d FACS ]; then unzip FACS.zip; fi

--2018-10-10 21:24:21--  https://ndownloader.figshare.com/articles/5715040/versions/1
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 34.255.241.113, 54.72.206.99, 34.248.163.7, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|34.255.241.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 308503289 (294M) [application/zip]
Saving to: 'data.zip'


2018-10-10 21:24:54 (9.12 MB/s) - 'data.zip' saved [308503289/308503289]

Archive:  data.zip
 extracting: FACS.zip                
 extracting: metadata_FACS.csv       
 extracting: annotations_FACS.csv    
Archive:  FACS.zip
   creating: FACS/
  inflating: FACS/Lung-counts.csv    
  inflating: FACS/Spleen-counts.csv  
  inflating: FACS/Trachea-counts.csv  
  inflating: FACS/Brain_Microglia-counts.csv  
  inflating: FACS/Kidney-counts.csv  
  inflating: FACS/Pancreas-counts.csv  
  inflating: FACS/Skin-counts.csv    
  inflating: FACS/Tongue-counts.csv  
  inflating: FACS/Colon-co

In [5]:
tissues = sorted([re.findall(r"FACS\/(\w.+?)-counts.csv", f)[0] for f in glob.glob("FACS/*-counts.csv")])
print("Found {} tissues: {}".format(len(tissues), tissues))

Found 18 tissues: ['Bladder', 'Brain_Microglia', 'Brain_Neurons', 'Colon', 'Fat', 'Heart', 'Kidney', 'Liver', 'Lung', 'Mammary', 'Marrow', 'Muscle', 'Pancreas', 'Skin', 'Spleen', 'Thymus', 'Tongue', 'Trachea']


In [10]:
from sklearn.model_selection import train_test_split

# Split each tissue into train and test sets and write separately into .tfrecord files
def write_tfrecord(writer, features, labels):
    writer.write(tf.train.Example(features=tf.train.Features(
        feature = {
            "features": tf.train.Feature(float_list=tf.train.FloatList(value=features.values)),
            "labels": tf.train.Feature(int64_list=tf.train.Int64List(value=labels))
        })).SerializeToString())

num_train_samples = 0
num_test_samples = 0
for tissue in tissues:
    print("{}: ".format(tissue), end="")
    # Load expression for single tissue and transpose rows=features to columns=features 
    expression = pd.read_csv("FACS/{}-counts.csv".format(tissue), index_col=0) \
        .astype(np.int32).T.sort_index(axis="columns")
    print(expression.shape[0], end=" ")
    
    train, test = train_test_split(expression, test_size=0.2)
    num_train_samples += train.shape[0]
    num_test_samples += test.shape[0]

    # Compressed which saves significant space as this is sparse data:
    # Bladder: csv (full) = 81M, tfrecord (train) = 118M, tfrecord.gzip (train) = 14M
    options=tf.python_io.TFRecordOptions(
        compression_type=tf.python_io.TFRecordCompressionType.GZIP)

    with tf.python_io.TFRecordWriter("FACS/{}.train.gzip.tfrecord".format(tissue), 
                                     options=options) as writer:
            for (_, features) in train.iterrows():
                write_tfrecord(writer, features, [tissues.index(tissue)])
                
    with tf.python_io.TFRecordWriter("FACS/{}.test.gzip.tfrecord".format(tissue), 
                                     options=options) as writer:
            for (_, features) in test.iterrows():
                write_tfrecord(writer, features, [tissues.index(tissue)])
            
# Save meta data
import json
with open("FACS/metadata.json", "w") as f:
    f.write(json.dumps({
        "tissues": tissues,
        "num_train_samples": num_train_samples,
        "num_test_samples": num_test_samples,
        "genes": expression.columns.tolist()}))

print("\nFinished. \nTotal Training Samples: {}, \nTotal Test Samples: {}".format(
    num_train_samples, num_test_samples))

Bladder: 1638 Brain_Microglia: 4762 Brain_Neurons: 5799 Colon: 4149 Fat: 5862 Heart: 7115 Kidney: 865 Liver: 981 Lung: 1923 Mammary: 2663 Marrow: 5355 Muscle: 2102 Pancreas: 1961 Skin: 2464 Spleen: 1718 Thymus: 1580 Tongue: 1432 Trachea: 1391 
Finished. 
Total Training Samples: 43001, 
Total Test Samples: 10759


In [7]:
# Delete all s3 files so we start clean
# !aws s3 rm --recursive s3://stuartlab/tabula-muris/ --profile prp --endpoint https://s3.nautilus.optiputer.net

In [11]:
import glob
import boto3
session = boto3.session.Session(profile_name=os.getenv("AWS_PROFILE"))
s3 = session.resource("s3", endpoint_url=os.getenv("AWS_S3_ENDPOINT"))
bucket = s3.Bucket("stuartlab")
s3_prefix = "tabula-muris/"
print("S3 Profile: {} Endpoint: {}".format(os.getenv("AWS_PROFILE"), os.getenv("AWS_S3_ENDPOINT")))

# Upload the json and tfrecord files to s3. TFRecordWrite doesn't appear to support 
# writing directly to S3, otherwise we could read, wrangle and write to s3 from memory
for path in glob.glob("FACS/*.tfrecord") + glob.glob("FACS/*.json"):
    print(path)
    bucket.Object(s3_prefix + path).upload_file(path, ExtraArgs={'ACL':'public-read'})
print("Done.")

S3 Profile: prp Endpoint: https://s3.nautilus.optiputer.net
FACS/Bladder.train.gzip.tfrecord
FACS/Bladder.test.gzip.tfrecord
FACS/Brain_Microglia.train.gzip.tfrecord
FACS/Brain_Microglia.test.gzip.tfrecord
FACS/Brain_Neurons.train.gzip.tfrecord
FACS/Brain_Neurons.test.gzip.tfrecord
FACS/Colon.train.gzip.tfrecord
FACS/Colon.test.gzip.tfrecord
FACS/Fat.train.gzip.tfrecord
FACS/Fat.test.gzip.tfrecord
FACS/Heart.train.gzip.tfrecord
FACS/Heart.test.gzip.tfrecord
FACS/Kidney.train.gzip.tfrecord
FACS/Kidney.test.gzip.tfrecord
FACS/Liver.train.gzip.tfrecord
FACS/Liver.test.gzip.tfrecord
FACS/Lung.train.gzip.tfrecord
FACS/Lung.test.gzip.tfrecord
FACS/Mammary.train.gzip.tfrecord
FACS/Mammary.test.gzip.tfrecord
FACS/Marrow.train.gzip.tfrecord
FACS/Marrow.test.gzip.tfrecord
FACS/Muscle.train.gzip.tfrecord
FACS/Muscle.test.gzip.tfrecord
FACS/Pancreas.train.gzip.tfrecord
FACS/Pancreas.test.gzip.tfrecord
FACS/Skin.train.gzip.tfrecord
FACS/Skin.test.gzip.tfrecord
FACS/Spleen.train.gzip.tfrecord
FACS/S

In [12]:
!aws --profile {os.getenv("AWS_PROFILE")} --endpoint {os.getenv("AWS_S3_ENDPOINT")} \
    s3 ls s3://stuartlab/tabula-muris/FACS/ 

2018-10-10 21:39:41    3583903 Bladder.test.gzip.tfrecord
2018-10-10 21:39:40   13967144 Bladder.train.gzip.tfrecord
2018-10-10 21:39:45    6161311 Brain_Microglia.test.gzip.tfrecord
2018-10-10 21:39:44   24725897 Brain_Microglia.train.gzip.tfrecord
2018-10-10 21:39:47    7041452 Brain_Neurons.test.gzip.tfrecord
2018-10-10 21:39:47   27650697 Brain_Neurons.train.gzip.tfrecord
2018-10-10 21:39:51    8393108 Colon.test.gzip.tfrecord
2018-10-10 21:39:49   34024941 Colon.train.gzip.tfrecord
2018-10-10 21:39:55    9121552 Fat.test.gzip.tfrecord
2018-10-10 21:39:53   36525787 Fat.train.gzip.tfrecord
2018-10-10 21:39:58    8753749 Heart.test.gzip.tfrecord
2018-10-10 21:39:56   33654460 Heart.train.gzip.tfrecord
2018-10-10 21:39:59     642001 Kidney.test.gzip.tfrecord
2018-10-10 21:39:59    2553561 Kidney.train.gzip.tfrecord
2018-10-10 21:40:00    1422611 Liver.test.gzip.tfrecord
2018-10-10 21:39:59    5712921 Liver.train.gzip.tfrecord
2018-10-10 21:40:02    2831570 Lung.test.g