## Description

if your dataset is very large then you can split it into several TFRecords files called shards. This will also improve the random shuffling, because the Dataset API only shuffles from a smaller buffer of e.g. 1024 elements loaded into RAM. So if you have e.g. 100 TFRecords files, then the randomization will be much better than for a single TFRecords file.

## Libraries

In [1]:
import h5py
import numpy as np
import os
import sys
import tensorflow as tf

  from ._conv import register_converters as _register_converters


## Load subjects

In [2]:
path = 'D:/mimicdb/180521_ch2_30s/';
filename = path + 'subject.txt';
subject_list = np.loadtxt(filename, delimiter='\t', dtype=np.int16)
subjects = np.unique(subject_list[:,1])
print(len(subjects), "subjects:", subjects)

38 subjects: [211 212 213 216 220 224 225 226 230 237 240 252 259 262 276 281 284 401
 404 408 411 413 417 427 437 438 439 443 446 449 450 452 471 472 476 482
 484 485]


## Load train data

In [3]:
def loadmat(path, subject_id):
    f = h5py.File(path + 'subject' + subject_id + '.mat')
    s = f['/subject/systolic'].value.T.astype(int)
    d = f['/subject/diastolic'].value.T.astype(int)
    e = f['/subject/ecg'].value.T
    p = f['/subject/ppg'].value.T
    i = f['/subject/index'].value.T.astype(int)
    return s, d, e, p, i

In [4]:
def print_progress(count, total):
    # Percentage completion.
    pct_complete = float(count) / total

    # Status-message.
    # Note the \r which means the line should overwrite itself.
    msg = "\r- Progress: {0:.1%}".format(pct_complete)

    # Print it.
    sys.stdout.write(msg)
    sys.stdout.flush()

In [5]:
def wrap_int64(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

In [6]:
def wrap_float(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

reading and writing data along with the class-labels to a TFRecords file. This loads and decodes the images to numpy-arrays and then stores the raw bytes in the TFRecords file. If the original image-files are compressed e.g. as jpeg-files, then the TFRecords file may be many times larger than the original image-files.

It is also possible to save the compressed image files directly in the TFRecords file because it can hold any raw bytes. We would then have to decode the compressed images when the TFRecords file is being read later in the `parse()` function below.

In [7]:
def convert(ecg, ppg, systolic, diastolic, indices, out_path):
    print("\nConverting: " + out_path)

    writer = None

    # Number of signals. Used when printing the progress.
    num_signals = len(systolic)
    
    for i, (e, p, s, d, a) in enumerate(zip(ecg, ppg, systolic, diastolic, indices)):
        if (i % 64 == 0):
            if writer:
                writer.close()
            writer = tf.python_io.TFRecordWriter(out_path + "-" + str(i) + ".tfrecords")
        
        # Print the percentage-progress.
        print_progress(count=i, total=num_signals-1)

        # Create a dict with the data we want to save in the
        # TFRecords file. You can add more relevant data here.
        data = \
            {
                'ecg': wrap_float(e),
                'ppg': wrap_float(p),
                'systolic': wrap_int64(s),
                'diastolic': wrap_int64(d),
                'annotation' : wrap_int64(a)
            }
        
        # Wrap the data as TensorFlow Features.
        feature = tf.train.Features(feature=data)
        
        # Wrap again as a TensorFlow Example.
        example = tf.train.Example(features=feature)
        
        # Serialize the data.
        serialized = example.SerializeToString()
        
        # Write the serialized data to the TFRecords file.
        writer.write(serialized)
    
    writer.close()

In [8]:
for subject in subjects:
    sid = "%03d" % (subject)
    s, d, e, p, i = loadmat(path, sid)
    record_tfrecords = os.path.join(path, "tfrecords-batches/s" + str(sid))    
    convert(ecg=e, ppg=p, systolic=s, diastolic=d, indices=i, out_path=record_tfrecords)


Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s211
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s212
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s213
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s216
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s220
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s224
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s225
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s226
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s230
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s237
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s240
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/tfrecords-batches/s252
- Progress: 100.0%
Converting: D:/mimicdb/180521_ch2_30s/t