# Data developments

For now I'm going to use this notebook to play with the dataset and develop data preprocessing bits.

In [2]:
import os

import numpy as np
import h5py

In [3]:
data_dir = '/project/projectdirs/m3363/www/cosmoUniverse_2019_05_4parE'

In [4]:
!ls $data_dir

21688988  21922619  21997469  22059249	22098324  22309462
21812950  21929749  22021490  22074825	22118427


## Open one h5 file and inspect contents

In [5]:
dfile = os.path.join(data_dir, '21688988', 'univ_ics_2019-03_a10000668.hdf5')

In [6]:
dfile

'/project/projectdirs/m3363/www/cosmoUniverse_2019_05_4parE/21688988/univ_ics_2019-03_a10000668.hdf5'

In [7]:
with h5py.File(dfile, mode='r') as f:
    print(f.keys())
    x = f['full'][:]
    y = f['unitPar'][:]

<KeysViewHDF5 ['full', 'namePar', 'physPar', 'redshifts', 'unitPar']>


In [8]:
x.shape

(512, 512, 512, 4)

In [9]:
x.sum()

536870912

## Splitting universe into cubes

In [10]:
np.split(x, 4)[0].shape

(128, 512, 512, 4)

In [11]:
def split_universe(x, size):
    n = x.shape[0] // size
    # Loop over each split
    for xi in np.split(x, n, axis=0):
        for xij in np.split(xi, n, axis=1):
            for xijk in np.split(xij, n, axis=2):
                yield xijk

In [12]:
cube_size = 128
sample_shape = (cube_size, cube_size, cube_size, 4)

In [13]:
# Example loop over splits, verify the sum
total_sum = 0
for i, xi in enumerate(split_universe(x, cube_size)):
    total_sum += xi.sum()
    # Write out a tfrecord here
    #print(i, xi.shape)
    #break

print('Total sum:', total_sum)

Total sum: 536870912


## Writing to TFRecord

Relevant documentation/examples:

Jan's conversion code: https://bitbucket.org/balewski/cosmoflow/src/2019_TF/IO_Cosmo_TF1_8.py

TF tutorial: https://www.tensorflow.org/tutorials/load_data/tfrecord

Another useful tutorial: https://towardsdatascience.com/working-with-tfrecords-and-tf-train-example-36d111b3ff4d

There are multiple ways to serialize the tensor.
- I could convert the tensor to a bytestring using numpy's `tostring` method, as Jan did, and then make a `BytesList` feature.
- I could possibly use `tf.io.serialize_tensor`, but it's not clear. I think with this one the data needs to first be converted to tf Tensor.
- I could save the tensor data as float features

In [14]:
import tensorflow as tf

In [15]:
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    # If the value is an eager tensor BytesList won't unpack a string from an EagerTensor.
    #if isinstance(value, type(tf.constant(0))):
    #    value = value.numpy() 
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(array):
    """Returns a float_list from a float / double."""
    # Flatten the array
    if len(array.shape) > 1:
        array = array.flatten()
    return tf.train.Feature(float_list=tf.train.FloatList(value=array))

### Save as float features

In [13]:
tf_example = tf.train.Example(
    features=tf.train.Features(
        feature=dict(x=_float_feature(xi), y=_float_feature(y))))

In [16]:
out_dir = '.'
tfr_file = os.path.join(out_dir, os.path.basename(dfile).replace('.hdf5', '.tfrecord'))

In [34]:
proto_example = tf_example.SerializeToString()

with tf.io.TFRecordWriter(tfr_file) as writer:
    writer.write(proto_example)

In [37]:
feature_description = dict(
    x=tf.io.FixedLenFeature(sample_shape, tf.float32),
    y=tf.io.FixedLenFeature([4], tf.float32)
)

In [38]:
parsed_example = tf.io.parse_single_example(proto_example, features=feature_description)

In [39]:
parsed_example

{'x': <tf.Tensor 'ParseSingleExample/ParseSingleExample:0' shape=(128, 128, 128, 4) dtype=float32>,
 'y': <tf.Tensor 'ParseSingleExample/ParseSingleExample:1' shape=(4,) dtype=float32>}

### Save using numpy tostring

In [42]:
tf_example = tf.train.Example(
    features=tf.train.Features(
        feature=dict(
            x=_bytes_feature(xi.tostring()),
            y=_float_feature(y)
        )
    )
)

In [43]:
tfr_file = os.path.join(out_dir, os.path.basename(dfile).replace('.hdf5', '.nps.tfrecord'))

In [44]:
# Write it
proto_example = tf_example.SerializeToString()
with tf.io.TFRecordWriter(tfr_file) as writer:
    writer.write(proto_example)

In [45]:
# Load it back
feature_description = dict(
    x=tf.io.FixedLenFeature([], tf.string),
    y=tf.io.FixedLenFeature([4], tf.float32)
)

In [46]:
parsed_example = tf.io.parse_single_example(proto_example, features=feature_description)

In [47]:
parsed_example

{'x': <tf.Tensor 'ParseSingleExample_1/ParseSingleExample:0' shape=() dtype=string>,
 'y': <tf.Tensor 'ParseSingleExample_1/ParseSingleExample:1' shape=(4,) dtype=float32>}

In [56]:
with tf.Session() as sess:
    _x = tf.reshape(tf.decode_raw(parsed_example['x'], tf.int16), xi.shape).eval()
    print(_x.sum())
    print(_x[0,0].sum(axis=0))

9042665
[184 125 136 145]


This works!!!!!!!!!

### Check the contents

In [32]:
print(xi.sum())
print(xi[0,0].sum(axis=0))

In [None]:
with tf.Session() as sess:
    print(parsed_example['x'].eval()[0,0,0])
    print(parsed_example['y'].eval())

In [69]:
with tf.Session() as sess:
    _x = parsed_example['x'].eval()
    print(_x.sum())
    print(_x[0,0].sum(axis=0))

9042665.0
[184. 125. 136. 145.]
