<a href="https://colab.research.google.com/github/tensorpig/learning_tensorflow/blob/master/Time_series_Datasets_and_shape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's assume we have a time series data, like share prices or sunspot activity. Then you'll find plenty of sample code on how to use a Tensorflow Dataset to window/batch the time series into consecutive x,y values to use to train your neural network. I'll go through those steps here, looking at the resulting elements of the dataset, and the shape of the data.

In [24]:
import tensorflow as tf
series = [1,2,3,4,5] # very simple, dummy time series
WINDOW_SIZE=3

In [25]:
ds = tf.data.Dataset.from_tensor_slices(series)
ds = ds.window(WINDOW_SIZE, shift=1, drop_remainder=False)
for win in ds:
  print(list(win.as_numpy_iterator()))

[1, 2, 3]
[2, 3, 4]
[3, 4, 5]
[4, 5]
[5]


We can use a sliding window on the time series to generate input Xs. If we set drop_remainder to False, the windows taper off when data runs out. 

In [26]:
ds = tf.data.Dataset.from_tensor_slices(series)
ds = ds.window(WINDOW_SIZE, shift=1, drop_remainder=True)
for win in ds:
  print(list(win.as_numpy_iterator()))
print(ds.element_spec)

[1, 2, 3]
[2, 3, 4]
[3, 4, 5]
DatasetSpec(TensorSpec(shape=(), dtype=tf.int32, name=None), TensorShape([]))


To keep all Xs uniform we use drop_remainder=True. Note that the resulting WindowDataSet has lost all notion of the shape of the data and each element is a complicated construct of DatasetSpec wrapping around TensorSpec

In [27]:
ds = ds.flat_map(lambda w: w.batch(WINDOW_SIZE))
for win in ds:
  print(win)
print(ds.element_spec)

tf.Tensor([1 2 3], shape=(3,), dtype=int32)
tf.Tensor([2 3 4], shape=(3,), dtype=int32)
tf.Tensor([3 4 5], shape=(3,), dtype=int32)
TensorSpec(shape=(None,), dtype=tf.int32, name=None)


Flattening the dataset using a lambda that (again) batches the data in windows results in a somewhat simpler element consistig of a Tensor(Spec). But again, the shape of the data is lost. 

In [29]:
ds = ds.batch(2)
print(ds.element_spec)

TensorSpec(shape=(None, None, None), dtype=tf.int32, name=None)


A final step is typically batching the data so that the dataset pipeline will feed batches of data into. Note that the shape of the data remains "lost" which is consufing when trying to feed this data into a Dense layer, which requires the input_size to be defined.

I was puzzled why the shape of the data was getting lost, as an iteration through the elements clearly showed that the dataset could know the shape. But then I realised that the Dataset upfront doesn't "have" the data and only gets it while iterating. And the 2 batch() calls made on the data as part of the pipeline were made without specifying the drop_remainder=True, meaning the Dataset had no way of knowing upfront if it would end up with partial windows, and hence no way of knowing the data shape. Note: we know of course that there are no partial windows, but the Dataset does not.

In [30]:
ds = tf.data.Dataset.from_tensor_slices(series)
ds = ds.window(WINDOW_SIZE + 1, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda w: w.batch(WINDOW_SIZE + 1, drop_remainder=True))
print(ds)
ds = ds.batch(2, drop_remainder=True)
print(ds)



<FlatMapDataset shapes: (4,), types: tf.int32>
<BatchDataset shapes: (2, 4), types: tf.int32>


Adding the drop_remainder=True to the 2 batch() calls clearly resolves the problem and allows the Dataset to know the shape of the data. I found this much easier to feed into the neural network Dense layer now.