# Preparing your data: Single acquisition with timing issues

In [1]:
import numpy as np
import xdas as xd

Most instruments usually produces datasets made out of a multitude of files, each one containing a temporal chunk of the full acquisition. In Xdas you can virtually concatenate all those files to create a virtual dataset that allows to seamlessly access the entire dataset as if it was a unique file.

In the example here, let's focus on several files that have been recorded with NTP synchronization:

In [2]:
!ls data/ntp_sync  # several files with NTP sync

021040.hdf5  021100.hdf5  021120.hdf5  022010.hdf5
021050.hdf5  021110.hdf5  022000.hdf5


We can open all the files from this folder and link them:

In [3]:
da = xd.open_mfdataarray(
    "data/ntp_sync/*.hdf5",  # path with wildcard
    engine="asn",  # do not forget to specify the format of the files
    tolerance=None,  # by default, no tolerance is used to merge the files
)
da

<xdas.DataArray (time: 8750, distance: 50000)>
VirtualStack: 1.6GB (float32)
Coordinates:
  * time (time): 2021-11-11T02:10:40.693 to 2021-11-11T02:20:20.680
  * distance (distance): 0.000 to 204255.953

Xdas only loads the metadata from each file and returns a `DataArray` object. This object has mainly two attributes:
- `data`: contains the data. Here a `VirtualStack` object that is a pointer to the different files we opened. 
- `coords`: contains the metadata related to how the space and the time are sampled. Here both dimensions are labeled using `InterpCoordinate` objects. Those allow to concisely store the time and space information, including potential gaps and overlaps. 

The advantage to bundle the data with coordinates is that it enables labeled indexing, i.e. selection by coordinates values and not by indices:

In [4]:
try:
    da.sel(time=slice("2021-11-11T02:10:50", "2021-11-11T02:11:00"))
except ValueError as error:
    print("Oh no the following error occurred:", error)

Oh no the following error occurred: fp must be strictly increasing


The `Value Error: fp must be strictly increasing` is obscure yet common error encountered in Xdas. It means that the coordinates values are not strictly increasing hence that overlaps are present. Overlaps break the bijection between coordinate indices and coordinate values and must be fixed. Most of the time they originate from bad time synchronization. Let's investigate the temporal dimension. 

We can investigate the temporal discontinuities like this:

In [5]:
da["time"].get_discontinuities()

Unnamed: 0,start_index,end_index,start_value,end_value,delta,type
0,1249,1250,2021-11-11 02:10:50.685000192,2021-11-11 02:10:50.694000128,0 days 00:00:00.008999936,gap
1,2499,2500,2021-11-11 02:11:00.686000128,2021-11-11 02:11:00.683000064,-1 days +23:59:59.996999936,overlap
2,6249,6250,2021-11-11 02:11:30.675000064,2021-11-11 02:20:00.688000000,0 days 00:08:30.012999936,gap


As we can see two gaps an one overlap where found:
- The first gap has a delta of 9 ms which is close to the sampling interval (here 125 Hz -> 8 ms)
- The overlap is has a delta of -3 ms and is also probably due to a NTP resynchronization
- The second gap has a delta of 8 min and 30 s and correspond to missing data

To enforce the continuity over small misalignments of the timestamps, a tolerance ca be used: 

In [6]:
# the `tolerance` can be directly passed to `open_mfdataarray`
tolerance = np.timedelta64(30, "ms")  # usually enough for NTP synchronized experiments
da["time"] = da["time"].simplify(tolerance)
da["time"].get_discontinuities()

Unnamed: 0,start_index,end_index,start_value,end_value,delta,type
0,6249,6250,2021-11-11 02:11:30.675000064,2021-11-11 02:20:00.688,0 days 00:08:30.012999936,gap


Now only the real gap is present and labled selection can be done without errors:

In [7]:
da.sel(
    time=slice("2021-11-11T02:10:50", "2021-11-11T02:11:00"),
    distance=slice(20_000, 100_000),  # Do not forget the `slice`
)

<xdas.DataArray (time: 1250, distance: 19583)>
VirtualStack: 93.4MB (float32)
Coordinates:
  * time (time): 2021-11-11T02:10:50.003 to 2021-11-11T02:10:59.993
  * distance (distance): 20001.143 to 99997.544

This virtually consolidated file can be saved to disk without copying the data, the resulting file will only contains pointers where the data is located. 

In [8]:
# Xdas tries to write data virtually by default
da.to_netcdf("outputs/ntp.nc", virtual=True)