# h5py

* Reading in a file
    * Valid modes of this function
    * Drivers
    * Version
    * Filenames
* Closing files
* Groups
    * Creating groups
    * Dictionary similarities
* Datasets
    * Creating datasets
    * Reading and writing datasets
    * Multiple indexing
    * Length and iteration
    * Resizable datasets
    * Empty or null datasets
* Attributes
* Dimension scales
* Strings
* Variable length strings
* Variable length datasets
* Object names
* Hardlinks
* Softlinks
* References
    * Object References
    * Region References
    * References in a dataset
    * Referencing in an attribute
    * Null Referencing
* User block
* Opaque data
* Single Writer Multiple Reader
* Virtual Datasets


## Reading in a file
To open and create files, the `File()` function is used. 

In [1]:
import h5py

f = h5py.File('VNP46A2.A2022130.h05v05.001.2022138144341.h5', "r")

### Valid modes of this function are:
`r` - For reading only, the file must exist (it is also the default).

`r+` - For reading and writing, the file must exist.

`w` - Used to create a file or truncate if it exists.

`w-` or `x` - For creating a file, the command fails if the file already exists.

`a` - For reading and writing if it already exists, creates a file otherwise. 

### Drivers

In [2]:
# f = h5py.File('myFile.hdf', driver=<driver name>, <driver_kwds>)

f = h5py.File('VNP46A2.A2022130.h05v05.001.2022138144341.h5', driver=None)

  f = h5py.File('VNP46A2.A2022130.h05v05.001.2022138144341.h5', driver=None)


HDF5 ships with a variety of different low-level drivers, which map the logical HDF5 address space to different storage mechanisms. You can specify which driver you want to use when the file is opened. 

`None` is the recommended option as it uses the standard HDF5 driver of the current platform. (Windows, HSFD_WINDOWS; UNIX, HSFD_SEC2) 

`core` allows one to store and manipulate the data in memory and optionally write it back out when the file is closed. Using this with an existing file and a reading mode will read the entire file into memory. 

`family` is used to store the file on disk as a series of fixed length chunks. 

### Version

In [3]:
f = h5py.File('newfile1.hdf5', "w", libver='earliest')    # most compatible
f = h5py.File('newfile2.hdf5', "w", libver='latest')      # most modern
f = h5py.File('newfile3.hdf5', "w", libver=('earliest', 'v108')) 

By default, objects will be written in the most compatible fashion. By using the `libver` option of `File()` the sophistication can be specified. For example, specifying as above `v108` for HDF5 1.8 and `v110` for HDF5 1.10.

### Filenames
Different operating systems (and different file systems) store filenames with different encodings. Additionally, in Python there are at least two different representations of filenames, as encoded bytes or as a Unicode string (str on Python 3).
h5py’s high-level interfaces always return filenames as str, e.g. `File.filename`. h5py accepts filenames as either str or bytes. In most cases, using Unicode (str) paths is preferred (to be used on macOS and Windows), but there are some caveats.

## Closing files
To close a file you can either leave a `with h5py.File(...)` block or call `File.close()`. Any groups or datasets will then be unusable. Else, if a file goes out of scope in your Python code the file will only be closed when there are no remaining objects belonging to it. 

In [4]:
import h5py

with h5py.File('VNP46A2.A2022130.h05v05.001.2022138144341.h5', 'r') as f1:
    ds = f1['HDFEOS']

#ds[0]         # ERROR - can't access dataset, because f1 is closed:

def get_dataset():
    f2 = h5py.File('VNP46A2.A2022130.h05v05.001.2022138144341.h5', 'r')
    return f2['HDFEOS']['GRIDS']['VNP_Grid_DNB']["Data Fields"]['Latest_High_Quality_Retrieval'][:]
ds = get_dataset()

ds[0]         # OK - f2 is out of scope, but the dataset reference keeps it open:


del ds        # Now f2.h5 will be closed

## Groups

Groups are the container mechanism of HDF5 files. They are a bit like keys of dictionaries with the datasets being the values. Another way to think of groups is like folders for the datasets. 

### Creating groups


In [5]:
f = h5py.File('createdExample.hdf5','w')
print(f.name)

print(list(f.keys()))

grp = f.create_group("bar")
print(grp.name)

subgrp = grp.create_group("baz")
print(subgrp.name)

grp2 = f.create_group("/some/long/path")
print(grp2.name)

grp3 = f['/some/long']
print(grp3.name)

print(list(f.keys()))

/
[]
/bar
/bar/baz
/some/long/path
/some/long
['bar', 'some']


### Dictionary similarities
Groups have some similar methods to dictionaries including: `keys()`, `values()`, indexing syntax and support iteration. 

In [6]:
f = h5py.File('VNP46A2.A2022130.h05v05.001.2022138144341.h5', "r")

print(list(f.keys()))

subgrp = f['HDFEOS']['GRIDS']['VNP_Grid_DNB']['Data Fields']
print(subgrp)
print(subgrp.values())

# missing = subgrp["missing"]   # raises an error

['HDFEOS', 'HDFEOS INFORMATION']
<HDF5 group "/HDFEOS/GRIDS/VNP_Grid_DNB/Data Fields" (7 members)>
ValuesViewHDF5(<HDF5 group "/HDFEOS/GRIDS/VNP_Grid_DNB/Data Fields" (7 members)>)


Objects can be deleted using the following syntax:

## Datasets
Similar to NumPy arrays, they are homogeneous collections of data elements with an immutable data type. They are represented by a thin proxy class which supports NumPy operations like slicing, along with a variety of descriptive attributes: shape, size, ndim, dtype, and nbytes. 
### Creating datasets


In [7]:
f = h5py.File('createdExample1.hdf5','w')
print(f.name)

dset = f.create_dataset("default", (100,))
print(dset)
dset = f.create_dataset("ints", (100,), dtype='i8')
print(dset)

/
<HDF5 dataset "default": shape (100,), type "<f4">
<HDF5 dataset "ints": shape (100,), type "<i8">


New datasets are created using either `Group.create_dataset()` or `Group.require_dataset()`. Existing datasets should be retrieved using the group indexing syntax (`dset = group["name"]`).

To initialise a dataset, all you have to do is specify a name, shape, and optionally the data type (defaults to 'f').

You may also initialize the dataset to an existing numpy array by providing the data parameter:

In [8]:
import numpy as np

arr = np.arange(100)
dset = f.create_dataset("init", data=arr)
print(dset)
print(arr)

<HDF5 dataset "init": shape (100,), type "<i8">
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


Keywords shape and dtype may be specified along with data; if so, they will override `data.shape` and `data.dtype`. It’s required that (1) the total number of points in shape match the total number of points in `data.shape`, and that (2) it’s possible to cast `data.dtype` to the requested dtype.

### Reading and writing datasets

In [9]:
dset = f.create_dataset("MyDataset", (10,10,10), 'f')
print(dset[0,0,0])
print()
print(dset[0,2:10,1:9:3])
print()
print(dset[:,::2,5])
print()
print(dset[0])
print()
print(dset[1,5])
print()
print(dset[0,...])
print()
print(dset[...,6])
print()
print(dset[()])

0.0

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[0. 0. 0. 0. 0. 0

The following NumPy slicing arguments are recognized:

- Indices: anything that can be converted to a Python long
- Slices (i.e. `[:]` or `[0:10]`)
- Field names, in the case of compound data
- At most one Ellipsis (...) object
- An empty tuple (`()`) to retrieve all data or scalar data

For compound data, it is advised to separate field names from the numeric slices:

For simple slicing, broadcasting is supported:

In [10]:
dset[0,:,:] = np.arange(10) 
print(dset[()])               # Broadcasts to (10,10)

[[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0

### Multiple indexing
Indexing a dataset once loads a NumPy array into memory. If you try to index it twice to write data, you may be surprised that nothing seems to have happened:

In [11]:
f = h5py.File('my_hdf5_file.h5', 'w')
dset = f.create_dataset("test", (2, 2))
dset[0][1] = 3.0      # No effect!
print(dset[0][1])

0.0


The assignment above only modifies the loaded array. It’s equivalent to this:

In [12]:
new_array = dset[0]
new_array[1] = 3.0
print(new_array[1])     # 3.0

print(dset[0][1])       # 0.0

3.0
0.0


To write to the dataset, combine the indexes in a single step:

In [13]:
dset[0, 1] = 3.0
print(dset[0, 1])

3.0


### Length and iteration 

As with NumPy arrays, the `len()` of a dataset is the length of the first axis, and iterating over a dataset iterates over the first axis. However, modifications to the yielded data are not recorded in the file. Resizing a dataset while iterating has undefined results.

### Resizable datasets

Datasets can be resized once created up to a maximum size, by calling `Dataset.resize()`. You specify this maximum size when creating the dataset, via the keyword `maxshape`.
Any (or all) axes may also be marked as “unlimited”, in which case they may be increased up to the HDF5 per-axis limit of 2\*\*64 elements. Indicate these axes using `None`.

In [14]:
dset = f.create_dataset("resizable", (10,10), maxshape=(500, 20))
dset = f.create_dataset("unlimited", (10, 10), maxshape=(None, 10))

### Empty or null datasets
These are not the same as an array with a shape of `()`, or a scalar dataspace in HDF5 terms. Instead, it is a dataset with an associated type, no data, and no shape. In h5py, we represent this as either a dataset with shape `None`, or an instance of `h5py.Empty`. Empty datasets and attributes cannot be sliced.

To create an empty attribute, use `h5py.Empty`. Reading an empty attribute returns `h5py.Empty`. 

To create an empty dataset, you can define a dtype but no shape in create dataset or define data to an instance of `h5py.Empty`. 

In [24]:
f.close()
f = h5py.File('my_hdf5_file.h5', 'w')

EmptyDset1 = f.create_dataset("EmptyDataset1", dtype="f")

#EmptyDset2 = f.create_dataset("EmptyDataset2", data=h5py.Empty("f"))

An empty dataset has shape defined as `None`, which is the best way of determining whether a dataset is empty or not. 

In [25]:
print(EmptyDset1)

<HDF5 dataset "EmptyDataset1": shape None, type "<f4">


The dtype of the dataset can be accessed via `<dset>.dtype`.

## Attributes
They are small named pieces of data attached directly to Group and Dataset objects. This is the official way to store metadata in HDF5.
Each Group or Dataset has a small proxy object attached to it, at `<obj>.attrs`. Attributes have the following properties:
- They may be created from any scalar or NumPy array
- There is no partial I/O (i.e. slicing); the entire attribute must be read.
- The `.attrs` proxy objects are of class `AttributeManager`, below. This class supports a dictionary-style interface.
- By default, attributes are iterated in alphanumeric order. However, if group or dataset is created with `track_order=True`, the attribute insertion order is remembered (tracked) in HDF5 file, and iteration uses that order. The latter is consistent with Python 3.7+ dictionaries. The default `track_order` for all new groups and datasets can be specified globally with `h5.get_config().track_order`.

`__iter__()` Get an iterator over attribute names.

`__contains__(name)` Determine if attribute name is attached to this object.

`__getitem__(name)` Retrieve an attribute.

`__setitem__(name, val)` Create an attribute, overwriting any existing attribute. The type and shape of the attribute are determined automatically by h5py.

`__delitem__(name)` Delete an attribute. KeyError if it doesn’t exist.

`keys()` Get the names of all attributes attached to this object. Returns set-like object.

`values()` Get the values of all attributes attached to this object. Returns collection or bag-like object.

`items()` Get (name, value) tuples for all attributes attached to this object. Returns collection or set-like object.

`get(name, default=None)` Retrieve name, or default if no such attribute exists.

`get_id(name)` Get the low-level AttrID for the named attribute.

`create(name, data, shape=None, dtype=None)` Create a new attribute, with control over the shape and type. Any existing attribute will be overwritten.
 Parameters: 
- name (String) – Name of the new attribute
- data – Value of the attribute; will be put through numpy.array(data).
- shape (Tuple) – Shape of the attribute. Overrides data.shape if both are given, in which case the total number of points must be unchanged.
- dtype (NumPy dtype) – Data type for the attribute. Overrides data.dtype if both are given.

`modify(name, value)`
Change the value of an attribute while preserving its type and shape. Unlike `AttributeManager.__setitem__()`, if the attribute already exists, only its value will be changed. This can be useful for interacting with externally generated files, where the type and shape must not be altered. If the attribute doesn’t exist, it will be created with a default shape and type. Parameters:
- name (String) – Name of attribute to modify.
- value – New value. Will be put through numpy.array(value).

## Dimension scales

HDF5 allows the dimensions of data to be labeled, for example:

Note that the first dimension, which has a length of 4, has been labeled “z”, the third dimension (in this case the fastest varying dimension), has been labeled “x”, and the second dimension was given no label at all.

In [26]:
f = h5py.File('example123.h5', 'w')
f['data'] = np.ones((4, 3, 2), 'f')

f['data'].dims[0].label = 'z'
f['data'].dims[2].label = 'x'

We can also use HDF5 datasets as dimension scales. For example, if we have the file below, we are going to treat the x1, x2, y1, and z1 datasets as dimension scales:

In [27]:
f['x1'] = [1, 2]
f['x2'] = [1, 1.1]
f['y1'] = [0, 1, 2]
f['z1'] = [0, 1, 4, 9]

f['x1'].make_scale()
f['x2'].make_scale('x2 name')
f['y1'].make_scale('y1 name')
f['z1'].make_scale('z1 name')

When you create a dimension scale, you may provide a name for that scale. In this case, the x1 scale was not given a name, but the others were. Now we can associate these dimension scales with the primary dataset. Note that two dimension scales were associated with the third dimension of data. You can also detach a dimension scale.

In [28]:
f['data'].dims[0].attach_scale(f['z1'])
f['data'].dims[1].attach_scale(f['y1'])
f['data'].dims[2].attach_scale(f['x1'])
f['data'].dims[2].attach_scale(f['x2'])

f['data'].dims[2].detach_scale(f['x2'])

For now, lets assume that we have both x1 and x2 still associated with the third dimension of data. You can attach a dimension scale to any number of HDF5 datasets, you can even attach it to multiple dimensions of a single HDF5 dataset. Now that the dimensions of data have been labeled, and the dimension scales for the various axes have been specified, we have provided much more context with which data can be interpreted. For example, if you want to know the labels for the various dimensions of data:

In [29]:
print([dim.label for dim in f['data'].dims])

['z', '', 'x']


If you want the names of the dimension scales associated with the “x” axis, the method below works and `items()` and `values()` methods are also provided. 

In [30]:
print(f['data'].dims[2].keys())

['']


The dimension scales themselves can also be accessed with 

In [31]:
print(f['data'].dims[2])#[1]

# or:

print(f['data'].dims[2])#['x2 name'])
# if:
print(True == f['data'].dims[2] == f['x2'])

<"x" dimension 2 of HDF5 dataset at 4602904384>
<"x" dimension 2 of HDF5 dataset at 4602904384>
False


though, beware that if you attempt to index the dimension scales with a string, the first dimension scale whose name matches the string is the one that will be returned. There is no guarantee that the name of the dimension scale is unique.

Nested dimension scales are not permitted: if a dataset has a dimension scale attached to it, converting the dataset to a dimension scale will fail, since the HDF5 specification doesn’t allow this.

## Strings

String data in HDF5 datasets is read as bytes by default: bytes objects for variable-length strings, or numpy bytes arrays ('S' dtypes) for fixed-length strings. Use `Dataset.asstr()` to retrieve str objects.

Variable-length strings in attributes are read as str objects. These are decoded as UTF-8 with surrogate escaping for unrecognised bytes.
When creating a new dataset or attribute, Python str or bytes objects will be treated as variable-length strings, marked as UTF-8 and ASCII respectively. Numpy bytes arrays ('S' dtypes) make fixed-length strings. You can use `string_dtype()` to explicitly specify any HDF5 string datatype.

When writing data to an existing dataset or attribute, data passed as bytes is written without checking the encoding. Data passed as Python str objects is encoded as either ASCII or UTF-8, based on the HDF5 datatype. In either case, null bytes ('\x00') in the data will cause an error.

NumPy also has a Unicode type, a UTF-32 fixed-width format (4-byte characters). HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type.

If you have a non-text blob in a Python byte string (as opposed to ASCII or UTF-8 encoded text, which is fine), you should wrap it in a void type for storage. This will map to the HDF5 OPAQUE datatype, and will prevent your blob from getting mangled by the string machinery.
Here’s an example of how to store binary data in an attribute, and then recover it:

In [32]:
dset = f.create_dataset("test", (5,5,5))

binary_blob = b"Hello\x00Hello\x00"
dset.attrs["attribute_name"] = np.void(binary_blob)
out = dset.attrs["attribute_name"]
binary_blob = out.tobytes()

## Variable length strings
In HDF5, data in variable length (VL) format is stored as arbitrary-length vectors of a base type. In particular, strings are stored C-style in null-terminated buffers. NumPy has no native mechanism to support this. Unfortunately, this is the standard for representing strings in the HDF5 C API, and in many HDF5 applications. 

Thankfully, NumPy has a generic pointer type in the form of the “object” (“O”) dtype. In h5py, variable-length strings are mapped to object arrays. A small amount of metadata attached to an “O” dtype tells h5py that its contents should be converted to VL strings when stored in the file.

Existing VL strings can be read and written to with no additional effort; Python strings and fixed-length NumPy strings can be auto-converted to VL data and stored. 

In [33]:
f = h5py.File('newexample1.hdf5', "w")
dt = h5py.string_dtype(encoding='utf-8')
ds = f.create_dataset('VLDS', (100,100), dtype=dt)
print(ds.dtype.kind)

print(h5py.check_string_dtype(ds.dtype))

O
string_info(encoding='utf-8', length=None)


## Variable length data
Starting with h5py 2.3, variable-length types are not restricted to strings. For example, you can create a “ragged” array of integers. Single elements are read as NumPy arrays and multidimensional selections produce an object array whose members are integer arrays. 

In [36]:
dt = h5py.vlen_dtype(np.dtype('int32'))
dset = f.create_dataset('vlen_int', (100,), dtype=dt)
dset[0] = [1,2,3]
dset[1] = [1,2,3,4,5]

# >>> dset[0]
# array([1, 2, 3], dtype=int32)

# >>> dset[0:2]
# array([array([1, 2, 3], dtype=int32), array([1, 2, 3, 4, 5], dtype=int32)], dtype=object)

RuntimeError: Unable to create link (name already exists)

We should note that NumPy doesn’t support ragged arrays, and the ‘arrays of arrays’ h5py uses as a workaround are not as convenient or efficient as regular NumPy arrays. 

## Object names
Unicode strings are used exclusively for object names in the file.

You can supply either byte or unicode strings when creating or retrieving objects. If a byte string is supplied, it will be used as-is; Unicode strings will be encoded as UTF-8.

In the file, h5py uses the most-compatible representation; H5T_CSET_ASCII for characters in the ASCII range; H5T_CSET_UTF8 otherwise.

In [37]:
print(f.name)

grp = f.create_dataset(b"name", (5,5,5))
print(grp)
grp2 = f.create_dataset("name2", (2,2,2))
print(grp2)

/
<HDF5 dataset "name": shape (5, 5, 5), type "<f4">
<HDF5 dataset "name2": shape (2, 2, 2), type "<f4">


## Hardlinks
When assigning an object to a name in the group, for NumPy arrays or other data the default is to create an HDF5 dataset:

In [38]:
import h5py
import numpy as np

f = h5py.File('example.hdf5','w')

grp = f.create_group("group1")

grp["name"] = 42
out = grp["name"]
print(out)

<HDF5 dataset "name": shape (), type "<i8">


When the object is an existing group/dataset, a new link is made (the dataset is not copied).

In [39]:
grp["other name"] = out
print(grp["other name"])

print(grp["other name"] == grp["name"])

<HDF5 dataset "other name": shape (), type "<i8">
True


## Softlinks 
HDF5 groups can contain a text path instead of a pointer to the object itself. You can create these in h5py by using `h5py.SoftLink`. If the target is removed, they will dangle.

In [40]:
myfile = h5py.File('examplename.hdf5','w')
group = myfile.create_group("somegroup")
myfile["alias"] = h5py.SoftLink('/somegroup')

del myfile['somegroup']
#print(myfile['alias'])       # this would return an error

## References

### Object References
Every high-level object in h5py has a read-only property `ref`, which when accessed returns a new object reference.

“Dereferencing” these objects is straightforward; use the same syntax as when opening any other object.

In [41]:
myfile = h5py.File('myfile.hdf5', 'w')
grp2 = myfile.create_group("/some/group")
mygroup = myfile['/some/group']
ref = mygroup.ref
print(ref)

mygroup2 = myfile[ref]
print(mygroup2)

<HDF5 object reference>
<HDF5 group "/some/group" (0 members)>


### Region References
Region references always contain a selection. You create them using the dataset property `regionref` and standard NumPy slicing syntax:

In [42]:
myds = myfile.create_dataset('dset', (200,200))
regref = myds.regionref[0:10, 0:5]
print(regref)

<HDF5 region reference>


The reference itself can now be used in place of slicing arguments to the dataset:

In [43]:
subset = myds[regref]

For selections which don’t conform to a regular grid, h5py copies the behavior of NumPy’s fancy indexing, which returns a 1D array. Note that for h5py release before 2.2, h5py always returns a 1D array.

In addition to storing a selection, region references inherit from object references, and can be used anywhere an object reference is accepted. In this case the object they point to is the dataset used to create them.

### References in a dataset
These dtypes are available from h5py for references and region references:

- `h5py.ref_dtype` - for object references
- `h5py.regionref_dtype` - for region references

To store an array of references, use the appropriate dtype when creating the dataset. You can read from and write to the array as normal:

In [44]:
ref_dataset = myfile.create_dataset("MyRefs", (100,), dtype=h5py.ref_dtype)

ref_dataset[0] = myfile.ref
print(ref_dataset[0])

<HDF5 object reference>


### References in an attribute
Simply assign the reference to a name; h5py will figure it out and store it with the correct type:

In [45]:
myref = myfile.ref
myfile.attrs["Root group reference"] = myref

### Null References
When you create a dataset of reference type, the uninitialized elements are “null” references. H5py uses the truth value of a reference object to indicate whether or not it is null:

In [46]:
print(bool(myfile.ref))

nullref = ref_dataset[50]
print(bool(nullref))

True
False


## Userblock

HDF5 allows the user to insert arbitrary data at the beginning of the file, in a reserved space called the user block. 
The length of the user block must be specified when the file is created. It can be either zero (the default) or a power of two greater than or equal to 512. 
You can specify the size of the user block when creating a new file, via the `userblock_size` keyword to `File`; the userblock size of an open file can likewise be queried through the `File.userblock_size` property.
Modifying the user block on an open file is not supported; this is a limitation of the HDF5 library. However, once the file is closed you are free to read and write data at the start of the file, provided your modifications don’t leave the user block region.

## Opaque data (Date-time array)
Numpy `datetime64` and `timedelta64` dtypes have no equivalent in HDF5 (the HDF5 time type is broken and deprecated). h5py allows you to store such data with an HDF5 opaque type; it can be read back correctly by h5py, but won’t be interoperable with other tools.
Here’s an example of storing and reading a datetime array:

In [47]:
arr = np.array([np.datetime64('2019-09-22T17:38:30')])
myfile['data'] = arr.astype(h5py.opaque_dtype(arr.dtype))
print(myfile['data'][:])

AttributeError: module 'h5py' has no attribute 'opaque_dtype'

##### `h5py.opaque_dtype(dt)`
Return a dtype like the input, tagged to be stored as HDF5 opaque type.

##### `h5py.check_opaque_dtype(dt)`
Return True if the dtype given is tagged to be stored as HDF5 opaque data.

## Single Writer Multiple Reader
Allows simple concurrent reading of a HDF5 file while it is being written from another process. 

The following basic steps are typically required by writer and reader processes:

- Writer process creates the target file and all groups, datasets and attributes.
- Writer process switches file into SWMR mode.
- Reader process can open the file with `swmr=True`.
- Writer writes and/or appends data to existing datasets (new groups and datasets cannot be created when in SWMR mode).
- Writer regularly flushes the target dataset to make it visible to reader processes.
- Reader refreshes target dataset before reading new meta-data and/or main data.
- Writer eventually completes and close the file as normal.
- Reader can finish and close file as normal whenever it is convenient.

The following snippet demonstrate a SWMR writer appending to a single dataset:

In [48]:
f = h5py.File("swmr.h5", 'w', libver='latest')
arr = np.array([1,2,3,4])
dset = f.create_dataset("data", chunks=(2,), maxshape=(None,), data=arr)
f.swmr_mode = True
# Now it is safe for the reader to open the swmr.h5 file
for i in range(5):
    new_shape = ((i+1) * len(arr), )
    dset.resize( new_shape )
    dset[i*len(arr):] = arr
    dset.flush()
    # Notify the reader process that new data has been written

## Virtual datasets
Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. The mapping can be made ahead of time, before the parent files are written, and is transparent to the parent dataset characteristics (SWMR, chunking, compression etc…). The datasets can be meshed in arbitrary combinations, and even the data type converted.

Once a virtual dataset has been created, it can be read just like any other HDF5 dataset.

##### Warning
Virtual dataset files cannot be opened with versions of the hdf5 library older than 1.10.

To make a virtual dataset using h5py, you need to:

- Create a VirtualLayout object representing the dimensions and data type of the virtual dataset.
- Create a number of VirtualSource objects, representing the datasets the array will be built from. These objects can be created either from an h5py Dataset, or from a filename, dataset name and shape. This can be done even before the source file exists.
- Map slices from the sources into the layout.
- Convert the VirtualLayout object into a virtual dataset in an HDF5 file.

The following snippet creates a virtual dataset to stack together four 1D datasets from separate files into a 2D dataset:

In [50]:
layout = h5py.VirtualLayout(shape=(4, 100), dtype='i4')

for n in range(1, 5):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, 'data', shape=(100,))
    layout[n - 1] = vsource

# Add virtual dataset to output file
with h5py.File("VDS.h5", 'w', libver='latest') as f:
    f.create_virtual_dataset('data', layout, fillvalue=-5)

## Review Questions

Create a file. Create a hierarchy of groups and datasets with the following details:

- This
    - one
        - dataset1 (1-100)
    - two
- That
    - one
        - dataset2 (2D)
    - two
    - three
        - varlendataset (a variable length dataset) 

Print dataset2 using iteration.

Create a minimum of 2 attributes for each dataset.