In many experimental sciences, we acquire data from one or more instruments (camera, photodetectors, other sensors, etc...).
Oftentimes data is encoded in binary numeric types, structures of numeric types and bit fields.

Here I assume we are dealing with "medium-data", which is, using [Wes McKinney words](https://twitter.com/wesmckinn/status/413159516096585729):

> Medium Data (n): Not too big for a single machine, but too big to be dumb about.

We can easily and efficiently read this kind of data using the cornerstone of numeric computing in Pyhton: the numpy library. In this post I show the powerful tools numpy
offers to read and "interpret" binary data.

## A Binary example

Lately, I had to read data from a new instrument that sends to the PC "words" of 48 bits.

The bit layout of the word is:

[placeholder]

Within each word, the byte order goes from byte1 (the first) to byte6 (the last).
The word is comprised of three numeric fields `MT`, `mT`, `CH` (unsigned integers of bit-width 29, 14  and 3 bits respectively), and a few flags (`#B`, `OV`).

For the numeric fields, the bit order in parenthesis shows that the most-significant-bit (MSB) is encountered first when reading the word orderly from the first to the last bit. This byte order is called [big-endian](https://en.wikipedia.org/wiki/Endianness) (often indicated by '>').

This example can seem fairly convoluted, but is quite common in data from scientific instruments. 

Let's see how we can decode this data, easily and efficiently, using numpy.

## Numpy dtype magics

This section is a brief digression to introduce numpy's dtype system 
and arrays "views" of binary buffers. 

One of the most powerful features of numpy is providing a flexible 
memory model to access numeric data in arbitrary layouts.

At the core of the numpy library there is the ndarray object, i.e. 
a multi-dimentional array of homogeneous data type.

The crucial point is: what does numpy accept as homogeneous data type?

Beign a numeric library, we can use all the standard numeric type like 
integers (signed or unsigned with 8, 16, 32 or 64 bits) or floats 
(16, 32, 64) or even complex numbers (internally a pair of floats).

But what if we want to define a more complex structure?

Numpy allows defining structures to be used as array elements by 
defining a custom `dtype` object. For example, to define a structure 
of a `unit16` and a `float16` we write:

In [4]:
import numpy as np

In [61]:
custom_dtype = np.dtype([('time', np.int16), ('humidity', np.float16)])

For general usage of `np.dtype()` see the comprehensive [dtype documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html).

We can build an array of such a structure specifying the dtype argument (present on almost all numpy functions returning arrays). For examples:

In [62]:
data = np.ones(6, dtype=custom_dtype)
data

array([(1, 1.0), (1, 1.0), (1, 1.0), (1, 1.0), (1, 1.0), (1, 1.0)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

Here we defined a 6-element array. Each element has 4 bytes: the first 2 contain a little-endian integer (`'<i8'`) and the last 2 bytes a little-endian 16-bit float (`'<f2'`).

Let's see a few example of what we can do with such an array.

**Scalar to array broadcast:**

In [63]:
data['time'] += 10
data

array([(11, 1.0), (11, 1.0), (11, 1.0), (11, 1.0), (11, 1.0), (11, 1.0)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

**Array assignment:**

In [64]:
data['time'] = np.arange(6)*10
data['humidity'] = (np.random.randn(6) + 5)*10
data

array([(0, 50.75), (10, 46.40625), (20, 62.1875), (30, 48.78125),
       (40, 35.53125), (50, 48.78125)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

**Indexing  and slicing:**

In [72]:
data[3]

(30, 48.78125)

In [65]:
data[1::2]

array([(10, 46.40625), (30, 48.78125), (50, 48.78125)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

**View with different dtype:**

In [70]:
custom_dtype_b = np.dtype([('byte1', 'u1'), ('byte2', 'u1'), ('humidity', '<f2')])

In [71]:
data_b = data.view(custom_dtype_b)
data_b

array([(0, 0, 50.75), (10, 0, 46.40625), (20, 0, 62.1875),
       (30, 0, 48.78125), (40, 0, 35.53125), (50, 0, 48.78125)], 
      dtype=[('byte1', 'u1'), ('byte2', 'u1'), ('humidity', '<f2')])

In this latter `data_b` points to the same memory as `data`, but it reinterprets the array element with a different structure (the integer is split in 2 bytes).

## Let's decode!

Coming back to the previous example, we recognize that data has `MT` and `mT` aligned at 32 bit and 16 bit boundaries with big-endian representation.

To access those fields we can simply interpret the binary data with the following dtype:

In [40]:
dtype_fields = np.dtype([('timestamps', '>u4'), ('nanotimes', '>u2')])

We only need to copy the other bits in CH, OV, #B and set them to 0 in the original buffer.
Would be convenient to access the single bytes, operation easily accomplished by defining a second dtype:

In [86]:
dtype_bytes = np.dtype([('byte%d' % b, 'u1') for b in range(1, 5)])
dtype_bytes

dtype([('byte1', 'u1'), ('byte2', 'u1'), ('byte3', 'u1'), ('byte4', 'u1')])