# Reading numpys fileformat

In the numpy manual there's a nice description of the [.npy fileformat](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html#npy-format), with a note under [capabilities](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html#capabilities) that says:

> Is straightforward to reverse engineer.<br>Datasets often live longer than the programs that created them.<br>A competent developer should be able to create a solution in their preferred programming language to read most .npy files that they have been given without much documentation.

So let's look at a numpy file:

In [1]:
import numpy as np
filename = "1.npy"
original_array = np.array([1,2,3,4,5] + list(range(100,900,100)))
np.save(filename, original_array)

with open(filename, 'rb') as f:
    print(f.read())

print(original_array)

b"\x93NUMPY\x01\x00v\x00{'descr': '<i4', 'fortran_order': False, 'shape': (13,), }                                                           \n\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00d\x00\x00\x00\xc8\x00\x00\x00,\x01\x00\x00\x90\x01\x00\x00\xf4\x01\x00\x00X\x02\x00\x00\xbc\x02\x00\x00 \x03\x00\x00"
[  1   2   3   4   5 100 200 300 400 500 600 700 800]


The [documentation](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html#format-version-1-0) tells us that:

> **Format Version 1.0**<br>
> The first 6 bytes are a magic string: exactly \x93NUMPY.<br>
> The next 1 byte is an unsigned byte: the major version number of the file format, e.g. \x01.<br>
> The next 1 byte is an unsigned byte: the minor version number of the file format, e.g. \x00. Note: the version of the file format is not tied to the version of the numpy package.<br>
> The next 2 bytes form a little-endian unsigned short int: the length of the header data HEADER_LEN.<br>
> The next HEADER_LEN bytes form the header data describing the array’s format. It is an ASCII string which contains a Python literal expression of a dictionary. It is terminated by a newline > (\n) and padded with spaces (\x20) to make the total of len(magic string) + 2 + len(length) + HEADER_LEN be evenly divisible by 64 for alignment purposes.<br>


In [2]:
import ast

def read_npy_v1(filename):
    with open(filename, "rb") as f:
        arr = f.read(10)
        magic = arr[:6]
        major = ord(arr[6:7])
        assert major == 1, "this only reads version 1"
        minor = ord(arr[7:8])
        header_len = int.from_bytes(arr[8:10], "little")
        header_str = f.read(header_len)
        header = ast.literal_eval(header_str.decode("ascii"))
        
        assert magic == b"\x93NUMPY"
        assert (len(arr) + header_len) % 64 == 0
        assert isinstance(header, dict)
        dtype = np.dtype(header["descr"])  # dtype will be str
        fortran_order = header["fortran_order"]
        shape = header["shape"]
        assert isinstance(fortran_order, bool)
        assert isinstance(shape, tuple)
        assert len(shape) == 1
        
        # instantiate the array
        array = np.ndarray(shape, dtype=dtype)
        # read the data
        data = f.read()
        # populate the array
        array[:] = np.frombuffer(data, dtype=dtype)
        return array

In [3]:
new_array = read_npy_v1(filename)
new_array

array([  1,   2,   3,   4,   5, 100, 200, 300, 400, 500, 600, 700, 800])

In [4]:
assert np.all(original_array == new_array)

## Conclusions?

I really appreciate the beautiful simplicity of the fileformat. 

I think the next step for me is to read/write `.npy` from `nim`.

In [5]:
import pathlib
pathlib.Path(filename).unlink()