# Deconstructing Feather

### Getting Started
1. Install flatbuffers using your favorite package manager, or from source:
https://google.github.io/flatbuffers/

2. Download the flatbuffer definition for the feather metadata: https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs

3. Generate Python bindings for the feather metadata:
```Python
flatc -p metadata.fbs
```

### Write a feather file
In R (or Python) write something simple like the iris dataset to a feather file:
```R
library(feather)
data(iris)

write_feather(iris, 'iris.feather')
```

### Read a feather file (easy-mode)

In [16]:
import feather

In [17]:
iris_df = feather.read_dataframe('iris_from_r.feather')
iris_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Read a feather file (hard-mode)

In [19]:
import struct
import numpy as np

from feather_meta.fbs.CTable import CTable
from feather_meta.fbs.CategoryMetadata import CategoryMetadata

In [3]:
f = open('iris_from_r.feather', 'rb').read()

### Parse the metadata

The metadata lives at the end of the file, we use the code generated by `flatc` to read/parse this. The file looks something like this:

`[data, <metadata>, <metadata_size>, FEA1]`

Where `metadata_size` is a 4-byte integer and `FEA1` is also 4-bytes.

In [18]:
metadata_size, = struct.unpack('I', f[-8:-4])

In [9]:
metadata = CTable.GetRootAsCTable(f, len(f) - 4 - 4 - metadata_size)

Find the offsets of each column in the feather file

In [10]:
for i in range(metadata.ColumnsLength()):
    name = metadata.Columns(i).Name()
    loc = metadata.Columns(i).Values()
    offset = loc.Offset()
    print("name: {}, offset: {}".format(name, offset))

name: b'Sepal.Length', offset: 8
name: b'Sepal.Width', offset: 1208
name: b'Petal.Length', offset: 2408
name: b'Petal.Width', offset: 3608
name: b'Species', offset: 4808


Just for fun, we'll actually load one into a numpy array.

In [14]:
col_metadata = metadata.Columns(0)
col_loc = col_metadata.Values()
sepal_length = np.frombuffer(f[col_loc.Offset():col_loc.Offset() + col_loc.TotalBytes()], dtype=np.double)

In [15]:
sepal_length

array([ 5.1,  4.9,  4.7,  4.6,  5. ,  5.4,  4.6,  5. ,  4.4,  4.9,  5.4,
        4.8,  4.8,  4.3,  5.8,  5.7,  5.4,  5.1,  5.7,  5.1,  5.4,  5.1,
        4.6,  5.1,  4.8,  5. ,  5. ,  5.2,  5.2,  4.7,  4.8,  5.4,  5.2,
        5.5,  4.9,  5. ,  5.5,  4.9,  4.4,  5.1,  5. ,  4.5,  4.4,  5. ,
        5.1,  4.8,  5.1,  4.6,  5.3,  5. ,  7. ,  6.4,  6.9,  5.5,  6.5,
        5.7,  6.3,  4.9,  6.6,  5.2,  5. ,  5.9,  6. ,  6.1,  5.6,  6.7,
        5.6,  5.8,  6.2,  5.6,  5.9,  6.1,  6.3,  6.1,  6.4,  6.6,  6.8,
        6.7,  6. ,  5.7,  5.5,  5.5,  5.8,  6. ,  5.4,  6. ,  6.7,  6.3,
        5.6,  5.5,  5.5,  6.1,  5.8,  5. ,  5.6,  5.7,  5.7,  6.2,  5.1,
        5.7,  6.3,  5.8,  7.1,  6.3,  6.5,  7.6,  4.9,  7.3,  6.7,  7.2,
        6.5,  6.4,  6.8,  5.7,  5.8,  6.4,  6.5,  7.7,  7.7,  6. ,  6.9,
        5.6,  7.7,  6.3,  6.7,  7.2,  6.2,  6.1,  6.4,  7.2,  7.4,  7.9,
        6.4,  6.3,  6.1,  7.7,  6.3,  6.4,  6. ,  6.9,  6.7,  6.9,  5.8,
        6.8,  6.7,  6.7,  6.3,  6.5,  6.2,  5.9])

Categorical columns are a bit more involved. These use integer codes to identify each value, and store a lookup table elsewhere in the feather file.

In [22]:
species_metadata = metadata.Columns(4)
species_loc = species_metadata.Values()
species_int_code = np.frombuffer(f[species_loc.Offset():species_loc.Offset() + species_loc.TotalBytes()], dtype=np.int32)

We can fetch the string values from the lookup table:

In [24]:
species_levels_loc = species_metadata.Metadata()
species_levels_metadata = CategoryMetadata()
species_levels_metadata.Init(species_levels_loc.Bytes, species_levels_loc.Pos)

In [45]:
species_levels_loc = species_levels_metadata.Levels()

In [46]:
str(f[species_levels_loc.Offset() + 16: species_levels_loc.Offset() + species_levels_loc.TotalBytes() - 6])

"b'setosaversicolorvirginica\\x00'"