Now that we know how to execute starlab commands and get the output in a format we can use, we need to think about
storing that output in a permanent way. This process (taking objects in memory and writing them out to disk) is called *Serialization*.

I've tidied up the `Story` class from the last notebook and placed it in a module for easy access. Before we can import and use it, though, we have to tell python where to find it.

In [1]:
import sys

sys.path.append("../")
from pystarlab.starlab import Story

I think the code is worth looking at, so I'll use the `%load` magic to put it into a cell at the end of this notebook. For now, though, let's just build a list of commands and create a simulated star cluster. We'll go with a smallish cluster (500 stars) and integrate it for 100 nondimensional time units.

In [2]:
cmds = []

cmds.append(["makeking", "-n", "500", "-w", "5", "-i",  "-u"])
cmds.append(["makemass", "-f", "2", "-l", "0.1,", "-u", "20", "-i"])
cmds.append(["scale", "-m", "1", "-e", "-0.25", "-q", "0.5"]) 
cmds.append(["kira", "-t", "100", "-d", "1", "-D", "5", "-n", "10", "-q", "0.5", "-G", "2"])

storylist = Story.from_command_list(cmds)

Including the integration command in our command list gives us a list of stories instead of a single story. The two parameters which govern how many stories are in the list are `-t` (the amount of dynamical time in the integration) and `-D` (the interval between snapshots). In this case, `-t` is 100, and `-D` is 5, which should give us 20 snapshots. Including initial conditions (for t=0) brings the number up to 21. The `story_from_command_list()` method automatically includes the initial conditions if the end result is a list, so we don't need to do that manually.

In [3]:
len(storylist)

21

Our first goal here is to make a permanent archive of this data.

## Simple serialization, two ways.

The simplest serialization we could use is to just write the raw output of the `starlab` commands to a file (or collection of files) in the filesystem. Then, to recreate the `Story` objects, we would have to open and read the files, and convert the strings to `Story` objects. The `Story` class has a method to do this, so it's not a problem.

A second approach would be to store the `Story` objects directly on disk. The standard python way to serialize objects is the [`pickle` module](https://docs.python.org/3.5/library/pickle.html). There are some kinds of things that can't be pickled, but we're not dealing with any of them here, so using `pickle` would be very straightforward.

It seems like json would be a third sensible way to approach this, but the json library doesn't know what to do with our `Story` class. There are various things we could try to convert `Story` objects to json, but it doesn't seem worth it at this point. If we were going to do direct visualization of `Story` objects via the web, it might make sense, but when we get to visualization, we'll only be using a subset of the data. In any case, we should wait until we're ready to take that step before deciding how to do it.

Let's compare the two obvious options.

#### Pickle

To use `pickle`, we have to import it first. Then we just dump our list of `Story` objects to the file. The standard extension (inasmuch as there is one) for pickle files is `.pkl`.

The construction here (`with open() ... `) automatically closes the file when we're done with it, and is more concise
(and less error prone) than opening, writing, and closing with unconnected commands would be.

In [4]:
import pickle

with open('kiraout.pkl', 'wb') as outfile:
    pickle.dump(storylist, outfile)

#### Plain text

We don't have to import another module to write text to a file, but we do want to make sure we're using the
right string representation of our `Story` objects, and we want to join the list into a single string.

In [5]:
with open('kira.out', 'w') as outfile:
    outfile.write("".join([str(story) for story in storylist]))

#### gzipped text

If we're going to use text, we might as well use the `gzip` module and compress it. It adds a little computational overhead, but saves us some space. The `encode()` method is necessary to turn the string into a `bytes` object for processing. People coming from python 2 may find this unintuitive (and annoying), but ultimately I think the other benefits of python 3 make it worth it.

In [6]:
import gzip
with gzip.open('kira.out.gz', 'wb') as outfile:
    outfile.write("".join([str(story) for story in storylist]).encode())

How do they compare, size-wise?

In [7]:
! ls -l kira*

-rw-rw-r-- 1 cmckay cmckay 4315695 Feb  8 20:38 kira.out
-rw-rw-r-- 1 cmckay cmckay 1293300 Feb  8 20:38 kira.out.gz
-rw-rw-r-- 1 cmckay cmckay 7434322 Feb  8 20:38 kiraout.pkl


Interesting. The pickled version is almost twice as big as the plain text, and gzipped text is the obvious winner. Since we can trivially recreate the list from the plain text file, pickling doesn't seem to buy us anything.

In [8]:
with open('kira.out', 'r') as infile:
    new_story_list = Story.from_buf(infile)
    
print(len(new_story_list))

21


## Filesystem vs. Database

Knowing how we're going to serialize the `Story` really only answers half of the question of how we're going to archive and store our data. The other half of the question is, where are we going to put it so that we can find it again?

Ultimately, we'd like to be able to find simulations that satisfy different criteria (all King model simulations with 500 stars, for example). To accomplish this, we'll need some way to connect stories with their metadata. We'll explore this in more detail in the next notebook, but it's worth going through a few options here.

The biggest question, though, is whether to store our data as files in the filesystem, or as entries in a database. There are advantages and disadvantages to each, so it's not immediately obvious which is the better choice. It's the kind of thing that gets discussed occasionally in discussion groups (see, for example, [this stackoverflow thread](http://stackoverflow.com/questions/504544/whats-the-best-practice-for-storing-huge-amounts-of-text-into-a-db-or-as-a-fil)) usually without clear resolution.

The main virtue of the filesystem is simplicity, but that's also its main failing. For our application, it's useful to know that:

- eventually we're going to be running lots of simulations in parallel across several different machines
- individual snapshots will typically be smaller than 1.5 MB (2600 stars gives a size of about 1.3 MB per snapshot)
- we will be running ensembles of simulations consisting of 100 runs or so.
- this is a write-once, read-occasionally-for-ETL situation; when we're doing analytics or plotting, we will use slices or subsets of the data rather than these archives.
- our aim is to build an interface to (at the very least) the metadata using Django.

The last point here is probably the most important one; since we're going to be using Django for the web interface, we'll be using postgresql for that, and it makes sense to use it for archiving the data, as well. 

In [None]:
storylist[0]

In [None]:
storylist[0].story_vals

The single value just tells us how many stars are in our cluster (or, more specifically, how many `Particle` stories are contained in the story tree below the root). Note that this *won't* be the same as the number of subobjects; we have a few subobjects that aren't `Particle`s, and if we have any combined `Particle`s (such as for binaries), they would show up as single subobjects but would all be counted for `N`. 

In this case, we don't have any binaries or other doubled up `Particle`s, so we're left with 500 `Particle` subobjects and 4 others.

The first is the `Log` object. This is valuable for reproducibility and data provenance, but I'm not going to need to extract or modify any values in the near term, so it might as well stay as a string or list of strings.

In [None]:
print(str(storylist[0].story_subobjects[0]))

Next is the root level `Dynamics` object. The most important things here are the center of mass position and velocity, and the system time. We will add some more important fields in subsequent snapshots.

In [None]:
print(str(storylist[0].story_subobjects[1]))

The `Hydro` story isn't relevant to any of the projects we've undertaken so far, so it's empty.

In [None]:
storylist[0].story_subobjects[2]

Actually, the `Star` object isn't, either, but this root level scaling could be useful if we're trying to communicate results in any kind of dimensional units.

In [None]:
print(str(storylist[0].story_subobjects[3]))

After that, it's all `Particle`s. They unfortunately aren't sorted in obvious way. Presumably, they're sorted in some way that's convenient to whatever tool generated that particular snapshot.

In [None]:
print(str(storylist[0].story_subobjects[4]))

In [None]:
print(str(storylist[0].story_subobjects[5]))

### Snapshots from the integration

Once `kira` is running, we add quite a bit more information, and some of it is useful.

If we're looking at the subobjects in the same order, we will start with the `Log`, which now contains information
about cpu time. This could be useful if we want to know which factors affect performance, and by how much.

In [None]:
print(str(storylist[1].story_subobjects[0]))

The `Dynamics` object now has energies, lagrangian radii, and a modified center of mass (which is also now the basis of the coordinate system).

In [None]:
print(str(storylist[1].story_subobjects[1]))

In [None]:
print(str(storylist[1].story_subobjects[4]))

In [None]:
print(str(storylist[1].story_subobjects[5]))

Another wrinkle: as time passes, we lose some stars. Whatever serialization method we use will need to deal with this gracefully. We could turn it off (by not passing the `-G` option to `kira`) but it would be nice to be able to cope with it if possible.

In [None]:
print(storylist[3].story_vals)

## numpy dtypes

We can make the data representation more compact (and index/sliceable) by using numpy dtypes.

In [None]:
import numpy as np

Let's see how this works by doing the simplest thing that will be useful. For the dynamics story of any particle, we want the mass, position, and velocity.

In [None]:
dynamics_type = np.dtype([('mass', np.float64), ('r', np.float64, (3,)), ('v', np.float64, (3,))])

We can manually construct a variable of this type.

In [None]:
def convert_to_dynamics_type(particle_story):
    """extract the dynamical values from a particle story"""
    dynamics_story = particle_story.story_subobjects[1]
    mass = float(dynamics_story.story_vals['m'])
    r = tuple(map(float, dynamics_story.story_vals['r'].split(" ")))
    v = tuple(map(float, dynamics_story.story_vals['v'].split(" ")))
    
    return (mass, r, v)

In [None]:
convert_to_dynamics_type(storylist[1].story_subobjects[5])

In [None]:
column = np.array([convert_to_dynamics_type(particle) for particle in storylist[1].story_subobjects[5:]], dtype=dynamics_type)

column[1]

The next bit we need to add is sorting a list of `Particle`s by their name/number, and then using that as an index in the resulting array.

In [None]:
def particle_id(particle_story):
    return int(particle_story.story_vals['i'])

particle_id(storylist[1].story_subobjects[5])

In [None]:
sorted_particles = sorted(storylist[1].story_subobjects[5:], key=particle_id)

In [None]:
print(str(sorted_particles[-1]))

Grab the indices. Python uses zero indexing, so we'll have to subtract one.

In [None]:
indices = np.array([particle_id(particle) for particle in sorted_particles]) - 1

In the example we ran above, we have 21 time snapshots and 500 particles. Let's see if we can store the dynamics in a 500x21 array.

In [None]:
# initialize
particle_dynamics = np.empty((500, 21))
particle_dynamics[:,:] = np.nan

In [None]:
for time_index, time_snapshot in enumerate(storylist):
    sorted_particles = sorted(time_snapshot.story_subobjects[4:], key=particle_id)
    indices = np.array([particle_id(particle) for particle in sorted_particles]) - 1
    column = np.array([convert_to_dynamics_type(particle) for particle in sorted_particles], dtype=dynamics_type)
    particle_dynamics[indices, time_index] = column

In [None]:
sorted_particles = sorted(storylist[7].story_subobjects[4:], key=particle_id)

In [None]:
storylist[7]

In [None]:
print(str(storylist[7].story_vals))

In [None]:
def flatten_particle_tree(particle_story):
    """Flatten a tree of particle stories."""
    
    # first, make sure something needs to be done
    if int(particle_story.story_vals['N']) == len(particle_story.story_subobjects[4:]):
        return particle_story
    else:
        #find the particles that contain more particles
        for particle in particle_story.story_subobjects[4:]:
            if  len(particle.story_subobjects) != 4:
                # we will need to apply a transformation for center of mass
                print(str(particle))

In [None]:
flatten_particle_tree(storylist[5])

In [None]:
int(storylist[5].story_vals['N'])

In [None]:
len(storylist[5].story_subobjects[4:])

In [None]:
storylist[4].story_subobjects[4]

## Improved serialization with h5py

Now that we know what we're dealing with, we can put together a schema for storing the data.

In [None]:
import h5py

In [None]:
file = h5py.File('test.h5', 'w')

## Appendix:  The `Story` class 

In [None]:
# %load "../pystarlab/starlab.py"
from subprocess import Popen, PIPE
import os
import re
from tempfile import SpooledTemporaryFile as tempfile

class Story:
    """Generic container class for starlab data."""
    def __init__(self):
        """Create an empty story."""
        self.story_lines = []
        self.story_vals = dict()
        self.story_subobjects = []
        self.kind = None
        return

    def __repr__(self):
        """A unique representation of the story object."""
        return ("[Story] %s, %d lines, %d values, %d subobjects" %
                (self.kind,
                 len(self.story_lines),
                 len(self.story_vals.keys()),
                 len(self.story_subobjects)))

    def __str__(self):
        """A string matching starlab's native format."""
        selfstr = "(%s\n" % self.kind
        for line in self.story_lines:
            selfstr += "%s\n" % line
        for key, val in sorted(self.story_vals.items()):
            selfstr += "  %s = %s\n" % (key, val)
        for substory in self.story_subobjects:
            selfstr += str(substory)
        return selfstr + ")%s\n" % self.kind

    @classmethod
    def from_buf(cls, buffered_result):
        """Generate a story from a buffer.

        This could either be a stream or a string that has
        been split into lines. It's supposed to add flexibility for
        running long kira integrations in which we don't want to hold
        the whole string in memory when converting it to stories.

        We use a little bit of state to avoid using recursion here. The
        reason for that is twofold:

        1. We want to treat lines in Log-type stories a little differently, and
        2. This will be more efficient, especially for large buffers.

        :param buffered_result: Results of a starlab command in an iterable format
        :type buffered_result: iterable

        :returns: results parsed into a story
        :rtype: story instance
        """
        stories_to_return = []
        story_stack = []

        # shouldn't be necessary
        thestory = None

        for line in buffered_result:
            if isinstance(line, bytes):
                line = line.decode()
            # check to see if we need to start a new story
            storystart = re.match("^\((\w+)",line)
            if storystart:
                thestory = cls()
                thestory.kind = storystart.group(1)
                story_stack.append(thestory)
            else:
                storyend = re.match("\)%s" % story_stack[-1].kind, line)
                if storyend:
                    thestory = story_stack.pop()
                    if len(story_stack) > 0:
                        story_stack[-1].story_subobjects.append(thestory)
                    else:
                        stories_to_return.append(thestory)
                else:
                    chunks = re.split('=', line)
                    if ((len(chunks) == 2) and story_stack[-1].kind != "Log"):
                        story_stack[-1].story_vals[chunks[0].strip()] = chunks[1].strip()
                    else:
                        story_stack[-1].story_lines.append(line)

        if len(stories_to_return) == 0:
            raise ValueError("No stories found in buffer!")
        elif len(stories_to_return) == 1:
            return stories_to_return[0]
        else:
            return stories_to_return

    @classmethod
    def from_string(cls, result_string):
        """Generate a story from a string.

        Assumes the string contains a single story (possibly with story subobjects).
        If there's more than one story in the string (e.g., output from kira), this
        will grab the last and discard the rest.

        :param result_string: The string to parse
        :type result_string: bytestring or unicode string

        :returns: string parsed into a story
        :rtype: Story instance
        """
        if isinstance(result_string, bytes):
            lines = result_string.decode('utf-8').splitlines()
        elif isinstance(result_string, str):
            lines = result_string.splitlines()
        else:
            raise TypeError('result_string should be a string or bytestring')

        newstory = cls.from_buf(lines)

        return newstory

    @classmethod
    def from_single_command(cls, command):
        """Generate a story from a single command.

        The command should be a creation command (e.g., makeking, makeplummer, etc.).
        It should also include all of the necessary command line options.

        :param command: The starlab command to run
        :type command: a string as it would appear on the command line
                       or a list suitable for subprocess.Popen()

        :returns: the output of command
        :rtype: Story instance
        """
        if isinstance(command, str):
            command = command.split(" ")
        elif isinstance(command, list):
            pass
        else:
            raise TypeError('command should be a string or list')
        thestory = None
        story_lines = []

        with Popen(command, stdout=PIPE, bufsize=1, universal_newlines=True) as process:
            for line in process.stdout:
                story_lines.append(line.rstrip())

        thestory = cls.from_buf(story_lines)

        return thestory

    @classmethod
    def from_command_list(cls, command_list):
        """Generate a story from a list of commands."""
        current_story = cls.from_single_command(command_list.pop(0))
        for command in command_list:
            current_story = current_story.apply_command(command)
        return current_story

    def apply_command(self, command):
        """Apply a starlab command to this story and return the result"""
        if isinstance(command, str):
            command = command.split(" ")
        elif isinstance(command, list):
            pass
        else:
            raise TypeError('command should be a string or list')

        story_lines = []
        with tempfile() as f:
            f.write(str(self).encode())
            f.seek(0)
            with Popen(command, stdout=PIPE, stdin=f,
                        universal_newlines=True, bufsize=1) as process:
                for line in process.stdout:
                    story_lines.append(line.rstrip())

        thestory = self.from_buf(story_lines)

        # if the command was an integration, we'll get a list
        if isinstance(thestory, list):
            # include the initial conditions
            thestory.insert(0, self)
        return thestory
