# Simulations and Metadata

The first things we need to know about the data are
 1. which simulations are available, and
 2. what physical systems those simulations represent.

The answers are provided by the `Simulations` object, which contains `Metadata` for each simulation.

## Simulations

To begin, we load the `Simulations` object with the `sxs.load` function:

In [None]:
import sxs

simulations = sxs.load("simulations")

The first time you call this function, it will attempt to download [the latest data from github](https://github.com/sxs-collaboration/sxs/tree/simulations) and cache it locally, as described in [the previous notebook](../00-Introduction#configuration-and-caching-preliminaries).

The returned object is essentially a `dict`, where the keys are SXS IDs like "SXS:BBH:1234":

In [None]:
list(simulations)[:10]

The values in this `dict` are essentially also `dict`s with metadata about the simulation.

## Metadata

For each simulation, you need to know its physical parameters — mass ratio, spins, initial separation, eccentricity, etc. — as well as information about the simulation itself and where to find the data.  The `sxs.Metadata` object encapsulates that information, with nice interactive features to help you explore.

Just to take an example, let's focus on one particular simulation in the list, the binary black hole simulation with SXS ID `SXS:BBH:0123`:

In [None]:
metadata = simulations["SXS:BBH:0123"]

Essentially, `metadata` is a standard python `dict`, with a few extra bells and whistles.  For example, it looks a bit tidier than your basic `dict`:

In [None]:
metadata

Some of these fields are more interesting than others.  Presumably, the most interesting ones are the numbers — things like the mass ratio and spins.  You can access them individually just like any `dict`:

In [None]:
metadata["reference_mass_ratio"]

Note that we also have tab completion when using IPython (or Jupyter).  For example, if you just start with

```python
metadata["reference_m
```

and then hit tab, you'll see a list of possible completions.  Every key can also be accessed as an attribute:

In [None]:
metadata.reference_mass_ratio

This also gives you tab completion.

Finally, we also provide some backwards compatibility with the older NRAR metadata format, which called for hyphens to be used where we use underscores:

In [None]:
metadata["reference-mass-ratio"]

## Pain points with the metadata

One of the issues that has built up over time is the fact that metadata keys are not entirely consistent.  For example, one key:value pair we see above is this:

In [None]:
metadata["reference_eccentricity"]

We might have expected to get a number out of this, but we got a string.  This is because the eccentricity fitting function can't always find a very exact value, and only returns an upper bound.  So if you're sorting through lots of different metadata files, looking for eccentricities — let's say — above 0.1, you might have a line that says

```python
if metadata["reference_eccentricity"] > 0.1:
    do_something()
```

Unfortunately, once you get to this particular metadata file, that test ***WILL RAISE AN ERROR***:

In [None]:
metadata["reference_eccentricity"] > 0.1

There are also many datasets where values are missing.  For example, many of these keys make no sense for simulations with matter (BHNS and NSNS); similarly many critical pieces of information in matter simulations are irrelevant for BBH simulations.

We need a more systematic interface to the data.

## Pain reliever: the dataframe

The idea behind these metadata objects is that they should serve as the official records of what was written at the time the simulation was run.  We don't want to be too clever above fixing the pain points, because we might incorrectly change some critical piece of information.

*However*, if you are willing to trade the possibility that this will replace data that you could make sense of with NaNs, for the sake of consistency, then the `simulations` object provides a more uniform interface to all the metadata collected in one place, in the form of `simulations.dataframe`.

The widely used `pandas` package is designed for precisely this application: analysing tabular data with heterogeneously typed columns.  It provides very powerful features for all sorts of sorting, selection, and statistical analysis.  So we use `pandas` to help us:

In [None]:
df = simulations.dataframe

This creates a dataframe (or table) with consistent types, and NaN for missing values:

In [None]:
df

Plus, we can use the `qgridnext` package to make this cool interactive table (which, unfortunately, will not show up if you are viewing this as a static web page):

In [None]:
from qgridnext import show_grid
show_grid(df, precision=8, show_toolbar=True, grid_options={"forceFitColumns": False})

You can sort by a column by clicking on the column header.  You can also filter by value by clicking the <span class="fa fa-filter filter-icon"></span> icon in the header.

## Doing that and more, programatically

While graphical interfaces are fun, there is more reproducibility and power in programming.

### Slices

We can slice the dataframe in a dizzying number of ways.  But there are two that are simplest and most reliable.  First, and most easily, we can take standard slices, like the first four elements:

In [None]:
df[:4]

Or we can select columns to extract:

In [None]:
df[["object_types", "initial_adot"]]

To combine them, we just do them in sequence:

In [None]:
df[:4][["object_types", "initial_adot"]]

### Tests

The concept of tests is fairly simply.  For example, we can test whether or not the `object_types` field is equal to `BHNS`:

In [None]:
df["object_types"] == "BHNS"

We get a pandas Series object, where most of the results say `False`, but the last few say `True` — because they are the ones for which the `object_types` field is `BHNS`.  Now, we can use this Series just like we would in numpy to extract the items where this test gives us `True`:

In [None]:
df[df["object_types"] == "BHNS"]

(Here, we're just looking at the data, so we don't bother with the fancy grid we used above.)

Next, we might want to combine tests.  This is done by putting each test inside parentheses, and combinging results with `&`:

In [None]:
df[(df["object_types"] == "BHNS") & (df["initial_separation"] < 52)]

Here, the combined test is only `True` if both tests to return `True` — the `&` operator is the boolean AND.  We also have OR with `|` and XOR with `^`, as well as negation with `~` — though this can usually be achieved by changing the test.

Before we do anything else, it's convenient to use what we've just learned to separate out the different types of systems:

In [None]:
BHBH = df[df["object_types"] == "BHBH"]
BHNS = df[df["object_types"] == "BHNS"]
NSNS = df[df["object_types"] == "NSNS"]

### Sorting

As with the fancy graphical table above, we can perform a standard sort with respect to any key:

In [None]:
BHBH.sort_values("initial_separation")

But unlike the fancy graphical table above, we can use a function that serves as the sort key.  (This sort of key function is also available in the standard python library's `sorted` function.)  Here, we'll sort by the absolute value of the difference between `initial_separation` and 20.0.

In [None]:
sorting_field = "initial_separation"
desired_value = 20.0

BHBH.sort_values(sorting_field, key=lambda s: abs(s-desired_value))

So, if we want the 8 systems with initial separations closest to 20, we can just take them:

In [None]:
sorting_field = "initial_separation"
desired_value = 20.0
N = 8

BHBH.sort_values(sorting_field, key=lambda s: abs(s-desired_value))[:N]

### Plotting

Pandas also makes it easy to plot the various quantities.  For example, we can make a scatter plot of mass ratio versus $\chi_{\mathrm{eff}}$:

In [None]:
BHBH.plot("reference_mass_ratio", "reference_chi_eff", kind="scatter")

# pandas adds the column labels as axis labels, but we can make them look nicer
import matplotlib.pyplot as plt
plt.xlabel(r"Mass ratio")
plt.ylabel(r"$\chi_\mathrm{eff}$");

Or we can make histograms of the data:

In [None]:
BHBH["initial_ADM_linear_momentum_mag"].plot.hist(log=True)
plt.xlabel(r"Magnitude of ADM linear momentum");

We can even make corner plots:

In [None]:
import seaborn as sns

pp = sns.pairplot(
    BHBH[["reference_chi_eff", "reference_chi1_perp", "reference_chi2_perp", "reference_mass_ratio"]].dropna(),
    corner=True,
)
pp.y_vars = [r"$\chi_{\mathrm{eff}}$", r"$\chi_{\perp,1}$", r"$\chi_{\perp,2}$", r"$q$"]
pp.x_vars = pp.y_vars
pp._add_axis_labels()

The `simulations` object, especially when agumented with the `dataframe`, provides powerful methods for selecting the particular simulations we are interested in.  Once we have done so, we need to load and interact with the simulations.

Continue with the [introduction to the `Simulation` objects](/tutorials/02-Simulation).