# Packet Dataframe Comparison

## Simulation Setup

This notebook is intended to be run first on the most up-to-date `master` branch to produce the `master_packet_data.h5` file.
We then run a new simulation from a feature branch to ensure the packets are behaving themselves.

**Note:** It may be important to disable `njit` for numerical consistency between runs.

In [69]:
from tardis.io.configuration.config_reader import Configuration
import numpy as np
import os
os.environ["NUMBA_DISABLE_JIT"] = "1"

### Configuring and running the simulation

In [70]:
from tardis import run_tardis
from tardis.io.atom_data import download_atom_data

# We download the atomic data needed to run the simulation
download_atom_data('kurucz_cd23_chianti_H_He_latest')

#config = Configuration.from_yaml("/home/connor/tardis/docs/reference/tardis_example.yml")
config = Configuration.from_yaml("/home/connor/tardis-dev/tardis/io/configuration/tests/data/tardis_configv1_verysimple.yml")
config.atom_data = "kurucz_cd23_chianti_H_He_latest.h5"
config.montecarlo.tracking.track_rpacket = True
config.montecarlo.no_of_packets = 1000
config.montecarlo.last_no_of_packets = 1000

sim = run_tardis(config, virtual_packet_logging=False)



Embedding the final state for Jupyter environments


## Extract the packet data

There's a built-in function for this in `plot_util.py`. One of the products is a dataframe containing columns of information about the packets' transport during the simulation.

In [71]:
from tardis.visualization.plot_util import extract_and_process_packet_data

packet_data = extract_and_process_packet_data(sim, packets_mode="real")

In [72]:
packet_data['packets_df']

Unnamed: 0,last_interaction_type,last_line_interaction_in_id,last_line_interaction_out_id,last_line_interaction_in_nu,last_interaction_in_r,nus,energies,lambdas
0,NO_INTERACTION,-1,-1,,,8.754070e+14,0.001019,3424.606480
1,LINE,5431,9343,1.633769e+15,1.411951e+15,8.018690e+14,0.001000,3738.671133
2,LINE,11222,12054,5.764874e+14,1.282665e+15,5.486258e+14,0.001026,5464.425370
3,LINE,7880,7697,1.012228e+15,1.301632e+15,1.104569e+15,0.001017,2714.111376
4,NO_INTERACTION,-1,-1,,,1.504592e+14,0.001018,19925.171932
...,...,...,...,...,...,...,...,...
739,NO_INTERACTION,-1,-1,,,9.159290e+14,0.001019,3273.097064
740,ESCATTERING,-1,-1,7.122196e+14,1.502366e+15,7.100973e+14,0.000977,4221.850267
741,ESCATTERING,-1,-1,9.132507e+14,1.420474e+15,9.302076e+14,0.001018,3222.855242
742,LINE,7697,7697,1.127092e+15,1.803440e+15,1.053940e+15,0.000968,2844.493195


## Load packet data from `master` for comparison

In [92]:
import pandas as pd
store = pd.HDFStore("master_packet_data.h5")
master_packet_data = store['packets_df_master']
store.close()

### Sanity check: make sure we can compare one of the data columns

In this case, the two versions of TARDIS I'm working with have different datatypes for the `last_interaction_type` column: in `master`, it's simply an integer, while in our feature branch it's an `InteractionType` object. To convert them both to the same format, I apply a function to the data from `master` to convert each index to an `InteractionType`.

In [74]:
# Last interaction type column
from tardis.transport.montecarlo.packets.radiative_packet import InteractionType

expected_li_type = master_packet_data['last_interaction_type'].apply(lambda f: InteractionType(f).name)
obtained_li_type = packet_data['packets_df']['last_interaction_type']

assert (expected_li_type == obtained_li_type).all()

The assertion passes, so we know these two columns are identical.

## Functions to compare packet data

### `get_mismatches`

This simply masks out the matching data from two columns using a comparison function `func`, which defaults to the "is not equal" function.

The function `func` should act on two arguments and return a boolean array where they are *not* equal.

In [75]:
def get_mismatches(obtained, expected, func=lambda x, y: ~(x==y)):
    mismatch_mask = func(obtained, expected)
    comparison = pd.concat([obtained[mismatch_mask], expected[mismatch_mask]], axis=1)
    comparison.columns = pd.MultiIndex.from_product([["obtained", "expected"]])
    return comparison

### `packets_df_equal`

And here's a simple wrapper to apply the `get_mismatches` function to each column in our packet dataframes. Since some of the data is formatted differently (e.g., `NaN` instead of empty values of `0.0` between the feature branch and `master`, respectively, for the `last_line_interaction_in_nu` column) I've defined specific conversions for each column, including the conversion used to convert the line interaction types above.

In [122]:
def packets_df_equal(obtained, expected):
    comparison_dict = {}
    for colname in obtained.columns:

        # Don't modify the obtained and expected dataframes themselves - just convert a copy
        expected_data = expected[colname]
        obtained_data = obtained[colname]

        # Specific type conversion and operations for columns that need it:
        if colname=='last_interaction_type':
            expected_data = expected[colname].apply(lambda f: InteractionType(f).name)
        elif colname=='last_line_interaction_in_nu':
            obtained_data = obtained_data.replace(np.nan, 0.0)
        elif colname=='last_interaction_in_r':
            obtained_data = obtained_data.replace(np.nan, 0.0)


        comparison = get_mismatches(obtained_data, expected_data)
        try:
            assert len(comparison) == 0
        except:
            comparison_dict[colname] = comparison
    return comparison_dict

# Packet Dataframe Comparisons

In [125]:
comparison = packets_df_equal(packet_data['packets_df'], master_packet_data)
print(comparison.keys())

dict_keys(['last_line_interaction_in_id', 'last_line_interaction_out_id', 'last_line_interaction_in_nu', 'last_interaction_in_r'])


Now we have a dict containing all of the data that doesn't match between the two packet dataframes. It's been filtered down to *only* the mismatches, and organized into "expected" and "obtained" for each packet df column.

The fact that there is no `last_interaction_type` key in this dict means that all the interaction types are identical, as expected. So, where we have a line interaction in `master`, we have a line interaction in the feature branch, where we have an `ESCATTERING`, etc.

Note that the `expected` data here is from `master`, and the `obtained` data is from the feature branch.

We can use the indices of these mismatches to investigate what's different about the packet behavior.

In [129]:
comparison['last_line_interaction_in_id']

Unnamed: 0,obtained,expected
11,-1,7317
48,-1,2978
56,-1,4545
57,-1,9841
66,-1,3710
71,-1,4681
80,-1,3138
83,-1,7697
88,-1,9402
100,-1,5406


This is a difference in what information is stored for line interactions that undergo a different 

In [68]:
packet_data['packets_df'].loc[comparison['last_line_interaction_in_id'].index]

Unnamed: 0,last_interaction_type,last_line_interaction_in_id,last_line_interaction_out_id,last_line_interaction_in_nu,last_interaction_in_r,nus,energies,lambdas
11,ESCATTERING,-1,-1,1107081000000000.0,1534973000000000.0,1120870000000000.0,0.000947,2674.640141
48,ESCATTERING,-1,-1,948314000000000.0,1274989000000000.0,922955500000000.0,0.001001,3248.178792
56,ESCATTERING,-1,-1,1961107000000000.0,1611699000000000.0,1976586000000000.0,0.000951,1516.718698
57,ESCATTERING,-1,-1,1399970000000000.0,1544619000000000.0,1398831000000000.0,0.000955,2143.163717
66,ESCATTERING,-1,-1,1130159000000000.0,1317908000000000.0,1168623000000000.0,0.000945,2565.348807
71,ESCATTERING,-1,-1,726737500000000.0,1544911000000000.0,713566900000000.0,0.000933,4201.322239
80,ESCATTERING,-1,-1,1163441000000000.0,1353222000000000.0,1193888000000000.0,0.000999,2511.060371
83,ESCATTERING,-1,-1,1030796000000000.0,1459022000000000.0,1078512000000000.0,0.000988,2779.685023
88,ESCATTERING,-1,-1,923177500000000.0,1425393000000000.0,909545000000000.0,0.001006,3296.070534
100,ESCATTERING,-1,-1,479096900000000.0,1376941000000000.0,477758700000000.0,0.000957,6274.976123
