# The Evaluation
This notebook contains the evaluation of our paper.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import math
import scipy.stats as stats
from pathlib import Path

def pretty(ax):
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    
    for spine in [ax.spines['left'], ax.spines['bottom'], ax.spines["right"], ax.spines['top']]:
        spine.set_position(("outward", 5))
        spine.set_color("gray")
        
    for axis in [ax.yaxis, ax.xaxis]:
        for x in axis.get_major_ticks():
            x.label1.set_color("gray")
            x.label2.set_color("gray")
            x.tick1line.set_color("gray")
            x.tick2line.set_color("gray")


This section covers the evaluation where we preserve the full bug. We start by loading the the data and indexing by `name`, `predicate`, and `strategy`. The data have been computed and put in `results/full/result.csv` by our evalutation framework. This evaluation is currently pointed at `pre-calculated` which contains the data from one of oure runs. It's sligthly different from the results we have in the paper but only in running time.

In [None]:
# use if you want to use the data you generated your self.
# folder  = Path("result/full/")
# use this to reproduce the result of the paper
folder = Path("pre-calculated")
results = pd.read_csv(folder / "result.csv").set_index(["name", "predicate","strategy"])

We have an line for each combination of benchmark, predicate (decompiler), and strategy. We have the following 
strategy:

*  `jreduce`: the original almost unmodified jreduce (see patch `nix/old-jreduce/reduce-util.patch`)
*  `classes`: our new version of jreduce with extended logging.
*  `items+graph+first` and `items+graph+last` reduction on the granuality of final version but approximates
   the logic with a graph were we either choose the first or last positive and negative element in each clause.
*  `items+logic` the final version as described in the paper.

A single line of our data looks like this, we store the follwing data: 

*  `bugs` which contain the number of lines in the cleaned up bug-report

*  `initial-scc` and `scc` contain the number of strongly connected components before and after reduction,
    (For `items+logic` this contains the number of variables, and for `jreduce` 0 because we do not have access to the number of scc)

*  `initial-classes` and `classes` contain the number of classes before and after reduction,

*  `initial-bytes` and `bytes` contain the number of bytes before and after reduction,

*  `initial-lines` and `lines` contain the number of lines in the decompiled code (for `jreduce` this is 0 because the original `jreduce` did not caputre that data for each iteration),

*  `iters` which contain the number of invocations of the predicate, 

*  `searches` the number of binary searches made by algorithm (ignored by `items+logic`),

*  `setup-time` the time before the first iteration,

*  `time` which records the time to reach the final successfull solution,

*  `status` which records whether the reduction completed correctly,

*  `verify` which records information about if bug is preserved.

*  `flaky` which records if we have noticed something flaky happining


Here is an example:


In [None]:
results.loc["url0067cdd33d_goldolphin_Mi", "cfr"]

While sadly not part of the paper here is some indept data about the benchmarks.

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(12,5), sharey=True)

cnfs = pd.read_csv( folder / "sizes/cnfs.csv").set_index("name")
index = results.unstack("strategy").index
bybench = pd.DataFrame(dict(clauses=[cnfs.clauses[n] for (n,v) in index], edges=[ cnfs.edges[n] for (n,v) in index]), index=index)
bybench["graphscore"] = bybench.edges / bybench.clauses

bugs = results["bugs"].unstack("strategy")["classes"]
initial_bytes = results["initial-bytes"].unstack("strategy")["classes"]
initial_classes = results["initial-classes"].unstack("strategy")["classes"]
initial_variables = results["initial-scc"].unstack("strategy")["items+logic"]
initial_lines = results["initial-lines"].unstack("strategy")["items+logic"]
clauses = cnfs.clauses[results.index]

number_of_benchmarks = len(bugs.index)

diagrams = [
    { "title": "Histogram of Classes"
    , "data": initial_classes
    , "xlabel": "Classes"
    },
    { "title": "Histogram of Bytes (in MB)"
    , "data": initial_bytes
    , "xformat" : lambda x, pos: f'{x / 1000000 :0.2f}'
    , "format" : lambda x, pos: f'{x / 1000 :0.0f} KB'
    , "xlabel": "Bytes (in MB)"
    },
    { "title": "Histogram of Lines"
    , "data": initial_lines
    , "xformat" : lambda x, pos: f'{x / 1000 :0.0f}k'
    , "format" : lambda x, pos: f'{x / 1000 :0.1f}k'
    , "xlabel": "Lines"
    #, "splits": np.linspace(bybench.graphscore.min(), 1, 11)
    },

    { "title": "Histogram of Bugs"
    , "data": bugs
    , "xlabel": "Errors in Output"
    , "format" : lambda x, pos: f'{x :0.1f}'
    , "splits": [1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
    },
    
    { "title": "Histogram of Variables"
    , "data": initial_variables
    , "xformat" : lambda x, pos: f'{x / 1000 :0.0f}k'
    , "format" : lambda x, pos: f'{x / 1000 :0.1f}k'
    , "xlabel": "Reducible Items"
    },
    { "title": "Histogram of Clauses"
    , "data": bybench.clauses
    , "xformat" : lambda x, pos: f'{x / 1000 :0.0f}k'
    , "format" : lambda x, pos: f'{x / 1000 :0.1f}k'
    , "xlabel": "Clauses"
    },
    { "title": "Histogram of Procentage"
    , "data": bybench.graphscore
    , "xformat" : lambda x, pos: f'{x * 100 :0.0f}%'
    , "format" : lambda x, pos: f'{x * 100 :0.1f}%'
    , "xlabel": "Edges per Clause"
    , "splits": np.linspace(bybench.graphscore.min(), 1, 11)
    },

    
]

axes[0][0].set_ylabel("Bencmarks")
axes[1][0].set_ylabel("Bencmarks")
for ax, diagram in zip(axes.flatten(), diagrams):
    pretty(ax)
    
    data = diagram["data"]
    xlim = (data.min(), data.max())
    splits = diagram.get("splits",np.linspace(*xlim, 11).round(0))
    
    
    ax.set_xlim(*xlim)
    ax.set_xticks(splits[::2])
    ax.set_xticks(splits, minor=True)
    
    
    ylim = (0, number_of_benchmarks)
    ax.set_ylim(*ylim)
    ax.set_yticks(np.linspace(*ylim, 8).round(0))
    if not ax in (axes[0][0], axes[1][0]):
        ax.spines["left"].set_visible(False)
        for x in ax.yaxis.get_major_ticks():
            x.set_visible(False)
    ax.set_xlabel(diagram["xlabel"])
   
    blocks = ax.hist(diagram["data"], splits, color="black", rwidth=0.75)
    
    
    xformat = diagram.get("xformat", lambda x, pos: f'{x:0.0f}')
    ax.xaxis.set_major_formatter(plt.FuncFormatter(xformat))
    #ax.xaxis.set_tick_params(rotation=70)
    
    gmean = stats.gmean(diagram["data"])
    v = ax.vlines(gmean, *ylim)
    v.set_color("gray")
    v.set_linestyle(":")
    
    t = ax.text(gmean + (xlim[1] - xlim[0]) * 0.05, ylim[1] * 0.94, "GM " + diagram.get("format", xformat)(gmean, 0))
    t.set_color("gray")
    
fig.tight_layout()
fig.subplots_adjust(wspace=0.18)
fig.savefig("figures/benchmarks.eps")
fig.savefig("figures/benchmarks.png")

We are testing four startegies: 

- `classes` which is equivilent to `jreduce`
- `items+graph+first`
- `items+graph+last`
- `logic`


In [None]:
strategies = list(reversed(["classes", "items+logic", "items+graph+first", "items+graph+last"]))

p = ["#0a1058", "#ee4242", "#ff9135", "#9857ff", "#4cb2ff"]


colors = { "classes" : "#f38989",  "items+logic": "#cc2424", "items+graph+first": "#ff9135", "items+graph+last": "#ff9135" }
shade  = { "classes" : "#F6F1B0",  "items+logic": "#B0B6F6"}
labels = { "classes" : "J-Reduce", "items+logic": "Our Reducer", "items+graph+first": "Graph (First)", "items+graph+last": "Graph (Last)"}
styles = { "classes" : "--",       "items+logic": "-", "items+graph+first": ":", "items+graph+last": ":"}


## Sanity Checks

*  How many procent do each strategy time out on? (Answer NONE)

In [None]:
fig, ax = plt.subplots(1, figsize=(10,0.7))

timeouts = (results.status == "timeout").groupby("strategy").mean()

ax.set_xlim(0, 100)

pretty(ax)
x = ax.barh(
        [labels[s] + " " + str((100 - timeouts[s] * 100).round(1)) + "%" for s in strategies], 
        [timeouts[s] * 100 for s in strategies], 
        color=[colors[s] for s in strategies],
    )

* How many benchmarks do 'classes' produce fewer classes than 'logic', and how many of them 
  are not due to timeouts? Answer NONE

In [None]:
outperforms = []
for (b, p, x) in results.index:
    if x != "classes": continue
    c = results.classes
    if c.loc[(b, p, x)] < c.loc[(b, p, "items+logic")]:
        outperforms.append(
            ( b + "/" + p
            , (c.loc[(b, p, "items+logic")] / c.loc[(b, p, "classes")]).round(1)
            , results.loc[(b,p,"items+logic")].status
            )
        )
len(outperforms), len([x for x in outperforms if x[2] != "timeout"])


 ## Comparative reduction

In our first experiment we are going to look at comparative final size, and time. We use the geometric mean, so that we can compare the results. We supress some warnings because when using the "time" column name pandas have a weird bug. Also we replace all 0's with 1's so that the gmean does not crash on the number of lines in the 'jreduce' output.

In [None]:
import warnings
# Supressing warnings due to a bug in pandas
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    gmean = results[["time", "bytes", "classes", "lines"]].replace(0, 1).groupby("strategy").agg(stats.gmean)

gmean.round(1)

We also include the mediean for completion:

In [None]:
results[["time", "bytes", "classes", "lines"]].groupby("strategy").median().round(1)

We use `classes` as a replacement of `jreduce` in the rest of the evaluation, so we want to make sure that they are equvilent on the two parameteres we can meassure on: "time" and "classes". And we can see that it is within a single percent in number of classes and 4% in time.

In [None]:
(((gmean.loc["classes"] / gmean.loc["jreduce"]).loc[["time", "classes"]] * 100) - 100).round(1) 

On page 11 we compare J-Reduce with our new reducerers. To get the relative reduction we need the initial bytes, classes, and lines:

In [None]:
import warnings
# Supressing warnings due to a bug in pandas
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    initial = results[["initial-bytes", "initial-classes", "initial-lines"]]\
     .replace(0, 1)\
     .groupby("strategy").agg(stats.gmean)\
     .rename(lambda n: n[8:], axis=1)
initial.round(1)

Here is the results, as presented in the paper, where we use that J-Reduce is the `classses` strategy.

In [None]:
rel = (gmean[["bytes", "classes", "lines"]] / initial)

## Claims of Section 5

In [None]:
def percent(x): return (x * 100).round(1)

Our tool reduces Java Bytecode to 4.6% of the size:

In [None]:
percent(rel["bytes"]["items+logic"])

Which is 5.3 times better than 24.3% achived by J-Reduce:

In [None]:
percent(rel["bytes"]["classes"])

In [None]:
(rel["bytes"]["classes"] / rel["bytes"]["items+logic"]).round(1)

It does this by only being 3.1 times slower. **(In this dataset it becomes 3.3)**:

In [None]:
(gmean["time"]["items+logic"] / gmean["time"]["classes"]).round(1)

If we want the same amount of reduction produces by J-Reduce we can achive that with our reducer after only 6 minutes: See figure 8 (b)

We can see that J-Reduce's and our reducers geometric mean running time is 218.6 s and 680.7 s respectively. **(In this dataset it becomes 220.0 s and 723.3 s)**

In [None]:
gmean["time"][["classes", "items+logic"]].round(1)

The reduction of our reducer is much better: for number of classes we can reduce to 8.4 % while J-Reduce gets 22.8%, and for bytes we reduce to 4.6% while J-Reduce gets 24.3%

In [None]:
percent(rel[["classes", "bytes"]].loc[["classes", "items+logic"]])

We perform 2.7 times better on classes and 5.3 times better on bytes:

In [None]:
(rel.loc["classes"] / rel.loc["items+logic"])[["classes", "bytes"]].round(1)

Now we turn our focus on the lossy encodings... \[T\]he first lossy encoding leads to a reduction 4% faster han our reduce while the \[last\] leads to a reduction that is 2% slower.

In [None]:
time = gmean["time"]
percent(time[["items+graph+first", "items+graph+last"]] / time["items+logic"] -1)

The first lossy encodeing produces 5% more bytes, \[... last\] produces 8% more bytes. Similarly, \[.. 6% and 8%\] more lines:

In [None]:
percent(gmean.loc[["items+graph+first", "items+graph+last"]] / gmean.loc["items+logic"] - 1)[["bytes", "lines"]]

Overall our reducer is better than the first lossy encoding for 94% of the benchmarks \[... \] the second lossy encoding for 96% of our benchmarks.

In [None]:
onbytes = results.unstack("strategy")["bytes"]
print("items+graph+first", 
 percent((onbytes["items+graph+first"] >= onbytes["items+logic"]).mean()))
print("items+graph+last", 
 percent((onbytes["items+graph+last"] >= onbytes["items+logic"]).mean()))


## Figure 8 (a)

Here is the Cumulative Frequencey Diagrams listed in Figure 8 (a), including the final size in lines as well. 

In [None]:
def draw_diagram(full):
    fig, axes = plt.subplots(1, 4, figsize=(12,3.2), sharey=True)
    
    diagrams = [
        { "title": "Finished Programs over Time"
        , "xformat": lambda x, pos: f'{x/3600:0.0f}:{x%3600/60:02.0f}'
        , "labelformat": lambda x: f'{x:0.1f}s'
        , "data": lambda s: list(sorted(d for d in full["time"].unstack("strategy")[s]))
        , "xlim": (0, 10 * 3600)
        , "xticks": np.linspace(0, 10*3600, 6)
        , "xlabel": "Time Spent (h:mm)"
        , "percent": False
        },
        { "title": "Finished Programs over Invocations"
        , "xformat": lambda x, pos: f'{x*100:0.0f}%'
        , "labelformat": lambda x: f'{x*100:0.1f}%'
        , "data": lambda s: sorted(full["classes"].unstack("strategy")[s] / initial_classes)
        , "xticks": np.linspace(0,1, 6)
        , "xlabel": "Final Relative Size (Classes)"
        },
        # { "title": "Finished Programs over Invocations"
        # , "xformat": lambda x, pos: f'{x:0.0f}'
        # , "data": lambda s: sorted(full["iters"].unstack("strategy")[s])
        # , "xlim": (0, full["iters"].max())
        # , "xlabel": "Invocations Made"
        # },
        { "title": "Finished Programs over Invocations"
        , "xformat": lambda x, pos: f'{x*100:0.0f}%'
        , "labelformat": lambda x: f'{x*100:0.1f}%'
        , "data": lambda s: sorted(full["bytes"].unstack("strategy")[s] / initial_bytes)
        , "xticks": np.linspace(0,1, 6)
        , "xlabel": "Final Relative Size (Bytes)"
        },
        { "title": "Finished Programs over Invocations"
        , "xformat": lambda x, pos: f'{x*100:0.0f}%'
        , "labelformat": lambda x: f'{x*100:0.1f}%'
        , "data": lambda s: sorted(full["lines"].unstack("strategy")[s] / initial_lines)
        , "xticks": np.linspace(0,1, 6)
        , "xlabel": "Final Relative Size (Lines)"
        },
        
        ]

    for diagram, ax in zip(diagrams, axes.flatten()):
        maxx, minx = 0, 1000000000
        pretty(ax)
       
        strats = sorted(["classes", "items+logic"], key=lambda s: np.mean(diagram["data"](s)))
        for s in strats:
            data = diagram["data"](s)
            ax.plot(data, [i + 1 for i,_ in enumerate(data)], 
                    label=labels[s], 
                    linestyle=styles[s],
                    color=colors[s])
            maxx = max(maxx, max(data))
            minx = min(minx, min(data))
            
            mean = stats.gmean(data)
            for i, x in enumerate(data):
                if x > mean: 
                    index = i + 1
                    break
            
            ax.scatter(mean, index, color=colors[s])
            if s == "items+logic":
                loc = (mean + 6 / 100 * diagram["xticks"][-1], index * 1.05 - 2)
            else:
                loc = (mean + 4 / 100 * diagram["xticks"][-1] , index - 18)
            
            ax.text(*loc, diagram["labelformat"](mean)
                    , color=colors[s]
                    , bbox=dict(boxstyle="round", fc="white", ec="white")
                   )
            
        minx = max(1, minx)
        

        xlim = diagram.get("xlim", (0, maxx))
        ax.set_xlim(*xlim)
        xtics = diagram.get("xticks", np.linspace(*xlim, 7))
        ax.set_xticks(xtics)
        
        ylim = 0, number_of_benchmarks
        ax.set_yticks(np.linspace(*ylim, 7).round())
        ax.set_yticks([], minor=True)
        ax.set_ylim(*ylim)
        if ax == axes[0]:
            ax.set_ylabel("Benchmarks")
        
        
        if diagram.get("percent", False):
            ax2 = ax.twinx()
            pretty(ax2)
            
            yticks = [227, 200]
            strats = sorted(strategies, key=lambda s: -len(diagram["data"](s)))
            ytickslabels = [f"{(len(diagram['data'](s)) - 1) / number_of_benchmarks * 100:0.0f} %" for s in strats]
            ax2.set_yticks(yticks)
            ax2.set_yticklabels(ytickslabels)
        
        
        ax.xaxis.set_major_formatter(plt.FuncFormatter(diagram.get("xformat", lambda x, pos: f'{x:0.0f}')))
        
        
        v = ax.hlines(round(number_of_benchmarks/2),*xlim)
        v.set_color("gray")
        v.set_linestyle(":")
                                      
        if ax == axes[0]:
            t = axes[0].text(15005, round(len(data)/2) + 4.5, "MEDIAN")
            t.set_color("gray")
                            
            
        
        ax.set_xlabel(diagram["xlabel"])    
    

    fig.tight_layout()
    fig.subplots_adjust(wspace=0.13)
    axes[2].legend(loc="lower right")
    return fig

fig = draw_diagram(results)
fig.savefig("figures/timings.eps")
fig.savefig("figures/timings.png")

## Figure 8 (b) 

Figure 8 (b) is using the `bytes.csv` and `classes.csv` files which contains the number of bytes left after some amount of time. The extraction code can be found in `bin/minutes.py`.

In [None]:
import warnings
# Supressing warnings due to a bug in pandas
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)

    tbytes = pd.read_csv(folder / "bytes.csv").groupby("strategy").agg(stats.gmean).T.rename(int)
    tclasses = pd.read_csv(folder / "classes.csv").groupby("strategy").agg(stats.gmean).T.rename(int)

    fclasses = results.classes.groupby("strategy").agg(stats.gmean)
    fbytes = results.bytes.groupby("strategy").agg(stats.gmean)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10,5.6))

diagrams = [
    { "title": "Mean Classes Left Over Time"
    , "data":  tclasses
    , "format": lambda x, pos: f"{x:0.0f}"
    , "percent": True
    , "ylabel": "Mean Classes Left"
    , "best": fclasses
    },
    { "title": "Mean Bytes Left Over Time (h:m)"
    , "data":  tbytes
    , "format": lambda x, pos: f"{x / 1000:0.0f} KB"
    , "percent": True
    , "ylabel": "Mean Bytes Left"
    , "best": fbytes
    },
    { "title": "Mean Reduction of Classes Over Time"
    , "data":  tclasses.rdiv(tclasses.max())
    , "format": lambda x, pos: f"x {x:0.0f}"
    , "ylim": (1, 25)
    , "yscale": "linear"
    , "yticks": np.linspace(1, 25, 7)
    , "ylabel": "Mean Times Smaller (Classes)"
    , "xlabel": "Time Spent (h:mm)"
    , "percent": False
    , "best": fclasses.rdiv(tclasses.max())
    },
    { "title": "Mean Reduction of Bytes Over Time"
    , "data":  tbytes.rdiv(tbytes.max())
    , "ylim": (1, 25)
    , "yscale": "linear"
    , "yticks": np.linspace(1, 25, 7)
    , "format": lambda x, pos: f"x {x:0.0f}"
    , "ylabel": "Mean Times Smaller (Bytes)"
    , "percent": False
    , "xlabel": "Time Spent (h:mm)"
    , "best": fbytes.rdiv(tbytes.max())
    }
]

for ax, diagram in zip(axes.flatten(), diagrams):
    data = diagram["data"]
   
    pretty(ax)
    strats = ["classes", "items+logic"]    
    for s in reversed(strats):
        ax.plot(data.index * 60, data[s], label=labels[s], color=colors[s], linestyle=styles[s])
        
        v = ax.hlines(diagram["best"][s],(data.index * 60).min(), (data.index * 60).max())
        v.set_color("lightgray")
        v.set_linestyle(":")
        
        quantiles = diagram.get("quantiles", None)
        if quantiles:
            low,high = quantiles
            ax.fill_between(low.index * 60, low[s], high[s], label=labels[s], color=shade[s], linestyle=styles[s])
            #ax.plot(high.index * 60, high[s], label=labels[s], color=colors[s], linestyle=styles[s])
            
        
    ylim = diagram.get("ylim", (0, data[strats].max().max()))
    ax.set_ylim(*ylim)
    ax.set_yscale(diagram.get("yscale", "linear"))
    yticks = diagram.get("yticks", np.linspace(*ylim, 6).round())
    ax.set_yticks([],minor=True)
    ax.set_yticks(yticks)
    yformat = diagram["format"]
    ax.yaxis.set_major_formatter(plt.FuncFormatter(yformat))
    
    ax.set_ylabel(diagram.get("ylabel"))
    
    xlabel = diagram.get("xlabel")
    ax.set_xlabel(xlabel)
    
    v = ax.vlines(6*60, *ylim)
    v.set_color("lightgray")
    v.set_linestyle(":")
    
    t = ax.text(8*60, 25 if ylim[1] == 25 else ylim[1] * 0.95, "0:06")
    t.set_color("gray")
  
    
    if diagram.get("percent", False):
        ax2 = ax.twinx()
        ax2.set_ylabel("Percentage Left")
        pretty(ax2)
        ax2.spines['right'].set_visible(True)
        ax2.yaxis.set_major_formatter(matplotlib.ticker.PercentFormatter(1, 0))
    else: 
        ax2 = ax.twinx()
        pretty(ax2)
        ax2.spines['right'].set_visible(True)
        ax2.set_ylabel("Percentage Left")
        ax2.set_ylim(*ylim)
        ax2.set_yscale(diagram.get("yscale", "linear"))
        
        onehour = (diagram["data"]["items+logic"].loc[60])
        
        v = ax.hlines(onehour,(data.index * 60).min(), (data.index * 60).max())
        v.set_color("lightgray")
        v.set_linestyle(":")
 
        
        yticks, ytickslabels = zip(
            *diagram.get("yticks2",
                        [ (d, f"{1/d * 100:0.1f}%") for d in (diagram["best"][s] for s in strats)
                        ] + [(onehour, f"{1/onehour * 100:0.1f}%")]
            ))
        ax2.set_yticks(yticks)
        ax2.set_yticklabels(ytickslabels)
        ax2.set_yticks([],minor=True)
        
        ax2.invert_yaxis()
        ax.invert_yaxis()
        
    xlim = (0, 60 * 60 * 2 + 1)
    ax.set_xlim(*xlim)
    ax.set_xticks(np.linspace(*xlim, 5))
    ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f'{math.floor(x/3600):0.0f}:{x%3600/60:02.0f}'))
    
   
axes[0][0].legend()
fig.tight_layout()
fig.subplots_adjust(hspace=0.20)
fig.savefig("figures/by-time.eps")
fig.savefig("figures/by-time.png")

    

## Choice of example

We choose an example in the introduction, we choose it as close to the mean of the difference between J-Reduce and our reducer, while still being as close to the geometric mean as possible.

In the example below, we both list the number of lines in the median (113) and the example we choose (129), we also list the times our approach is better in number of bytes left.

In [None]:
x = results.unstack("strategy")
b = x["bytes"]
index = (b["classes"] / b["items+logic"]).sort_values().index[[113,129]]

pd.DataFrame({
    "initial": x["initial-lines"]["classes"],
    "classes": x["lines"]["classes"],
    "items+logic": x["lines"]["items+logic"],
    "better": (b["classes"] / b["items+logic"]).round(1)
}).loc[index]

## Total Evaluation Time

In [None]:
print((results["time"].sum() / 3600).round(), "h")