# Example 5: DSE with the Python API

The example design in this directory is sensitive to FIFO depth. Let's explore how changing the FIFO depth affects the latency. To do this, we will use LightningSim's Python API.

First, we import the necessary libraries.

In [None]:
import dataclasses
from pathlib import Path
from lightningsim.model import Solution
from lightningsim.runner import Runner, RunnerStep

The `Solution` class represents a specific Vitis HLS project solution and accepts a `pathlib.Path` to the solution directory. This is the same path you would pass to the `lightningsim` command on the command line.

In [None]:
solution = Solution(Path.cwd() / "solution1")

The `Runner` class is responsible for compiling the LLVM bitcode and running it to get the execution trace, as well as parsing and resolving the trace into a dynamic schedule.

It performs steps (A)&ndash;(D) of the LightningSim flow:

The LightningSim flow.

It takes the `Solution` to use as a parameter.

In [None]:
runner = Runner(solution)

By default, the `Runner` does not output any progress information as it runs. We can optionally hook into each step to print progress information.

In [None]:
runner.steps[RunnerStep.ANALYZING_PROJECT].on_start(lambda _: print("Analyzing project..."))
runner.steps[RunnerStep.WAITING_FOR_BITCODE].on_start(lambda _: print("Waiting for bitcode to be generated..."))
runner.steps[RunnerStep.GENERATING_SUPPORT_CODE].on_start(lambda _: print("Generating support code..."))
runner.steps[RunnerStep.LINKING_BITCODE].on_start(lambda _: print("Linking bitcode..."))
runner.steps[RunnerStep.COMPILING_BITCODE].on_start(lambda _: print("Compiling bitcode..."))
runner.steps[RunnerStep.LINKING_TESTBENCH].on_start(lambda _: print("Linking testbench..."))
runner.steps[RunnerStep.RUNNING_TESTBENCH].on_start(lambda _: print("Running testbench..."))
runner.steps[RunnerStep.PARSING_SCHEDULE_DATA].on_start(lambda _: print("Parsing schedule data from C synthesis..."))
runner.steps[RunnerStep.RESOLVING_TRACE].on_start(lambda _: print("Resolving dynamic schedule from trace..."));

Then, simply `await runner.run()` to get a `ResolvedTrace` object representing the dynamic schedule.

> **Note:** Because `runner.run()` is an async function, you must call it either as `await runner.run()` (from another async function) or `asyncio.run(runner.run())` (from a non-async function). Since Jupyter notebooks support top-level `await`, we use the first form here.

In [None]:
trace = await runner.run()
trace

The resulting `ResolvedTrace` object has a few important properties:

* `compiled: CompiledSimulation`: Can be used to obtain actual simulation results, as will be described below.
* `fifos: list[ResolvedStream]`: A list of the FIFO streams used in the design. Each `ResolvedStream` in the list has three properties:
  * `id: int`: A unique identifier for this FIFO stream.
  * `name: str`: The name assigned to the FIFO stream by Vitis HLS.
  * `width: int`: The number of bits of each element processed by the FIFO stream.
* `axi_interfaces: list[AXIInterface]`: A list of the AXI interfaces used by the design. Each `AXIInterface` in the list has two properties:
  * `address: int`: A unique base address for the AXI interface.
  * `name: str`: The name assigned to the interface by Vitis HLS.
* `params: SimulationParameters`: A dataclass containing the stall calculation parameters for the final step of the simulation. It has two important properties:
  * `fifo_depths: dict[int, int | None]`: A mapping between FIFO stream `id`s and their corresponding buffer depths. A depth of `None` is treated as infinite.
  * `axi_delays: dict[int, int]`: A mapping between AXI interface `address`es and their corresponding latencies in cycles.

The initial parameters provided in `trace.params` are the values synthesized by Vitis HLS. Therefore, the easiest way to run the simulation&mdash;and the default for the `lightningsim` command&mdash;is to run `trace.compiled.execute()` with `trace.params` as the argument.

This function performs steps (E)&ndash;(F) of the LightningSim flow.

In [None]:
try:
    simulation = trace.compiled.execute(trace.params)
except ValueError:
    print("Deadlock detected")
else:
    print(f"Kernel took {simulation.top_module.end:,d} cycles")

The resulting `Simulation` object has the following properties:

* `top_module: SimulatedModule`: The top module of the simulated kernel. It has the following properties:
  * `name: str`: The name of the module.
  * `start: int`: The clock cycle at which the module started. For the top module, this is always 0.
  * `end: int`: The clock cycle at which the module was done.
  * `submodules: list[SimulatedModule]`: A list of submodules invoked by this module, each with the same properties.
* `fifo_io: dict`: The clock cycles at which each FIFO stream was written to and read from during simulation.
* `axi_io: dict`: The clock cycles at which each AXI interface's requests and responses occurred during simulation.

By traversing the `submodules` of each module recursively, starting at the `simulation.top_module`, we can see the entire hierarchy of modules invoked within the kernel.

In [None]:
def print_module(module, *, indent="", max_digits=None):
    if max_digits is None:
        max_digits = len(f"{module.end:,d}")
    print(f"[{module.start:{max_digits},d}-{module.end:{max_digits},d}] {indent}{module.name}")
    for submodule in module.submodules:
        print_module(submodule, indent=indent + "    ", max_digits=max_digits)

print_module(simulation.top_module)

We can test out new FIFO buffer depths simply by passing new `SimulationParameters` to `trace.compiled.execute()`.

In [None]:
# Find the FIFO named "channel" and set its depth to 1024
fifo_to_change = next(fifo for fifo in trace.fifos if fifo.name == "channel")
new_fifo_depths = {**trace.params.fifo_depths, fifo_to_change.id: 1024}
new_params = dataclasses.replace(trace.params, fifo_depths=new_fifo_depths)
simulation = trace.compiled.execute(new_params)
print_module(simulation.top_module)

## Advanced Usage for FIFO DSE

The `trace.compiled.execute()` function is already very fast. However, LightningSim also provides the `trace.compiled.dse()` function, which is highly parallelized but specialized for the use case of design space exploration to find the best FIFO depths.

In [None]:
# Query the design space for the selected FIFO
# This returns a list of depths for the given FIFO, each with a different amount of BRAM usage
# (You can provide multiple FIFOs with a common width if you intend to vary their depths together)
depths = trace.compiled.get_fifo_design_space([fifo_to_change.id], fifo_to_change.width)

# Run the DSE on this design space
# (For larger design spaces, e.g., when varying multiple FIFOs, you may need to sample the design space instead)
base_params = trace.params
fifo_widths = {fifo.id: fifo.width for fifo in trace.fifos}
design_points = [{fifo_to_change.id: depth} for depth in depths]
dse_results = trace.compiled.dse(base_params, fifo_widths, design_points)

# dse_results is a list of DsePoints in the same order as the design_points passed in
# Each DsePoint has a latency (None if deadlocked) and a bram_count
def print_dse_results(design_points, dse_results):
    for design_point, result in zip(design_points, dse_results):
        depth, = design_point.values()
        print(f"FIFO with depth {depth:d} uses an estimated {result.bram_count:d} BRAM_18K resources and", end=" ")
        if result.latency is not None:
            print(f"takes {result.latency:,d} cycles")
        else:
            print("deadlocks")

print_dse_results(design_points, dse_results)

Note that this seems to skip some depth values. This is because there is no benefit in choosing a depth value outside of the design points returned by `trace.compiled.get_fifo_design_space()`.

In [None]:
design_points = [{fifo_to_change.id: depth} for depth in (1023, 1024, 1025, 2048)]
dse_results = trace.compiled.dse(base_params, fifo_widths, design_points)
print_dse_results(design_points, dse_results)

In this example, both depths 1023 and 1024 take 15 BRAM_18K resources, but depth 1023 takes 4 more cycles. Meanwhile, increasing this depth to 1025 takes 29 BRAM_18K resources, the same as depth 2048.