# Demo: Building a Design Dataset

Author: Stefan Abi-Karam

This notebook will walk you through the process of building a design dataset from different sources. This includes using dataset retrievers to download and preprocess different design sources into a design dataset, as well as using design flows to generate more EDA and other data alongside the design sources.


## Setup Code

Below is some initial notebook setup code to load juypter notebook extensions and include common imports needed throughout the notebook.


In [28]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
from pathlib import Path

from dotenv import dotenv_values

## Importing Dataset Creation Tools

Below is code to import the `DesignDataset` class uses for creating and managing a design dataset, as well as various dataset retrievers for downloading and preprocessing different design sources from external sources into a design dataset.


In [30]:
from digital_design_dataset.data_sources.data_retrievers import (
    EPFLDatasetRetriever,
    HW2VecDatasetRetriever,
    ISCAS85DatasetRetriever,
    ISCAS89DatasetRetriever,
    KoiosDatasetRetriever,
    LGSynth89DatasetRetriever,
    LGSynth91DatasetRetriever,
    OPDBDatasetRetriever,
    OpencoresDatasetRetriever,
    VTRDatasetRetriever,
)
from digital_design_dataset.data_sources.hls_data import (
    PolybenchRetriever
)
from digital_design_dataset.design_dataset import (
    DesignDataset,
)

## Creating a Design Dataset


We will define the current working directory. This is needed since `__file__` is not defined in Jupyter notebooks so we need to manually set the current working directory as if we are in the root of the repository.


In [31]:
current_script_dir = Path("demo_scripts")

We also can define a GitHub token to use for having a higher rate limit when dataset retrievers are downloading data from GitHub. It loads this token fromm a `.env` file sibling to the notebook.

This is optional and can be set to `None` if you do not have a GitHub token.


In [32]:
env_config = dotenv_values(current_script_dir / ".env")
gh_token = None
if "GITHUB_TOKEN" in env_config:
    gh_token = env_config["GITHUB_TOKEN"]

**Finally we create the dataset!**

We need to specify the dataset directory where the dataset will be on disk. We can also optionally specify weather we wan't to completely overwrite the dataset if it already exists as well as an optional GitHub token as previously mentioned.


In [33]:
test_db_dir = current_script_dir / "test_dataset_v2"
d = DesignDataset(
    test_db_dir,
    overwrite=True,
    gh_token=gh_token,
)

Now we will add designs to this dataset using dataset retrievers.

To use any dataset retrieve:

1. create a dataset retriever instance by passing in a `DesignDataset` instance when creating the retriever
2. call the `retrieve` method on the retriever to download and preprocess the design source into the dataset


### ISCAS 85 Dataset Retriever

This dataset retriever sources the ISCAS 85 benchmark.

We source the benchmarks from here: [https://ddd.fit.cvut.cz/www/prj/Benchmarks/](https://ddd.fit.cvut.cz/www/prj/Benchmarks/). More work is being done to curate and mirror the original benchmarks ourselves.


In [8]:
iscas85_retriever = ISCAS85DatasetRetriever(d)
iscas85_retriever.get_dataset()
# iscas85_retriever.remove_dataset()

### ISCAS 89 Dataset Retriever

This dataset retriever sources the ISCAS 89 benchmark.

We source the benchmarks from here: [https://ddd.fit.cvut.cz/www/prj/Benchmarks/](https://ddd.fit.cvut.cz/www/prj/Benchmarks/). More work is being done to curate and mirror the original benchmarks ourselves.


In [9]:
iscas89_retriever = ISCAS89DatasetRetriever(d)
iscas89_retriever.get_dataset()
# iscas89_retriever.remove_dataset()

### LGSynth 89 Dataset Retriever

This dataset retriever sources the LGSynth 89 benchmark.

We source the benchmarks from here: [https://ddd.fit.cvut.cz/www/prj/Benchmarks/](https://ddd.fit.cvut.cz/www/prj/Benchmarks/). More work is being done to curate and mirror the original benchmarks ourselves.


In [10]:
lgsynth89_retriever = LGSynth89DatasetRetriever(d)
lgsynth89_retriever.get_dataset()
# lgsynth89_retriever.remove_dataset()

### LGSynth 91 Dataset Retriever

This dataset retriever sources the LGSynth 91 benchmark.

We source the benchmarks from here: [https://ddd.fit.cvut.cz/www/prj/Benchmarks/](https://ddd.fit.cvut.cz/www/prj/Benchmarks/). More work is being done to curate and mirror the original benchmarks ourselves.

In [11]:
lgsynth91_retriever = LGSynth91DatasetRetriever(d)
lgsynth91_retriever.get_dataset()

### OpenCores Dataset Retriever

This dataset retriever sources hardware designs / IPs that we have currated from the [OpenCores](https://opencores.org/) website and [FreeCores](http://freecores.github.io/) mirror.

We have packaged our curated designs here: [https://github.com/stefanpie/hardware-design-dataset-opencores](https://github.com/stefanpie/hardware-design-dataset-opencores)


In [12]:
opencores_retriever = OpencoresDatasetRetriever(d)
opencores_retriever.get_dataset()
# opencores_retriever.remove_dataset()

### HW2VEC Dataset Retriever

This dataset retriever sources the HW2VEC dataset. This dataset is a collection of designs from the HW2VEC paper. Many designs in HW2VEC are from other sources but are modified or tweaked for HW trojan detection and IP piracy detection research.

The deigsn are sourced from the HW2VEC repository: [https://github.com/AICPS/hw2vec](https://github.com/AICPS/hw2vec)


In [13]:
hw2vec_retriever = HW2VecDatasetRetriever(d)
hw2vec_retriever.get_dataset()
# hw2vec_retriever.remove_dataset()

### VTR Dataset Retriever

This dataset retriever sources benchmark designs used in the Verilog-to-Routing (VTR) project. These benchmark designs are used to assess algorithm performance and Quality of Results (QoR) of the VTR tools.

These designs are sourced from the VTR repository: [https://github.com/verilog-to-routing/vtr-verilog-to-routing](https://github.com/verilog-to-routing/vtr-verilog-to-routing)


In [34]:
vtr_retriever = VTRDatasetRetriever(d)
vtr_retriever.get_dataset()
# vtr_retriever.remove_dataset()

### Koios Dataset Retriver

This dataset retriever sources benchmark designs from the Koios 2.0 deep learning benchmark suite. This benchmark consists of designs which are deep learning accelerators. This benchmark was originally targeted for FPGA architecture and CAD research and is integrated into the Verilog-to-Routing benchmark evaluations.

These designs are hosted and sources from the VTR repository: [https://github.com/verilog-to-routing/vtr-verilog-to-routing](https://github.com/verilog-to-routing/vtr-verilog-to-routing)


In [15]:
koios_retriever = KoiosDatasetRetriever(d)
koios_retriever.get_dataset()
# kiois_retriever.remove_dataset()

### EPFL Dataset Retriever

This dataset retriever sources benchmark designs from the "EPFL Combinational Benchmark Suite". This benchmark suite is a collection of combinational circuits used for benchmarking logic optimization and synthesis algorithms.

These designs are hosted and sourced from the EPFL benchmark repository: [https://github.com/lsils/benchmarks](https://github.com/lsils/benchmarks)


In [16]:
epfl_retriever = EPFLDatasetRetriever(d)
epfl_retriever.get_dataset()
# epfl_retriever.remove_dataset()

### OPDB Dataset Retriever

This dataset retriever sources benchmark designs from the "OpenPiton Design Benchmark" (OPDB). These designs in this benchmark are created from the components of the [OpenPiton project](http://parallel.princeton.edu/openpiton/) and are used for benchmarking EDA tools.

These designs are hosted and sourced from the OpenPiton Design Benchmark repository: [https://github.com/PrincetonUniversity/OPDB](https://github.com/PrincetonUniversity/OPDB)


In [17]:
opdb_retriever = OPDBDatasetRetriever(d)
opdb_retriever.get_dataset()
# opdb_retriever.remove_dataset()

### HLS Polybench Retriever

This dataset retriever benchmark designs from built versions of the Polybench HLS benchmark suite. These designs are the synthesized and generated output RTL from Vitis HLS.

These designs are hosted and sourced from the HLS Polybench repository: ...

In [35]:
polybench_retriever = PolybenchRetriever(d)
polybench_retriever.get_dataset()

## Running Flows for Generating More Design Data

In addition to design sources being part of a dataset, the user can run different tool flows to generate more data that sits alongside the design sources in the dataset. The flows themselves are arbitrary and user-defined, but our work defines a common interface for running these flows and interfacing with the dataset. Flows can be implemented in Python only (e.g., text processing and embedding) or can call out to external tools such as EDA tools to generate more data (e.g., synthesis, place and route, etc.).

As with dataset retrievers, we include several flows as part of our framework. All the flows are still a work in progress and are being actively developed. We hope to cover a wide range of use cases such as different EDA tool flows and data processing tasks.

Note that in this framework, not all flows need to be able to process all designs in the dataset. The framework is designed to be flexible and allow for different flows to apply to different subsets of designs as needed. Ideally, we are working towards having all built-in designs be supported by all built-in flows.


### Module Info Flow

This flow generates hardware module information for each design using Yosys. Yosys reads in the design sources and generates all instantiated modules in the design. This list is then serialized and stored as part of the design in the dataset.


In [20]:
from digital_design_dataset.flows.flows import ModuleInfoFlow

yosys_bin = "yosys"
module_info_flow = ModuleInfoFlow(d, yosys_bin)
module_info_flow.build_flow(n_jobs=12)


[A[2025-02-17 20:19:23,398][ModuleInfoFlow][INFO] Building flow module_count for epfl__adder
[2025-02-17 20:19:23,406][ModuleInfoFlow][INFO] Building flow module_count for epfl__arbiter
[2025-02-17 20:19:23,421][ModuleInfoFlow][INFO] Building flow module_count for epfl__cavlc
[2025-02-17 20:19:23,437][ModuleInfoFlow][INFO] Building flow module_count for epfl__bar
[2025-02-17 20:19:23,445][ModuleInfoFlow][INFO] Building flow module_count for epfl__dec
[2025-02-17 20:19:23,446][ModuleInfoFlow][INFO] Building flow module_count for epfl__mem_ctrl
[2025-02-17 20:19:23,449][ModuleInfoFlow][INFO] Building flow module_count for epfl__ctrl
[2025-02-17 20:19:23,449][ModuleInfoFlow][INFO] Building flow module_count for epfl__multiplier
[2025-02-17 20:19:23,451][ModuleInfoFlow][INFO] Building flow module_count for epfl__priority
[2025-02-17 20:19:23,453][ModuleInfoFlow][INFO] Building flow module_count for epfl__router
[2025-02-17 20:19:23,454][ModuleInfoFlow][INFO] Building flow module_count fo

PermissionError: [Errno 13] Permission denied: 'yosys'

### Verible AST Flow

This flow generates the syntax tree for the design sources present in a design using Verible. Verible is a suite of SystemVerilog tools that can be used to parse and analyze SystemVerilog code. We feed all the design sources into `verible-verilog-syntax` tool, and the syntax tree generated is serialized and stored as part of the design in the dataset.

Warning: the syntax trees generated can take up a lot of disk space.


In [None]:
from digital_design_dataset.flows.flows import VeribleASTFlow

verible_verilog_syntax_bin = "verible-verilog-syntax"
verible_ast_flow = VeribleASTFlow(
    d, verible_verilog_syntax_bin=verible_verilog_syntax_bin
)
verible_ast_flow.build_flow(n_jobs=1)

### Yosys AIG Flow

This flow generates a synthesized AIG netlist for a design using Yosys. Yosys reads the design sources and runs a coarse-grained generic synthesis flow. Then, the AIG pass is called to convert the synthesized netlist into an AIG netlist. This AIG netlist is then serialized and stored as part of the design in the dataset.

Warning: the netlist generated can take up a lot of disk space.


In [25]:
from digital_design_dataset.flows.flows import YosysAIGFlow

yosys_bin = "yosys"
yosys_ast_flow = YosysAIGFlow(d, yosys_bin=yosys_bin)
yosys_ast_flow.build_flow(n_jobs=1)

  0%|          | 0/84 [00:00<?, ?it/s]

PermissionError: [Errno 13] Permission denied: 'yosys'

In [27]:
# YosysSimpleSynthFlow
import os
import shutil
from digital_design_dataset.flows.flows import YosysSimpleSynthFlow

# load OSS_CAD_SUITE from .env in current dir and add to path for yosys to find
env_config = dotenv_values(current_script_dir / ".env")
oss_cad_suite = None
if "OSS_CAD_SUITE" in env_config:
    oss_cad_suite = env_config["OSS_CAD_SUITE"]
    assert oss_cad_suite
    os.environ["PATH"] += os.pathsep + oss_cad_suite

yosys_bin = shutil.which("yosys")
assert yosys_bin
yosys_simple_synth_flow = YosysSimpleSynthFlow(d, yosys_bin=yosys_bin)
yosys_simple_synth_flow.build_flow(n_jobs=32)




The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value wi

In [36]:
import os
import shutil
from digital_design_dataset.flows.flows import YosysXilinxSynthFlow

# load OSS_CAD_SUITE from .env in current dir and add to path for yosys to find
env_config = dotenv_values(current_script_dir / ".env")
oss_cad_suite = None
if "OSS_CAD_SUITE" in env_config:
    oss_cad_suite = env_config["OSS_CAD_SUITE"]
    assert oss_cad_suite
    os.environ["PATH"] += os.pathsep + oss_cad_suite

yosys_bin = shutil.which("yosys")
assert yosys_bin
yosys_simple_synth_flow = YosysXilinxSynthFlow(d, yosys_bin=yosys_bin)
yosys_simple_synth_flow.build_flow(n_jobs=32)

The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.
The default value wi