<a target="_blank" href="https://colab.research.google.com/drive/1142VbFt8kLRz7amDi9UQt5LaJvIcTaZt?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# CircuitsDNA: Evolutionary Synthesis of Approximate Hardware acceleration

```
Submission to IEEE SSCS Open-Source Ecosystem “Code-a-Chip” Travel Grant Awards at ISSCC'26
```


|Name|Affiliation|
|:--:|:----------:|
|Ruichen Qi <br /> Email ID: ruichen_qi@brown.edu|Brown University|
|Junyi Luo <br /> Email ID: junyi_luo@brown.edu|Brown University|

<br>

---

## 1. Introduction



### 1.1 Background

According to Moore’s Law and Dennard scaling, continuous transistor miniaturization since 1974 has enabled exponential growth in device density—doubling with each generation—while maintaining higher clock speeds under constant power density. However, around 2007, the benefits of Dennard scaling began to fade due to leakage currents and power density constraints, and by 2012, scaling had largely halted.

Modern computing systems now face severe power and thermal bottlenecks. As transistor density continues to rise, heat dissipation has become a fundamental limitation: chips can no longer operate all transistors simultaneously without exceeding safe thermal limits. This results in “dark silicon,” where only a portion of the available cores can remain active to avoid overheating. Elevated temperatures also degrade device reliability, accelerating wear-out mechanisms and shortening system lifetime.

To address these challenges, approximate computing has emerged as a promising solution. By relaxing accuracy requirements in error-tolerant applications, approximate designs can substantially reduce power consumption and heat generation while maintaining acceptable output quality. This paradigm is especially suitable for neural network and large language model (LLM) accelerators, which are inherently resilient to small arithmetic errors. Minor inaccuracies in multiplications or accumulations typically have negligible impact on model accuracy, allowing designers to adopt approximate multipliers, adders, or reduced-precision datapaths for significant savings in power, area, and latency.

In hardware accelerators such as systolic arrays, which are highly computing-intensive, multiply–accumulate (MAC) units dominate both area and power consumption. Fig.1 shows an hardware architecture of systolic-array based hardware architecture for Large Language Model (LLM). These regular and massively parallel structures in hardware accelerators perform billions of MAC operations per second, making them ideal candidates for approximate design. Replacing exact multipliers or accumulators with approximate counterparts can yield substantial overall energy and area reductions with minimal loss in computational accuracy.

Moreover, fine-tuning techniques can further mitigate hardware-induced errors. By retraining or adapting model parameters on the approximate hardware, most of the lost accuracy can be recovered while retaining energy efficiency gains. This makes approximate computing particularly attractive for large-scale AI accelerators, where computational and memory demands are exceptionally high.

### 1.2 Motivation

Despite its promise, approximate computing still faces practical challenges in logic synthesis. Traditional design methods—such as manual simplification or heuristic gate pruning often rely on structural assumptions and struggle to scale for complex arithmetic blocks like multipliers. These methods typically yield locally optimized solutions and lack the flexibility to explore the vast combinational design space.

To overcome these limitations, genetic algorithms offer an effective alternative. By evolving populations of candidate circuits through mutation, crossover, and selection, GAs can efficiently explore discrete, non-linear design spaces without requiring gradient information or explicit models. Their ability to support multi-objective optimization makes them ideal for balancing trade-offs among accuracy, power, area, and delay.

However, most GA-based approximate synthesis approaches remain limited to small-scale demonstrations and rarely connect circuit-level optimization to system-level evaluation. This work bridges that gap by introducing an end-to-end GA-driven framework that synthesizes approximate computing circuits and evaluates their real-world impact on neural network tasks such as image classifications.

### 1.3 Notebook Overview

This notebook presents a genetic-algorithm–based framework for approximate logic synthesis and its application-level evaluation on deep learning models including CIFAR-100 (ResNet-18/ResNet-20).
The framework is shown in Fig.2. The framework supports both random circuit generation and optimization from a seed netlist, providing a complete flow from synthesis to performance analysis.

Workflow summary:

1. An 8-bit signed multiplier is synthesized and verified using Yosys-ABC, iVerilog, and OpenSTA.
2. The genetic algorithm performs approximate logic synthesis on the extracted netlist under various error and area constraints.
3. The evolved approximate design is analyzed in OpenSTA to extract power, timing, and LUT information.
4. Finally, the approximate multipliers are integrated into deep learning workloads (ResNet-18 and ResNet-20) to evaluate accuracy–efficiency trade-offs.

<div style="text-align:center;">
  <img src="./images/systolic_array_hardware_architecture.png"
       alt="systolic_array_hardware_architecture"
       style="width:80%; max-width:1600px; border:1px solid #ccc;">
  <p style="font-style:italic; color:gray;">
    Figure 1: Systolic array hardware architecture.
  </p>
</div>

<div style="text-align:center;">
  <img src="./images/workflow.png"
       alt="workflow"
       style="width:80%; max-width:1200px; border:1px solid #ccc;">
  <p style="font-style:italic; color:gray;">
    Figure 2: Workflow in this notebook.
  </p>
</div>

---

## 2. Environment Configuration

### 2.1 Open-source Tools Installation and Verification
It is required to install the open-source tools using the provided script. This is because we rely on the latest versions and newly introduced features, and OpenSTA is not available in the conda repositories. The script will automatically clone the corresponding GitHub repositories of these tools and complete the installation process.

The latest versions of Yosys, iVerilog, and OpenSTA should be installed. There are many installation tutorials available online. You can either install them manually by following the instructions on their GitHub pages, or use the script we’ve provided:



In [None]:
# === 2.1 Open-source Tools Installation and Verification ===

import os
import sys
import shutil
import subprocess

setup_sh = os.path.join(os.getcwd(), 'setup.sh')
if os.path.exists(setup_sh):
    print(f'Running {setup_sh}...')
    subprocess.run(['bash', setup_sh], check=True)
else:
    print(f'No setup.sh found at {setup_sh}; skipping create_env')

def which_tool(name, required=False):
    p = shutil.which(name)
    if p:
        print(f'[OK] {name} found: {p}')
        return True
    else:
        if required:
            print(f'[ERR] {name} not found in PATH')
            sys.exit(1)
        else:
            print(f'[WARN] {name} not found in PATH')
            return False

# Icarus Verilog, yosys and sta are required
which_tool('iverilog', required=True)
which_tool('yosys', required=False)
which_tool('sta', required=False)

### 2.2 Create Python Virtual Environment

This step sets up an isolated Python environment to manage dependencies for the project. It ensures that all required packages are installed locally without interfering with system-wide Python libraries. All the dependencies are saved in requirements.txt. The Python virtual environment is mainly used for training deep learning models. Section 6 requires a GPU for model training. If your current device does not support GPU acceleration, please keep only ‘pandas’ in the requirements.txt file, comment out the other dependencies, and use only the evolutionary algorithm to generate approximate computing circuits.

In [None]:
# === 2.2 Create Python Virtual Environment ===

!python3 -m venv venv
!venv/bin/python -m pip install --upgrade pip setuptools wheel ipykernel
!venv/bin/python -m pip install --upgrade-strategy eager -r requirements.txt
# !venv/bin/python -m ipykernel install --user --name=venv --display-name "Python (venv)"


For the following steps, please switch the python kernal to the virtual environment we just created.

---

## 3. 8-bit Signed Multiplier Synthesis and Verification

In this demonstration, an 8-bit signed multiplier is synthesized and verified based on the GF180 technology as an example. All necessary files for synthesis and verification are prepared. Only six basic gate types — AND, NAND, OR, XOR, XNOR, and INV are used for synthesis, though the framework can be easily extended to include other logic gates or even larger building blocks such as full adders (FA), barrel shifters, multipliers, etc. Here, we use these six gate types solely as a representative example.

### 3.1 Goldenbrick Generation, Behavioral Simulation & Verification

The following step automates the full verification pipeline for the 8-bit signed multiplier design:

3.11 Goldenbrick Generation

The “goldenbrick” serves as a golden reference output, generated by a Python script (goldenbrick.py) that computes the expected results of the multiplier for all input combinations. It writes these results into goldenbrick.txt under code/synthesis/goldenbrick/.

3.12 Behavioral Simulation

The script then compiles and runs a behavioral simulation using Icarus Verilog (iverilog + vvp). It includes the GF180 standard cell models (gf180mcu_fd_sc_mcu9t5v0.v, primitives.v), the multiplier RTL (mult_8bits.sv), and its corresponding testbench (mult_8bits_tb.sv). During simulation, the testbench writes the computed results to rtl_sim_output.txt and dumps waveforms into rtl.vcd.

3.13 Result Checking

Finally, the simulation output is compared against the goldenbrick reference. If both files match exactly, it confirms that the multiplier’s behavioral model is functionally correct. Otherwise, it reports mismatches, indicating possible logic or testbench issues. This flow thus connects Python-based golden reference generation with Verilog simulation and verification, ensuring consistency between algorithmic and hardware-level implementations.

In [None]:
# === 3.1. Goldenbrick Generation, Behavioral Simulation & Verification ===

import os
import subprocess
from pathlib import Path

# Paths (mirror Makefile variables)
GF_STD_ROOT = Path('src')
STD_CELLS = [str(GF_STD_ROOT / 'gf180mcu_fd_sc_mcu9t5v0.v'), str(GF_STD_ROOT / 'primitives.v')]
SIM_FILES = 'code/synthesis/verilog/modules/mult_8bits.sv'
TOP_LEVEL_TESTBENCH = 'code/synthesis/verilog/testbench/mult_8bits_tb.sv'
VSIM_DIR = Path('code/synthesis/vsim')
RTL_VVP = str(VSIM_DIR / 'rtl.vvp')
RTL_VCD = str(VSIM_DIR / 'rtl.vcd')
RTL_TXT_OUT = str(VSIM_DIR / 'rtl_sim_output.txt')
IVFLAGS = ['-g2012', '-I../verilog/modules']

# -- 3.11 Goldenbrick Generation: run python script and save to file --
gb_path = Path('code/synthesis/goldenbrick/goldenbrick.txt')
gb_path.parent.mkdir(parents=True, exist_ok=True)
print('Generating goldenbrick...')
with open(gb_path, 'w') as f:
    subprocess.run([sys.executable, 'code/synthesis/goldenbrick/goldenbrick.py', '--width', '8', '--signed'], stdout=f, check=True)
print(f'Goldenbrick saved to {gb_path}')

# -- 3.12 Behavior Simulation: run iverilog and vvp --
VSIM_DIR.mkdir(parents=True, exist_ok=True)
iverilog_cmd = ['iverilog'] + IVFLAGS + ['-o', RTL_VVP] + STD_CELLS + [SIM_FILES, TOP_LEVEL_TESTBENCH, '-D', 'FUNCTIONAL', '-D', f'VSIM_OUT="{RTL_TXT_OUT}"', '-D', f'VCD_FILE="{RTL_VCD}"']
print('Compiling behavioral simulation with:', ' '.join(iverilog_cmd))
subprocess.run(iverilog_cmd, check=True)
subprocess.run(['vvp', RTL_VVP], check=True)
print('Behavioral simulation finished; output written to', RTL_TXT_OUT)

# -- 3.13 Result Checking: compare outputs --
gold_file = gb_path
sim_out = Path(RTL_TXT_OUT)
if gold_file.exists() and sim_out.exists():
    import filecmp
    if filecmp.cmp(str(gold_file), str(sim_out), shallow=False):
        print('Outputs match! Behavioral simulation passed!')
    else:
        print('Outputs differ!')
else:
    print('Golden or simulation output missing; cannot compare')

### 3.2 Logic Synthesis, Gate-Level Simulation & Static Timing Analysis

The following step performs the post-synthesis verification flow for the 8-bit signed multiplier design.

3.21 Logic Synthesis

This step invokes Yosys to synthesize the RTL design (mult_8bits.sv) into a gate-level netlist using the GF180MCU standard-cell library (gf180mcu_fd_sc_mcu9t5v0__tt_025C_3v30.lib).
The resulting netlist (mult_8bits.syn.v) is saved under code/synthesis/syn/.
This process transforms the high-level behavioral description into a technology-mapped implementation ready for physical analysis.

3.22 Gate-Level Simulation

After synthesis, the script runs a gate-level simulation using Icarus Verilog (iverilog + vvp).
It uses the same testbench as the behavioral simulation, along with the synthesized netlist and GF180 standard-cell models.
Simulation outputs are written to gl_sim_output.txt, and signal waveforms are dumped into gl.vcd for inspection.

3.23 Output Checking

The gate-level simulation results are compared against the goldenbrick reference.
If both outputs match, it confirms that the synthesis preserved the intended functionality.
Any mismatch indicates possible synthesis or simulation discrepancies.

3.24 Static Timing Analysis (STA)

If OpenSTA is available, the script runs sta.tcl to perform static timing analysis using the same GF180 library.
This step reports the critical path delay, setup/hold time, and timing slack information, stored in code/synthesis/sta/sta.report.
It verifies that the synthesized circuit meets timing requirements under typical operating conditions.

In [None]:
# === 3.2. Logic Synthesis, Gate-Level Simulation & Static Timing Analysis ===

import os
import subprocess
import shutil
from pathlib import Path

# Paths / vars
GF_LIB = str(Path('src') / 'gf180mcu_fd_sc_mcu9t5v0__tt_025C_3v30.lib')
SYN_FILES = Path('code/synthesis/verilog/modules/mult_8bits.sv')
SYN_NETLIST = 'code/synthesis/syn/mult_8bits.syn.v'
STD_CELLS = [str(Path('src') / 'gf180mcu_fd_sc_mcu9t5v0.v'), str(Path('src') / 'primitives.v')]
TOP_LEVEL_TESTBENCH = 'code/synthesis/verilog/testbench/mult_8bits_tb.sv'
VSIM_DIR = Path('code/synthesis/vsim')
GL_VVP = str(VSIM_DIR / 'gl.vvp')
GL_VCD = str(VSIM_DIR / 'gl.vcd')

# -- 3.21 Logic Synthesis: call yosys in code/synthesis --
print('Running synthesis (yosys)...')
env = os.environ.copy()

# Make GF_LIB absolute so it can be found regardless of the working directory used for yosys
GF_LIB = str((Path.cwd() / 'src' / 'gf180mcu_fd_sc_mcu9t5v0__tt_025C_3v30.lib').resolve())
env['LIB_SYN'] = GF_LIB
try:
    subprocess.run(['yosys', '-ql', 'syn/run_synth.log', '-c', 'syn/run_synth.tcl'], cwd='code/synthesis', check=True, env=env)
except subprocess.CalledProcessError as e:
    print('Yosys failed:')
    print(e)
    # Re-raise so the notebook cell still fails visibly after printing helpful info
    raise
if not SYN_FILES.exists():
    raise SystemExit(f'[ERR] Synthesis did not produce {SYN_FILES}')
print('[SYN] Done. Log: code/synthesis/syn/run_synth.log')

# -- 3.22 Gate-level simulation --
VSIM_DIR.mkdir(parents=True, exist_ok=True)
iverilog_cmd = ['iverilog', '-g2012', '-I../verilog/modules', '-o', GL_VVP] + STD_CELLS + [SYN_NETLIST, TOP_LEVEL_TESTBENCH, '-D', f'VSIM_OUT="{VSIM_DIR}/gl_sim_output.txt"', '-D', f'VCD_FILE="{GL_VCD}"']
print('Compiling gate-level simulation with:', ' '.join(iverilog_cmd))
subprocess.run(iverilog_cmd, check=True)
subprocess.run(['vvp', GL_VVP], check=True)

# -- 3.23 Output Checking: compare outputs --
gold = Path('code/synthesis/goldenbrick/goldenbrick.txt')
gl_out = Path(str(VSIM_DIR / 'gl_sim_output.txt'))
if gold.exists() and gl_out.exists():
    import filecmp
    if filecmp.cmp(str(gold), str(gl_out), shallow=False):
        print('Outputs match! Gate-level netlist simulation passed!')
    else:
        print('Outputs differ!')
else:
    print('Golden or gate-level output missing; cannot compare')

# -- 3.24 Static Timing Analysis: run OpenSTA if available --
sta_bin = shutil.which('sta')
if sta_bin:
    Path('code/synthesis/sta').mkdir(parents=True, exist_ok=True)
    env = os.environ.copy(); env['LIB_SYN'] = GF_LIB
    with open('code/synthesis/sta/sta.report', 'w') as fout:
        subprocess.run([sta_bin, 'code/synthesis/sta/sta.tcl'], check=True, stdout=fout, env=env)
    print('[STA] Done. Reports in code/synthesis/sta')
else:
    print('[WARN] OpenSTA (sta) not found; skipping STA')

---
## 4. Approximate Multiplier Synthesis from the Synthesized Netlist via Genetic Algorithm

In this section, we use a genetic algorithm (GA) to evolve approximate 8-bit signed multipliers. Genetic algorithms struggle with large circuits due to the exponentially growing search space, high computational cost, and poor scalability. Therefore, our GA starts from the seed circuit, which is loaded from the synthesized Verilog netlist in previous step.

During each generation, the algorithm evaluates all candidate circuits by simulating their outputs over all input patterns. The fitness function jointly considers circuit accuracy and area efficiency — penalizing individuals with large worst-case errors (WCE > ε_th) while favoring smaller transistor counts. Multiple error metrics such as NMED, ER, WCE, MRE, and sMAPE are recorded for analysis.

New individuals are created through mutation operators, including:
1. Add / delete node: randomly insert or remove a gate;
2. Change gate type: switch to another logic primitive;
3. Rewire inputs or outputs: alter signal connectivity;
4. Merge equivalent nodes: remove redundant subcircuits.

By iteratively applying mutation, pruning inactive nodes, and selecting the best individuals, the GA searches the discrete, irregular design space to obtain compact approximate multipliers with bounded output error. This process enables automatic approximate logic synthesis directly at the gate level, providing a flexible framework for exploring accuracy–area trade-offs under the GF180 technology library.




<p align="center">
  <img src="./images/GA_pseudo_code.png"
       alt="GA_pseudo_code"
       style="width:80%; max-width:800px; border:1px solid #ccc;">
  <br>
  <em>Figure 3: Pseudo code for the implemented genetic algorithm.</em>
</p>

<p align="center">
  <img src="./images/GA_pseudo_code_explanation.png"
       alt="GA_pseudo_code_explanation"
       style="width:80%; max-width:800px; border:1px solid #ccc;">
  <br>
  <em>Figure 4: List of symbols and their definitions</em>
</p>

### 4.1 Genetic Algorithm Compilation

The following step compiles the C++ implementation of the genetic algorithm (GA) used for evolving approximate circuit designs. It compiles all necessary source files and produces the executable. This executable serves as the core evolution engine, responsible for generating, mutating, and evaluating circuit candidates during the genetic search process.

In [None]:
# === 4.1. Genetic Algorithm Compilation ===

import subprocess
print('Compiling genetic algorithm C++ sources...')
subprocess.run(['g++', '-std=c++17', '-O3', '-Wall', '-Wextra', '-I.', 'file_io.cpp', 'gate.cpp', 'problem.cpp', 'utils.cpp', 'main.cpp', '-o', 'main_exec'], cwd='code/genetic_algorithm', check=True)
print('Compilation finished. Executable: code/genetic_algorithm/main_exec')

### 4.2 Genetic Algorithm Execution

The following step launches the genetic algorithm (GA) engine to evolve approximate multiplier designs based on the synthesized 8-bit netlist.

4.21 Execution Setup

The script first ensures that the compiled executable (main_exec) exists under code/genetic_algorithm/.
If missing, it automatically recompiles the source files using g++ with optimization and warning flags.
All paths are resolved relative to the repository root to maintain consistency with internal file references.

4.22 Runtime Configuration

The GA is executed with explicit runtime arguments, overriding the defaults defined in main.cpp.
These parameters include:
- Maximum number of generations: 100,000
- Population size: 100
- Top-k (elite) ratio: 2% of the population
- Mutation probabilities: [0.10, 0.10, 0.10, 0.40, 0.15, 0.15] for different mutation operators
- Error exploration schedule: eps_start=0.0, eps_end=0.1, eps_tau=2000.0
- Seed: 2025

The script runs the GA on the synthesized multiplier netlist (mult_8bits.syn.v) and writes all logs and evolved results into code/genetic_algorithm/output/.

4.23 Evolution Process

During execution, the GA iteratively generates, mutates, and evaluates circuit candidates to minimize area and error under given constraints.
All key metrics, including fitness scores, generation progress, and best solutions, are saved in evolution_metrics_signed8x8.txt. This step effectively performs automated approximate circuit synthesis, bridging the synthesis flow and the evolutionary optimization stage.

In [None]:
# === 4.2. Genetic Algorithm Execution ===

import subprocess
import os
from pathlib import Path

# -- 4.21 Execution Setup --
exe_rel = os.path.join('code', 'genetic_algorithm', 'main_exec')
exe = os.path.abspath(exe_rel)

# If missing, attempt to compile in the GA source directory
if not os.path.exists(exe):
    print(f'Executable not found at {exe}; attempting to compile in code/genetic_algorithm...')
    try:
        subprocess.run(['g++', '-std=c++17', '-O3', '-Wall', '-Wextra', '-I.', 'file_io.cpp', 'gate.cpp', 'problem.cpp', 'utils.cpp', 'main.cpp', '-o', 'main_exec'], cwd='code/genetic_algorithm', check=True)
    except subprocess.CalledProcessError as e:
        print('Compilation failed:')
        print(e)
        raise SystemExit('Could not compile GA executable; please check build errors.')
    if not os.path.exists(exe):
        raise SystemExit('Executable still missing after compilation attempt.')
    
# Run from repository root (top-level) so paths are consistent
candidate_cwd = os.getcwd()
print('Running GA executable from', candidate_cwd)

# Determine netlist and output directory to pass as runtime args
netlist_path = str(Path(candidate_cwd) / 'code' / 'synthesis' / 'syn' / 'mult_8bits.syn.v')
output_dir = str(Path(candidate_cwd) / 'code' / 'genetic_algorithm' / 'output')
logfile = str(Path(output_dir) / 'evolution_metrics_signed8x8.txt')

# -- 4.22 Runtime Configuration --
generations = '100000'
pop_size = '100'
top_k_ratio = '0.02'
eps_start = '0.0'
eps_end = '0.1'
eps_tau = '2000.0'
seed = '2025'

# 1 = true, 0 = false for these flags as parsed by main.cpp
load_from_file = '1'
signed_mult = '1'

# Mutation weights as a comma-separated string matching main.cpp parsing expectations
mut_weights = '0.10,0.10,0.10,0.40,0.15,0.15'   #Add, delete, change_gate, merge_equiv, rewire_in, rewire_out
cmd = [
    exe,
    '--netlist-file', netlist_path,
    '--output-dir', output_dir,
    '--logfile', logfile,
    '--generations', generations,
    '--pop-size', pop_size,
    '--top-k-ratio', top_k_ratio,
    '--eps-start', eps_start,
    '--eps-end', eps_end,
    '--eps-tau', eps_tau,
    '--seed', seed,
    '--load-from-file', load_from_file,
    '--signed-mult', signed_mult,
    '--mut-weights', mut_weights
]
print('Command:', ' '.join(cmd))

# -- 4.23 Evolution Process --
Path(output_dir).mkdir(parents=True, exist_ok=True)
try:
    proc = subprocess.run(cmd, cwd=candidate_cwd, check=True, capture_output=True, text=True)
    if proc.stdout:
        print('--- stdout ---')
        print(proc.stdout)
    if proc.stderr:
        print('--- stderr ---')
        print(proc.stderr)
    print('GA run finished')
except subprocess.CalledProcessError as e:
    print('GA executable failed with returncode', e.returncode)
    if e.stdout:
        print('--- stdout ---')
        print(e.stdout)
    if e.stderr:
        print('--- stderr ---')
        print(e.stderr)
    raise


### 4.3 Result Evaluation

Here we use the evolution with an relative worst case error threshold of 0.5% as an example. The algorithm is very lightweight and can run on a regular laptop. All the experiments in this section were run for 10 hours on an AMD Ryzen 9 7845HX laptop with 32GB memory.

4.31 Evolution process of an signed 8-bit approximate multiplier. 

<p align="center">
  <img src="./images/evolution_process_epsEnd_0.005.png"
       alt="evolution_process_epsEnd_0.005"
       style="width:80%; max-width:800px; border:1px solid #ccc;">
  <br>
  <em>Figure 5: Evolution process with relative worst-case error threshold of 0.5%.</em>
</p>

4.32 Equivalent transistor count and error metrics under various relative worst-case error thresholds

A grid search over the relative worst-case error threshold was performed, and the results are shown below. With only a 0.5% relative worst-case error, the equivalent transistor count was reduced to 76.5% of the original size. All the evolved circuit netlists as well as look-up tables are provided in src/ as a reference.

<p align="center">
  <img src="./images/equivalent_transistor_count_vs_epsEnd.png"
       alt="equivalent_transistor_count_vs_epsEnd"
       style="width:80%; max-width:800px; border:1px solid #ccc;">
  <br>
  <em>Figure 6: Equivalent transistor count versus relative worst-case error threshold.</em>
</p>

<p align="center">
  <img src="./images/error_metrics_vs_epsEnd.png"
       alt="error_metrics_vs_epsEnd"
       style="width:80%; max-width:800px; border:1px solid #ccc;">
  <br>
  <em>Figure 7: Error metrics versus relative worst-case error threshold.</em>
</p>


---
## 5. Analysis and Verification of Generated Approximate Multiplier

In this section, we map the generated netlist to a specific semiconductor technology (GF180, for example) and perform functional verification as well as power, performance, and area (PPA) analysis using OpenSTA. The design is mappd to standard cells from the target technology library, and timing, power, and area reports are extracted to evaluate the quality of the evolved approximate multiplier.



### 5.1 Netlist Mapping to GF180 Structural Verilog

This step converts the evolved netlist produced by the genetic algorithm into a structural SystemVerilog format compatible with the GF180MCU standard-cell library. The script calls the Python utility netlist_to_verilog.py, which reads the raw gate-level description (netlist_output.txt) and translates it into a clean, synthesizable SystemVerilog module. The resulting file netlist_output.sv is saved under code/genetic_algorithm/output/ and the top-level module is named approxMult_signed8x8.

In [None]:
# === 5.1. Netlist Mapping to GF180 Structural Verilog ===

import subprocess
import sys
print('Mapping generated netlist to structural SystemVerilog...')
subprocess.run([sys.executable, 'scripts/netlist_to_verilog.py', '--in', './code/genetic_algorithm/output/netlist_output.txt', '--out', './code/genetic_algorithm/output/netlist_output.sv', '--module-name', 'approxMult_signed8x8'], check=True)
print('Mapped netlist written to code/genetic_algorithm/output/netlist_output.sv')

### 5.2 Static Timing Analysis for the Evolved Approximate Design

This step performs static timing analysis (STA) on the evolved approximate multiplier using OpenSTA, ensuring timing integrity after the design evolution. The script checks whether the sta binary (OpenSTA) is available in the system path. If found, it runs the timing script sta_approx.tcl under code/synthesis/sta/, using the same GF180MCU timing library (gf180mcu_fd_sc_mcu9t5v0__tt_025C_3v30.lib). All analysis results are written to sta_approx.report.

In [None]:
# === 5.2. Static Timing Analysis for the Evolved Approximate Design ===

sta_bin = shutil.which('sta')
if sta_bin:
    Path('code/synthesis/sta').mkdir(parents=True, exist_ok=True)
    env = os.environ.copy(); env['LIB_SYN'] = GF_LIB
    with open('code/synthesis/sta/sta_approx.report', 'w') as fout:
        subprocess.run([sta_bin, 'code/synthesis/sta/sta_approx.tcl'], check=True, stdout=fout, env=env)
    print('[STA] Done. Reports in code/synthesis/sta')
else:
    print('[WARN] OpenSTA (sta) not found; skipping STA')

### 5.3 Gate-Level Simulation and Truth Table Extraction of the Evolved Design

This step verifies the functionality of the evolved approximate multiplier through gate-level simulation, followed by truth table extraction for quantitative error analysis.

5.31 Gate-Level Simulation

The script compiles and runs the evolved SystemVerilog netlist (netlist_output.sv) together with its dedicated testbench (approxMult_signed8x8_tb.sv) using Icarus Verilog (iverilog + vvp). It links against the GF180MCU standard-cell models (gf180mcu_fd_sc_mcu9t5v0.v, primitives.v) to accurately model technology-specific behavior. Simulation results are stored in gl_sim_output.txt, and the waveform is dumped into gl.vcd for inspection.

5.32 Truth Table Extraction

After simulation, the script calls extract_truth_table.py to parse the simulation results and reconstruct the complete input–output truth table of the evolved multiplier.
This extracted table serves as the basis for evaluating numerical error metrics such as ER, WCE, NMED, and MRE, enabling precise comparison against the golden reference.

In [None]:
# === 5.3. Gate-Level Simulation and Truth Table Extraction of the Evolved Design ===

from pathlib import Path
import subprocess

VSIM_DIR = Path('code/synthesis/vsim')
VSIM_DIR.mkdir(parents=True, exist_ok=True)
STD_CELLS = [str(Path('src') / 'gf180mcu_fd_sc_mcu9t5v0.v'), str(Path('src') / 'primitives.v')]
APPROX_NETLIST = 'code/genetic_algorithm/output/netlist_output.sv'
APPROX_NETLIST_TESTBENCH = 'code/synthesis/verilog/testbench/approxMult_signed8x8_tb.sv'
GL_VVP = str(VSIM_DIR / 'gl.vvp')
GL_VCD = str(VSIM_DIR / 'gl.vcd')

# -- 5.31 Gate-Level Simulation --
iverilog_cmd = ['iverilog', '-g2012', '-I../verilog/modules', '-o', GL_VVP] + STD_CELLS + [APPROX_NETLIST, APPROX_NETLIST_TESTBENCH, '-D', f'VSIM_OUT="{VSIM_DIR}/gl_sim_output.txt"', '-D', f'VCD_FILE="{GL_VCD}"']
print('Compiling approximate netlist simulation...')
subprocess.run(iverilog_cmd, check=True)
subprocess.run(['vvp', GL_VVP], check=True)
print('Simulation finished; extracting truth table...')

# -- 5.32 Truth Table Extraction --
subprocess.run([sys.executable, 'scripts/extract_truth_table.py'], check=True)
print('Truth table extraction complete')

### 5.4 Error Evaluation on Extracted Truth Table

This step performs quantitative error analysis on the evolved approximate multiplier based on its extracted truth table. The script checks for the presence of the file extracted_truth_table.txt, generated in the previous step. If it exists, it calls error_eval.py to compute various approximation error metrics, using 16384 as the normalization factor (the maximum absolute value for an 8-bit signed multiplication).

In [None]:
# === 5.4. Error Evaluation on Extracted Truth Table ===

import os
import subprocess
from pathlib import Path
tt = Path('code/genetic_algorithm/output/extracted_truth_table.txt')
if tt.exists():
    print('Running error evaluation on extracted truth table...')
    subprocess.run([sys.executable, 'scripts/error_eval.py', str(tt), '--maxabs', '16384'], check=True)
else:
    print(f'Extracted truth table not found at {tt}; run the simulation/extraction step first')

---
## 6. Implementation of Deep Learning Tasks: ResNet-18 & ResNet-20 with Approximate Computing Units
### 6.1 Fine-tune an INT8 quantized ResNet-20 on the CIFAR-100 dataset
We start by fine-tuning an INT8-quantized ResNet-20 on the CIFAR-100 dataset. This experiment serves as a proof-of-concept for integrating approximate arithmetic units within a complete image-classification pipeline. It establishes the baseline workflow and verifies functional compatibility before extending the approach to larger architectures such as ResNet-18.

In [None]:
# Quantization-aware training for CIFAR-100 ResNet-20 using helpers from train_qat_cifar100_resnet20.py
import os, sys, json, torch
from pathlib import Path

module_dir = Path("code/resnet/quant_code")
if not module_dir.is_dir():
    raise FileNotFoundError(f"Expected directory '{module_dir}' alongside this notebook.")
if str(module_dir.resolve()) not in sys.path:
    sys.path.insert(0, str(module_dir.resolve()))
repo_root = Path(".").resolve() / "code" / "resnet"

from train_qat_cifar100_resnet20 import (
    Cfg, get_loaders_cifar100, QResNet20CIFAR,
    train_one_epoch, evaluate, set_activation_quant_enabled,
    calibrate_activations, calibrate_weights
 )
from torch.optim.lr_scheduler import CosineAnnealingLR

cfg = Cfg()
cfg.data_root = str((repo_root / "datasets").resolve())
cfg.fp32_ckpt = str((repo_root / "runs_fp32" / "resnet20_cifar100_fp32.pt").resolve())
cfg.out_dir = str((repo_root / "runs_qat" / "resnet20").resolve())
os.makedirs(cfg.out_dir, exist_ok=True)

train_loader, test_loader = get_loaders_cifar100(cfg)
model = QResNet20CIFAR(num_classes=cfg.num_classes, cfg=cfg).to(cfg.device)

if cfg.fp32_ckpt and os.path.isfile(cfg.fp32_ckpt):
    sd = torch.load(cfg.fp32_ckpt, map_location='cpu')
    missing, unexpected = model.load_state_dict(sd, strict=False)
    print(f'load_state: missing={len(missing)} unexpected={len(unexpected)}')
else:
    print('[Warn] fp32 checkpoint not found, training continues from scratch.')

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=cfg.lr_qat, momentum=cfg.momentum,
    weight_decay=cfg.weight_decay, nesterov=cfg.nesterov
 )

total_epochs = cfg.warmup_epochs + cfg.epochs_qat
scheduler = CosineAnnealingLR(optimizer, T_max=total_epochs, eta_min=cfg.eta_min)

history = {"train_loss": [], "train_acc": [], "test_loss": [], "test_acc": []}

set_activation_quant_enabled(model, in_enabled=False, out_enabled=False)
print(f'[Warm-up] epochs={cfg.warmup_epochs}')
for ep in range(1, cfg.warmup_epochs + 1):
    tr_loss, tr_acc = train_one_epoch(model, train_loader, optimizer, cfg.device)
    te_loss, te_acc = evaluate(model, test_loader, cfg.device)
    cur_lr = optimizer.param_groups[0]['lr']
    print(f'[W{ep:02d}] lr {cur_lr:.2e} | train {tr_loss:.4f}/{tr_acc*100:.2f}% | test {te_loss:.4f}/{te_acc*100:.2f}%')
    scheduler.step()

calibrate_weights(model)

set_activation_quant_enabled(model, in_enabled=cfg.quantize_input, out_enabled=cfg.quantize_output)
print(f'[Calib-1] collecting {cfg.calib_steps} mini-batches...')
calibrate_activations(model, train_loader, cfg.device, cfg.calib_steps)

total_qat = cfg.epochs_qat
re_ep = int(total_qat * cfg.recalib_ratio) if cfg.recalib else -1

for ep in range(1, total_qat + 1):
    tr_loss, tr_acc = train_one_epoch(model, train_loader, optimizer, cfg.device)
    te_loss, te_acc = evaluate(model, test_loader, cfg.device)
    history["train_loss"].append(tr_loss); history["train_acc"].append(tr_acc)
    history["test_loss"].append(te_loss);  history["test_acc"].append(te_acc)
    cur_lr = optimizer.param_groups[0]['lr']
    print(f'[QAT {ep:03d}] lr {cur_lr:.2e} | train {tr_loss:.4f}/{tr_acc*100:.2f}% | test {te_loss:.4f}/{te_acc*100:.2f}%')
    scheduler.step()
    if cfg.recalib and ep == re_ep:
        print(f'[Calib-2 @ QAT {ep}] collecting {cfg.calib_steps} mini-batches...')
        calibrate_activations(model, train_loader, cfg.device, cfg.calib_steps)

torch.save(model.state_dict(), os.path.join(cfg.out_dir, cfg.ckpt))
with open(os.path.join(cfg.out_dir, cfg.hist), 'w') as f:
    json.dump(history, f)

te_loss, te_acc = evaluate(model, test_loader, cfg.device)
print(f'INT8 final: loss {te_loss:.4f}, acc {te_acc*100:.2f}%')

### 6.2 Diagnostic: INT8 Weights with an Approximate Multiplier

We ran a control diagnostic where we kept the baseline INT8 weights unchanged and only swapped the exact multiplier for an approximate one. Classification accuracy degraded sharply. This confirms that hardware-aware fine-tuning is necessary: quantized weights alone cannot offset the additional arithmetic error introduced by approximate units.

Implementation note. For each approximate-multiplier variant, point lut_table_path to the matching LUT so the kernel uses the correct truth table.

In [None]:
from dataclasses import dataclass
from pathlib import Path
import time, torch
from torch import nn
from torch.utils.data import DataLoader, Subset
from torchvision import datasets, transforms

from resnet20_lut import ResNet20LUTCfg, build_lut_resnet20, resolve_lut_table


@dataclass
class Cfg(ResNet20LUTCfg):
    data_root: str = 'code/resnet/datasets'
    test_subset: int | None = None
    lut_table_path: str | None = 'code/resnet/lut/truth_table_0.5.csv'
    qat_ckpt: str = 'code/resnet/runs_qat/resnet20/resnet20_cifar100_qat_int8.pt'
    # lut_table_path: str | None = None



def get_test_loader(cfg: Cfg) -> DataLoader:
    mean = [0.5071, 0.4865, 0.4409]
    std = [0.2673, 0.2564, 0.2762]
    test_tf = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean, std),
    ])
    data_root = Path(cfg.data_root).expanduser()
    data_root.mkdir(parents=True, exist_ok=True)
    test_set = datasets.CIFAR100(str(data_root), train=False, download=True, transform=test_tf)
    if cfg.test_subset is not None and cfg.test_subset < len(test_set):
        indices = torch.arange(cfg.test_subset)
        test_set = Subset(test_set, indices)
    return DataLoader(test_set, batch_size=cfg.batch_sz, shuffle=False,
                      num_workers=cfg.num_workers, pin_memory=True)




@torch.no_grad()
def evaluate(model: nn.Module, loader: DataLoader, device: str) -> tuple[float, float, float]:
    model.eval()
    total = 0
    correct = 0
    total_loss = 0.0
    criterion = nn.CrossEntropyLoss()

    if device.startswith('cuda'):
        torch.cuda.synchronize()
    t0 = time.time()
    total_batches = len(loader)
    for step, (x, y) in enumerate(loader, 1):
        x = x.to(device)
        y = y.to(device)
        logits = model(x)
        loss = criterion(logits, y)
        total_loss += loss.item() * x.size(0)
        pred = logits.argmax(dim=1)
        correct += (pred == y).sum().item()
        total += x.size(0)
        if step % max(1, total_batches // 10) == 0 or step == total_batches:
            print(f'[Eval] processed {step}/{total_batches} batches...')
    if device.startswith('cuda'):
        torch.cuda.synchronize()
    elapsed = time.time() - t0
    avg_loss = total_loss / max(total, 1)
    acc = correct / max(total, 1)
    return avg_loss, acc, elapsed

cfg = Cfg()
device = cfg.device
torch.backends.cudnn.benchmark = True

if cfg.lut_table_path is not None:
    base_dir = Path.cwd()
    resolve_lut_table(cfg, base_dir)
    print(f'[Info] Loaded LUT truth table: {cfg.lut_table_path}')

print('[Info] Building LUT ResNet-20 and loading QAT checkpoint...')
print(f'[Info] Using checkpoint at: {cfg.qat_ckpt}')
model = build_lut_resnet20(cfg).to(device)

print('[Info] Preparing CIFAR-100 test set...')
test_loader = get_test_loader(cfg)

print('[Info] Starting evaluation...')
loss, acc, elapsed = evaluate(model, test_loader, device)
samples = len(test_loader.dataset)
print(f'[Result] test_loss={loss:.4f}, test_acc={acc*100:.2f}%, time={elapsed:.2f}s, throughput={samples/elapsed:.1f} samples/s')

### 6.3 Hardware-Aware Fine-Tuning with Approximate Multipliers

We now proceed to hardware-aware fine-tuning. During backpropagation, the gradient of the exact multiplier is substituted for that of the approximate multiplier to preserve a descent direction and ensure that training remains stable. Although the forward pass uses non-differentiable LUT-based approximate multipliers, this gradient substitution allows the network to adapt its parameters to compensate for arithmetic inaccuracies.



This stage aligns algorithmic adaptation with hardware characteristics, yielding stable convergence and near-optimal accuracy under approximate computation.
<p align="center">
  <img src="./images/fine_tuning_explanation.png"
       alt="Fine tuning"
       style="width:80%; max-width:800px; border:1px solid #ccc;">
  <br>
  <em>Figure 8: Fine tuning for Approx Mult..</em>
</p>

In [None]:
import json
import os, sys
import time
from dataclasses import dataclass
from pathlib import Path

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

module_dir = Path("code/resnet/quant_code")
if not module_dir.is_dir():
    raise FileNotFoundError(f"Expected directory '{module_dir}' alongside this notebook.")
module_path = str(module_dir.resolve())
if module_path not in sys.path:
    sys.path.insert(0, module_path)

from resnet20_lut import (
    QResNet20CIFARLUT,
    ResNet20LUTCfg,
    build_lut_resnet20,
    resolve_lut_table,
 )


@dataclass
class Cfg(ResNet20LUTCfg):
    qat_ckpt: str = 'code/resnet/runs_qat/resnet20/resnet20_cifar100_qat_int8.pt'  # QAT weight path
    data_root: str = 'code/resnet/datasets'
    epochs: int = 5
    lr: float = 5e-4
    weight_decay: float = 1e-4
    momentum: float = 0.9
    nesterov: bool = True
    log_interval: int = 100
    eta_min: float = 1e-5
    output_dir: str = 'code/resnet/runs_qat/resnet20'
    save_path: str | None = None
    resume_path: str | None = None
    history_path: str | None = None
    lut_table_path: str | None = 'code/resnet/lut/truth_table_0.5.csv'
    num_workers: int = 0


def get_loaders(cfg: Cfg) -> tuple[DataLoader, DataLoader]:
    mean = [0.5071, 0.4865, 0.4409]
    std = [0.2673, 0.2564, 0.2762]
    train_tf = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean, std),
    ])
    test_tf = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean, std),
    ])
    train_set = datasets.CIFAR100(cfg.data_root, train=True, download=True, transform=train_tf)
    test_set = datasets.CIFAR100(cfg.data_root, train=False, download=True, transform=test_tf)
    train_loader = DataLoader(
        train_set, batch_size=cfg.batch_sz, shuffle=True, num_workers=cfg.num_workers, pin_memory=True
    )
    test_loader = DataLoader(
        test_set, batch_size=cfg.batch_sz, shuffle=False, num_workers=cfg.num_workers, pin_memory=True
    )
    return train_loader, test_loader


def infer_lut_tag(lut_path: str | None) -> str:
    if not lut_path:
        return 'exact'
    stem = os.path.splitext(os.path.basename(lut_path))[0]
    candidate = stem.split('_')[-1] if '_' in stem else stem
    candidate = candidate.strip()
    return candidate or 'custom'


def train_one_epoch(
    model: QResNet20CIFARLUT,
    loader: DataLoader,
    optimizer: torch.optim.Optimizer,
    device: str,
    log_interval: int,
 ) -> tuple[float, float]:
    model.train()
    total = 0
    correct = 0
    loss_sum = 0.0
    for step, (x, y) in enumerate(loader, 1):
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad(set_to_none=True)
        logits = model(x)
        loss = F.cross_entropy(logits, y)
        loss.backward()
        optimizer.step()
        loss_sum += loss.item() * x.size(0)
        correct += (logits.argmax(dim=1) == y).sum().item()
        total += x.size(0)
        if step % max(1, log_interval) == 0:
            print(
                f'[Train] step={step:04d} loss={loss.item():.4f} acc={(correct/total)*100:.2f}%',
                flush=True,
            )
    return loss_sum / max(total, 1), correct / max(total, 1)


@torch.no_grad()
def evaluate(model: QResNet20CIFARLUT, loader: DataLoader, device: str) -> tuple[float, float]:
    model.eval()
    total = 0
    correct = 0
    loss_sum = 0.0
    for x, y in loader:
        x = x.to(device)
        y = y.to(device)
        logits = model(x)
        loss = F.cross_entropy(logits, y)
        loss_sum += loss.item() * x.size(0)
        correct += (logits.argmax(dim=1) == y).sum().item()
        total += x.size(0)
    return loss_sum / max(total, 1), correct / max(total, 1)


def build_model(cfg: Cfg) -> QResNet20CIFARLUT:
    if cfg.qat_ckpt and os.path.isfile(cfg.qat_ckpt):
        model = build_lut_resnet20(cfg)
    elif cfg.resume_path:
        print('[Warn] Initial QAT checkpoint not found, falling back to resume only.')
        model = QResNet20CIFARLUT(num_classes=cfg.num_classes, cfg=cfg)
    else:
        raise FileNotFoundError('Missing initial QAT weights; set cfg.qat_ckpt or provide resume_path')
    if cfg.resume_path and os.path.isfile(cfg.resume_path):
        print(f'[Info] Resume from {cfg.resume_path}')
        state = torch.load(cfg.resume_path, map_location='cpu')
        model.load_state_dict(state, strict=False)
    return model

cfg = Cfg()
device = cfg.device
torch.backends.cudnn.benchmark = True
if cfg.lut_table_path is not None:
    base_dir = Path.cwd()
    resolve_lut_table(cfg, base_dir)
    print(f'[Info] Loaded LUT table: {cfg.lut_table_path}')
lut_tag = infer_lut_tag(cfg.lut_table_path)
if cfg.save_path is None:
    cfg.save_path = os.path.join(cfg.output_dir, f'resnet20_cifar100_qat_int8_lut_{lut_tag}_ft.pt')
if cfg.history_path is None:
    cfg.history_path = os.path.join(cfg.output_dir, f'history_ft_resnet20_lut_{lut_tag}.json')
train_loader, test_loader = get_loaders(cfg)
model = build_model(cfg).to(device)
optimizer = torch.optim.SGD(
    model.parameters(), lr=cfg.lr, momentum=cfg.momentum, weight_decay=cfg.weight_decay, nesterov=cfg.nesterov
 )
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=cfg.epochs, eta_min=cfg.eta_min)
best_acc = 0.0
history: dict[str, list[float]] = {
    'train_loss': [],
    'train_acc': [],
    'test_loss': [],
    'test_acc': [],
}
for epoch in range(1, cfg.epochs + 1):
    t0 = time.time()
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, device, cfg.log_interval)
    test_loss, test_acc = evaluate(model, test_loader, device)
    scheduler.step()
    elapsed = time.time() - t0
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['test_loss'].append(test_loss)
    history['test_acc'].append(test_acc)
    print(
        f"[Epoch {epoch:03d}] {elapsed:.1f}s | train {train_loss:.4f}/{train_acc*100:.2f}% | "
        f"test {test_loss:.4f}/{test_acc*100:.2f}% | lr={optimizer.param_groups[0]['lr']:.2e}",
        flush=True,
    )
    if test_acc > best_acc:
        best_acc = test_acc
        os.makedirs(os.path.dirname(cfg.save_path), exist_ok=True)
        torch.save(model.state_dict(), cfg.save_path)
        print(f'[Info] Saved new best model -> {cfg.save_path} (acc={best_acc*100:.2f}%)', flush=True)
print(f'[Done] Best test accuracy: {best_acc*100:.2f}%')
os.makedirs(os.path.dirname(cfg.history_path), exist_ok=True)
with open(cfg.history_path, 'w') as f:
    json.dump(history, f)
print(f'[Info] Training history saved to {cfg.history_path}')

### 6.4 Impact of Multiplier Accuracy on Fine-Tuning Performance

To evaluate the impact of multiplier accuracy, we perform hardware-aware fine-tuning using approximate multipliers characterized by different Relative Worst-Case Error (WCE) values.
Figure right illustrates the test loss, and Figure left shows the test accuracy over 12 epochs for each configuration.

Observations

Low-error multipliers (Relative WCE ≤ 0.02) achieve stable convergence and maintain accuracy close to the exact baseline (≈ 68–70%).

Moderate-error multipliers (Relative WCE ≈ 0.05) still allow partial recovery through fine-tuning, reaching about 55–60% accuracy.

High-error multipliers (Relative WCE ≥ 0.1) show significantly degraded convergence and accuracy, indicating that when arithmetic error becomes large, gradient substitution alone cannot fully compensate for the approximate behavior.

These results demonstrate that hardware-aware fine-tuning can successfully adapt to approximate arithmetic units with small to moderate error levels, but excessive deviation from exact computation leads to accuracy collapse.

Fairness and Reproducibility

For a fair comparison, all experiments are fine-tuned for exactly 12 epochs under identical hyperparameters and optimization schedules.
Readers interested in further exploration may extend the fine-tuning process or adjust the learning-rate decay to investigate whether additional training epochs can further recover performance for higher-error multipliers.

<p align="center">
  <img src="./images/lut_compare_acc.png"
       alt="Test Accuracy for Approximate Multipliers"
       style="width:45%; max-width:500px; border:1px solid #ccc; margin-right:10px;">
  <img src="./images/lut_compare_loss.png"
       alt="Test Loss for Approximate Multipliers"
       style="width:45%; max-width:500px; border:1px solid #ccc;">
</p>

<p align="center">
  <em>Figure 9: Test accuracy and loss during hardware-aware fine-tuning with approximate multipliers of different relative WCE values.</em>
</p>


Plot the logs to see the trend

In [None]:
import json, os
import matplotlib.pyplot as plt

history_path = "runs_qat/resnet20/history_qat_resnet20.json"   
# ISSCC26/CircuitsDNA/code/resnet/runs_qat/resnet20/history_ft_resnet20_lut_0.5.json
save_path    = "runs_qat/resnet20/history_qat_resnet20.json.png"     
title        = "Training Curves"
smooth_alpha = 0.2  # Exponential smoothing parameters, 0 or None indicate that no smoothing is used.


def _resolve_alpha(alpha):
    if alpha is None:
        return None
    try:
        alpha = float(alpha)
    except (TypeError, ValueError):
        return None
    if alpha <= 0:
        return None
    return alpha


ema_alpha = _resolve_alpha(smooth_alpha)
raw_alpha = 0.35 if ema_alpha else 0.9

def ema(xs, alpha=0.2):
    if xs is None: return None
    out = []
    m = None
    for v in xs:
        m = v if m is None else (alpha * v + (1 - alpha) * m)
        out.append(m)
    return out

def load_history(p):
    with open(p, "r") as f:
        raw = json.load(f)
    keys = ["train_loss", "test_loss", "train_acc", "test_acc"]
    if isinstance(raw, list):
        hist = {k: [] for k in keys}
        for entry in raw:
            for k in keys:
                hist[k].append(entry.get(k, 0.0))
        return hist
    for k in keys:
        raw.setdefault(k, [])
    return raw

history = load_history(history_path)
epochs = list(range(1, len(history["train_loss"]) + 1))

TL, TeL = history["train_loss"], history["test_loss"]
TA, TeA = history["train_acc"], history["test_acc"]

if ema_alpha:
    TLs, TeLs = ema(TL, ema_alpha), ema(TeL, ema_alpha)
    TAs, TeAs = ema(TA, ema_alpha), ema(TeA, ema_alpha)
else:
    TLs, TeLs, TAs, TeAs = TL, TeL, TA, TeA

plt.figure(figsize=(12,5))

# Loss
plt.subplot(1,2,1)
plt.plot(epochs, TL,  label="Train Loss", alpha=raw_alpha)
plt.plot(epochs, TeL, label="Test Loss",  alpha=raw_alpha)
if ema_alpha:
    plt.plot(epochs, TLs, label=f"Train Loss (EMA {ema_alpha:.2f})", alpha=1.0)
    plt.plot(epochs, TeLs, label=f"Test Loss (EMA {ema_alpha:.2f})", alpha=1.0)
plt.xlabel("Epoch"); plt.ylabel("Loss"); plt.title("Loss")
plt.grid(True); plt.legend()

# Accuracy
plt.subplot(1,2,2)
plt.plot(epochs, [a*100 for a in TA],  label="Train Acc (%)", alpha=raw_alpha)
plt.plot(epochs, [a*100 for a in TeA], label="Test Acc (%)",  alpha=raw_alpha)
if ema_alpha:
    plt.plot(epochs, [a*100 for a in TAs],  label=f"Train Acc (EMA {ema_alpha:.2f})", alpha=1.0)
    plt.plot(epochs, [a*100 for a in TeAs], label=f"Test Acc (EMA {ema_alpha:.2f})", alpha=1.0)
plt.xlabel("Epoch"); plt.ylabel("Accuracy (%)"); plt.title("Accuracy")
plt.grid(True); plt.legend()

plt.suptitle(title)
plt.tight_layout(rect=[0,0,1,0.96])

if save_path:
    os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True)
    plt.savefig(save_path, dpi=180)
    print(f"Saved: {save_path}")

plt.show()

Readers can refer to the scripts in `code/quant_code` to fine-tune larger networks—such as ResNet-18—under the approximate multiplier configuration.

---
## 7. Discussion


1. Currently, only six types of gates are used for synthesis and optimization, which may limit the design space explored by Yosys-abc. This restriction could lead to suboptimal implementations, especially for more complex arithmetic circuits where richer logic primitives can significantly reduce depth and transistor count. However, the proposed approach is fundamentally general and easily extensible. By modifying the gate set, our framework can seamlessly incorporate new logic types or even higher-level functional blocks such as full adders (FA), multiplexers, barrel shifters, etc. This flexibility allows the evolutionary algorithm to search a vastly larger and more expressive design space, potentially discovering architectures that are both more compact and more accurate. In other words, while our current experiments focus on a limited set of gates for clarity and consistency, the methodology itself provides a scalable foundation for future work that targets richer, application-specific gate libraries or hierarchical circuit components.

2. From our previous experiments, we observed that the network size has a significant impact on the accuracy after replacing standard units with approximate computing elements. Larger networks tend to suffer from greater accuracy degradation due to cumulative errors introduced by multiple approximate operations. However, for the same task, larger networks also exhibit stronger error resilience through fine-tuning, they can effectively compensate for these approximation-induced errors and ultimately achieve better performance than smaller networks.

---
## 8. Conclusion

In this work, we presented a comprehensive framework that bridges evolutionary circuit synthesis and approximate deep learning acceleration. At the circuit level, we developed a genetic algorithm-based approximate synthesis flow that optimizes logic structures under multiple constraints. Only six basic gates were used in the current setup to ensure simplicity and compatibility with standard cell libraries, but the framework itself is general and extensible—it can easily incorporate more complex gate types or higher-level functional blocks such as full adders or barrel shifters, thus broadening the search space for better trade-offs.
At the algorithmic level, we integrated these evolved approximate multipliers into neural network inference pipelines and evaluated their impact on real-world tasks.

Overall, the proposed framework establishes a complete design loop from transistor-level logic optimization to system-level accuracy evaluation. It provides a powerful foundation for co-exploring accuracy, efficiency, and resilience across multiple abstraction layers. Future work will focus on extending the gate library, incorporating realistic physical constraints (timing, power, PVT variation), and exploring joint hardware–software co-design strategies for energy-efficient AI accelerators.