FloatNum — Software and Hardware Floating-Point Implementation

A from-scratch C++ floating-point class with full IEEE 754-style arithmetic, plus a parallel Catapult HLS synthesizable implementation using Algorithmic C (AC) datatypes.

Project Structure

FloatNum.h            — Software class declaration
FloatNum.cpp          — Software implementation
FloatNum_hw.h         — Synthesizable struct and function declarations (AC types)
FloatNum_hw.cpp       — Synthesizable implementation
test_floatnum.cpp     — GoogleTest suite for the software model (90 tests)
test_floatnum_hw.cpp  — GoogleTest suite for the hardware model (90 tests)
catapult.tcl          — Catapult HLS synthesis script
CMakeLists.txt        — Build system (GoogleTest via FetchContent)
ac_types/             — Siemens AC datatypes library (header-only)

Build and Test

Prerequisites: CMake ≥ 3.14, a C++17 compiler, internet access for the first build (GoogleTest is fetched automatically).

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
ctest --test-dir build --output-on-failure

All 180 tests should pass: 90 for the software model, 90 for the hardware model.

Internal Representation

Both the software and hardware models share the same conceptual representation.

Value formula

For a normal (non-special) number:

value = (-1)^sign  ×  mantissa  ×  2^(exponent − 63)

Fields

Field	SW type	HW type	Width	Role
`sign`	`bool`	`bool`	1 bit	`false` = positive, `true` = negative
`mantissa` / `mant`	`uint64_t`	`ac_int<64, false>`	64 bits	Significand; bit 63 is always 1 for normal numbers
`exponent` / `exp`	`int32_t`	`ac_int<32, true>`	32 bits	Unbiased binary exponent
`category` / `cat`	`enum class Category`	`ac_int<2, false>`	2 bits	Numeric class (see below)

Category encoding

Name	SW enum	HW constant	Value	Meaning
Normal	`Category::Normal`	`CAT_NORMAL`	0	Finite non-zero number
Zero	`Category::Zero`	`CAT_ZERO`	1	±0
Infinity	`Category::Infinity`	`CAT_INF`	2	±∞
NaN	`Category::NaN`	`CAT_NAN`	3	Not-a-Number

Using an explicit category tag — rather than IEEE 754's sentinel bit patterns — keeps the arithmetic clean: every operation can dispatch on category first with a simple switch/if chain, and the normal-number path contains no special-value checks.

Normalization invariant

For Category::Normal numbers, bit 63 of the mantissa is always set. This is the explicit leading-1 convention (as opposed to IEEE 754's implicit leading-1). It simplifies addition and subtraction alignment because the raw uint64_t/ac_int<64,false> value can be shifted directly without masking or re-inserting the hidden bit.

The value interpretation is:

value = sign × (mantissa as a 64-bit integer) × 2^(exponent − 63)

Because bit 63 is set, mantissa is in the range [2^63, 2^64), making the effective significand in [1.0, 2.0) — matching IEEE 754's normalized form.

Design choices

Rounding: truncation (round toward zero). During alignment right-shifts and multiplication/division truncation, low-order bits are simply discarded. This is documented and expected; it avoids the complexity of guard, round, and sticky bits.

Denormals: flushed to zero. If normalization would produce a mantissa of zero (complete cancellation), the result is returned as Category::Zero. Subnormal numbers are not represented.

Exponent range: ±2 billion. The 32-bit signed exponent far exceeds the IEEE 754 double range of [−1022, +1023], so exponent overflow and underflow cannot occur for any input that fits in a double.

Software Model — `FloatNum`

Public API

// Constructors
FloatNum();                     // +0
explicit FloatNum(double d);    // convert from double (uses std::frexp)

// Named factories for special values
static FloatNum zero(bool negative = false);
static FloatNum infinity(bool negative = false);
static FloatNum nan();

// Conversion back to double
double toDouble() const;

// Classification
bool isZero()     const;
bool isInfinite() const;
bool isNaN()      const;
bool isNormal()   const;
bool isNegative() const;

// Arithmetic
FloatNum operator+(const FloatNum& rhs) const;
FloatNum operator-(const FloatNum& rhs) const;
FloatNum operator*(const FloatNum& rhs) const;
FloatNum operator/(const FloatNum& rhs) const;
FloatNum operator-() const;  // unary negation

// Comparisons (IEEE 754 semantics: NaN unordered, ±0 equal)
bool operator==(const FloatNum& rhs) const;
bool operator!=(const FloatNum& rhs) const;
bool operator< (const FloatNum& rhs) const;
bool operator<=(const FloatNum& rhs) const;
bool operator> (const FloatNum& rhs) const;
bool operator>=(const FloatNum& rhs) const;

// Debug
std::string toString() const;

Construction from `double`

std::frexp(d, &exp) decomposes a double into a significand f ∈ [0.5, 1.0) and an integer exponent exp such that d = f × 2^exp. Scaling f by 2^64 places it in [2^63, 2^64) and it can be stored as a uint64_t without rounding error beyond what double's 53-bit mantissa already carries.

mantissa_ = (uint64_t)(f × 2^64)
exponent_ = exp − 1

This is chosen over std::log2 to avoid floating-point approximation errors in the exponent computation.

The round-trip property FloatNum(d).toDouble() == d holds exactly for all finite, non-denormal double values because the 64-bit mantissa captures all 53 bits of the double significand.

Algorithms

Normalize

After every arithmetic operation, the result mantissa may have leading zeros. normalize finds the highest set bit and left-shifts the mantissa to place it at bit 63, decrementing the exponent by the same amount.

if mantissa == 0:
    return zero(sign)
lz = count_leading_zeros(mantissa)   // 0..63
mantissa <<= lz
exponent  -= lz

In the software model, __builtin_clzll provides the leading-zero count in one instruction. In the hardware model it is replaced by a 64-iteration priority encoder loop (see Hardware section).

Addition

operator+(a, b):
    dispatch on category (NaN, Inf, Zero)
    if same sign:
        result = addMagnitudes(a, b)
        result.sign = a.sign
    else:
        cmp = compareMagnitude(a, b)
        if cmp == 0: return +0          // exact cancellation
        if |a| > |b|: result = subtractMagnitudes(a, b); result.sign = a.sign
        else:         result = subtractMagnitudes(b, a); result.sign = b.sign

addMagnitudes: Aligns the smaller operand by right-shifting its mantissa by the exponent difference, then sums using 128-bit arithmetic to detect carry out of bit 63. If a carry occurs, the sum is right-shifted once and the exponent is incremented.

shift = hi.exponent − lo.exponent
lo_m = lo.mantissa >> shift              // truncate; zero if shift ≥ 64
sum128 = (128-bit)(hi.mantissa + lo_m)
if sum128[64]:                           // carry
    sum128 >>= 1
    exponent++
result.mantissa = (uint64_t)sum128       // bit 63 is guaranteed set

subtractMagnitudes: Aligns the smaller operand, subtracts, then normalizes (since subtraction can produce many leading zeros).

shift = larger.exponent − smaller.exponent
s_m = smaller.mantissa >> shift
diff = larger.mantissa − s_m             // no borrow since larger ≥ smaller
return normalize(false, diff, larger.exponent)

Multiplication

Two 64-bit normalized mantissas both have bit 63 set, so their product is in [2^126, 2^128). The top 64 bits of the 128-bit product form the result mantissa, and the exponents add.

product128 = (128-bit)(a.mantissa × b.mantissa)
result_m   = product128[127:64]              // top 64 bits
result_e   = a.exponent + b.exponent + 1

if result_m[63] == 0:                        // product was in [2^126, 2^127)
    result_m <<= 1
    result_e  -= 1

The +1 in result_e accounts for the fixed-point scaling: each mantissa represents a value m × 2^(e−63), so the product is m1×m2 × 2^(e1+e2−126) = (m1×m2 >> 64) × 2^(e1+e2−62) = result_m × 2^(result_e−63) where result_e = e1+e2+1.

Division

The numerator mantissa is scaled up by 2^64 (128-bit shift) before dividing by the denominator mantissa, preserving 64 bits of quotient precision.

numerator128 = (128-bit)(a.mantissa) << 64
q128         = numerator128 / b.mantissa
result_e     = a.exponent − b.exponent − 1

if q128[64]:                                 // quotient ≥ 2^64 (when a == b)
    q128     >>= 1
    result_e  += 1
return normalize(result_sign, (uint64_t)q128, result_e)

The overflow check catches the case a.mantissa == b.mantissa, where the exact quotient is 2^64 (a 65-bit value). A single right-shift and exponent bump resolves it.

Comparison

Magnitude comparison compares exponent first (higher exponent = larger value for equal-sign normals), then mantissa if exponents are equal. Because both mantissas are normalized with bit 63 set, this is a direct integer comparison.

Full comparison (operator<) logic:

NaN: always return false
±0 equal to each other
Different signs: negative is smaller
Same sign, both infinite: equal
Same sign, one infinite: +inf > anything, -inf < anything
Same sign, both normal: compare magnitudes; for negatives, larger magnitude = smaller value

operator<= and operator>= include explicit NaN guards because their naive implementations (!(rhs < *this) and !(*this < rhs)) would return true for NaN operands — a bug that was caught and fixed during testing.

Hardware Model — `FloatNum_hw`

The hardware model is a direct translation of the software model into Catapult HLS-synthesizable C++. All prohibited constructs are replaced; the original FloatNum files are unchanged.

Why a separate model?

Catapult HLS cannot synthesize several constructs used in the software model:

Prohibited	Reason	Replacement
`__uint128_t`	GCC/Clang extension	`ac_int<128, false>`
`__builtin_clzll()`	Compiler built-in	Priority-encoder loop
`double` type	Not synthesizable to RTL	`#ifndef __SYNTHESIS__` guard
`std::frexp`, `std::ldexp`, etc.	Software library calls	Same guard
`std::string`, `std::ostringstream`	Not synthesizable	`toString()` omitted
`std::swap` on pointers	Pointer semantics	Explicit `if/else` mux
`<cassert>`, `<cmath>`, etc.	Standard library	Removed or guarded
`enum class Category`	Synthesizes poorly	`ac_int<2, false>` constants
`int` from `compareMagnitude`	Signed return	2-bit code (0/1/2 encoding)

`ac_int` key properties

ac_int<W, S> from the Algorithmic C datatypes library (ac_types/include/ac_int.h) is a templated arbitrary-precision integer class:

W: bit width; S: signedness (true = two's-complement signed)
Supports +, -, *, /, >>, <<, [] (bit access), all comparisons
Multiplication: ac_int<64,false> × ac_int<64,false> → ac_int<128,false> automatically
Slicing: x.slc<W>(lsb) extracts a W-bit field starting at bit lsb
Conversion (for testbench only): .to_uint64(), .to_int()
Compiles and runs with standard g++ without any Catapult license

`FP` struct

struct FP {
    bool              sign;  // false = positive
    ac_int<64, false> mant;  // bit 63 set for Normal
    ac_int<32, true>  exp;   // unbiased
    ac_int<2,  false> cat;   // CAT_NORMAL/ZERO/INF/NAN
};

Factory inlines create special values: fp_zero(), fp_inf(), fp_nan().

`fp_clz64` — priority encoder

Replaces __builtin_clzll. The loop is annotated with #pragma hls_unroll yes so Catapult fully unrolls it into a 64-input combinational priority network.

ac_int<6, false> fp_clz64(ac_int<64, false> x) {
    ac_int<7, false> n = 64;   // 7 bits needed to hold sentinel value 64
    #pragma hls_unroll yes
    for (int i = 0; i < 64; i++) {
        if (x[63 - i] == 1 && n == 64) n = i;
    }
    return n.slc<6>(0);        // safe: caller guarantees x != 0
}

The ac_int<7, false> sentinel is important: a 6-bit type cannot represent 64, so n == 64 would always be false. The 7-bit type holds 64 correctly; after the loop n is always 0..63 (since x != 0 is a precondition), so slc<6>(0) is lossless.

Arithmetic width management

Addition: (ac_int<128,false>)hi.mant + lo_m — cast one operand to 128 bits before adding to get a 128-bit result with a visible carry bit.

Multiplication: a.mant * b.mant assigned to ac_int<128,false> — do not cast either operand first; casting a.mant to 128 bits before multiplying would change the result width to 192 bits.

Division numerator shift: (ac_int<128,false>)a.mant << 64 — cast to 128 bits first, then shift. Without the cast, a.mant << 64 operates on a 64-bit value and would produce zero.

Shift amount clamping: The exponent difference is ac_int<32,true>. Before using it as a 6-bit shift amount, it is clamped with an if/else rather than a ternary, because raw_shift.slc<6>(0) returns a signed type in this version of ac_types, which creates an ambiguous ternary with the unsigned literal branch.

Non-synthesizable conversion functions

fp_from_double and fp_to_double are guarded by #ifndef __SYNTHESIS__:

Catapult defines __SYNTHESIS__ during RTL extraction, excluding these functions from synthesis
For g++ C simulation and testbenches the guard is not defined, so both functions are available

Hardware functions

Function	SW equivalent	Description
`fp_clz64`	`__builtin_clzll`	Leading-zero count, 64-input priority encoder
`fp_normalize`	`normalize`	Left-shift mantissa, adjust exponent
`fp_cmp_mag`	`compareMagnitude`	Returns 0/1/2 (equal/a>b/a<b)
`fp_add_mag`	`addMagnitudes`	Align and add, detect carry
`fp_sub_mag`	`subtractMagnitudes`	Align and subtract, then normalize
`fp_neg`	`operator-()`	Flip sign bit
`fp_add`	`operator+`	Full add with special-value dispatch
`fp_sub`	`operator-`	Implemented as `fp_add(a, fp_neg(b))`
`fp_mul`	`operator*`	64×64→128-bit multiply
`fp_div`	`operator/`	128/64-bit divide with precision scaling
`fp_eq`	`operator==`	IEEE equality (NaN ≠ NaN, +0 == -0)
`fp_lt`	`operator<`	IEEE less-than (NaN unordered)

Catapult HLS Synthesis

catapult.tcl synthesizes each arithmetic function as a separate combinational RTL block. The flow:

go compile — parse and elaborate the C++ source
go libraries — select technology cell library
go assembly — schedule operations into states
go architect — allocate hardware resources
go allocate — bind operations to resources
go extract — generate Verilog/VHDL output

Synthesis considerations

fp_clz64: With #pragma hls_unroll yes, Catapult unrolls the 64-iteration loop into a priority encoder tree (~6 levels of 2-to-1 muxes). If the pragma is not honored, an equivalent TCL directive can be added: directive set /fp_clz64/core/main -UNROLL yes.

fp_div: The 128/64-bit integer division generates a large combinational circuit. For timing-critical designs, the division can be made multi-cycle by adding: directive set /fp_div -PIPELINE_INIT_INTERVAL 2.

SCVerify RTL co-simulation: After RTL extraction, Catapult's SCVerify wraps the generated Verilog in a SystemC testbench that replays the C++ testbench against the RTL, providing bit-exact equivalence checking automatically.

Tests

Software tests (`test_floatnum.cpp`) — 90 tests

Suite	Tests
`Construction`	Zero, ±0, ±1, 0.5, 0.25, 1.5, 1.75, 2, 3, 6, 1e15, 1e-15, ±inf, NaN; round-trip
`Negation`	Sign flip, -0, -inf, -NaN
`Addition`	Normal cases, cancellation, identity, both negative, inf, NaN propagation, large shift
`Subtraction`	Normal cases, self-subtraction, inf−inf=NaN
`Multiplication`	Normal cases, signs, 0×inf=NaN, inf×x, large/small cancel
`Division`	Normal cases, x/0=inf, 0/0=NaN, inf/inf=NaN, 0/x, x/inf
`Comparisons`	All six operators; ±0=0; NaN unordered; inf ordering; negative ordering
`Chain`	Powers of 2, distributive, 1/3×3≈1, alternating sum, √2×√2≈2

Hardware tests (`test_floatnum_hw.cpp`) — 90 tests

Mirrors the software test suite. Every test uses fp_from_double to construct inputs and fp_to_double to read results. Selected tests also cross-check the HW result against the SW golden model (FloatNum) to catch any behavioral divergence.

Known Limitations

Rounding: Truncation only (round toward zero). Results may differ from IEEE 754 round-to-nearest by up to 1 ULP.
Denormals: Flushed to zero. Values smaller than 2^(−2^31 + 63) cannot be represented.
No exceptions: Overflow/underflow flags are not generated.
Division cost: 128/64-bit integer division in hardware is large. Consider a multi-cycle or iterative implementation for area-critical designs.
No FMA: Fused multiply-add is not implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ac_types		ac_types
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
FloatNum.cpp		FloatNum.cpp
FloatNum.h		FloatNum.h
FloatNum_hw.cpp		FloatNum_hw.cpp
FloatNum_hw.h		FloatNum_hw.h
LICENSE		LICENSE
README.md		README.md
catapult.tcl		catapult.tcl
test_floatnum.cpp		test_floatnum.cpp
test_floatnum_hw.cpp		test_floatnum_hw.cpp

Folders and files

Latest commit

History

Repository files navigation

FloatNum — Software and Hardware Floating-Point Implementation

Project Structure

Build and Test

Internal Representation

Value formula

Fields

Category encoding

Normalization invariant

Design choices

Software Model — FloatNum

Public API

Construction from double

Algorithms

Normalize

Addition

Multiplication

Division

Comparison

Hardware Model — FloatNum_hw

Why a separate model?

ac_int key properties

FP struct

fp_clz64 — priority encoder

Arithmetic width management

Non-synthesizable conversion functions

Hardware functions

Catapult HLS Synthesis

Synthesis considerations

Tests

Software tests (test_floatnum.cpp) — 90 tests

Hardware tests (test_floatnum_hw.cpp) — 90 tests

Known Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Software Model — `FloatNum`

Construction from `double`

Hardware Model — `FloatNum_hw`

`ac_int` key properties

`FP` struct

`fp_clz64` — priority encoder

Software tests (`test_floatnum.cpp`) — 90 tests

Hardware tests (`test_floatnum_hw.cpp`) — 90 tests

Packages