A from-scratch C++ floating-point class with full IEEE 754-style arithmetic, plus a parallel Catapult HLS synthesizable implementation using Algorithmic C (AC) datatypes.
FloatNum.h — Software class declaration
FloatNum.cpp — Software implementation
FloatNum_hw.h — Synthesizable struct and function declarations (AC types)
FloatNum_hw.cpp — Synthesizable implementation
test_floatnum.cpp — GoogleTest suite for the software model (90 tests)
test_floatnum_hw.cpp — GoogleTest suite for the hardware model (90 tests)
catapult.tcl — Catapult HLS synthesis script
CMakeLists.txt — Build system (GoogleTest via FetchContent)
ac_types/ — Siemens AC datatypes library (header-only)
Prerequisites: CMake ≥ 3.14, a C++17 compiler, internet access for the first build (GoogleTest is fetched automatically).
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
ctest --test-dir build --output-on-failureAll 180 tests should pass: 90 for the software model, 90 for the hardware model.
Both the software and hardware models share the same conceptual representation.
For a normal (non-special) number:
value = (-1)^sign × mantissa × 2^(exponent − 63)
| Field | SW type | HW type | Width | Role |
|---|---|---|---|---|
sign |
bool |
bool |
1 bit | false = positive, true = negative |
mantissa / mant |
uint64_t |
ac_int<64, false> |
64 bits | Significand; bit 63 is always 1 for normal numbers |
exponent / exp |
int32_t |
ac_int<32, true> |
32 bits | Unbiased binary exponent |
category / cat |
enum class Category |
ac_int<2, false> |
2 bits | Numeric class (see below) |
| Name | SW enum | HW constant | Value | Meaning |
|---|---|---|---|---|
| Normal | Category::Normal |
CAT_NORMAL |
0 | Finite non-zero number |
| Zero | Category::Zero |
CAT_ZERO |
1 | ±0 |
| Infinity | Category::Infinity |
CAT_INF |
2 | ±∞ |
| NaN | Category::NaN |
CAT_NAN |
3 | Not-a-Number |
Using an explicit category tag — rather than IEEE 754's sentinel bit patterns — keeps the arithmetic clean: every operation can dispatch on category first with a simple switch/if chain, and the normal-number path contains no special-value checks.
For Category::Normal numbers, bit 63 of the mantissa is always set. This is the explicit leading-1 convention (as opposed to IEEE 754's implicit leading-1). It simplifies addition and subtraction alignment because the raw uint64_t/ac_int<64,false> value can be shifted directly without masking or re-inserting the hidden bit.
The value interpretation is:
value = sign × (mantissa as a 64-bit integer) × 2^(exponent − 63)
Because bit 63 is set, mantissa is in the range [2^63, 2^64), making the effective significand in [1.0, 2.0) — matching IEEE 754's normalized form.
Rounding: truncation (round toward zero). During alignment right-shifts and multiplication/division truncation, low-order bits are simply discarded. This is documented and expected; it avoids the complexity of guard, round, and sticky bits.
Denormals: flushed to zero. If normalization would produce a mantissa of zero (complete cancellation), the result is returned as Category::Zero. Subnormal numbers are not represented.
Exponent range: ±2 billion. The 32-bit signed exponent far exceeds the IEEE 754 double range of [−1022, +1023], so exponent overflow and underflow cannot occur for any input that fits in a double.
// Constructors
FloatNum(); // +0
explicit FloatNum(double d); // convert from double (uses std::frexp)
// Named factories for special values
static FloatNum zero(bool negative = false);
static FloatNum infinity(bool negative = false);
static FloatNum nan();
// Conversion back to double
double toDouble() const;
// Classification
bool isZero() const;
bool isInfinite() const;
bool isNaN() const;
bool isNormal() const;
bool isNegative() const;
// Arithmetic
FloatNum operator+(const FloatNum& rhs) const;
FloatNum operator-(const FloatNum& rhs) const;
FloatNum operator*(const FloatNum& rhs) const;
FloatNum operator/(const FloatNum& rhs) const;
FloatNum operator-() const; // unary negation
// Comparisons (IEEE 754 semantics: NaN unordered, ±0 equal)
bool operator==(const FloatNum& rhs) const;
bool operator!=(const FloatNum& rhs) const;
bool operator< (const FloatNum& rhs) const;
bool operator<=(const FloatNum& rhs) const;
bool operator> (const FloatNum& rhs) const;
bool operator>=(const FloatNum& rhs) const;
// Debug
std::string toString() const;std::frexp(d, &exp) decomposes a double into a significand f ∈ [0.5, 1.0) and an integer exponent exp such that d = f × 2^exp. Scaling f by 2^64 places it in [2^63, 2^64) and it can be stored as a uint64_t without rounding error beyond what double's 53-bit mantissa already carries.
mantissa_ = (uint64_t)(f × 2^64)
exponent_ = exp − 1
This is chosen over std::log2 to avoid floating-point approximation errors in the exponent computation.
The round-trip property FloatNum(d).toDouble() == d holds exactly for all finite, non-denormal double values because the 64-bit mantissa captures all 53 bits of the double significand.
After every arithmetic operation, the result mantissa may have leading zeros. normalize finds the highest set bit and left-shifts the mantissa to place it at bit 63, decrementing the exponent by the same amount.
if mantissa == 0:
return zero(sign)
lz = count_leading_zeros(mantissa) // 0..63
mantissa <<= lz
exponent -= lz
In the software model, __builtin_clzll provides the leading-zero count in one instruction. In the hardware model it is replaced by a 64-iteration priority encoder loop (see Hardware section).
operator+(a, b):
dispatch on category (NaN, Inf, Zero)
if same sign:
result = addMagnitudes(a, b)
result.sign = a.sign
else:
cmp = compareMagnitude(a, b)
if cmp == 0: return +0 // exact cancellation
if |a| > |b|: result = subtractMagnitudes(a, b); result.sign = a.sign
else: result = subtractMagnitudes(b, a); result.sign = b.sign
addMagnitudes: Aligns the smaller operand by right-shifting its mantissa by the exponent difference, then sums using 128-bit arithmetic to detect carry out of bit 63. If a carry occurs, the sum is right-shifted once and the exponent is incremented.
shift = hi.exponent − lo.exponent
lo_m = lo.mantissa >> shift // truncate; zero if shift ≥ 64
sum128 = (128-bit)(hi.mantissa + lo_m)
if sum128[64]: // carry
sum128 >>= 1
exponent++
result.mantissa = (uint64_t)sum128 // bit 63 is guaranteed set
subtractMagnitudes: Aligns the smaller operand, subtracts, then normalizes (since subtraction can produce many leading zeros).
shift = larger.exponent − smaller.exponent
s_m = smaller.mantissa >> shift
diff = larger.mantissa − s_m // no borrow since larger ≥ smaller
return normalize(false, diff, larger.exponent)
Two 64-bit normalized mantissas both have bit 63 set, so their product is in [2^126, 2^128). The top 64 bits of the 128-bit product form the result mantissa, and the exponents add.
product128 = (128-bit)(a.mantissa × b.mantissa)
result_m = product128[127:64] // top 64 bits
result_e = a.exponent + b.exponent + 1
if result_m[63] == 0: // product was in [2^126, 2^127)
result_m <<= 1
result_e -= 1
The +1 in result_e accounts for the fixed-point scaling: each mantissa represents a value m × 2^(e−63), so the product is m1×m2 × 2^(e1+e2−126) = (m1×m2 >> 64) × 2^(e1+e2−62) = result_m × 2^(result_e−63) where result_e = e1+e2+1.
The numerator mantissa is scaled up by 2^64 (128-bit shift) before dividing by the denominator mantissa, preserving 64 bits of quotient precision.
numerator128 = (128-bit)(a.mantissa) << 64
q128 = numerator128 / b.mantissa
result_e = a.exponent − b.exponent − 1
if q128[64]: // quotient ≥ 2^64 (when a == b)
q128 >>= 1
result_e += 1
return normalize(result_sign, (uint64_t)q128, result_e)
The overflow check catches the case a.mantissa == b.mantissa, where the exact quotient is 2^64 (a 65-bit value). A single right-shift and exponent bump resolves it.
Magnitude comparison compares exponent first (higher exponent = larger value for equal-sign normals), then mantissa if exponents are equal. Because both mantissas are normalized with bit 63 set, this is a direct integer comparison.
Full comparison (operator<) logic:
- NaN: always return
false - ±0 equal to each other
- Different signs: negative is smaller
- Same sign, both infinite: equal
- Same sign, one infinite:
+inf > anything,-inf < anything - Same sign, both normal: compare magnitudes; for negatives, larger magnitude = smaller value
operator<= and operator>= include explicit NaN guards because their naive implementations (!(rhs < *this) and !(*this < rhs)) would return true for NaN operands — a bug that was caught and fixed during testing.
The hardware model is a direct translation of the software model into Catapult HLS-synthesizable C++. All prohibited constructs are replaced; the original FloatNum files are unchanged.
Catapult HLS cannot synthesize several constructs used in the software model:
| Prohibited | Reason | Replacement |
|---|---|---|
__uint128_t |
GCC/Clang extension | ac_int<128, false> |
__builtin_clzll() |
Compiler built-in | Priority-encoder loop |
double type |
Not synthesizable to RTL | #ifndef __SYNTHESIS__ guard |
std::frexp, std::ldexp, etc. |
Software library calls | Same guard |
std::string, std::ostringstream |
Not synthesizable | toString() omitted |
std::swap on pointers |
Pointer semantics | Explicit if/else mux |
<cassert>, <cmath>, etc. |
Standard library | Removed or guarded |
enum class Category |
Synthesizes poorly | ac_int<2, false> constants |
int from compareMagnitude |
Signed return | 2-bit code (0/1/2 encoding) |
ac_int<W, S> from the Algorithmic C datatypes library (ac_types/include/ac_int.h) is a templated arbitrary-precision integer class:
- W: bit width; S: signedness (
true= two's-complement signed) - Supports
+,-,*,/,>>,<<,[](bit access), all comparisons - Multiplication:
ac_int<64,false> × ac_int<64,false>→ac_int<128,false>automatically - Slicing:
x.slc<W>(lsb)extracts a W-bit field starting at bitlsb - Conversion (for testbench only):
.to_uint64(),.to_int() - Compiles and runs with standard
g++without any Catapult license
struct FP {
bool sign; // false = positive
ac_int<64, false> mant; // bit 63 set for Normal
ac_int<32, true> exp; // unbiased
ac_int<2, false> cat; // CAT_NORMAL/ZERO/INF/NAN
};Factory inlines create special values: fp_zero(), fp_inf(), fp_nan().
Replaces __builtin_clzll. The loop is annotated with #pragma hls_unroll yes so Catapult fully unrolls it into a 64-input combinational priority network.
ac_int<6, false> fp_clz64(ac_int<64, false> x) {
ac_int<7, false> n = 64; // 7 bits needed to hold sentinel value 64
#pragma hls_unroll yes
for (int i = 0; i < 64; i++) {
if (x[63 - i] == 1 && n == 64) n = i;
}
return n.slc<6>(0); // safe: caller guarantees x != 0
}The ac_int<7, false> sentinel is important: a 6-bit type cannot represent 64, so n == 64 would always be false. The 7-bit type holds 64 correctly; after the loop n is always 0..63 (since x != 0 is a precondition), so slc<6>(0) is lossless.
Addition: (ac_int<128,false>)hi.mant + lo_m — cast one operand to 128 bits before adding to get a 128-bit result with a visible carry bit.
Multiplication: a.mant * b.mant assigned to ac_int<128,false> — do not cast either operand first; casting a.mant to 128 bits before multiplying would change the result width to 192 bits.
Division numerator shift: (ac_int<128,false>)a.mant << 64 — cast to 128 bits first, then shift. Without the cast, a.mant << 64 operates on a 64-bit value and would produce zero.
Shift amount clamping: The exponent difference is ac_int<32,true>. Before using it as a 6-bit shift amount, it is clamped with an if/else rather than a ternary, because raw_shift.slc<6>(0) returns a signed type in this version of ac_types, which creates an ambiguous ternary with the unsigned literal branch.
fp_from_double and fp_to_double are guarded by #ifndef __SYNTHESIS__:
- Catapult defines
__SYNTHESIS__during RTL extraction, excluding these functions from synthesis - For g++ C simulation and testbenches the guard is not defined, so both functions are available
| Function | SW equivalent | Description |
|---|---|---|
fp_clz64 |
__builtin_clzll |
Leading-zero count, 64-input priority encoder |
fp_normalize |
normalize |
Left-shift mantissa, adjust exponent |
fp_cmp_mag |
compareMagnitude |
Returns 0/1/2 (equal/a>b/a<b) |
fp_add_mag |
addMagnitudes |
Align and add, detect carry |
fp_sub_mag |
subtractMagnitudes |
Align and subtract, then normalize |
fp_neg |
operator-() |
Flip sign bit |
fp_add |
operator+ |
Full add with special-value dispatch |
fp_sub |
operator- |
Implemented as fp_add(a, fp_neg(b)) |
fp_mul |
operator* |
64×64→128-bit multiply |
fp_div |
operator/ |
128/64-bit divide with precision scaling |
fp_eq |
operator== |
IEEE equality (NaN ≠ NaN, +0 == -0) |
fp_lt |
operator< |
IEEE less-than (NaN unordered) |
catapult.tcl synthesizes each arithmetic function as a separate combinational RTL block. The flow:
go compile— parse and elaborate the C++ sourcego libraries— select technology cell librarygo assembly— schedule operations into statesgo architect— allocate hardware resourcesgo allocate— bind operations to resourcesgo extract— generate Verilog/VHDL output
fp_clz64: With #pragma hls_unroll yes, Catapult unrolls the 64-iteration loop into a priority encoder tree (~6 levels of 2-to-1 muxes). If the pragma is not honored, an equivalent TCL directive can be added: directive set /fp_clz64/core/main -UNROLL yes.
fp_div: The 128/64-bit integer division generates a large combinational circuit. For timing-critical designs, the division can be made multi-cycle by adding: directive set /fp_div -PIPELINE_INIT_INTERVAL 2.
SCVerify RTL co-simulation: After RTL extraction, Catapult's SCVerify wraps the generated Verilog in a SystemC testbench that replays the C++ testbench against the RTL, providing bit-exact equivalence checking automatically.
| Suite | Tests |
|---|---|
Construction |
Zero, ±0, ±1, 0.5, 0.25, 1.5, 1.75, 2, 3, 6, 1e15, 1e-15, ±inf, NaN; round-trip |
Negation |
Sign flip, -0, -inf, -NaN |
Addition |
Normal cases, cancellation, identity, both negative, inf, NaN propagation, large shift |
Subtraction |
Normal cases, self-subtraction, inf−inf=NaN |
Multiplication |
Normal cases, signs, 0×inf=NaN, inf×x, large/small cancel |
Division |
Normal cases, x/0=inf, 0/0=NaN, inf/inf=NaN, 0/x, x/inf |
Comparisons |
All six operators; ±0=0; NaN unordered; inf ordering; negative ordering |
Chain |
Powers of 2, distributive, 1/3×3≈1, alternating sum, √2×√2≈2 |
Mirrors the software test suite. Every test uses fp_from_double to construct inputs and fp_to_double to read results. Selected tests also cross-check the HW result against the SW golden model (FloatNum) to catch any behavioral divergence.
- Rounding: Truncation only (round toward zero). Results may differ from IEEE 754 round-to-nearest by up to 1 ULP.
- Denormals: Flushed to zero. Values smaller than
2^(−2^31 + 63)cannot be represented. - No exceptions: Overflow/underflow flags are not generated.
- Division cost: 128/64-bit integer division in hardware is large. Consider a multi-cycle or iterative implementation for area-critical designs.
- No FMA: Fused multiply-add is not implemented.