Skip to content

sergev/floatnum

Repository files navigation

FloatNum — Software and Hardware Floating-Point Implementation

A from-scratch C++ floating-point class with full IEEE 754-style arithmetic, plus a parallel Catapult HLS synthesizable implementation using Algorithmic C (AC) datatypes.


Project Structure

FloatNum.h            — Software class declaration
FloatNum.cpp          — Software implementation
FloatNum_hw.h         — Synthesizable struct and function declarations (AC types)
FloatNum_hw.cpp       — Synthesizable implementation
test_floatnum.cpp     — GoogleTest suite for the software model (90 tests)
test_floatnum_hw.cpp  — GoogleTest suite for the hardware model (90 tests)
catapult.tcl          — Catapult HLS synthesis script
CMakeLists.txt        — Build system (GoogleTest via FetchContent)
ac_types/             — Siemens AC datatypes library (header-only)

Build and Test

Prerequisites: CMake ≥ 3.14, a C++17 compiler, internet access for the first build (GoogleTest is fetched automatically).

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
ctest --test-dir build --output-on-failure

All 180 tests should pass: 90 for the software model, 90 for the hardware model.


Internal Representation

Both the software and hardware models share the same conceptual representation.

Value formula

For a normal (non-special) number:

value = (-1)^sign  ×  mantissa  ×  2^(exponent − 63)

Fields

Field SW type HW type Width Role
sign bool bool 1 bit false = positive, true = negative
mantissa / mant uint64_t ac_int<64, false> 64 bits Significand; bit 63 is always 1 for normal numbers
exponent / exp int32_t ac_int<32, true> 32 bits Unbiased binary exponent
category / cat enum class Category ac_int<2, false> 2 bits Numeric class (see below)

Category encoding

Name SW enum HW constant Value Meaning
Normal Category::Normal CAT_NORMAL 0 Finite non-zero number
Zero Category::Zero CAT_ZERO 1 ±0
Infinity Category::Infinity CAT_INF 2 ±∞
NaN Category::NaN CAT_NAN 3 Not-a-Number

Using an explicit category tag — rather than IEEE 754's sentinel bit patterns — keeps the arithmetic clean: every operation can dispatch on category first with a simple switch/if chain, and the normal-number path contains no special-value checks.

Normalization invariant

For Category::Normal numbers, bit 63 of the mantissa is always set. This is the explicit leading-1 convention (as opposed to IEEE 754's implicit leading-1). It simplifies addition and subtraction alignment because the raw uint64_t/ac_int<64,false> value can be shifted directly without masking or re-inserting the hidden bit.

The value interpretation is:

value = sign × (mantissa as a 64-bit integer) × 2^(exponent − 63)

Because bit 63 is set, mantissa is in the range [2^63, 2^64), making the effective significand in [1.0, 2.0) — matching IEEE 754's normalized form.

Design choices

Rounding: truncation (round toward zero). During alignment right-shifts and multiplication/division truncation, low-order bits are simply discarded. This is documented and expected; it avoids the complexity of guard, round, and sticky bits.

Denormals: flushed to zero. If normalization would produce a mantissa of zero (complete cancellation), the result is returned as Category::Zero. Subnormal numbers are not represented.

Exponent range: ±2 billion. The 32-bit signed exponent far exceeds the IEEE 754 double range of [−1022, +1023], so exponent overflow and underflow cannot occur for any input that fits in a double.


Software Model — FloatNum

Public API

// Constructors
FloatNum();                     // +0
explicit FloatNum(double d);    // convert from double (uses std::frexp)

// Named factories for special values
static FloatNum zero(bool negative = false);
static FloatNum infinity(bool negative = false);
static FloatNum nan();

// Conversion back to double
double toDouble() const;

// Classification
bool isZero()     const;
bool isInfinite() const;
bool isNaN()      const;
bool isNormal()   const;
bool isNegative() const;

// Arithmetic
FloatNum operator+(const FloatNum& rhs) const;
FloatNum operator-(const FloatNum& rhs) const;
FloatNum operator*(const FloatNum& rhs) const;
FloatNum operator/(const FloatNum& rhs) const;
FloatNum operator-() const;  // unary negation

// Comparisons (IEEE 754 semantics: NaN unordered, ±0 equal)
bool operator==(const FloatNum& rhs) const;
bool operator!=(const FloatNum& rhs) const;
bool operator< (const FloatNum& rhs) const;
bool operator<=(const FloatNum& rhs) const;
bool operator> (const FloatNum& rhs) const;
bool operator>=(const FloatNum& rhs) const;

// Debug
std::string toString() const;

Construction from double

std::frexp(d, &exp) decomposes a double into a significand f ∈ [0.5, 1.0) and an integer exponent exp such that d = f × 2^exp. Scaling f by 2^64 places it in [2^63, 2^64) and it can be stored as a uint64_t without rounding error beyond what double's 53-bit mantissa already carries.

mantissa_ = (uint64_t)(f × 2^64)
exponent_ = exp − 1

This is chosen over std::log2 to avoid floating-point approximation errors in the exponent computation.

The round-trip property FloatNum(d).toDouble() == d holds exactly for all finite, non-denormal double values because the 64-bit mantissa captures all 53 bits of the double significand.


Algorithms

Normalize

After every arithmetic operation, the result mantissa may have leading zeros. normalize finds the highest set bit and left-shifts the mantissa to place it at bit 63, decrementing the exponent by the same amount.

if mantissa == 0:
    return zero(sign)
lz = count_leading_zeros(mantissa)   // 0..63
mantissa <<= lz
exponent  -= lz

In the software model, __builtin_clzll provides the leading-zero count in one instruction. In the hardware model it is replaced by a 64-iteration priority encoder loop (see Hardware section).

Addition

operator+(a, b):
    dispatch on category (NaN, Inf, Zero)
    if same sign:
        result = addMagnitudes(a, b)
        result.sign = a.sign
    else:
        cmp = compareMagnitude(a, b)
        if cmp == 0: return +0          // exact cancellation
        if |a| > |b|: result = subtractMagnitudes(a, b); result.sign = a.sign
        else:         result = subtractMagnitudes(b, a); result.sign = b.sign

addMagnitudes: Aligns the smaller operand by right-shifting its mantissa by the exponent difference, then sums using 128-bit arithmetic to detect carry out of bit 63. If a carry occurs, the sum is right-shifted once and the exponent is incremented.

shift = hi.exponent − lo.exponent
lo_m = lo.mantissa >> shift              // truncate; zero if shift ≥ 64
sum128 = (128-bit)(hi.mantissa + lo_m)
if sum128[64]:                           // carry
    sum128 >>= 1
    exponent++
result.mantissa = (uint64_t)sum128       // bit 63 is guaranteed set

subtractMagnitudes: Aligns the smaller operand, subtracts, then normalizes (since subtraction can produce many leading zeros).

shift = larger.exponent − smaller.exponent
s_m = smaller.mantissa >> shift
diff = larger.mantissa − s_m             // no borrow since larger ≥ smaller
return normalize(false, diff, larger.exponent)

Multiplication

Two 64-bit normalized mantissas both have bit 63 set, so their product is in [2^126, 2^128). The top 64 bits of the 128-bit product form the result mantissa, and the exponents add.

product128 = (128-bit)(a.mantissa × b.mantissa)
result_m   = product128[127:64]              // top 64 bits
result_e   = a.exponent + b.exponent + 1

if result_m[63] == 0:                        // product was in [2^126, 2^127)
    result_m <<= 1
    result_e  -= 1

The +1 in result_e accounts for the fixed-point scaling: each mantissa represents a value m × 2^(e−63), so the product is m1×m2 × 2^(e1+e2−126) = (m1×m2 >> 64) × 2^(e1+e2−62) = result_m × 2^(result_e−63) where result_e = e1+e2+1.

Division

The numerator mantissa is scaled up by 2^64 (128-bit shift) before dividing by the denominator mantissa, preserving 64 bits of quotient precision.

numerator128 = (128-bit)(a.mantissa) << 64
q128         = numerator128 / b.mantissa
result_e     = a.exponent − b.exponent − 1

if q128[64]:                                 // quotient ≥ 2^64 (when a == b)
    q128     >>= 1
    result_e  += 1
return normalize(result_sign, (uint64_t)q128, result_e)

The overflow check catches the case a.mantissa == b.mantissa, where the exact quotient is 2^64 (a 65-bit value). A single right-shift and exponent bump resolves it.

Comparison

Magnitude comparison compares exponent first (higher exponent = larger value for equal-sign normals), then mantissa if exponents are equal. Because both mantissas are normalized with bit 63 set, this is a direct integer comparison.

Full comparison (operator<) logic:

  • NaN: always return false
  • ±0 equal to each other
  • Different signs: negative is smaller
  • Same sign, both infinite: equal
  • Same sign, one infinite: +inf > anything, -inf < anything
  • Same sign, both normal: compare magnitudes; for negatives, larger magnitude = smaller value

operator<= and operator>= include explicit NaN guards because their naive implementations (!(rhs < *this) and !(*this < rhs)) would return true for NaN operands — a bug that was caught and fixed during testing.


Hardware Model — FloatNum_hw

The hardware model is a direct translation of the software model into Catapult HLS-synthesizable C++. All prohibited constructs are replaced; the original FloatNum files are unchanged.

Why a separate model?

Catapult HLS cannot synthesize several constructs used in the software model:

Prohibited Reason Replacement
__uint128_t GCC/Clang extension ac_int<128, false>
__builtin_clzll() Compiler built-in Priority-encoder loop
double type Not synthesizable to RTL #ifndef __SYNTHESIS__ guard
std::frexp, std::ldexp, etc. Software library calls Same guard
std::string, std::ostringstream Not synthesizable toString() omitted
std::swap on pointers Pointer semantics Explicit if/else mux
<cassert>, <cmath>, etc. Standard library Removed or guarded
enum class Category Synthesizes poorly ac_int<2, false> constants
int from compareMagnitude Signed return 2-bit code (0/1/2 encoding)

ac_int key properties

ac_int<W, S> from the Algorithmic C datatypes library (ac_types/include/ac_int.h) is a templated arbitrary-precision integer class:

  • W: bit width; S: signedness (true = two's-complement signed)
  • Supports +, -, *, /, >>, <<, [] (bit access), all comparisons
  • Multiplication: ac_int<64,false> × ac_int<64,false>ac_int<128,false> automatically
  • Slicing: x.slc<W>(lsb) extracts a W-bit field starting at bit lsb
  • Conversion (for testbench only): .to_uint64(), .to_int()
  • Compiles and runs with standard g++ without any Catapult license

FP struct

struct FP {
    bool              sign;  // false = positive
    ac_int<64, false> mant;  // bit 63 set for Normal
    ac_int<32, true>  exp;   // unbiased
    ac_int<2,  false> cat;   // CAT_NORMAL/ZERO/INF/NAN
};

Factory inlines create special values: fp_zero(), fp_inf(), fp_nan().

fp_clz64 — priority encoder

Replaces __builtin_clzll. The loop is annotated with #pragma hls_unroll yes so Catapult fully unrolls it into a 64-input combinational priority network.

ac_int<6, false> fp_clz64(ac_int<64, false> x) {
    ac_int<7, false> n = 64;   // 7 bits needed to hold sentinel value 64
    #pragma hls_unroll yes
    for (int i = 0; i < 64; i++) {
        if (x[63 - i] == 1 && n == 64) n = i;
    }
    return n.slc<6>(0);        // safe: caller guarantees x != 0
}

The ac_int<7, false> sentinel is important: a 6-bit type cannot represent 64, so n == 64 would always be false. The 7-bit type holds 64 correctly; after the loop n is always 0..63 (since x != 0 is a precondition), so slc<6>(0) is lossless.

Arithmetic width management

Addition: (ac_int<128,false>)hi.mant + lo_m — cast one operand to 128 bits before adding to get a 128-bit result with a visible carry bit.

Multiplication: a.mant * b.mant assigned to ac_int<128,false> — do not cast either operand first; casting a.mant to 128 bits before multiplying would change the result width to 192 bits.

Division numerator shift: (ac_int<128,false>)a.mant << 64 — cast to 128 bits first, then shift. Without the cast, a.mant << 64 operates on a 64-bit value and would produce zero.

Shift amount clamping: The exponent difference is ac_int<32,true>. Before using it as a 6-bit shift amount, it is clamped with an if/else rather than a ternary, because raw_shift.slc<6>(0) returns a signed type in this version of ac_types, which creates an ambiguous ternary with the unsigned literal branch.

Non-synthesizable conversion functions

fp_from_double and fp_to_double are guarded by #ifndef __SYNTHESIS__:

  • Catapult defines __SYNTHESIS__ during RTL extraction, excluding these functions from synthesis
  • For g++ C simulation and testbenches the guard is not defined, so both functions are available

Hardware functions

Function SW equivalent Description
fp_clz64 __builtin_clzll Leading-zero count, 64-input priority encoder
fp_normalize normalize Left-shift mantissa, adjust exponent
fp_cmp_mag compareMagnitude Returns 0/1/2 (equal/a>b/a<b)
fp_add_mag addMagnitudes Align and add, detect carry
fp_sub_mag subtractMagnitudes Align and subtract, then normalize
fp_neg operator-() Flip sign bit
fp_add operator+ Full add with special-value dispatch
fp_sub operator- Implemented as fp_add(a, fp_neg(b))
fp_mul operator* 64×64→128-bit multiply
fp_div operator/ 128/64-bit divide with precision scaling
fp_eq operator== IEEE equality (NaN ≠ NaN, +0 == -0)
fp_lt operator< IEEE less-than (NaN unordered)

Catapult HLS Synthesis

catapult.tcl synthesizes each arithmetic function as a separate combinational RTL block. The flow:

  1. go compile — parse and elaborate the C++ source
  2. go libraries — select technology cell library
  3. go assembly — schedule operations into states
  4. go architect — allocate hardware resources
  5. go allocate — bind operations to resources
  6. go extract — generate Verilog/VHDL output

Synthesis considerations

fp_clz64: With #pragma hls_unroll yes, Catapult unrolls the 64-iteration loop into a priority encoder tree (~6 levels of 2-to-1 muxes). If the pragma is not honored, an equivalent TCL directive can be added: directive set /fp_clz64/core/main -UNROLL yes.

fp_div: The 128/64-bit integer division generates a large combinational circuit. For timing-critical designs, the division can be made multi-cycle by adding: directive set /fp_div -PIPELINE_INIT_INTERVAL 2.

SCVerify RTL co-simulation: After RTL extraction, Catapult's SCVerify wraps the generated Verilog in a SystemC testbench that replays the C++ testbench against the RTL, providing bit-exact equivalence checking automatically.


Tests

Software tests (test_floatnum.cpp) — 90 tests

Suite Tests
Construction Zero, ±0, ±1, 0.5, 0.25, 1.5, 1.75, 2, 3, 6, 1e15, 1e-15, ±inf, NaN; round-trip
Negation Sign flip, -0, -inf, -NaN
Addition Normal cases, cancellation, identity, both negative, inf, NaN propagation, large shift
Subtraction Normal cases, self-subtraction, inf−inf=NaN
Multiplication Normal cases, signs, 0×inf=NaN, inf×x, large/small cancel
Division Normal cases, x/0=inf, 0/0=NaN, inf/inf=NaN, 0/x, x/inf
Comparisons All six operators; ±0=0; NaN unordered; inf ordering; negative ordering
Chain Powers of 2, distributive, 1/3×3≈1, alternating sum, √2×√2≈2

Hardware tests (test_floatnum_hw.cpp) — 90 tests

Mirrors the software test suite. Every test uses fp_from_double to construct inputs and fp_to_double to read results. Selected tests also cross-check the HW result against the SW golden model (FloatNum) to catch any behavioral divergence.


Known Limitations

  • Rounding: Truncation only (round toward zero). Results may differ from IEEE 754 round-to-nearest by up to 1 ULP.
  • Denormals: Flushed to zero. Values smaller than 2^(−2^31 + 63) cannot be represented.
  • No exceptions: Overflow/underflow flags are not generated.
  • Division cost: 128/64-bit integer division in hardware is large. Consider a multi-cycle or iterative implementation for area-critical designs.
  • No FMA: Fused multiply-add is not implemented.

About

Implementation of a floating point number, synthesizable with Catapult.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors