## **XI. Generating Assembler for the Co-processor**
Dive into floating-point operations and co-processor capabilities. You'll understand the architecture of floating-point units, use their instruction sets, handle precision data types, and carry out scientific operations like trigonometric functions within mixed-mode code.

## **Topics Covered**
A. Understanding floating-point architecture

1. Binary representation

2. Floating-point stack

3. Unique registers and flags

B. Familiarization with the floating-point instruction set

1. Mixed mode assembler code

2. Arithmetic operations

3. Precision data types

4. Trigonometric functions

## **A. Understanding Floating-Point Architecture**
## 1. Binary Representation**

Modern CPUs have a **Floating-Point Unit (FPU)** (also called a co-processor) for handling real numbers.
Unlike integers, floating-point numbers follow the **IEEE 754 standard**, which uses **scientific notation in binary**.

---

### **1. IEEE 754 Floating-Point Format**

A floating-point number is stored as:

| Part                    | Bits                      | Purpose                               |
| ----------------------- | ------------------------- | ------------------------------------- |
| **Sign**                | 1                         | 0 for positive, 1 for negative        |
| **Exponent**            | 8 (single) / 11 (double)  | Determines magnitude (via power of 2) |
| **Mantissa** (Fraction) | 23 (single) / 52 (double) | Stores significant digits             |

**Formula:**

$$
\text{Value} = (-1)^{Sign} \times 1.Mantissa \times 2^{(Exponent - Bias)}
$$

* **Bias** is 127 for single-precision, 1023 for double.

---

### **2. Example - Representing 5.75 in Binary**

1. Convert integer part:

   * 5 in binary = `101`
2. Convert fraction part:

   * 0.75 x 2 = 1 → `1`
   * 0.5 x 2 = 1 → `1`
   * Fraction = `.11`
3. Combine: `101.11`
4. Normalize to `1.0111 x 2²`
5. Sign = 0 (positive)
   Exponent = 2 + 127 = `129` → binary `10000001`
   Mantissa = `01110000000000000000000`

Final **32-bit IEEE 754**:
`0 10000001 01110000000000000000000`

---

### **3. Assembly and Floating-Point Data**

In assembly, floating-point constants are often defined as:

```asm
.data
num1 REAL4 5.75      ; Single precision (32-bit)
num2 REAL8 5.75      ; Double precision (64-bit)
```

---

### **4. Quick Demo - Viewing Floating-Point Representation in Assembly**

Here's a small MASM example that stores and inspects the bits of a floating-point number.

In [None]:
; floating_point_bits.asm
.386
.model flat, stdcall
.stack 4096
ExitProcess PROTO :DWORD

.data
floatNum REAL4 5.75
bits     DWORD ?

.code
main PROC
    mov eax, DWORD PTR floatNum ; get raw bits of float
    mov bits, eax               ; store them for inspection
    invoke ExitProcess, 0
main ENDP
END main

You can run the above code in a debugger and see the exact IEEE 754 bits in `bits`.

⬇️

## **2 Floating-Point Stack**

The **x87 Floating-Point Unit (FPU)** inside many CPUs uses a **register stack** model to perform floating-point operations.

---

### **1. The FPU Stack Structure**

* There are **8 registers**: `ST(0)` through `ST(7)`.
* They are organized as a **stack**:

  * **Push** = load a new value onto the stack (`fld`)
  * **Pop** = remove the top value (`fstp`)
* The **top** of the stack is always `ST(0)`.
* Operations are **implicitly** done on `ST(0)` and possibly another stack element.

---

**Diagram:**

```
    +---------+  <- ST(0) (top)
    |  value  |
    +---------+  <- ST(1)
    |  value  |
    +---------+
       ...
    +---------+  <- ST(7) (bottom)
```

---

### **2. Common FPU Stack Instructions**

| Instruction | Action                                       |
| ----------- | -------------------------------------------- |
| `fld src`   | Push a value onto the stack                  |
| `fst dest`  | Store value from `ST(0)` without popping     |
| `fstp dest` | Store value from `ST(0)` **and** pop         |
| `fadd`      | Add `ST(0)` to another register              |
| `fsub`      | Subtract                                     |
| `fmul`      | Multiply                                     |
| `fdiv`      | Divide                                       |
| `fxch`      | Exchange `ST(0)` with another stack register |

---

### **3. Example - Using the FPU Stack**

In [None]:
; fpu_stack_example.asm
.386
.model flat, stdcall
.stack 4096
ExitProcess PROTO :DWORD

.data
val1 REAL4 5.0
val2 REAL4 2.0
result REAL4 ?

.code
main PROC
    fld val1        ; ST(0) = 5.0
    fld val2        ; ST(0) = 2.0, ST(1) = 5.0
    fadd            ; ST(0) = ST(0) + ST(1) → 2.0 + 5.0 = 7.0
    fstp result     ; pop and store result (7.0)
    invoke ExitProcess, 0
main ENDP
END main

### **4. Key Points to Remember**

* **Order matters** because it's a stack.
* The FPU **automatically converts** integer values to floating-point when loaded with `fld`.
* You should **always pop values** you're done with, to avoid stack overflow.

⬇️

## **3 Floating-Point Arithmetic Instructions**

The **x87 FPU** provides instructions for performing arithmetic operations directly on the floating-point stack registers (`ST(0)`-`ST(7)`).

---

### **1. Basic Arithmetic Operations**

| Instruction | Description                                 |
| ----------- | ------------------------------------------- |
| `fadd`      | Add `ST(0)` to another stack element        |
| `faddp`     | Add and pop stack                           |
| `fsub`      | Subtract another stack element from `ST(0)` |
| `fsubp`     | Subtract and pop                            |
| `fmul`      | Multiply `ST(0)` with another stack element |
| `fmulp`     | Multiply and pop                            |
| `fdiv`      | Divide `ST(0)` by another stack element     |
| `fdivp`     | Divide and pop                              |

---

### **2. Operand Variants**

* **Register form**: `fadd ST(0), ST(1)`
  → `ST(0) = ST(0) + ST(1)`
* **Memory form**: `fadd mem32real`
  → Adds a 32-bit floating-point value from memory to `ST(0)`

---

### **3. Example - Basic Arithmetic**

In [None]:
; fpu_arithmetic_example.asm
.386
.model flat, stdcall
.stack 4096
ExitProcess PROTO :DWORD

.data
a REAL4 6.0
b REAL4 3.0
sum REAL4 ?
product REAL4 ?
quotient REAL4 ?

.code
main PROC
    ; sum = a + b
    fld a           ; ST(0) = 6.0
    fadd b          ; ST(0) = 6.0 + 3.0 = 9.0
    fstp sum        ; store and pop

    ; product = a * b
    fld a           ; ST(0) = 6.0
    fmul b          ; ST(0) = 18.0
    fstp product

    ; quotient = a / b
    fld a           ; ST(0) = 6.0
    fdiv b          ; ST(0) = 2.0
    fstp quotient

    invoke ExitProcess, 0
main ENDP
END main

### **4. Pop Variants**

Some instructions (like `faddp`, `fmulp`, `fdivp`) **pop** one value from the stack after performing the operation.
Example:

```asm
fld a
fld b
faddp       ; ST(1) = a + b, pop ST(0)
```

---

### **5. Key Tips**

* **Pop vs No-Pop**: Use pop variants if you want to clean up the stack automatically.
* **Memory operands** are useful to avoid extra `fld` instructions.
* Keep track of **stack depth** to prevent overflow.

⬇️

## **4 Floating-Point Comparison and Conditional Moves**

The x87 FPU supports comparison instructions that set **condition flags** in the FPU status word (and, in some cases, in the CPU flags register) so you can perform conditional branching or moves.

---

### **1. Basic Comparison Instructions**

| Instruction | Description                                             |
| ----------- | ------------------------------------------------------- |
| `fcom`      | Compare `ST(0)` with another register or memory operand |
| `fcomp`     | Compare and pop `ST(0)`                                 |
| `fcompp`    | Compare and pop twice                                   |
| `fucom`     | Unordered compare (checks for NaNs)                     |
| `fucomp`    | Unordered compare and pop                               |
| `fucompp`   | Unordered compare and pop twice                         |

**How it works**:

* The comparison sets **C0, C2, C3** bits in the FPU status word.
* If needed, you can transfer these bits to the CPU flags register with `fstsw` followed by `sahf` to use normal `jcc` jumps.

---

### **2. Example - Comparing Two Values**

In [None]:
; fpu_compare.asm
.386
.model flat, stdcall
.stack 4096
ExitProcess PROTO :DWORD

.data
x REAL4 5.0
y REAL4 8.0

.code
main PROC
    fld x              ; ST(0) = 5.0
    fcomp y            ; Compare and pop ST(0)

    fstsw ax           ; Copy FPU status word into AX
    sahf               ; Move AH into FLAGS register

    ; Now ZF=1 if equal, CF=1 if ST(0) < operand
    ; Example jump:
    jc less_than
    je equal_to

greater_than:
    ; Code for x > y
    jmp done

less_than:
    ; Code for x < y
    jmp done

equal_to:
    ; Code for x == y

done:
    invoke ExitProcess, 0
main ENDP
END main

### **3. Conditional Moves**

Later x86 CPUs (Pentium Pro and newer) allow **conditional moves** after FPU comparisons, avoiding jumps.

* `fcmovb`  → Move if below (CF=1)
* `fcmove`  → Move if equal (ZF=1)
* `fcmovbe` → Move if below or equal
* `fcmovu`  → Move if unordered (NaN detected)

Example:

```asm
fld st(1)        ; load candidate value
fcmovb st(0), st(2)  ; if below, replace st(0) with st(2)
```

---

### **4. Ordered vs Unordered Compare**

* **Ordered (`fcom`)** assumes valid numbers.
* **Unordered (`fucom`)** works safely with NaNs, setting C2=1 if unordered.

---

✅ **Tip:** Always use `fucom` or `fucomp` if there’s a possibility of NaN values to avoid unexpected results.

⬇️

## **5 Floating-Point Constants and Data Transfer Instructions**

Efficient floating-point programming often involves **loading constants** directly from the FPU's built-in instructions and **moving data** between the FPU registers and memory.

---

### **1. Built-in Floating-Point Constants**

The x87 FPU has **predefined constants** that can be loaded quickly without accessing memory:

| Instruction | Constant Loaded into ST(0) |
| ----------- | -------------------------- |
| `fld1`      | +1.0                       |
| `fldz`      | +0.0                       |
| `fldpi`     | π (≈ 3.14159)              |
| `fldl2e`    | log₂(e)                    |
| `fldl2t`    | log₂(10)                   |
| `fldlg2`    | log₁₀(2)                   |
| `fldln2`    | ln(2)                      |

**Example:**

```asm
fldpi     ; ST(0) = π
fld1      ; ST(0) = 1.0, ST(1) = π
```

---

### **2. Moving Data Between FPU and Memory**

#### **Load Instructions**

| Instruction | Description                                                  |
| ----------- | ------------------------------------------------------------ |
| `fld`       | Load floating-point value into ST(0) from memory or register |
| `fild`      | Load integer (and convert to float)                          |

Example:

```asm
fld DWORD PTR myFloat
fild DWORD PTR myInt
```

#### **Store Instructions**

| Instruction | Description                                 |
| ----------- | ------------------------------------------- |
| `fst`       | Store ST(0) into memory (keeps value in ST) |
| `fstp`      | Store and pop ST(0)                         |
| `fist`      | Store integer (converted from float)        |
| `fistp`     | Store integer and pop                       |

Example:

```asm
fstp DWORD PTR result
fistp DWORD PTR intResult
```

---

### **3. Moving Data Between FPU Registers**

The `fxch` instruction exchanges ST(0) with another FPU register.

```asm
fxch st(2)  ; Swap ST(0) and ST(2)
```

---

### **4. Example - Loading Constants and Saving a Result**

In [None]:
.386
.model flat, stdcall
.stack 4096
ExitProcess PROTO :DWORD

.data
result REAL4 ?

.code
main PROC
    fldpi           ; ST(0) = π
    fld1            ; ST(0) = 1.0, ST(1) = π
    faddp st(1), st(0) ; ST(1) = π + 1.0, pop ST(0)
    fstp result     ; Save to memory
    invoke ExitProcess, 0
main ENDP
END main

✅ **Tip:**
Use built-in constants (`fld1`, `fldpi`, etc.) instead of storing them in `.data` when possible — they're faster and save memory.

⬇️

## **6 Floating-Point Control Word and Status Word**

The **x87 Floating-Point Unit (FPU)** maintains special registers that control how it operates and record its current status. Two of the most important are:

* **Control Word (CW)** - Configures FPU behavior.
* **Status Word (SW)** - Reports current state, errors, and flags.

---

### **1. Floating-Point Control Word (CW)**

The **Control Word** is a 16-bit register that determines:

1. **Precision Control (PC)** - How many bits are used for floating-point precision:

   * `00` - 24 bits (single precision)
   * `10` - 53 bits (double precision)
   * `11` - 64 bits (extended precision)

2. **Rounding Control (RC)** - How results are rounded:

   * `00` - Round to nearest (default)
   * `01` - Round down (toward −∞)
   * `10` - Round up (toward +∞)
   * `11` - Truncate (toward 0)

3. **Exception Masking** - Whether certain floating-point exceptions (like division by zero, overflow, underflow) trigger interrupts or are ignored.

---

**Loading and Storing the Control Word:**

```asm
fstcw myControlWord    ; Store CW into memory
fldcw myControlWord    ; Load CW from memory
```

Example: Change rounding mode to **truncate**:

```asm
fstcw control
or control, 0C00h   ; Set RC bits to 11
fldcw control
```

---

### **2. Floating-Point Status Word (SW)**

The **Status Word** reflects:

* Condition flags (e.g., result is zero, less than zero, unordered).
* Exception flags (e.g., divide-by-zero occurred).
* Top-of-stack pointer (TOS).
* Busy status of the FPU.

---

**Reading the Status Word:**

```asm
fstsw status
```

For CPU compatibility, `fstsw ax` is often used to copy the Status Word into the AX register.

---

**Example - Check if result is zero:**

```asm
fstsw ax
sahf               ; Load AH into CPU flags
; Zero flag will be set if result is zero
```

---

### **3. Example - Changing Precision and Reading Status**

In [None]:
.data
cw     WORD ?
status WORD ?

.code
main PROC
    fstcw cw            ; Get current CW
    and cw, 0F3FFh      ; Clear precision bits
    or cw, 0200h        ; Set precision to 53 bits
    fldcw cw            ; Load new CW

    fld1
    fld1
    faddp st(1), st(0)  ; Result = 2.0

    fstsw status        ; Store status word
    invoke ExitProcess, 0
main ENDP
END main

✅ **Tip:**

* Use **Control Word** to fine-tune performance vs. precision trade-offs.
* Check the **Status Word** to detect and handle exceptional results during calculations.

⬇️

## **B. Familiarization with the floating-point instruction set**
### 1. Mixed mode assembler code

The x87 Floating-Point Unit (FPU) has **built-in trigonometric functions** for working with angles and periodic values, eliminating the need to implement these in software.

These instructions work with **radians** (not degrees) and operate directly on the floating-point register stack.

---

### **1. Angle Units**

* **x87 expects angles in radians**.
* Conversion formulas:

  * Degrees → Radians:

    $$
    \text{radians} = \text{degrees} \times \frac{\pi}{180}
    $$
  * Radians → Degrees:

    $$
    \text{degrees} = \text{radians} \times \frac{180}{\pi}
    $$

---

### **2. Trigonometric Instructions**

| Instruction | Description                                                  | Stack Effect                          |
| ----------- | ------------------------------------------------------------ | ------------------------------------- |
| **FSIN**    | Sine of ST(0)                                                | ST(0) ← sin(ST(0))                    |
| **FCOS**    | Cosine of ST(0)                                              | ST(0) ← cos(ST(0))                    |
| **FSINCOS** | Compute sine and cosine, store cosine in ST(0) and push sine | ST(0) ← cos, ST(1) ← sin              |
| **FPTAN**   | Compute tan(x) and push 1.0                                  | ST(0) ← tan, ST(1) ← 1.0              |
| **FPATAN**  | Compute arctan(y/x)                                          | Pops ST(0) and ST(1), result in ST(0) |

---

### **3. Example - Sine Calculation**

```asm
.data
angle   REAL8 1.57079632679    ; ~π/2 radians
result  REAL8 ?

.code
main PROC
    fld angle     ; Load π/2
    fsin          ; ST(0) = sin(π/2)
    fstp result   ; Store result in memory
    invoke ExitProcess, 0
main ENDP
END main
```

**Expected output:** `1.0`

---

### **4. Example - Sine and Cosine Together**

```asm
.data
angle   REAL8 0.78539816339    ; ~π/4 radians
sine    REAL8 ?
cosine  REAL8 ?

.code
main PROC
    fld angle     ; Load π/4
    fsincos       ; ST(0)=cos, ST(1)=sin
    fstp cosine   ; Pop and store cosine
    fstp sine     ; Pop and store sine
    invoke ExitProcess, 0
main ENDP
END main
```

**Expected output:**

* sine ≈ 0.7071
* cosine ≈ 0.7071

---

✅ **Tip:** Always remember to **normalize** angles if they're too large. The x87 automatically reduces the argument for `FSIN` and `FCOS` internally for accuracy.

⬇️

# **2. Tangent and Arctangent Instructions**

The x87 FPU provides **specialized instructions** for computing tangent and arctangent values directly on floating-point numbers.
These are useful in applications such as geometry, physics, and graphics transformations.

---

### **1. Tangent - `FPTAN`**

* **Computes:**

  $$
  \text{tan}(x) = \frac{\sin(x)}{\cos(x)}
  $$
* **Input:** `ST(0)` = angle in radians.
* **Output:**

  * `ST(0)` ← **tan(x)**
  * Pushes **1.0** onto the stack at `ST(1)` (this is a quirk of `FPTAN`).

**Usage:**

```asm
fld angle   ; Load angle
fptan       ; ST(0) = tan(angle), ST(1) = 1.0
fstp dummy  ; Remove 1.0 from stack
```

---

### **2. Arctangent - `FPATAN`**

* **Computes:**

  $$
  \text{atan2}(y, x)
  $$

  which returns the angle whose tangent is `y/x`, considering the signs of both to determine the correct quadrant.
* **Inputs:**

  * `ST(0)` = **x**
  * `ST(1)` = **y**
* **Output:** Pops both values and leaves **angle in radians** in `ST(0)`.

**Usage:**

```asm
fld y_value    ; Push y
fld x_value    ; Push x
fpatan         ; ST(0) = atan2(y, x)
```

---

### **3. Example - Tangent Calculation**

```asm
.data
angle   REAL8 0.78539816339    ; ~π/4 radians
result  REAL8 ?
dummy   REAL8 ?

.code
main PROC
    fld angle     ; Load π/4
    fptan         ; ST(0) = tan, ST(1) = 1.0
    fstp dummy    ; Pop 1.0
    fstp result   ; Store tan(π/4)
    invoke ExitProcess, 0
main ENDP
END main
```

**Expected output:** `1.0`

---

### **4. Example - Arctangent Calculation**

```asm
.data
y_value REAL8 1.0
x_value REAL8 1.0
angle   REAL8 ?

.code
main PROC
    fld y_value   ; Push y = 1.0
    fld x_value   ; Push x = 1.0
    fpatan        ; ST(0) = atan2(1,1)
    fstp angle    ; Store result (~π/4)
    invoke ExitProcess, 0
main ENDP
END main
```

**Expected output:** \~0.7854 radians (π/4).

---

✅ **Tip:**

* `FPTAN` **always leaves an extra 1.0** on the stack - don't forget to pop it.
* `FPATAN` is **very useful** in coordinate transformations where you need to find the angle from X and Y components.

⬇️

## **3. Inverse Trigonometric Functions**

The x87 FPU **does not** provide direct instructions for arcsine (`asin`) or arccosine (`acos`) like it does for arctangent.
However, we can compute them using mathematical identities and existing FPU instructions.

---

### **1. Arcsine (`asin`)**

**Formula:**

$$
\asin(x) = \atan2(x, \sqrt{1 - x^2})
$$

**Steps in assembly:**

1. Load `x` into the FPU.
2. Compute `1 - x²`.
3. Take the square root → `sqrt(1 - x²)`.
4. Use `FPATAN` to find `atan2(x, sqrt(...))`.

**Example:**

```asm
.data
x_val   REAL8 0.5
asin_res REAL8 ?

.code
main PROC
    fld1                  ; ST(0) = 1.0
    fld x_val              ; ST(0) = x, ST(1) = 1.0
    fld st(0)              ; Duplicate x
    fmul st(0), st(0)      ; x²
    fsub st(1), st(0)      ; 1 - x²
    fsqrt                  ; sqrt(1 - x²)
    fxch st(1)             ; Swap so ST(0)=x, ST(1)=sqrt(...)
    fpatan                 ; asin(x)
    fstp asin_res
    invoke ExitProcess, 0
main ENDP
END main
```

---

### **2. Arccosine (`acos`)**

**Formula:**

$$
\acos(x) = \frac{\pi}{2} - \asin(x)
$$

We can reuse the arcsine computation, then subtract from π/2.

**Example:**

```asm
.data
x_val   REAL8 0.5
pi_over_2 REAL8 1.57079632679
acos_res REAL8 ?

.code
main PROC
    ; Compute asin(x) first (same as above)
    fld1
    fld x_val
    fld st(0)
    fmul st(0), st(0)
    fsub st(1), st(0)
    fsqrt
    fxch st(1)
    fpatan                ; asin(x)
    fld pi_over_2
    fsub                  ; π/2 - asin(x)
    fstp acos_res
    invoke ExitProcess, 0
main ENDP
END main
```

---

### **3. Notes**

* **Why no direct instruction?**
  The x87 designers prioritized speed for commonly used trig ops (sin, cos, tan, atan) and left others for software computation.
* **Range restrictions:**

  * `asin` and `acos` are only defined for `-1 ≤ x ≤ 1`.
* **Units:** Always **radians**.

⬇️

## **4. Hyperbolic Functions**

The x87 FPU **does not** have direct instructions for hyperbolic functions (`sinh`, `cosh`, `tanh`).
Instead, we compute them using their definitions from exponentials.

---

### **1. Definitions**

$$
\sinh(x) = \frac{e^x - e^{-x}}{2}
$$

$$
\cosh(x) = \frac{e^x + e^{-x}}{2}
$$

$$
\tanh(x) = \frac{\sinh(x)}{\cosh(x)}
$$

---

### **2. Using the x87 FPU to Compute Exponentials**

The x87 FPU has the `F2XM1` instruction (computes $2^x - 1$ for -1 ≤ x ≤ +1) and the `FYL2X` instruction (y \* log₂(x)) to help with exponentials.

To compute $e^x$:

1. Convert $e^x$ into base-2 form:

   $$
   e^x = 2^{x \cdot \log_2(e)}
   $$
2. Separate into integer and fractional parts for `F2XM1`.
3. Use `FSCALE` to handle the integer exponent.

---

### **3. Example: sinh(x)**

```asm
.data
x_val REAL8 1.0
log2e REAL8 1.4426950408889634   ; log₂(e)
half  REAL8 0.5
sinh_res REAL8 ?

.code
main PROC
    ; Compute e^x
    fld x_val
    fld log2e
    fmul                   ; x * log₂(e)
    fld st(0)              ; Duplicate
    frndint                ; Integer part
    fld st(1)              
    fsub st(0), st(2)      ; Fractional part
    f2xm1                  ; 2^(fraction) - 1
    fld1
    fadd                   ; 2^(fraction)
    fscale                 ; Scale by integer part
    fstp st(1)             ; Remove extra copy → ST(0) = e^x

    ; Store e^x for later
    fld1
    fdiv st(0), st(1)      ; e^(-x) = 1 / e^x

    ; Compute sinh = (e^x - e^-x) / 2
    fxch                   ; Put e^x on top
    fsub                   ; e^x - e^-x
    fld half
    fmul
    fstp sinh_res

    invoke ExitProcess, 0
main ENDP
END main
```

---

### **4. Computing cosh(x)**

Same process, but use:

$$
\cosh(x) = \frac{e^x + e^{-x}}{2}
$$

---

### **5. Computing tanh(x)**

Once you have `sinh(x)` and `cosh(x)`:

```asm
fld sinh_val
fld cosh_val
fdiv                  ; sinh / cosh
```

---

### **6. Notes**

* Hyperbolic functions appear in **engineering, physics, and numerical simulations**.
* Unlike trig functions, they are **unbounded** for large |x|.
* For large arguments, direct computation may overflow — approximations or scaling may be needed.

⬇️

## **5. Inverse Hyperbolic Functions**

The inverse hyperbolic functions — **asinh**, **acosh**, and **atanh** — are not built into the x87 FPU,
but can be computed from their **logarithmic definitions**.

---

### **1. Mathematical Definitions**

1. **Inverse Hyperbolic Sine**

$$
\operatorname{asinh}(x) = \ln\left( x + \sqrt{x^2 + 1} \right)
$$

2. **Inverse Hyperbolic Cosine**

$$
\operatorname{acosh}(x) = \ln\left( x + \sqrt{x^2 - 1} \right), \quad x \ge 1
$$

3. **Inverse Hyperbolic Tangent**

$$
\operatorname{atanh}(x) = \frac12 \ln\left( \frac{1+x}{1-x} \right), \quad |x| < 1
$$

---

### **2. x87 FPU Approach**

We can implement these using:

* `FSQRT` for square roots
* `FYL2X` for logarithms (since $\ln(x) = \log_2(x) \cdot \ln(2)$)
* Basic arithmetic (`FADD`, `FSUB`, `FMUL`, `FDIV`)

---

### **3. Example: asinh(x)**

```asm
.data
x_val   REAL8  1.5
ln2     REAL8  0.6931471805599453
one     REAL8  1.0
asinh_res REAL8 ?

.code
main PROC
    fld x_val           ; ST0 = x
    fld st(0)           ; Duplicate x
    fmul st(0), st(0)   ; x^2
    fld1
    fadd                ; x^2 + 1
    fsqrt               ; sqrt(x^2 + 1)
    fadd                ; x + sqrt(x^2+1)

    ; Now compute ln(ST0)
    fld1
    fxch
    fyl2x               ; log₂(value)
    fld ln2
    fmul                ; ln(value)

    fstp asinh_res
    invoke ExitProcess, 0
main ENDP
END main
```

---

### **4. Example: acosh(x)**

```asm
fld x_val
fld st(0)
fmul st(0), st(0)    ; x^2
fld1
fsub                 ; x^2 - 1
fsqrt
fadd                 ; x + sqrt(x^2 - 1)
fld1
fxch
fyl2x
fld ln2
fmul
```

---

### **5. Example: atanh(x)**

```asm
fld1
fld x_val
fadd                 ; 1 + x
fld1
fld x_val
fsub                 ; 1 - x
fdiv                 ; (1+x)/(1-x)
fld1
fxch
fyl2x
fld ln2
fmul
fld1
fld st(0)
fdiv st(0), st(1)    ; multiply by 0.5
```

---

### **6. Notes**

* **Domain restrictions** must be respected:

  * `acosh(x)` → x ≥ 1
  * `atanh(x)` → |x| < 1
* For **large arguments**, numerical stability can be improved by factoring terms.

⬇️

## **6. Special Constants in Computations**

The **x87 FPU** includes several built-in constants and provides instructions to load them directly,
avoiding the need to store them in memory.

---

### **1. Built-in Constants (x87)**

| Instruction | Loads into ST(0) | Value                 |
| ----------- | ---------------- | --------------------- |
| `FLD1`      | 1.0              | $1.0$                 |
| `FLDZ`      | 0.0              | $0.0$                 |
| `FLDPI`     | π                | $3.14159265358979...$ |
| `FLDL2E`    | log₂(e)          | $1.44269504088896...$ |
| `FLDL2T`    | log₂(10)         | $3.32192809488736...$ |
| `FLDLG2`    | log₁₀(2)         | $0.30102999566398...$ |
| `FLDLN2`    | ln(2)            | $0.69314718055995...$ |

---

### **2. Why They're Useful**

* **No memory access** → faster execution
* **Full precision** → avoids rounding errors from storing in memory
* **Convenient** for formulas involving π, ln(2), log₂(10), etc.

---

### **3. Example: Compute Circle Area**

Formula:

$$
A = \pi r^2
$$

Assembly:

```asm
.data
radius REAL8  2.5
area   REAL8  ?

.code
main PROC
    fld radius       ; ST0 = r
    fmul st(0), st(0) ; r²
    fldpi            ; ST0 = π, ST1 = r²
    fmulp st(1), st(0) ; π * r²
    fstp area
    invoke ExitProcess, 0
main ENDP
END main
```

---

### **4. Example: Convert Log Base 2 to Natural Log**

Formula:

$$
\ln(x) = \log_2(x) \cdot \ln(2)
$$

```asm
fld value       ; x
fld1
fxch
fyl2x           ; log₂(x)
fldln2          ; ln(2)
fmul            ; ln(x)
```

---

### **5. Caution**

* Constants are **read-only** in the FPU — you can't modify them.
* Loading a constant **overwrites ST(0)** unless you use the FPU stack carefully.

⬇️

## **7. Handling Floating-Point Exceptions**

The **x87 FPU** can detect and signal a variety of **floating-point exceptions** during computation.
These exceptions help programmers catch numerical errors or special conditions.

---

### **1. Types of Floating-Point Exceptions**

| Exception                | Cause                                      | Example                          |
| ------------------------ | ------------------------------------------ | -------------------------------- |
| **Invalid Operation**    | Undefined or unsupported operation         | √(-1), 0 ÷ 0                     |
| **Denormalized Operand** | Operand too small to be normalized         | Very small numbers close to zero |
| **Divide by Zero**       | Division by zero                           | 5 ÷ 0                            |
| **Overflow**             | Result too large to represent              | $1 \times 10^{5000}$             |
| **Underflow**            | Result too small to represent              | $1 \times 10^{-5000}$            |
| **Precision Loss**       | Rounding caused loss of significant digits | 1/3 in finite precision          |

---

### **2. FPU Control Word**

The **control word** determines which exceptions are **masked** (ignored) or **unmasked** (generate interrupts).

* **Mask bit = 1** → Exception is ignored, result may be special value (NaN, ±∞, etc.)
* **Mask bit = 0** → Exception triggers an interrupt

---

### **3. Default Behavior**

* By default, all exceptions are masked.
* The FPU returns **special values** like:

  * **NaN** (Not a Number) → Invalid result
  * **±∞** → Overflow or divide-by-zero
  * **0.0** → Underflow

---

### **4. Checking Exception Flags**

The **status word** contains bits indicating if an exception has occurred.

Example bits:

* **IE** → Invalid operation
* **ZE** → Divide by zero
* **OE** → Overflow
* **UE** → Underflow
* **PE** → Precision loss

---

### **5. Example: Detect Division by Zero**

```asm
fld1            ; 1.0
fldz            ; 0.0
fdiv            ; ST0 = 1.0 / 0.0 → ∞, ZE flag set

fstsw ax        ; Store status word in AX
sahf            ; Copy to CPU flags
; Now test AX bits to check for ZE (bit 2)
```

---

### **6. Practical Notes**

* For most applications, **masked exceptions** are fine — computations continue with special values.
* For safety-critical or scientific code, **unmasking** exceptions helps detect invalid states early.
* The `FSTCW` and `FLDCW` instructions control exception masking.

⬇️

## **8. Rounding Modes**

The **x87 Floating Point Unit (FPU)** supports different **rounding modes** for controlling how results are rounded when they cannot be represented exactly in the available precision.

---

### **1. Purpose of Rounding Modes**

When a calculation produces a result with more significant digits than the FPU's precision, the FPU must **round** it to fit.
The rounding mode determines **which direction** the value is adjusted.

---

### **2. Rounding Control Bits**

* Located in the **Control Word** (bits 10 and 11)
* Affect **all FPU arithmetic operations**
* Selected using `FLDCW` (load control word) or `FSTCW` (store control word)

---

### **3. Available Rounding Modes**

| Mode                                    | Bits | Description                       | Example (Round 2.7) |
| --------------------------------------- | ---- | --------------------------------- | ------------------- |
| **Round to Nearest (Even)** *(default)* | 00   | Round to nearest; ties go to even | 3.0                 |
| **Round Down (Floor)**                  | 01   | Round toward −∞                   | 2.0                 |
| **Round Up (Ceil)**                     | 10   | Round toward +∞                   | 3.0                 |
| **Truncate (Toward Zero)**              | 11   | Chop fractional part              | 2.0                 |

---

### **4. Example: Changing Rounding Mode**

```asm
fstcw  control_word      ; Save current control word
mov    ax, [control_word]
or     ax, 0x0400        ; Set bits for round down (01)
mov    [control_word], ax
fldcw  control_word      ; Load new control word
```

---

### **5. Practical Usage**

* **Financial calculations** → Often use **Round to Nearest** for fairness.
* **Graphics / geometry** → Sometimes require **Truncate** to avoid overshooting.
* **Interval arithmetic** → May use **Round Up** or **Round Down** to keep safe bounds.
* **Debugging** → Changing rounding mode can help test numerical stability.

---

### **6. Special Note**

The rounding mode applies **globally** to all FPU instructions until changed again — so you must **restore the original control word** after temporary changes.


⬇️

## **C.1 Basics of SIMD**

---

### **1. What is SIMD?**

**SIMD** stands for **Single Instruction, Multiple Data** — a parallel processing technique where **one instruction** operates on **multiple pieces of data at the same time**.

Instead of doing:

```asm
; Scalar approach (one number at a time)
add eax, ebx
add ecx, edx
```

SIMD can do:

```asm
; SIMD approach (multiple numbers at once)
paddd xmm1, xmm2   ; Add 4 integers in parallel
```

---

### **2. Why SIMD Exists**

* **Speed**: Modern CPUs can process multiple numbers in a single clock cycle.
* **Efficiency**: Less instruction overhead.
* **Optimized for multimedia**: Image, video, audio, and scientific computation often require the same operation on many values.

---

### **3. Common SIMD Instruction Sets**

| Instruction Set | Register Size | Data Types Supported      |
| --------------- | ------------- | ------------------------- |
| **MMX**         | 64-bit        | Integers                  |
| **SSE / SSE2+** | 128-bit       | Integers & floating point |
| **AVX**         | 256-bit       | Wider vectors             |
| **AVX-512**     | 512-bit       | Massive parallelism       |

---

### **4. Example: Adding Arrays with SIMD**

**Scalar (regular)**:

```asm
mov eax, [a1]
add eax, [b1]
mov [c1], eax
; repeat for each element
```

**SIMD**:

```asm
movaps xmm0, [a]     ; Load 4 floats
addps  xmm0, [b]     ; Add 4 floats at once
movaps [c], xmm0     ; Store result
```

---

### **5. When to Use SIMD**

* Processing large arrays of data
* Image/video/audio processing
* Cryptographic algorithms
* Scientific simulations
* Machine learning preprocessing

---

### **6. Limitations**

* Data must be **aligned** in memory for peak performance.
* Branch-heavy code does not benefit as much.
* Works best with large datasets and repetitive operations.

⬇️

## **C.2 SSE Registers**

---

### **1. What Are SSE Registers?**

**SSE (Streaming SIMD Extensions)** introduced a set of **128-bit registers** called **XMM registers**.
They are used to store multiple numbers and operate on them **in parallel**.

---

### **2. SSE Register Overview**

| Register Name    | Size                    | Purpose                                                     |
| ---------------- | ----------------------- | ----------------------------------------------------------- |
| **XMM0 - XMM15** | 128-bit                 | Store integers or floating-point values for SIMD operations |
| **YMM0 - YMM15** | 256-bit (AVX extension) | Wider SIMD processing                                       |
| **ZMM0 - ZMM31** | 512-bit (AVX-512)       | Ultra-wide SIMD processing                                  |

---

### **3. Data Layout in SSE Registers**

A **128-bit XMM register** can hold:

* **4 x 32-bit floats**
* **2 x 64-bit doubles**
* **16 x 8-bit integers**
* **8 x 16-bit integers**
* **4 x 32-bit integers**

Example memory layout (**4 floats**):

```
| float3 | float2 | float1 | float0 |
|  96-127|  64-95 |  32-63 |   0-31 |
```

---

### **4. Basic SSE Load/Store Instructions**

```asm
movaps xmm0, [a]     ; Load aligned 4 floats from memory
movups xmm1, [b]     ; Load unaligned 4 floats from memory
movaps [c], xmm0     ; Store result
```

* **movaps** → Aligned Packed Single-precision
* **movups** → Unaligned Packed Single-precision

---

### **5. Basic SSE Arithmetic**

```asm
addps xmm0, xmm1     ; Add packed single-precision floats
subps xmm2, xmm3     ; Subtract packed floats
mulps xmm4, xmm5     ; Multiply packed floats
divps xmm6, xmm7     ; Divide packed floats
```

---

### **6. Example: Vector Addition Using SSE**

```asm
section .data
    a   dd 1.0, 2.0, 3.0, 4.0
    b   dd 5.0, 6.0, 7.0, 8.0
    res dd 0.0, 0.0, 0.0, 0.0

section .text
global _start
_start:
    movaps xmm0, [a]     ; Load vector a
    movaps xmm1, [b]     ; Load vector b
    addps  xmm0, xmm1    ; Add a + b
    movaps [res], xmm0   ; Store result
```

This adds **4 floating-point numbers in parallel**.

---

### **7. Key Points**

* SSE registers are **128-bit wide**.
* XMM registers allow **parallel arithmetic**.
* Alignment matters for **fast access**.
* Extended sets (YMM/ZMM) come with AVX/AVX-512.

⬇️

## **C.3 SIMD Instructions**

---

### **1. What Are SIMD Instructions?**

**SIMD (Single Instruction, Multiple Data)** instructions let the CPU perform the **same operation** on multiple pieces of data **at the same time**.
They are heavily used in:

* Image processing
* Audio/video encoding
* Scientific simulations
* Cryptography

---

### **2. SIMD Instruction Categories**

| Category                | Example Mnemonics                  | Description                              |
| ----------------------- | ---------------------------------- | ---------------------------------------- |
| **Load/Store**          | `movaps`, `movups`, `movdqa`       | Move data between memory and registers   |
| **Arithmetic**          | `addps`, `subps`, `mulps`, `divps` | Parallel add, subtract, multiply, divide |
| **Logical**             | `andps`, `orps`, `xorps`           | Bitwise operations on packed values      |
| **Comparison**          | `cmpps`, `cmpeqps`                 | Compare packed floats                    |
| **Shuffling/Permuting** | `shufps`, `unpcklps`               | Rearrange data within registers          |
| **Conversion**          | `cvtsi2ss`, `cvtps2pd`             | Convert between types                    |
| **Specialized**         | `sqrtps`, `maxps`, `minps`         | Math functions in parallel               |

---

### **3. SIMD Data Types**

SSE and later extensions allow operations on:

* **Packed single-precision floats** (32-bit x 4 in XMM)
* **Packed double-precision floats** (64-bit x 2 in XMM)
* **Packed integers** (8-bit, 16-bit, 32-bit, 64-bit)

---

### **4. Example: SIMD Vector Addition**

```asm
section .data
    A   dd 1.0, 2.0, 3.0, 4.0
    B   dd 5.0, 6.0, 7.0, 8.0
    RES dd 0.0, 0.0, 0.0, 0.0

section .text
global _start
_start:
    movaps xmm0, [A]     ; Load vector A
    movaps xmm1, [B]     ; Load vector B
    addps  xmm0, xmm1    ; A + B in parallel
    movaps [RES], xmm0   ; Store result
```

* Performs **4 additions** in the time it would normally take to do **1**.

---

### **5. Example: SIMD Max of Two Vectors**

```asm
movaps xmm2, [A]
movaps xmm3, [B]
maxps  xmm2, xmm3    ; Select max of each element
```

---

### **6. Why SIMD is Powerful**

* Processes **large datasets faster**.
* Reduces loop iterations.
* Minimizes memory bandwidth usage.
* Modern CPUs have **256-bit (AVX)** or **512-bit (AVX-512)** SIMD registers.