In [None]:
'''
 * Copyright (c) 2016 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Chapter : Vector Calculus

Many algorithms in machine learning optimize an objective function with respect to a set of desired model parameters that control how well a model explains the data: Finding good parameters can be phrased as an optimization problem (see Sections 8.2 and 8.3). Examples include:

(i) linear regression (see Chapter 9), where we look at curve-fitting problems and optimize linear weight parameters to maximize the likelihood;

(ii) neural-network auto-encoders for dimensionality reduction and data compression, where the parameters are the weights and biases of each layer, and where we minimize a reconstruction error by repeated application of the chain rule;

(iii) Gaussian mixture models (see Chapter 11) for modeling data distributions, where we optimize the location and shape parameters of each mixture component to maximize the likelihood of the model.

Figure 5.1 illustrates some of these problems, which we typically solve by using optimization algorithms that exploit gradient information (Section 7.1). Figure 5.2 gives an overview of how concepts in this chapter are related and how they are connected to other chapters of the book.

Central to this chapter is the concept of a function. A function $ f $ is a quantity that relates two quantities to each other. In this book, these quantities are typically inputs $ x \in \mathbb{R}^D $ and targets (function values) $ f(x) $, which we assume are real-valued if not stated otherwise. Here $ \mathbb{R}^D $ is the domain of $ f $, and the function values $ f(x) $ are the image/codomain of $ f $.

![image.png](attachment:image.png)

## Fig.1: Vector Calculus Applications

Vector calculus plays a central role in (a) regression (curve fitting) and (b) density estimation, i.e., modeling data distributions.

**(a)** Regression problem: Find parameters, such that the curve explains the observations (crosses) well.

**(b)** Density estimation with a Gaussian mixture model: Find means and covariances, such that the data (dots) can be explained well.

![image-2.png](attachment:image-2.png)

## Figure 5.2: Concept Map

Figure 5.2 provides a mind map of the concepts introduced in this chapter, along with where they are used in other parts of the book:

- **Difference quotient**: Defined in this chapter.
- **Partial derivatives**: Used in Chapter 7 (Optimization).
- **Jacobian**: Collected in this chapter, used in Chapter 10 (Dimensionality reduction).
- **Hessian**: Used in Chapter 11 (Density estimation).
- **Taylor series**: Used in Chapter 12 (Classification).
- **Applications**:
  - Chapter 9: Regression.
  - Chapter 10: Dimensionality reduction.
  - Chapter 11: Density estimation.
  - Chapter 12: Classification.

Section 2.7.3 provides a much more detailed discussion in the context of linear functions.

We often write

$$
f: \mathbb{R}^D \to \mathbb{R} \quad \text{(5.1a)}
$$

$$
x \mapsto f(x) \quad \text{(5.1b)}
$$

to specify a function, where (5.1a) specifies that $ f $ is a mapping from $ \mathbb{R}^D $ to $ \mathbb{R} $ and (5.1b) specifies the explicit assignment of an input $ x $ to a function value $ f(x) $. A function $ f $ assigns every input $ x $ exactly one function value $ f(x) $.

### Example 5.1

Recall the dot product as a special case of an inner product (Section 3.2). In the previous notation, the function $ f(x) = x^\top x $, $ x \in \mathbb{R}^2 $, would be specified as

$$
f: \mathbb{R}^2 \to \mathbb{R} \quad \text{(5.2a)}
$$

$$
x \mapsto x_1^2 + x_2^2. \quad \text{(5.2b)}
$$

In this chapter, we will discuss how to compute gradients of functions, which is often essential to facilitate learning in machine learning models since the gradient points in the direction of steepest ascent.

In [1]:
import math

# --- Matrix Operations ---
def transpose(A):
    """
    Compute the transpose of matrix A.
    """
    m, n = len(A), len(A[0])
    return [[A[j][i] for j in range(m)] for i in range(n)]

def matrix_multiply(A, B):
    """
    Multiply two matrices A (m x n) and B (n x p).
    """
    m, n = len(A), len(B[0])
    result = [[0 for _ in range(n)] for _ in range(m)]
    for i in range(m):
        for j in range(n):
            result[i][j] = sum(A[i][k] * B[k][j] for k in range(len(B)))
    return result

def matrices_equal(A, B, tol=1e-6):
    """
    Check if two matrices are equal within a tolerance.
    """
    return all(abs(A[i][j] - B[i][j]) < tol for i in range(len(A)) for j in range(len(A[0])))

def dot_product(x, y):
    """
    Compute the dot product of two vectors.
    """
    return sum(xi * yi for xi, yi in zip(x, y))

# --- Matrix Classifier Class ---
class MatrixClassifier:
    def __init__(self, A):
        self.A = A
        self.m = len(A)
        self.n = len(A[0]) if A else 0
        self.A_T = transpose(A) if A else []

    def is_square(self):
        """
        Check if the matrix is square (m = n).
        """
        return self.m == self.n

    def determinant_2x2(self):
        """
        Compute determinant for a 2x2 matrix.
        """
        if self.m != 2 or self.n != 2:
            raise ValueError("Matrix must be 2x2")
        return self.A[0][0] * self.A[1][1] - self.A[0][1] * self.A[1][0]

    def is_invertible(self):
        """
        Check if the matrix is invertible (nonzero determinant, square matrix only).
        Simplified for 2x2 matrices.
        """
        if not self.is_square():
            return False
        if self.m == 2:
            det = self.determinant_2x2()
            return abs(det) > 1e-6
        # For larger matrices, determinant computation is complex without libraries
        return None  # Placeholder for non-2x2 matrices

    def is_symmetric(self):
        """
        Check if the matrix is symmetric (A = A^T).
        """
        if not self.is_square():
            return False
        return matrices_equal(self.A, self.A_T)

    def is_normal(self):
        """
        Check if the matrix is normal (A^T A = A A^T).
        """
        if not self.is_square():
            return False
        A_T_A = matrix_multiply(self.A_T, self.A)
        A_A_T = matrix_multiply(self.A, self.A_T)
        return matrices_equal(A_T_A, A_A_T)

    def is_orthogonal(self):
        """
        Check if the matrix is orthogonal (A^T A = A A^T = I).
        """
        if not self.is_square():
            return False
        n = self.n
        I = [[1 if i == j else 0 for j in range(n)] for i in range(n)]
        A_T_A = matrix_multiply(self.A_T, self.A)
        return matrices_equal(A_T_A, I) and matrices_equal(matrix_multiply(self.A, self.A_T), I)

    def is_diagonal(self):
        """
        Check if the matrix is diagonal (non-diagonal entries are zero).
        """
        if not self.is_square():
            return False
        for i in range(self.m):
            for j in range(self.n):
                if i != j and abs(self.A[i][j]) > 1e-6:
                    return False
        return True

    def is_identity(self):
        """
        Check if the matrix is the identity matrix.
        """
        if not self.is_diagonal():
            return False
        return all(abs(self.A[i][i] - 1) < 1e-6 for i in range(self.m))

    def is_positive_definite(self):
        """
        Check if the matrix is positive definite (x^T A x > 0 for all x ≠ 0).
        Simplified: check if symmetric and all diagonal entries are positive (for diagonal matrices).
        Full check requires eigenvalues, which is complex without libraries.
        """
        if not self.is_symmetric():
            return False
        if self.is_diagonal():
            return all(self.A[i][i] > 0 for i in range(self.m))
        # For non-diagonal matrices, we'd need eigenvalues
        return None  # Placeholder

    def classify_matrix(self):
        """
        Classify the matrix according to the phylogeny in Figure 4.13.
        """
        print(f"Classifying Matrix (shape {self.m}x{self.n}):")
        for row in self.A:
            print(row)

        properties = []
        operations = []

        # Step 1: Real matrix
        properties.append("Real matrix")

        # Step 2: Square or non-square
        if self.is_square():
            properties.append("Square")
            # Check invertibility
            invertible = self.is_invertible()
            if invertible is True:
                properties.append("Invertible (Regular)")
                operations.append("Inverse exists")
            elif invertible is False:
                properties.append("Singular (det = 0)")

            # Check for normal matrix
            if self.is_normal():
                properties.append("Normal")
                # Check for orthogonal matrix
                if self.is_orthogonal():
                    properties.append("Orthogonal")
                    properties.append("Rotation matrix (if det = 1)")
                    operations.append("A^T = A^-1")

                # Check for symmetric matrix
                if self.is_symmetric():
                    properties.append("Symmetric")
                    operations.append("Eigenvalues are real")
                    # Check for positive definite
                    pd = self.is_positive_definite()
                    if pd is True:
                        properties.append("Positive definite")
                        operations.append("Cholesky decomposition exists")
                        operations.append("Eigenvalues > 0")
                    elif pd is False:
                        properties.append("Not positive definite")

                    # Check for diagonal matrix
                    if self.is_diagonal():
                        properties.append("Diagonal")
                        # Check for identity matrix
                        if self.is_identity():
                            properties.append("Identity")

            # Eigendecomposition (simplified check)
            if invertible is not None and invertible:
                properties.append("Likely non-defective (simplified check)")
                operations.append("Eigendecomposition likely exists")
            else:
                properties.append("Possibly defective (simplified check)")

        else:
            properties.append("Nonsquare")
            operations.append("SVD exists")
            operations.append("Pseudo-inverse exists")

        # Print classification
        print("\nProperties:")
        for prop in properties:
            print(f"- {prop}")

        print("\nOperations/Characteristics:")
        for op in operations:
            print(f"- {op}")

# --- Demonstration ---
def demonstrate_matrix_phylogeny():
    """
    Demonstrate matrix classification using examples.
    """
    print("=== Matrix Phylogeny Classification ===")
    print("Section 4.7: Classifying Matrices per Figure 4.13\n")

    # Test Case 1: Identity Matrix (2x2)
    print("Test Case 1: Identity Matrix")
    A1 = [[1, 0], [0, 1]]
    classifier1 = MatrixClassifier(A1)
    classifier1.classify_matrix()

    # Test Case 2: Symmetric Positive Definite Matrix (2x2)
    print("\nTest Case 2: Symmetric Positive Definite Matrix")
    A2 = [[2, 1], [1, 2]]
    classifier2 = MatrixClassifier(A2)
    classifier2.classify_matrix()

    # Test Case 3: Non-square Matrix (2x3)
    print("\nTest Case 3: Non-square Matrix")
    A3 = [[1, 2, 3], [4, 5, 6]]
    classifier3 = MatrixClassifier(A3)
    classifier3.classify_matrix()

    # Test Case 4: Orthogonal Matrix (Rotation by 90 degrees, 2x2)
    print("\nTest Case 4: Orthogonal Matrix (Rotation by 90 degrees)")
    A4 = [[0, -1], [1, 0]]
    classifier4 = MatrixClassifier(A4)
    classifier4.classify_matrix()

# --- Main Execution ---
if __name__ == "__main__":
    print("Matrix Phylogeny Analysis")
    print("=" * 60)

    # Run demonstration
    demonstrate_matrix_phylogeny()

    print("\n" + "=" * 60)
    print("Summary of Key Results:")
    print("• Classified matrices into square/non-square, normal, symmetric, orthogonal, etc.")
    print("• Identified applicable operations (SVD, eigendecomposition, Cholesky, etc.)")
    print("• Demonstrated the phylogenetic relationships as per Figure 4.13")

Matrix Phylogeny Analysis
=== Matrix Phylogeny Classification ===
Section 4.7: Classifying Matrices per Figure 4.13

Test Case 1: Identity Matrix
Classifying Matrix (shape 2x2):
[1, 0]
[0, 1]

Properties:
- Real matrix
- Square
- Invertible (Regular)
- Normal
- Orthogonal
- Rotation matrix (if det = 1)
- Symmetric
- Positive definite
- Diagonal
- Identity
- Likely non-defective (simplified check)

Operations/Characteristics:
- Inverse exists
- A^T = A^-1
- Eigenvalues are real
- Cholesky decomposition exists
- Eigenvalues > 0
- Eigendecomposition likely exists

Test Case 2: Symmetric Positive Definite Matrix
Classifying Matrix (shape 2x2):
[2, 1]
[1, 2]

Properties:
- Real matrix
- Square
- Invertible (Regular)
- Normal
- Symmetric
- Likely non-defective (simplified check)

Operations/Characteristics:
- Inverse exists
- Eigenvalues are real
- Eigendecomposition likely exists

Test Case 3: Non-square Matrix
Classifying Matrix (shape 2x3):
[1, 2, 3]
[4, 5, 6]

Properties:
- Real matrix
-

![image.png](attachment:image.png)

Fig.3 The average incline of a function f between x0 and x0 + δx is the incline of the secant (blue) through f (x0 ) and f (x0 + δx) and given by δy/δx.

#### Average Incline of a Function

Figure 5.3 illustrates the average incline of a function $ f $ between $ x_0 $ and $ x_0 + \delta x $, which is the incline of the secant (blue) through $ f(x_0) $ and $ f(x_0 + \delta x) $ and given by $ \delta y / \delta x $.

Vector calculus is one of the fundamental mathematical tools we need in machine learning. Throughout this book, we assume that functions are differentiable. With some additional technical definitions, which we do not cover here, many of the approaches presented can be extended to sub-differentials (functions that are continuous but not differentiable at certain points). We will look at an extension to the case of functions with constraints in Chapter 7.

## 5.1 Differentiation of Univariate Functions

In the following, we briefly revisit differentiation of a univariate function, which may be familiar from high school mathematics. We start with the difference quotient of a univariate function $ y = f(x) $, $ x, y \in \mathbb{R} $, which we will subsequently use to define derivatives.

### Definition 5.1 (Difference Quotient)

The difference quotient

$$
\frac{\delta y}{\delta x} := \frac{f(x + \delta x) - f(x)}{\delta x} \quad \text{(5.3)}
$$

computes the slope of the secant line through two points on the graph of $ f $. In Figure 5.3, these are the points with x-coordinates $ x_0 $ and $ x_0 + \delta x $. The difference quotient can also be considered the average slope of $ f $ between $ x $ and $ x + \delta x $ if we assume $ f $ to be a linear function.

In the limit for $ \delta x \to 0 $, we obtain the tangent of $ f $ at $ x $, if $ f $ is differentiable. The tangent is then the derivative of $ f $ at $ x $.

### Definition 5.2 (Derivative)

More formally, for $ h > 0 $ the derivative of $ f $ at $ x $ is defined as the limit

$$
\frac{df}{dx} := \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}, \quad \text{(5.4)}
$$

and the secant in Figure 5.3 becomes a tangent. The derivative of $ f $ points in the direction of steepest ascent of $ f $.

### Example 5.2 (Derivative of a Polynomial)

We want to compute the derivative of $ f(x) = x^n $, $ n \in \mathbb{N} $. We may already know that the answer will be $ n x^{n-1} $, but we want to derive this result using the definition of the derivative as the limit of the difference quotient. Using the definition of the derivative in (5.4), we obtain

$$
\frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} \quad \text{(5.5a)}
$$

$$
= \lim_{h \to 0} \frac{(x + h)^n - x^n}{h} \quad \text{(5.5b)}
$$

$$
= \lim_{h \to 0} \frac{\sum_{i=0}^{n} \binom{n}{i} x^{n-i} h^i - x^n}{h}. \quad \text{(5.5c)}
$$

We see that $ x^n = \binom{n}{0} x^{n-0} h^0 $. By starting the sum at 1, the $ x^n $-term cancels, and we obtain

$$
\frac{df}{dx} = \lim_{h \to 0} \frac{\sum_{i=1}^{n} \binom{n}{i} x^{n-i} h^i}{h} \quad \text{(5.6a)}
$$

$$
= \lim_{h \to 0} \left( \sum_{i=1}^{n} \binom{n}{i} x^{n-i} h^{i-1} \right) \quad \text{(5.6b)}
$$

$$
= \lim_{h \to 0} \left( \binom{n}{1} x^{n-1} + \sum_{i=2}^{n} \binom{n}{i} x^{n-i} h^{i-1} \right) \quad \text{(5.6c)}
$$

$$
= \binom{n}{1} x^{n-1} + \underbrace{\lim_{h \to 0} \sum_{i=2}^{n} \binom{n}{i} x^{n-i} h^{i-1}}_{ \to 0 \text{ as } h \to 0} = \frac{n!}{1!(n-1)!} x^{n-1} = n x^{n-1}. \quad \text{(5.6d)}
$$

## 5.1.1 Taylor Series

The Taylor series is a representation of a function $ f $ as an infinite sum of terms. These terms are determined using derivatives of $ f $ evaluated at $ x_0 $.

### Definition 5.3 (Taylor Polynomial)

The Taylor polynomial of degree $ n $ of $ f: \mathbb{R} \to \mathbb{R} $ at $ x_0 $ is defined as

$$
T_n(x) := \sum_{k=0}^{n} \frac{f^{(k)}(x_0)}{k!} (x - x_0)^k, \quad \text{(5.7)}
$$

where $ f^{(k)}(x_0) $ is the $ k $-th derivative of $ f $ at $ x_0 $ (which we assume exists) and $ \frac{f^{(k)}(x_0)}{k!} $ are the coefficients of the polynomial.

### Definition 5.4 (Taylor Series)

For a smooth function $ f \in C^\infty $, $ f: \mathbb{R} \to \mathbb{R} $, the Taylor series of $ f $ at $ x_0 $ is defined as

![image-2.png](attachment:image-2.png)

Fig.4 Taylor polynomials. The original function f (x) = sin(x) + cos(x) (black, solid) is approximated by Taylor polynomials (dashed) around x0 = 0. Higher-order Taylor polynomials approximate the function f better and more globally. T10 is already similar to f in [−4, 4].

## Figure 5.4: Taylor Polynomials

Figure 5.4 illustrates the original function $ f(x) = \sin(x) + \cos(x) $ (black, solid), which is approximated by Taylor polynomials (dashed) around $ x_0 = 0 $. Higher-order Taylor polynomials approximate the function $ f $ better and more globally. $ T_{10} $ is already similar to $ f $ in $ [-4, 4] $.

### Example 5.4 (Taylor Series)

Consider the function in Figure 5.4 given by

$$
f(x) = \sin(x) + \cos(x) \in C^\infty. \quad \text{(5.19)}
$$

We seek a Taylor series expansion of $ f $ at $ x_0 = 0 $, which is the Maclaurin series expansion of $ f $. We obtain the following derivatives:

$$
f(0) = \sin(0) + \cos(0) = 1 \quad \text{(5.20)}
$$

$$
f'(0) = \cos(0) - \sin(0) = 1 \quad \text{(5.21)}
$$

$$
f''(0) = -\sin(0) - \cos(0) = -1 \quad \text{(5.22)}
$$

$$
f^{(3)}(0) = -\cos(0) + \sin(0) = -1 \quad \text{(5.23)}
$$

$$
f^{(4)}(0) = \sin(0) + \cos(0) = f(0) = 1 \quad \text{(5.24)}
$$

$ \vdots $

We can see a pattern here: The coefficients in our Taylor series are only $ \pm 1 $ (since $ \sin(0) = 0 $), each of which occurs twice before switching to the other one. Furthermore, $ f^{(k+4)}(0) = f^{(k)}(0) $. Therefore, the full Taylor series expansion of $ f $ at $ x_0 = 0 $ is given by

$$
T_\infty(x) = \sum_{k=0}^\infty \frac{f^{(k)}(x_0)}{k!} (x - x_0)^k \quad \text{(5.25a)}
$$

$$
= 1 + x - \frac{1}{2!} x^2 - \frac{1}{3!} x^3 + \frac{1}{4!} x^4 + \frac{1}{5!} x^5 - \cdots \quad \text{(5.25b)}
$$

$$
= \left(1 - \frac{1}{2!} x^2 + \frac{1}{4!} x^4 \mp \cdots\right) + \left(x - \frac{1}{3!} x^3 + \frac{1}{5!} x^5 \mp \cdots\right) \quad \text{(5.25c)}
$$

$$
= \sum_{k=0}^\infty (-1)^k \frac{x^{2k}}{(2k)!} + \sum_{k=0}^\infty (-1)^k \frac{x^{2k+1}}{(2k+1)!} \quad \text{(5.25d)}
$$

$$
= \cos(x) + \sin(x), \quad \text{(5.25e)}
$$


In [2]:
# --- Vector Operations ---
def dot_product(x, y):
    """
    Compute the dot product of two vectors.
    """
    return sum(xi * yi for xi, yi in zip(x, y))

# --- Function Class to Represent f: R^D -> R ---
class Function:
    def __init__(self, domain_dim, codomain_dim=1):
        """
        Initialize a function f: R^D -> R.
        domain_dim: Dimension D of the input space (R^D).
        codomain_dim: Dimension of the output space (R, so 1).
        """
        self.domain_dim = domain_dim
        self.codomain_dim = codomain_dim

    def evaluate(self, x):
        """
        Evaluate the function at input x.
        Must be implemented by subclasses.
        """
        raise NotImplementedError("Evaluation not implemented")

    def describe(self):
        """
        Describe the function's domain and codomain.
        """
        return f"f: R^{self.domain_dim} -> R"

# --- Example Function f(x) = x^T x (Example 5.1) ---
class DotProductFunction(Function):
    def __init__(self):
        super().__init__(domain_dim=2)  # R^2 -> R

    def evaluate(self, x):
        """
        Evaluate f(x) = x^T x = x1^2 + x2^2 for x in R^2.
        """
        if len(x) != self.domain_dim:
            raise ValueError(f"Input must be in R^{self.domain_dim}")
        return dot_product(x, x)

    def describe_mapping(self):
        """
        Describe the mapping x |-> x1^2 + x2^2.
        """
        return "x |-> x1^2 + x2^2"

# --- Concept Map Representation ---
def display_concept_map():
    """
    Display the mind map of concepts from Figure 5.2.
    """
    concepts = {
        "Difference quotient": {"Defined": "Chapter 5"},
        "Partial derivatives": {"Used in": "Chapter 7 (Optimization)"},
        "Jacobian": {"Collected": "Chapter 5", "Used in": "Chapter 10 (Dimensionality reduction)"},
        "Hessian": {"Used in": "Chapter 11 (Density estimation)"},
        "Taylor series": {"Used in": "Chapter 12 (Classification)"},
        "Applications": {
            "Regression": "Chapter 9",
            "Dimensionality reduction": "Chapter 10",
            "Density estimation": "Chapter 11",
            "Classification": "Chapter 12"
        }
    }

    print("=== Figure 5.2: Concept Map ===")
    print("Concepts introduced in Chapter 5 and their connections:\n")
    for concept, details in concepts.items():
        print(f"{concept}:")
        for key, value in details.items():
            print(f"  - {key}: {value}")
        print()

# --- Demonstration ---
def demonstrate_vector_calculus_concepts():
    """
    Demonstrate the concepts introduced in the text:
    - Function definition (Equations 5.1a–b)
    - Example 5.1: f(x) = x^T x (Equations 5.2a–b)
    - Concept map (Figure 5.2)
    """
    print("=== Vector Calculus Concepts ===")
    print("Chapter 5: Introduction to Functions\n")

    # Step 1: Define and describe a function f: R^D -> R (Equations 5.1a–b)
    print("Function Definition (Equations 5.1a–b):")
    f_generic = Function(domain_dim=3)  # Example with D=3
    print(f_generic.describe())
    print("x |-> f(x) (general mapping)\n")

    # Step 2: Example 5.1: f(x) = x^T x for x in R^2 (Equations 5.2a–b)
    print("Example 5.1: Dot Product Function")
    f_dot = DotProductFunction()
    print(f"Function: {f_dot.describe()}")
    print(f"Mapping: {f_dot.describe_mapping()}")

    # Evaluate the function at some points
    test_points = [[1, 0], [0, 1], [1, 1], [2, 3]]
    print("\nEvaluating f(x) = x^T x at test points:")
    for x in test_points:
        result = f_dot.evaluate(x)
        print(f"x = {x}, f(x) = {result:.4f}")

    # Step 3: Display the concept map (Figure 5.2)
    print("\nDisplaying Concept Map (Figure 5.2):")
    display_concept_map()

    # Step 4: Introduction to gradients (placeholder for future gradient computation)
    print("\nIntroduction to Gradients:")
    print("The gradient of f(x) = x^T x in R^2 is ∇f(x) = [2x1, 2x2] (to be computed in later sections).")
    print("Gradients point in the direction of steepest ascent, crucial for optimization in machine learning.")

# --- Main Execution ---
if __name__ == "__main__":
    print("Vector Calculus Analysis")
    print("=" * 60)

    # Run demonstration
    demonstrate_vector_calculus_concepts()

    print("\n" + "=" * 60)
    print("Summary of Key Results:")
    print("• Defined functions f: R^D -> R with domain and codomain")
    print("• Implemented f(x) = x^T x for x in R^2 (Example 5.1)")
    print("• Displayed concept map linking vector calculus to machine learning applications")
    print("• Introduced gradients as a foundation for optimization")

Vector Calculus Analysis
=== Vector Calculus Concepts ===
Chapter 5: Introduction to Functions

Function Definition (Equations 5.1a–b):
f: R^3 -> R
x |-> f(x) (general mapping)

Example 5.1: Dot Product Function
Function: f: R^2 -> R
Mapping: x |-> x1^2 + x2^2

Evaluating f(x) = x^T x at test points:
x = [1, 0], f(x) = 1.0000
x = [0, 1], f(x) = 1.0000
x = [1, 1], f(x) = 2.0000
x = [2, 3], f(x) = 13.0000

Displaying Concept Map (Figure 5.2):
=== Figure 5.2: Concept Map ===
Concepts introduced in Chapter 5 and their connections:

Difference quotient:
  - Defined: Chapter 5

Partial derivatives:
  - Used in: Chapter 7 (Optimization)

Jacobian:
  - Collected: Chapter 5
  - Used in: Chapter 10 (Dimensionality reduction)

Hessian:
  - Used in: Chapter 11 (Density estimation)

Taylor series:
  - Used in: Chapter 12 (Classification)

Applications:
  - Regression: Chapter 9
  - Dimensionality reduction: Chapter 10
  - Density estimation: Chapter 11
  - Classification: Chapter 12


Introduction 

where we used the power series representations

$$
\cos(x) = \sum_{k=0}^\infty (-1)^k \frac{x^{2k}}{(2k)!}, \quad \text{(5.26)}
$$

$$
\sin(x) = \sum_{k=0}^\infty (-1)^k \frac{x^{2k+1}}{(2k+1)!}. \quad \text{(5.27)}
$$

Figure 5.4 shows the corresponding first Taylor polynomials $ T_n $ for $ n = 0, 1, 5, 10 $.

### Remark

A Taylor series is a special case of a power series

$$
f(x) = \sum_{k=0}^\infty a_k (x - c)^k \quad \text{(5.28)}
$$

where $ a_k $ are coefficients and $ c $ is a constant, which has the special form in Definition 5.4. $ \diamond $

## 5.1.2 Differentiation Rules

In the following, we briefly state basic differentiation rules, where we denote the derivative of $ f $ by $ f' $.

**Product rule:**

$$
(f(x)g(x))' = f'(x)g(x) + f(x)g'(x) \quad \text{(5.29)}
$$

**Quotient rule:**

$$
\left( \frac{f(x)}{g(x)} \right)' = \frac{f'(x)g(x) - f(x)g'(x)}{(g(x))^2} \quad \text{(5.30)}
$$

**Sum rule:**

$$
(f(x) + g(x))' = f'(x) + g'(x) \quad \text{(5.31)}
$$

**Chain rule:**

$$
(g(f(x)))' = (g \circ f)'(x) = g'(f(x))f'(x) \quad \text{(5.32)}
$$

Here, $ g \circ f $ denotes function composition $ x \mapsto f(x) \mapsto g(f(x)) $.

### Example 5.5 (Chain Rule)

Let us compute the derivative of the function $ h(x) = (2x + 1)^4 $ using the chain rule. With

$$
h(x) = (2x + 1)^4 = g(f(x)), \quad \text{(5.33)}
$$

$$
f(x) = 2x + 1, \quad \text{(5.34)}
$$

$$
g(f) = f^4, \quad \text{(5.35)}
$$

we obtain the derivatives of $ f $ and $ g $ as

$$
f'(x) = 2, \quad \text{(5.36)}
$$

$$
g'(f) = 4 f^3, \quad \text{(5.37)}
$$

such that the derivative of $ h $ is given as

$$
h'(x) = g'(f) f'(x) = (4 f^3) \cdot 2 = 4 (2x + 1)^3 \cdot 2 = 8 (2x + 1)^3, \quad \text{(5.38)}
$$

where we used the chain rule (5.32) and substituted the definition of $ f $ in (5.34) in $ g'(f) $.

## 5.2 Partial Differentiation and Gradients

Differentiation as discussed in Section 5.1 applies to functions $ f $ of a scalar variable $ x \in \mathbb{R} $. In the following, we consider the general case where the function $ f $ depends on one or more variables $ x \in \mathbb{R}^n $, e.g., $ f(x) = f(x_1, x_2) $. The generalization of the derivative to functions of several variables is the gradient. We find the gradient of the function $ f $ with respect to $ x $ by varying one variable at a time and keeping the others constant. The gradient is then the collection of these partial derivatives.

### Definition 5.5 (Partial Derivative)

For a function $ f: \mathbb{R}^n \to \mathbb{R} $, $ x \mapsto f(x) $, $ x \in \mathbb{R}^n $ of $ n $ variables $ x_1, \ldots, x_n $, we define the partial derivatives as

$$
\frac{\partial f}{\partial x_1} = \lim_{h \to 0} \frac{f(x_1 + h, x_2, \ldots, x_n) - f(x)}{h} \quad \vdots \quad \text{(5.39)}
$$

$$
\frac{\partial f}{\partial x_n} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_{n-1}, x_n + h) - f(x)}{h}
$$

and collect them in the row vector

$$
\nabla_x f = \text{grad} f = \frac{df}{dx} = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} & \frac{\partial f(x)}{\partial x_2} & \cdots & \frac{\partial f(x)}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{1 \times n}, \quad \text{(5.40)}
$$

where $ n $ is the number of variables and 1 is the dimension of the image/range/codomain of $ f $. Here, we defined the column vector $ x = [x_1, \ldots, x_n]^\top \in \mathbb{R}^n $. The row vector in (5.40) is called the gradient of $ f $ or the Jacobian and is the generalization of the derivative from Section 5.1.

### Remark

This definition of the Jacobian is a special case of the general definition of the Jacobian for vector-valued functions as the collection of partial derivatives. We will get back to this in Section 5.3. $ \diamond $

### Example 5.6 (Partial Derivatives Using the Chain Rule)

For $ f(x, y) = (x + 2 y^3)^2 $, we obtain the partial derivatives

$$
\frac{\partial f(x, y)}{\partial x} = 2 (x + 2 y^3) \frac{\partial}{\partial x} (x + 2 y^3) = 2 (x + 2 y^3), \quad \text{(5.41)}
$$
$$
\frac{\partial f(x, y)}{\partial y} = 2 (x + 2 y^3) \frac{\partial}{\partial y} (x + 2 y^3) = 12 (x + 2 y^3) y^2, \quad \text{(5.42)}
$$

where we used the chain rule (5.32) to compute the partial derivatives.

### Remark (Gradient as a Row Vector)

It is not uncommon in the literature to define the gradient vector as a column vector, following the convention that vectors are generally column vectors. The reason why we define the gradient vector as a row vector is twofold: First, we can consistently generalize the gradient to vector-valued functions $ f: \mathbb{R}^n \to \mathbb{R}^m $ (then the gradient becomes a matrix). Second, we can immediately apply the multivariate chain rule without paying attention to the dimension of the gradient. We will discuss both points in Section 5.3. $ \diamond $

### Example 5.7 (Gradient)

For $ f(x_1, x_2) = x_1^2 x_2 + x_1 x_2^3 \in \mathbb{R} $, the partial derivatives (i.e., the derivatives of $ f $ with respect to $ x_1 $ and $ x_2 $) are

$$
\frac{\partial f(x_1, x_2)}{\partial x_1} = 2 x_1 x_2 + x_2^3 \quad \text{(5.43)}
$$

$$
\frac{\partial f(x_1, x_2)}{\partial x_2} = x_1^2 + 3 x_1 x_2^2 \quad \text{(5.44)}
$$

and the gradient is then

$$
\frac{df}{dx} = \begin{bmatrix} \frac{\partial f(x_1, x_2)}{\partial x_1} & \frac{\partial f(x_1, x_2)}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2 x_1 x_2 + x_2^3 & x_1^2 + 3 x_1 x_2^2 \end{bmatrix} \in \mathbb{R}^{1 \times 2}. \quad \text{(5.45)}
$$

## 5.2.1 Basic Rules of Partial Differentiation

In the multivariate case, where $ x \in \mathbb{R}^n $, the basic differentiation rules that we know from school (e.g., sum rule, product rule, chain rule; see also Section 5.1.2) still apply. However, when we compute derivatives with respect to vectors $ x \in \mathbb{R}^n $, we need to pay attention: Our gradients now involve vectors and matrices, and matrix multiplication is not commutative (Section 2.2.1), i.e., the order matters. Here are the general product rule, sum rule, and chain rule:

**Product rule:**

$$
\frac{\partial}{\partial x} \left( f(x) g(x) \right) = g(x) \frac{\partial f}{\partial x} + f(x) \frac{\partial g}{\partial x} \quad \text{(5.46)}
$$

**Sum rule:**

$$
\frac{\partial}{\partial x} \left( f(x) + g(x) \right) = \frac{\partial f}{\partial x} + \frac{\partial g}{\partial x} \quad \text{(5.47)}
$$

**Chain rule:**

## Chain Rule as Matrix Multiplication

For a function \( f(\mathbf{x}) \) where \( \mathbf{x} = [x_1(s, t), x_2(s, t)]^\top \), the gradient with respect to \( (s, t) \) is computed via matrix multiplication:

$$
\frac{\partial f}{\partial (s, t)} = \underbrace{\left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2} \right]}_{\text{Row vector (gradient)}} \cdot \underbrace{\begin{bmatrix}
\frac{\partial x_1}{\partial s} & \frac{\partial x_1}{\partial t} \\
\frac{\partial x_2}{\partial s} & \frac{\partial x_2}{\partial t}
\end{bmatrix}}_{\text{Jacobian matrix}} 
$$

This notation assumes the **gradient is a row vector**. Transposing gradients would be necessary if defined as column vectors.

---

## Gradient Checking

To verify gradient implementations numerically, use **finite differences**:

1. Compute the analytic gradient $ \frac{\partial f}{\partial x_i} $.
2. Approximate it numerically:
   $$
   \frac{\partial f}{\partial x_i} \approx \frac{f(x_i + h) - f(x_i)}{h} \quad \text{(forward difference)}
   $$
3. Validate using the relative error:
   $$
   \sqrt{\frac{\sum_i (dh_i - df_i)^2}{\sum_i (dh_i + df_i)^2}} < 10^{-6}
   $$
   where $ dh_i $ = finite-difference approximation, $ df_i $ = analytic gradient.

---

## Gradients of Vector-Valued Functions

For $ \mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m $, the Jacobian matrix generalizes gradients:

$$
\mathbf{f}(\mathbf{x}) = \begin{bmatrix}
f_1(\mathbf{x}) \\
\vdots \\
f_m(\mathbf{x})
\end{bmatrix}, \quad 
\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix} \in \mathbb{R}^{m \times n}
$$

- Each row represents the gradient of $ f_i $ (ensuring dimensional consistency for matrix operations).
- For $ \mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m $, the Jacobian has dimensions $ m \times n $.


In [3]:
import numpy as np

def f(x):
    # Example function: f(x1, x2) = x1^2 + 2*x2^2
    return x[0]**2 + 2*x[1]**2

def analytic_gradient(x):
    return np.array([2*x[0], 4*x[1]])

def finite_diff_gradient(f, x, h=1e-4):
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += h
        x_minus = x.copy()
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2*h)
    return grad

x = np.array([1.0, 2.0])
grad_analytic = analytic_gradient(x)
grad_finite = finite_diff_gradient(f, x)

print("Analytic gradient:", grad_analytic)
print("Finite difference gradient:", grad_finite)

# Check error
error = np.sqrt(np.sum((grad_finite - grad_analytic)**2) / np.sum((grad_finite + grad_analytic)**2))
print("Relative error:", error)


Analytic gradient: [2. 8.]
Finite difference gradient: [2. 8.]
Relative error: 5.484773222661818e-13


## Partial Derivatives of Vector-Valued Functions

For a vector-valued function $ \mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m $, the **partial derivative** of $ \mathbf{f} $ with respect to $ x_i \in \mathbb{R} $ is:

$$
\frac{\partial \mathbf{f}}{\partial x_i} = \begin{bmatrix}
\lim_{h \to 0} \frac{f_1(x_1, \dots, x_i + h, \dots, x_n) - f_1(\mathbf{x})}{h} \\
\vdots \\
\lim_{h \to 0} \frac{f_m(x_1, \dots, x_i + h, \dots, x_n) - f_m(\mathbf{x})}{h}
\end{bmatrix} \in \mathbb{R}^m
$$

---

## Jacobian Matrix

The **gradient** (Jacobian) of $ \mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m $ with respect to $ \mathbf{x} \in \mathbb{R}^n $ is:

$$
\frac{\mathrm{d}\mathbf{f}(\mathbf{x})}{\mathrm{d}\mathbf{x}} = \begin{bmatrix}
\frac{\partial \mathbf{f}(\mathbf{x})}{\partial x_1} & \cdots & \frac{\partial \mathbf{f}(\mathbf{x})}{\partial x_n}
\end{bmatrix}
$$

Explicitly, the Jacobian \( \mathbf{J} \in \mathbb{R}^{m \times n} \) is:

$$
\mathbf{J} = \nabla_{\mathbf{x}} \mathbf{f} = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
$$

- **Rows**: Correspond to outputs $ f_1, \dots, f_m $.
- **Columns**: Correspond to inputs $ x_1, \dots, x_n $.

---

### Special Case: Scalar Functions
For $ f: \mathbb{R}^n \to \mathbb{R} $, the Jacobian is a **row vector** (gradient):

$$
\nabla_{\mathbf{x}} f = \begin{bmatrix}
\frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n}
\end{bmatrix} \in \mathbb{R}^{1 \times n}
$$

---

## Numerator Layout Convention
In this book, we use the **numerator layout** (Jacobian formulation):
- Derivative $ \frac{\mathrm{d}\mathbf{f}}{\mathrm{d}\mathbf{x}} $ is an $ m \times n $ matrix.
- **Rows** align with $ \mathbf{f} $’s dimensions.
- **Columns** align with $ \mathbf{x} $’s dimensions.

---

### Example
For $ \mathbf{f}: \mathbb{R}^2 \to \mathbb{R}^3 $ defined as:
$$
\mathbf{f}(\mathbf{x}) = \begin{bmatrix}
x_1^2 \\
2x_2^2 \\
x_1 x_2
\end{bmatrix},
$$
the Jacobian is:
$$
\mathbf{J} = \begin{bmatrix}
2x_1 & 0 \\
0 & 4x_2 \\
x_2 & x_1
\end{bmatrix}.
$$


In [4]:
import numpy as np

def f(x):
    # Example: f: R^2 → R^3
    return np.array([
        x[0]**2,
        2*x[1]**2,
        x[0] * x[1]
    ])

def numerical_jacobian(f, x, h=1e-6):
    n = len(x)
    m = len(f(x))
    J = np.zeros((m, n))
    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += h
        x_minus = x.copy()
        x_minus[j] -= h
        J[:, j] = (f(x_plus) - f(x_minus)) / (2*h)
    return J

x = np.array([1.0, 2.0])
J_num = numerical_jacobian(f, x)
print("Numerical Jacobian:\n", J_num)

def analytic_jacobian(x):
    # For f(x) = [x1^2, 2x2^2, x1*x2]
    return np.array([
        [2*x[0], 0],
        [0, 4*x[1]],
        [x[1], x[0]]
    ])

J_ana = analytic_jacobian(x)
print("Analytic Jacobian:\n", J_ana)
def f_pure(x):
    return [x[0]**2, 2*x[1]**2, x[0]*x[1]]

def numerical_jacobian_pure(f, x, h=1e-6):
    n = len(x)
    m = len(f(x))
    J = [[0.0 for _ in range(n)] for _ in range(m)]
    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += h
        x_minus = x.copy()
        x_minus[j] -= h
        for i in range(m):
            J[i][j] = (f(x_plus)[i] - f(x_minus)[i]) / (2*h)
    return J

x = [1.0, 2.0]
J_pure = numerical_jacobian_pure(f_pure, x)
print("Pure Python Numerical Jacobian:")
for row in J_pure:
    print(row)


Numerical Jacobian:
 [[2. 0.]
 [0. 8.]
 [2. 1.]]
Analytic Jacobian:
 [[2. 0.]
 [0. 8.]
 [2. 1.]]
Pure Python Numerical Jacobian:
[2.000000000002, 0.0]
[0.0, 8.000000000230045]
[1.999999999946489, 1.0000000000287557]
