

This notebook overviews a Python library based on "Object structure of vector calculus", hereinafter *VC*. The central thrust of VC is that a mimetic approach to vector calculus, with computation imitating mathematics as closely as possible, is necessarily object-oriented. The essential objects are also structurally constrained.

The core mathematical types of vector calculus are normed vector spaces, vectors, linear operators or maps, and differentiable functions. My particular objective is mimetic expression of continuous optimization algorithms, which generally presume an inner product, so norms are assumed here to be inner product norms. This notebook reviews a Python realization of the structure explained in VC, with a few simple examples based on NumPy. 

# Spaces and Vectors

As explained in VC, (pre-Hilbert) vector spaces combine a set of data objects with two operations on them, linear combination and inner (dot) product. The set of data objects has infinite cardinality except in one obvious instance, so is not computationally realizable. However it is sufficient to be able to determine whether an object is a member of the set, and to obtain a new object in the set on request ("let $v \in V$"). These observations translate into four pseudo-code attributes of a Space object:
1. a boolean function that takes an object argument and returns True if the argument refers to a data object for this space;
2. a function with no arguments that returns a new data object;
3. a linear combination function that evaluates $y=ax + by$, with arguments $a$, $b$ (scalars), $x$, and $y$ (data objects). Error if either $x$ or $y$ is not a valid data object;
4. an inner product function that returns a scalar $\langle x, y \rangle$ for arguments $x$ and $y$. Error if either $x$ or $y$ is not a data object.

All Spaces have these attributes, so the collection of Spaces is naturally expressed as an abstract class. In Python,

In [1]:
from abc import ABC, abstractmethod

class Space(ABC):
    @abstractmethod
    def isData(self,x):
        pass
    @abstractmethod
    def getData(self):
        pass
    @abstractmethod
    def linComb(self, a, x, y, b=1.0):
        pass
    @abstractmethod    
    def dot(self,x,y):
        pass
    ...

The *Space* class declaration in *vcl.py* includes several other convenient methods, including a self-description interface and a *cleanup* method to remove any part of a data object that Python garbage collection does not automatically remove.

The almost-universal mathematical vernacular calls the data objects of a vector space "vectors". As explained in VC, this is logically incorrect, but also misleading: data objects must be combined with the other attributes of their vector space to act functionally as vectors. So a vector is not a data object alone, but a composite of a data object *together with* a vector space with its linear combination and inner product functions, for which the data object belongs to its proper set of data objects.

While *vcl.Space* is an abstract type, asserting behaviour but not implementing it, every aspect of vector behaviour is determined by attributes of the corresponding space. So the Python realization *vcl.Vector* is a concrete, rather than abstract, class - its attributes are defined, not merely declared:


In [2]:
class Vector:
    def __init__(self, sp):
        self.space = sp
        self.data = sp.getData()
    def __del__(self):
        self.space.cleanup(self.data)            
    def linComb(self,a,x,b=1.0):
        self.space.linComb(a,x.data,self.data,b)
    def dot(self,x):
        return self.space.dot(self.data,x.data)
    ...

The full definition in *vcl.py* includes a mechanism for assigning a *Vector* to an existing data object, rather than the new data object assigned to it on construction. This possibility is convenient in applications. Several other convenience attributes are also provided.

For a simple example, I will construct a vector space based on NumPy. This choice makes a point: NumPy is already a fine environment for matrix algebra. However it does not offer interfaces for functions on subsets of vector spaces, nor for expression of algorithms defined in terms of vector functions, such as Newton's method. NumPy's array manipulation capabilities support straightforward construction of a vector space type, in the form of a *Space* as defined (partly) above. The key choice is use of NumPy *ndarrays* as data objects. Each space corresponds mathematically to ${\bf R}^n$, so is characterized by its dimension. So an object is a data object of an *npSpace* if it is an column *numpy.ndarray* of the right dimension. Note that the *ndarray* is NOT a vector - it only becomes one when wrapped up with the appropriate linear algebra operations, as is done in *npvc.Space*.

In [3]:
import vcl
import numpy as np

# numpy vector space
class npSpace(vcl.Space):
    def __init__(self,n):
        self.dim = n
    def getData(self):
        return np.zeros(self.dim).reshape(self.dim,1)
    def isData(self,x):
        return (isinstance(x,np.ndarray) and x.shape() == (self.dim,1))
    def linComb(self,a,x,y,b=1.0):
        y = a*x + b*y
    def dot(self,x,y):
        return np.dot(x.T,y)[0][0]

*npSpace* is a sub- (or derived) class of *Space*, as indicated by the first line in the definition. So it can be used in any context that calls for a *Space*.

The full definition in *npvc.py* includes several other useful functions, that are (or could be) defined in terms of the core functions described above. The *lincomb* and *dot* functions (along with the others) are also written to provide error messages on failure.

Here is a very simple use of the *npSpace* and *Vector* classes:

In [4]:
import numpy as np
import vcl
import npvc

dom = npvc.Space(2)
x=vcl.Vector(dom)
x.data[0]=1
x.data[1]=1
print('A vector in 2D NumPy-based Space:')
x.myNameIs()

A vector in 2D NumPy-based Space:
Vector in space:
npvc.Space of dimension 2
Data object:
[[1.]
 [1.]]


This example, simple as it is, makes two important points. First, there are NumPy *Space*s (namely instances of *npvc.Space*), but there is no NumPy *Vector*: there are only *Vector*s in *npvc.Space*s. The vector concept does not need to be specialized: it is a "function" of the choice of space, and is otherwise completely defined.

Second, a feature characteristic of all uses of VCL: its classes and functions manipulate data, but some data has to be initialized by external means. In this case, it's the very simple use of Python assignment to set the components of the data object (a NumPy *ndarray*). In more complex examples, external initialization is correspondingly more complex: for example, VCL applications based on RSF will need RSF utilities to initialize key data objects (RSF file pairs), which are input to VCL-coded processes.

# Linear Operators

The third fundamental concept of linear algebra, after vector space and vector, is that of linear map or operator. These are simply functions whose domains and ranges are vector spaces, and which satisfy the linearity condition. Since the vector spaces at issue are presumed to be inner product spaces, linear operators really come in adjoint pairs.

A linear operator should be able to identify its domain and range spaces, and apply itself to a vector, producing vector output. The obvious methods to identify domain and range are functions returning these as *vcl.Space* s. In the class *vcl.LinearOperator* these methods are named *getDomain* and *getRange*. While left abstract in the class constructed here, they are naturally implemented by simply returning stored references to the domain and range spaces.

Operarator application (or evaluation) is a bit more subtle. The usual mathematical syntax for linear operator application is juxtaposition: if $A$ is a linear operator and $x$ is a vector in its domain, then the value of $A$ on $x$ is $Ax$. That syntax can't be reproduced precisely in code: Python needs some indication that a method is to be called. The most appropriate option appears to be asterisk, that is, the symbol that also signifies scalar multiplication (which is essentially a special case). Thus if *A* is a *vcl.LinearOperator* and *x* is an input *vcl.Vector* in its domain, the operator output is *A\*x*. This syntax is available through *\_\_mul\_\_*, one of Python's "magic methods", which *overloads* the asterisk operator: essentially, it lets you (re)define a multiplication operator appropriate to a type you have defined, with the same syntax as scalar multiplication.

If the *\_\_mul\_\_* method is to return the operator output, then the output object needs to be allocated, as part of the method definition. It's convenient to divide the evaluation task into two parts: allocation of the output object, and modification of its data object to contain the correct output data. The allocation part is always accomplished by the same code, namely the vector constructor (remember, *vcl.Vector* is a fully defined class!). Computation of output data is peculiar to the particular operator, so is natually isolated in an abstract function, which the *\_\_mul\_\_* method calls.

These considerations lead to the (partial) abstract class definition. Note that *vcl.LinearOperator* is a subclass of *vcl.Function*, for the obvious logical reason; the *vcl.Function* class is described below.

In [5]:
class LinearOperator(vcl.Function):
    @abstractmethod
    def getDomain(self):
        pass
    @abstractmethod
    def getRange(self):
        pass
    @abstractmethod
    def applyFwd(self,x, y):
        pass
    @abstractmethod
    def applyAdj(self,x, y):
        pass    
    
    def __mul__(self,x):
        try:
            if x.space != self.getDomain():
                raise Exception('Error: input not in domain')
            y = Vector(self.getRange())
            self.applyFwd(x,y)
        except Exception as ex:
            print(ex)
            raise Exception('called from vcl.LinearOperator *')
        else:
            return y

The implementation of *\_\_mul\_\_* tests that the input vector is actually a member of the domain space. I have used Python exception handling to implement this test, and arranged the raised exceptions to offer a traceback message if invoked. This pattern is followed in all of the *vcl* classes: any obvious error conditions are tested, and exceptions raised as instances of the *Exception* class with information about the error and where it occurred. 

To pass the test, the input vector's *vcl.Space* data member must be a reference to *the same* object as is the value returned by the *getDomain* method. This choice implies that *vcl.LinearOperator*s will store references to externally defined domain and range spaces.

The output (*y* in the listing) is *constructed* as a member of the range space, so there is no need to test its membership.

The actual computations to apply the operator to the input vector are localized in the *applyFwd* method, to which are passed (already allocated) input and output vectors. It is assumed that the *applyFwd* method is used *only* in this way, so no further checks on membership in domain and range need be implemented in it. In C++ for example, this pattern could be enforced by identifying *applyFwd* as a private (or perhaps protected) method, not accessible to non-member objects. Since Python doesn't provide that kind of access control, this design depends on the usual rule for Python design: don't do anything stupid!

As noted above, since all of the vector spaces considered in VCL are inner product spaces, linear operators occur (implicitly) as adjoint pairs. As will become apparent shortly, the natural location for the calculations required to apply the adjoint is also *vcl.LinearOperator*, in the form of the *applyAdj* method. As for *applyFwd*, *applyAdj* is not intended for direct use. 

Probably the simplest specialization of the linear operator class uses *npSpace* performs matrix-vector multiplication:

In [6]:
class MatrixOperator(vcl.LinearOperator):    
    def __init__(self,dom,rng,mat):
        self.dom = dom
        self.rng = rng
        self.mat = np.copy(mat)
        #....    
    def applyFwd(self,x, y):
        y.data = self.mat@x.data
    def applyAdj(self,x, y):
        y.data = self.mat.T@x.data

The elided code in the constructor checks that the *npSpaces* *dom* and *rng* have the number of columns of the *numpy.ndarray*, respectively its number of rows, as their dimensions, raising an exception if either is false.

Note that the domain and range are passed as references to externallly defined *Space* objects that exist independently of the *MatrixOperator*, whereas the matrix argument is copied, internal to the object. This construction requires the error-checking just mentioned, but has several advantages. Obviously the domain and range spaces could also be copied, or even generated internally to the *MatrixOperator* object, but that would be logically incorrect as well as inconvenient. With the *MatrixOperator* storing references to externally defined spaces, membership in those spaces can be verified by simple comparison, as in the implementation of *\_\_mul\_\_* above. Note that it cannot be enough to check that the input vector and the domain have the same dimension. For example, subclasses of *npvc.Space* could be equipped with units or other auxiliary information necessary for their proper use. Only checking equality of dimensions could lead to dimensional errors. Insisting that the spaces involved should be *the same objects* avoids such egregious errors.

The matrix, on the other hand, is naturally internal data. This construction maintains the identity of the *MatrixOperator* even if the NumPy array passed to the constructor is subsequently changed.

Here is a simple example of matrix multiplication via the *npvc.MatrixOperator* class. The textual overhead from expressing matrix multiplication as the action of a linear operator (as compared to straight NumPy) is: 3 lines of code, out of 15.

In [7]:
import numpy as np
import vcl
import npvc

# domain space and vector in it
dom = npvc.Space(2)
x=vcl.Vector(dom)
x.data[0]=1
x.data[1]=1

# range space
rng = npvc.Space(3)

# 3 x 2 matrix - 
mat=np.zeros((3,2))
mat[0,0]=1
mat[0,1]=1
mat[1,1]=1

# matrix operator based on mat
matop=npvc.MatrixOperator(dom,rng,mat)

# matrix-vector product as matrix operator 
# application
y = matop*x

print('Input vector x')
x.myNameIs()
print('Matrix Operator matop')
matop.myNameIs()
print('Output vector y = matop(x)')
y.myNameIs()

Input vector x
Vector in space:
npvc.Space of dimension 2
Data object:
[[1.]
 [1.]]
Matrix Operator matop
NUMPY Matrix Operator with matrix:
[[1. 1.]
 [0. 1.]
 [0. 0.]]
domain:
npvc.Space of dimension 2
range:
npvc.Space of dimension 3
Output vector y = matop(x)
Vector in space:
npvc.Space of dimension 3
Data object:
[[2.]
 [1.]
 [0.]]


I emphasize that the output *y* of the matrix product operator *matop* is *created* by *matop*. You *could* create *y* before calling the operator evaluation, that is *y=vcl.Vector(rng); y=matop*x*.
However, after the second statement, the variable y would refer to the data created internally by the * operator, and the data created by the first statement (*vcl.Vector* constructor) would be "orphaned", i.e. *y* no longer refers to it - no external references exist. So the data allocated in the first statement would be garbage-collected, never having been used. The first statement is redundant, and worse involves some wasted computational work (memory allocation, initialization).

This is not just a peculiarity of the way Python handled variables - it is logically correct, and exactly parallels the corresponding mathematics. Suppose you were to write "Let $y \in Y$, and $y=Ax$". The second statement already presumes that $y$ is in the range (must be $Y$) of the linear operator $A$, so the first statement is redundant - exactly as in the corresponding code.

Access to the adjoint operator is provided through the *vcl.transp* class. This class takes a *vcl.LinearOperator* as argument to its constructor, which returns another *vcl.LinearOperator*. This latter implements the adjoint (transpose) operator by accessing the methods of the argument, especially the *applyAdj* method. The methods are arranged so that *transp(transp(A))* duplicates the action of *A*, as one would hope. 

In [8]:
# apply transpose of matop to y
z = vcl.transp(matop)*y
print('Output vector z = transp(matop)(y)')
z.myNameIs()

Output vector z = transp(matop)(y)
Vector in space:
npvc.Space of dimension 2
Data object:
[[2.]
 [3.]]


# Functions

Having constructed a class for linear operators (i.e. linear functions), it's obvious how to build a class for (possibly) non-linear functions. First, absent linearity there is no natural adjoint concept, so there is only a "forward" application function. Second, for differentiable functions a derivative (a linear operator) is another attribute. 

These considerations suggest a very simple abstract interface. The natural syntax for evaluation of the function $F$ at $x$ is $F(x)$. As was the case with linear operators, there is a Python "magic method" interface for function call syntax, which incorporates sanity checking. The computations specific to an instance are relegated to the abstract *apply* method.

In [9]:
from abc import ABC, abstractmethod
import vcl

class Function(ABC):
    @abstractmethod
    def getDomain(self):
        pass
    @abstractmethod
    def getRange(self):
        pass
    @abstractmethod
    def apply(self,x,y):
        pass
    
    def __call__(self,x):
        try:
            if x.space != self.getDomain():
                raise Exception('Error: input vec not in domain')
            y = vcl.Vector(self.getRange())
            self.apply(x,y)
        except Exception as ex:
            print(ex)
            raise Exception('called from vcl.Function operator()')
        else:
            return y

Access to the derivative is through a method *deriv*, which should return a *vcl.LinearOperator*. Error-checking goes as before, but the actual computations are specific to individual subtypes, so an abstract interface *raw_deriv* is provided. As is the case with *apply*, *raw_deriv* is not intended for direct use.

In [10]:
    # should return linear op
    @abstractmethod
    def raw_deriv(self,x):
        pass
    
    def deriv(self,x):
        try:
            if x.space != self.getDomain():
                raise Exception('Error: input vec not in domain')
        except Exception as ex:
            print(ex)
            raise Exception('called from vcl.Function.deriv')
        else:        
            return self.raw_deriv(x)

A simple NumPy-based example is provided as *npvc.OpExpl1*. It realizes the function $f: {\bf R}^2 \rightarrow {\bf R}^3$ given by 
$$
f((x_0,x_1)^T) = (x_0*x_1, -x_1+x_0^2, x_1^2)^T.
$$
Its code is written according the principles outlined above. For instance, the domain and range spaces are constructed externally to the function object, and passed to its constructor as arguments. Their dimensions are checked to be 2 and 3 respectively. The function object stores references to these externally defined spaces. The *\_\_call\_\_* and *deriv* methods sanity-check their arguments. See the code for details.

In [11]:
dom = npvc.Space(2)
rng = npvc.Space(3)
f = npvc.OpExpl1(dom,rng)
x = vcl.Vector(dom)
x.data[0]=1
x.data[1]=-2
print('input vector:')
x.myNameIs()
y = f(x)
print('output of apply method:')
y.myNameIs()
dfx = f.deriv(x)
print('output of deriv method:')
dfx.myNameIs()

input vector:
Vector in space:
npvc.Space of dimension 2
Data object:
[[ 1.]
 [-2.]]
output of apply method:
Vector in space:
npvc.Space of dimension 3
Data object:
[[-2.]
 [ 3.]
 [ 4.]]
output of deriv method:
NUMPY Matrix Operator with matrix:
[[-2.  1.]
 [ 2. -1.]
 [ 0. -4.]]
domain:
npvc.Space of dimension 2
range:
npvc.Space of dimension 3


It is worth pointing out a feature also present in the previous example. While the input vector (*x* in both examples) is constructed and initialized externally to the *Function*, the output vector (*y*) is constructed and initialized internally, and returned to the calling environment by assignment. The same goes for the derivative, represented by a *LinearOperator*. Of course *x* also appears in the environment as the left-hand side of an assigment, namely the *Vector* constructor.

Linear functions (operators) are also functions, so really *LinearOperator* should subclass *Function*. The derivative is a linear operator, so the definition of *Function* might seem to depend of the definition of *LinearOperator* which in turn depends on *Function*. However it is possible to take advanage of Python's genericity to break this cyclic dependence: the return type of the *Function.deriv* method isn't specified, because return types are not specified in Python! The *raw_deriv* method is implemented for *LinearOperator*, and simply returns a reference to the object itself, since linear operators are their own derivatives. Also, the *Function.apply* method is implemented by delegation to *applyFwd*. So the function call sytax also works for *LinearOperator*s, though the asterisk syntax is preferable for readability.

## Jets

The *jet* concept solves an obvious consistency vs. efficiency problem involving the attributes of functions. Suppose that *f* is a *vcl,Function*, and *x* is a *vcl.Vector*. Record the function's value in the *vcl.Vector* *y*, and the derivative in the *vcl.LinearOperator* *df*. Some number of lines later in the program, access these same objects. Is there any guarantee that they are still related in the same way? In fact, there is not. If *x* is changed but the function is not re-evaluated, then the three objects have become inconsistent. The only way to make sure that this inconsistency does not occur appears to be re-computing the value and derivative each time they are accessed. That could amount to considerable wasted computational effort.

Jets, in the sense meant here, borrow a concept from differential geometry, and offer a way around this problem. There are several equivalent definitions, of which I cite the one most relevant to computation. The $k$-jet of a $C^{\infty}$ function $f$ at a point $x$ in its domain is the sequence of values $D^{\alpha}f(x)$ of $f$ and all of its derivatives of orders $|\alpha| \le k$. This definition suggests an obvious container class, implemented in $vcl.Jet$. For the time being, I have implemented only the 1-jet. The constructor arguments are the *vcl.Function* *f* and *vcl.Vector* *x*. The methods *getPoint*, *getValue*, and *getDeriv* return the computational analogues of $x$, $f(x)$, and $Df(x)$ respectively. These are stored as internal copies, so changing *x* after creation of *vcl.Jet(f,x)* does not change the return values of these methods. Thus an instance of *vcl.Jet* provides access to a consistent set of values.

An earlier implementation of this concept in the C++ library RVL used access control and the *const* keyword to prevent any violation of the jet's internal data, so really offered a guarantee that the return values of its methods are coherent. Such guarantees are impossible to provide in Python, which does not implement *const* and makes all class data public. So *vcl.Jet* really only provides guidance to help the programmer maintain the coherence of function values. Vector data is also private in RVL, and can only be altered through the action of a few specified function classes. This restriction makes a *versioning* system possible. The jet methods compare the version index of a vector with a recorded index to tell whether the vector had been altered, thus update values and derivatives automatically whenever necessary. So RVL implements the jet concept as *f(x) for variable x*. Such automation is impossible in a Python framework: the VCL user is responsible for updating *vcl.Jet* instances as needed. 

Partly for this reason, I have not used the Jet class provided in *vcl* in formulating algorithms, at least so far.

## Scalar Functions

Scalar-valued functions are central players in optimization, but the classes described so far do not encompass them. ${\bf R}$ can be identified with a 1-D vector space over the reals, but it is not *the same* as that vector space. This is even more so in the computational context: *float* objects are not interchangeable with 1-D *ndarray*s. In VCL, the latter are data of *Vector*s, not themselves *Vector*s. So there are a couple of conceptual layers between scalars and vectors, and scalar valued functions require a their own proper class.

The analogue of *Function.apply* is *ScalarFunction.value*. Unlike  *apply*, *value* simply returns a value, rather tha alteraing an argument. Like *apply*, it is intended for "private" use, with sanity testing done by the *\_\_call\_\_* method. So the rule is: 

for a scalar function class *fun*, implement *fun.value*. For an instance *f* of *fun*, call *f(x)* (not *f.value(x)*). 

Another feature of this class deserving of mention is the representation of the derivative by the gradient, its Riesz representer. Thus *ScalarFunction.gradient* replaces *Function.deriv*. As in the *Function* case, there is a "raw" version (*raw_gradient*) to be implemented, whereas the "cooked" version *gradient* (with standard sanity test) is to be used in code.

Because it occurs so often, I include a definition of the standard least-squares function
$$
J(x) = \|F(x)-y\|^2
$$ 
in which $F: X \rightarrow Y$ is a map between Hilbert spaces $X$ and $Y$, $y \in Y$, and $J: X \rightarrow {\bf R}$. 

(Of course, there's nothing "least" about it, but minimizing $J$ is the standard least-squares optimization problem, and the objective function is stuck with the same name, in the vernacular.)

In [12]:
# function x -> 0.5*|f(x)-b|^2
class LeastSquares(vcl.ScalarFunction):
    def __init__(self,f,b):
        self.f = f
        self.b = b
    def value(self,x):
        res = self.f(x)
        res.linComb(-1,0,self.b)
        return 0.5*res.dot(res)
    def raw_gradient(self,x):
        res = self.f(x)
        res.linComb(-1,0,self.b)
        df = self.f.deriv(x)
        return vcl.transp(df)*res

Here is an example using once again the function *npvc.OpExpl1* Note that *J(x)* and *J.gradient(x)* appear in this "application", rather than *J.value(x)* and *J.raw_gradient(x)*, in order that the input be checked for membership in the domain.

In [13]:
dom = npvc.Space(2)
rng = npvc.Space(3)
f = npvc.OpExpl1(dom,rng)
x = vcl.Vector(dom)
y = vcl.Vector(rng)
y.data[0]=3
y.data[1]=2
y.data[2]=-3
x.data[0]=1
x.data[1]=-2
J = vcl.LeastSquares(f,y)
J.myNameIs()
print('input vector x:')
x.myNameIs()
print('J.value(x) =' + str(J(x)))
g=J.gradient(x)
print('J.gradient(x) = ')
g.myNameIs()

Least Squares Function
*** operator:
OpExpl1: npvc example of vcl.Function class
implements (x0,x1) -> (x0*x1, -x1+x0^2, x1^2)
domain = R^2, range = R^3
*** rhs vector
Vector in space:
npvc.Space of dimension 3
Data object:
[[ 3.]
 [ 2.]
 [-3.]]
input vector x:
Vector in space:
npvc.Space of dimension 2
Data object:
[[ 1.]
 [-2.]]
J.value(x) =37.5
J.gradient(x) = 
Vector in space:
npvc.Space of dimension 2
Data object:
[[ 12.]
 [-34.]]


# Product Spaces and Partial Derivatives

Most scientific data lives in (Cartesian) product spaces, and they turn up in lots of other ways. The computational realization *ProductSpace* simply makes a list of *Space* objects behave as a *Space*. A *ProductSpace* data object is naturally a list of data objects, one for each *Space* factor. *ProductSpace* methods implement standard induced operations: linear combinations are lists of linear combinations, dot product is the sum of dot products, etc. The list of *Space* factors is available as the *spl* data member - which can be queried for length, individual factors, etc. since it's publicly accessible, like all class data in Python. 

In [14]:
dom1=npvc.Space(1)
dom2=npvc.Space(1)
spacelist=[dom1,dom2]
pdom=vcl.ProductSpace(spacelist)
pdom.myNameIs()
x=vcl.Vector(pdom)
x.data[0][0]=1
x.data[1][0]=-2
x.myNameIs()

vcl.ProductSpace
*** component 0
npvc.Space of dimension 1
*** component 1
npvc.Space of dimension 1
Vector in space:
vcl.ProductSpace
*** component 0
npvc.Space of dimension 1
*** component 1
npvc.Space of dimension 1
Data object:
[[1.]]
[[-2.]]


Comparison of this example with the preceding one reminds the reader that $R^2$ is not *the same* as $R^1 \oplus R^1$ - the two are isomorphic, but not identical. This mathematical truth is precisely reflected in the computational structure. Both examples construct a vector whose data array(s) has (have) two components with values -1 and 2, but in the first example it's an array reals of length 2, whereas in the second it's an array of length 2 of real arrays of length 1.

A further consequence of this reasoning: there is no such thing as a "product vector". Instead, there are vectors in product spaces. However, vectors in product spaces have components, each of which is a vector. Also, product spaces have components, each of which is a space. The "magic method" *\_\_getitem\_\_* provides access via the usual indexing interface (operator[]) in each case:

In [15]:
#x0=x.component(0)
x0 = x[0]
x0.myNameIs()
x0.data[0]=2
x.myNameIs()

Vector in space:
npvc.Space of dimension 1
Data object:
[[1.]]
Vector in space:
vcl.ProductSpace
*** component 0
npvc.Space of dimension 1
*** component 1
npvc.Space of dimension 1
Data object:
[[2.]]
[[-2.]]


A function on a product space may (or may not) be provided with partial derivatives - that's one of those features that are implicit in the mathematical concept, but must be added explicitly to its computational homolog. If it is, the derivative is implemented as a *vcl.RowLinearOperator*, which provides a *vcl.LinearOperator* interface for a list of *vcl.LinearOperator* s. The individual operators making up a *RowLinearOperator* are accessed by index via the indexing operator[], another instance of *\_\_getitem\_\_*. Thus the *i*th partial derivative of *f* at *x* is *f.deriv(x)[i]*.

I have modified *npvc.OpExpl1* to make the domain $R^1 \oplus R^1$ rather than $R^2$, and implemented the derivative as a *vcl.RowLinearOperator*. This change costs a couple of extra lines of code to build up the list of partial derivatives. Note that the output of the derivative, applied to a particular vector, is the same as the sum of the partial derivatives applied to its components, as it should be.

In [16]:
f = npvc.OpExpl2(pdom,rng)
x.data[0][0]=1
x.data[1][0]=-2
print('input vector:')
x.myNameIs()
y=f(x)
print('output of apply method:')
y.myNameIs()
dfx = f.deriv(x)
print('output of deriv method:')
dfx.myNameIs()
dx=vcl.Vector(pdom)
dx.data[0][0]=2
dx.data[1][0]=-3
print('input to deriv')
dx.myNameIs()
dy=dfx*dx
print('output of deriv')
dy.myNameIs()
dy0=dfx[0]*dx[0]
print('input of partial deriv 0')
dx[0].myNameIs()
print('output of partial deriv 0')
dy0.myNameIs()
dy1=dfx[1]*dx[1]
print('input of partial deriv 1')
dx[1].myNameIs()
print('output of partial deriv 1')
dy1.myNameIs()
# sum the outputs of the partial derivs
dy1.linComb(1.0,dy0)
print('sum of partial deriv outputs')
dy1.myNameIs()

input vector:
Vector in space:
vcl.ProductSpace
*** component 0
npvc.Space of dimension 1
*** component 1
npvc.Space of dimension 1
Data object:
[[1.]]
[[-2.]]
output of apply method:
Vector in space:
npvc.Space of dimension 3
Data object:
[[-2.]
 [ 3.]
 [ 4.]]
output of deriv method:
RowLinearOperator length = 2
*** Component 0:
NUMPY Matrix Operator with matrix:
[[-2.]
 [ 2.]
 [ 0.]]
domain:
npvc.Space of dimension 1
range:
npvc.Space of dimension 3
*** Component 1:
NUMPY Matrix Operator with matrix:
[[ 1.]
 [-1.]
 [-4.]]
domain:
npvc.Space of dimension 1
range:
npvc.Space of dimension 3
input to deriv
Vector in space:
vcl.ProductSpace
*** component 0
npvc.Space of dimension 1
*** component 1
npvc.Space of dimension 1
Data object:
[[2.]]
[[-3.]]
output of deriv
Vector in space:
npvc.Space of dimension 3
Data object:
[[-7.]
 [ 7.]
 [12.]]
input of partial deriv 0
Vector in space:
npvc.Space of dimension 1
Data object:
[[2.]]
output of partial deriv 0
Vector in space:
npvc.Space of dim

# Algorithms

The main excuse for inventing this machinery is to make algorithms based on vector calculus easier to express. This section describes several examples.

## Conjugate Gradient Algorithm for Linear Least Squares

This algorithm is adapted from the Hestenes-Stiefel conjugate gradient algorithm for positive definite symmetric linear systems. It gives an approximate solution of the least-squares problem
$$
\mbox{Given }A\mbox{ and }b \in B, \,\min_{x \in X} \|Ax-b\|^2
$$ 
in which $X$ and $B$ are Hilbert spaces and $A: X \rightarrow B$ is a (bounded) linear operator. Assuming that $A$ is coercive (full column rank in the finite-dimensional case), $x$ is also the unique solution of the linear system (normal equation)
$$
A^TA x = A^Tb
$$
The conjugate gradient algorithm, as described in Golub and van Loan, sections 10.2 and 10.3; Nocedal and Wright, algorithm 5.2, can be applied directly to this system. The only difference between that algorithm and the one described here is the introduction of a vector variable ($q$ below) to avoid explicit construction of the normal operator $A^TA$. Hanke, section 2.3, describes essentially this algorithm.

As I have written it here, five auxiliary vectors are required: $r, p, s \in X$, and $e, q \in B$. At each iteration, $e=b-Ax$ is the residual vector, $r=A^T(b-Ax)$ the normal residual vector. The iteration terminates when the length of either of these two vectors falls below a factor $\epsilon, \rho \in (0,1)$ of its original length. 

The algorithm proceeds in two phases. In each of the following steps, the equality sign "$=$" represents assignment, that is, the right hand side is first evaluated, the overwritten on the left hand side.

Initialization:

1. $x = 0$
2. $e = b$
3. $r = A^Tb$
4. $p = r$
7. $\gamma_0 = \langle r, r \rangle_X$
8. $\gamma = \gamma_0$
9. $k=0$
    
Iteration: Repeat while $k<k_{\rm max}$, $\|e\|>\epsilon \|b\|$, and $\|r\|>\rho \|A^Tb\|$

1. $q = Ap$
2. $s = A^Tq$
3. $\alpha = \gamma / \langle q, q\rangle_B$
4. $x = x+\alpha p$
5. $e = e-\alpha q$
6. $r = r-\alpha s$
7. $\delta = \langle r, r \rangle_X$
8. $\beta = \delta / \gamma$
9. $p = r + \beta p$
10. $\gamma = \delta$
11. $k = k+1$

The translation into VCL code is straightforward. I pick out a few examples to illustrate how this goes, all from the iteration phase.

(step 1) $q = Ap$ becomes *A.applyFwd(p,q)*

(step 2) $s = A^Tq$ becomes *A.applyAdj(q,s)*

(step 3) $\alpha = \gamma / \langle q, q\rangle_B$ becomes *alpha = gamma/q.dot(q)*

(step 4) $x = x+\alpha p$ becomes *x.linComb(alpha,p)* (*linComb* is 
often called *axpy*, standing for "(y = ) a x plus y")

The complete algorithm (function *cg* in the module *vcalg*) includes several levels of screen output, from none to printing the norms of $e$ and $r$ at every iteration.

Properties of the algorithm are described in the cited references and many others. Key facts:

1. the residual (norm of $e$) decreases monotonically. 

2. the normal residual (norm of $r$) decreases eventually, but not monotonically (in general).

3. for a system of dimension $n$, in exact arithmetic, the algorithm terminates at a solution of the normal equations in $n$ or fewer iterations. In floating point arithmetic, the residual in the normal equations is generally on the order of the square root of macheps in $n$ iterations. For very large problems, useful convergence is governed by the the distribution of eigenvalues of $A^TA$, not by the dimension. 

To illustrate some of these features, I constructed a least squares problem using *MatrixOperator* with 4-dimensional domain and 6-dimensional range. I built noise free data and solved the corresponding least squares problem using *vcalg.cg*.

In [17]:
import vcl
import npvc
import vcalg
import numpy as np

# domain space and vector in it
dom = npvc.Space(4)
xstar=vcl.Vector(dom)
xstar.data[0]=1
xstar.data[1]=1
xstar.data[2]=1
xstar.data[3]=1

# range space and vector in it
rng = npvc.Space(6)

# 3 x 2 matrix - initialize as outer product
#mat=y.data@x.data.T
mat=np.zeros((6,4))
mat[0,0]=1
mat[1,1]=2
mat[2,2]=3
mat[3,3]=4

# matrix operator based on mat
matop=npvc.MatrixOperator(dom,rng,mat)

# initialize rhs
b = matop*xstar

# solution, residual, normal residual vectors (if desired)
x = vcl.Vector(dom)


# set cg parameters and run
kmax=20
eps=0.01
rho=0.01
vcalg.conjgrad(x=x, b=b, A=matop, kmax=kmax, eps=eps,\
               rho=rho, verbose=2)

# view result
print('\nsolution vector:')
x.myNameIs()

  k       |e|       |r|=
  0  5.4772e+00  1.8815e+01
  1  2.0912e+00  5.0185e+00
  2  1.0929e+00  1.9103e+00
  3  6.0569e-01  6.6167e-01
  4  1.2545e-15  4.3724e-15
----------------------------------------------------
  k       |e|     |e|/|e0|      |r|        |r|/r0|
  4  1.2545e-15  2.2903e-16  4.3724e-15  2.3239e-16

solution vector:
Vector in space:
npvc.Space of dimension 4
Data object:
[[1.]
 [1.]
 [1.]
 [1.]]


## Truncated Gauss-Newton Algorithm for Nonlinear Least Squares

Suppose that $F:U \rightarrow B$ is a (possibly nonlinear) function from an open subset $U$ of the Hilbert space $X$ to another Hilbert space $B$. "Nonlinear least squares" refers to the optimization problem,
$$
\mbox{Given }b \in B, \,\min_{x \in U} J(x),
$$ 
where $J(x)=0.5*\|F(x)-b\|^2$. Given a current estimate $x$ of the solution, Newton's method produces an update by using the gradient
$$
g(x) = DF(x)^T(F(x)-b)
$$
and Hessian
$$
H(x) = D(DF^T)(x)(F(x)-b) + DF(x)^TDF(x)
$$
by solving for the Newton step $s$: 
$$
x \leftarrow x+s; H(x)s = -g.
$$
The Gauss-Newton variant modifies the Hessian by dropping the first term. There are three reasons for doing this:

1. If the residual $F(x)-b$ is small at the solution (so that data has small noise), then this term should be small when $x$ is close to the minimizer;

2. Without that term, it is easy to see that $s$ is always a descent direction, assuming that $DF(x)$ has full column rank; and

3. Without that term, it is not necessary to compute the second derivative of $F$.

Also, the Gauss-Newton step is the solution of the linear least squares problem
$$
\min_s \|DF(x)s-g\|^2
$$
which suggests the possibility of computing $s$ via the Conjugate Gradient algorithm, which is particularly attractive for large-scale problems. Even better, it suggests a refinement for taking a partial step when far from the solution, where a full Newton step is not likely to be constructive, and furthermore reduces the total computational work. 

This refinement is due to Steihaug (see Nocedal and Wright, sections 4.1 and 6.4). Suppose that the quadratic model of $J$ based on the Gauss-Newton Hessian at $x$ is presumed to be sufficiently accurate in a ball $\{x+s:\|s\|\le \Delta\}$ that actual reduction in $J$ from taking the step is a significant fraction of the predicted reduction in $J$ based on the quadratic model. The step isn't allowed to exceed the "trust radius" $\Delta$.

The quadratic model around $x$ is
$$        
J(x+s)\approx J(x) + \langle s, g(x)\rangle  + 0.5\langle s, H(x)s\rangle
$$
where $H(x)=DF(x)^TDF(x)$. and the step $s$ solves $H(x)s = -g(x) = DF(x)^T(b-F(x))$ approximately. The predicted reduction is
$$
\mbox{predred} = J(x) - 0.5s^Tg(x)
$$
The actual reduction is
$$
\mbox{actred} = J(x) - J(x+s)
$$
Steihaug's algorithm uses the trust radius $\Delta$ as a maximum step length. The usual CG termination criterion (residual less than fraction of initial) is augmented by testing the step length: it is it greater than $\Delta$ then the iteration is stopped and the step is scaled back to have length $\Delta$. Otherwise the iteration terminates as usual. If this step $s$ of length at most $\Delta$, resulting from this modified CG, produces actual objective reduction (actred) that is too much less than predicted reduction (predred), then $\Delta$ is reduced and $s$ is re-computed. Otherwise, $x$ is updated to $x+s$. If the step is very successful, in that the actual reduction is close to the predicted reduction, than $\Delta$ is increased before then next update. 

As $x$ converges to a stationary point of $J$, eventually all steps are accepted. On the other hand, if the early iterates are far from a stationary point, $\Delta$ may be reduced so much that the CG stops at the first iteration: then $s$ is precisely the negative gradient, and a short enough step in that direction must produce actual reduction close to predicted reduction. Thus this "trust region" algorithm must converge to a stationary point from any initial guess, and ends with full Gauss-Newton steps.  

A single step of this algorithm involves an inner conjugate gradient iteration, and depends on a normal residual tolerance $0<\rho<1$, reduction ("Goldstein-Armijo") parameters $0<\gamma_{\rm red} < \gamma_{\rm inc} <1$, scaling parameters $\mu_{\rm red} < 1 < \mu_{\rm inc}$ satisfying $\mu_{\rm red}*\mu_{\rm inc}<1$, and gradient tolerance $0<\epsilon<1$. To update a current estimate $x$,

1. Apply the C-G algorithm to update $s$, starting at $s=0$. Stop when either (a) $\|H(x)s-g(x)\|<\rho\|g\|$, or (b) $\|s\| > \Delta$. In case (b), replace $s$ by $\Delta s /\|s\|$. Evaluate actred and predred as defined above.

2. if $\mbox{actred} < \gamma_{\rm red}*\mbox{predred}$, replace $\Delta$ by $\mu_{\rm red}*\Delta$ and repeat step 1.

3. Otherwise update $x$ (replace $x$ by $x+s$).

4. if $\mbox{actred} > \gamma_{\rm inc}*\mbox{predred}$, replace $\Delta$ by $\mu_{\rm inc}*\Delta$. 

The module *vcalg.py* contains a modified CG algorithm, and a truncated Gauss-Newton algorithm using it to implement the algorithm described here. Typical parameters for the various constants are $\rho=10^{-2}, \gamma_{\rm red}=0.1, \gamma_{\rm inc}=0.9, \mu_{\rm red}=0.5, \mu_{\rm inc}=1.8, \epsilon=10^{-2}$.

I apply this algorithm to the so-called Rosenbrock function, a moderately difficult nonlinear least-squares test problem (Nocedal and Wright, Excercise 9.1). The basic example is 2x2 and depends on a scale factor, usually set to 10. The function to be minimized is then 
$$
f(x_0,x_1) = 100(x_1-x_0^2)^2 + (1-x_0)^2
$$ 
which is equivalent to $J$ as defined above with
$$
F(x)=(10(x_1-x_0^2),-x_0)^T, \, b=(0,-1)
$$
I have doubled this function to make a 4x4 problem, with scale factors 10 and 2, that is,
$$
F(x)=(10(x_1-x_0^2),-x_0,2(x_3-x_2^2),-x_2)^T, \, b=(0,-1,0,-1)
$$
The minimum is at $(1,1,1,1)^T$, where $J=0$, so it is a global minimizer, also the unique stationary point of the Rosenbrock function.

This function is defined in *npvc.DoubleRosie*, as a function on the 4-dimensional *npvc.Space*. The *deriv* method returns a *npvc.MatrixOperator*, as in the other examples above.

I initialize $x$ at a point far from the global stationary point. The inner CG iteration is given enough iterations that it returns a very precise solution (in error by square root of macheps or less) of the Gauss-Newton equation $H(x)s + g(x)=0$ - if it is allowed to converge, instead of being stopped by the trust radius condition. I have suppressed the verbose output of the CG algorithm, to make the overall course of the GN iteration easier to see (cgverbose=0). For a large-scale problem, it will be important to monitor the behaviour of the CG iteration.

Note that the initial trust radius $\Delta=10$ is reduced four times before a successful step occurs, that is, $\mbox{actred} > \gamma_{\rm red}\mbox{predred}$.  Iteration 3 requires a further reduction of the trust radius, then the step is very successful ($\mbox{actred} > \gamma_{\rm inc}\mbox{predred}$), so the trust radius is increased. That turns out to be too long, so iteration 4 again decreases the trust radius, then takes another very successful step. This pattern repeats in Iterations 5 and 6. After iteration 6, all steps are successful.

The reader can change the verbosity level to see the actred/predred comparison that drives the modification of the trust radius (gnverbose=2), and display the details of the CG iterations (cgverbose=1 or 2) to see whether the usual convergence criterion or the trust radius condition is active. In fact only in the last few iterations does CG iteration run to completion: in previous G-N iterations, CG is truncated by the trust radius constraint.

In [18]:
import vcl
import npvc
import vcalg

sp = npvc.Space(4)
x = vcl.Vector(sp)
b = vcl.Vector(sp)

x.data[0]=-1.2
x.data[1]=1.0
x.data[2]=-1.2
x.data[3]=1.0
b.data[0]=0.0
b.data[1]=-1.0
b.data[2]=0.0
b.data[3]=-1.0

F = npvc.DoubleRosie(sp)

vcalg.trgn(x, b, F, imax=40, eps=0.001, kmax=10, rho=1.e-6, \
           Delta=10.0, mured=0.5, muinc=1.8, \
           gammared=0.1, gammainc=0.95, \
           gnverbose=1, cgverbose=0)

print('\nsolution estimate:')
x.myNameIs()  

  i      J        |grad J|      Delta
  0  1.4907e+01  1.1662e+02  1.0000e+01
  0  1.4907e+01  1.1662e+02  5.0000e+00
  0  1.4907e+01  1.1662e+02  2.5000e+00
  0  1.4907e+01  1.1662e+02  1.2500e+00
  0  1.4907e+01  1.1662e+02  6.2500e-01
  1  1.4031e+01  1.0102e+02  6.2500e-01
  2  1.3251e+01  8.4256e+01  6.2500e-01
  3  1.2748e+01  6.6337e+01  6.2500e-01
  3  1.2748e+01  6.6337e+01  3.1250e-01
  4  2.4665e+00  7.1220e+00  5.6250e-01
  4  2.4665e+00  7.1220e+00  2.8125e-01
  5  1.6241e+00  3.5881e+00  5.0625e-01
  6  1.2041e+00  8.4635e+00  5.0625e-01
  6  1.2041e+00  8.4635e+00  2.5312e-01
  7  9.6584e-01  9.6969e+00  2.5312e-01
  8  8.0398e-01  1.1596e+01  2.5312e-01
  9  5.0942e-01  1.0316e+01  4.5563e-01
 10  4.5372e-01  1.5153e+01  4.5563e-01
 11  1.9967e-01  1.2305e+01  4.5563e-01
 12  6.4199e-03  2.4658e+00  8.2012e-01
 13  7.8562e-19  2.8051e-08  1.4762e+00

solution estimate:
Vector in space:
npvc.Space of dimension 4
Data object:
[[1.]
 [1.]
 [1.]
 [1.]]


## Function Composition and Constrained Optimization via Change of Variable

Constraints on the solution vectors of optimization problem occur either because of physical or other limits on the vector components, or because objective functions are undefined or ill-behaved outside of so-called feasible sets. The most common constraints are so called simple bounds, which mandate that the components $x$ of solution vectors (with respect to a specified basis) lie between lower and upper limits, either inclusive: $l \le x \le u$, or exclusive: $l \lt x \lt u$. There is actually an important distinction between the two cases: the former admits the possibility of a solution lying on the boundary of the feasible set, whereas the latter does not. In the former case, the notion of solution is generalized: a solution on the boundary is not necessarily a stationary point of the objective, but instead satisfies the so-called KKT conditions (see for example Nocedal and Wright). A great deal of effort has gone into optimization with inclusive bounds, the quintessential example being linear programming (for which the solution *always* lies on the boundary). Little has been devoted to the exclusive case, which would occur when a physical field is known to lie strictly between two bounds, and possibly when an objective function definition is only correct when the bounds are strictly obeyed, so that any optimum or stationary point must be found in the interior of the cube defined by the bounds. This section describes an approach to this interior optimization problem, based on the observation that search for interior minima (or stationary points) is not really a constrained optimization problem, because it is equivalent to an unconstrained problem by change of variable.

The first step is to devise a function that detects whether a vector lies in the interior of the cube defined by the bounds. For the NumPy-based vector class used in this notebook, that's simple: an implementation is given in *npvc.testbounds*. The use is simple:

In [19]:
import vcl
import npvc
import vcalg
import numpy as np

sp = npvc.Space(4)
x = vcl.Vector(sp)
u = vcl.Vector(sp)
l = vcl.Vector(sp)

u.data=2.0*np.ones((4,1))
l.data=-2.0*np.ones((4,1))

x.data[0]=-1.2
x.data[1]=1.0
x.data[2]=-1.2
x.data[3]=1.0

print(npvc.testbounds(u,l,x))

x.data[2]=3.0
                   
print(npvc.testbounds(u,l,x))

True
False


I've created a somewhat artificial example of an objective function that is only well-defined in the interior of a cube, by adding a bounds test to the *DoubleRosie* function of the previous example. If you attempt to evaluate it at a point in the complement of the cube interior, it throws an exception:

In [20]:
FB = npvc.DoubleRosieWithBounds(sp,u,l)

try:
    print(FB(x).data)
except Exception as ex:
    print(ex)

bounds violated
called from npvc.DoubleRosieWithBounds.apply
called from vcl.Vector operator()


For points in the interior, evaluation is as in the unconstrained case:

In [21]:
x.data[2]=-1.2

print(FB(x).data)

[[-4.4 ]
 [ 1.2 ]
 [-0.88]
 [ 1.2 ]]


This example is of course totally artificial, since the Rosenbrock function is well-defined in ${\bf R}^n$. However many objective functions of simulation-driven optimization, based on numerical solution of stiff ordinary differential equations or partial differential equations, may fail to return a value if the parameter vectors are chosen outside of appropriate open cubes. 

This is important becasue unconstrained optimization methods, such as the Gauss-Newton method described in the last section, offer no means to confine their iterates to the feasible cubes, so that the iterations fail, often at the first step. Here for example is what happens if the Trust-Region Gauss-Newton algorithm is applied to *DoubleRosieWithBounds*, using the same parameters and initial solution vector as in the last section:

In [22]:
b = vcl.Vector(sp)

b.data[0]=0.0
b.data[1]=-1.0
b.data[2]=0.0
b.data[3]=-1.0

try:
    vcalg.trgn(x, b, FB, imax=40, eps=0.001, kmax=10, rho=1.e-6, \
               Delta=10.0, mured=0.5, muinc=1.8, \
               gammared=0.1, gammainc=0.95, \
               gnverbose=1, cgverbose=0)

    print('\nsolution estimate:')
    x.myNameIs()  
except Exception as ex:
    print(ex)

  i      J        |grad J|      Delta
  0  1.4907e+01  1.1662e+02  1.0000e+01
bounds violated
called from npvc.DoubleRosieWithBounds.apply
called from vcl.Vector operator()
called from trgn


The open cube defined by the bounds can be viewed as the image of ${\bf R}^n$ under a differentiable map with a differentiable inverse. An example of such a map in 1D is
$$
g(x) = \frac{u+l}{2} + \frac{u-l}{2} \frac{x}{\sqrt{1+x^2}}
$$
(obviously not the only choice): for any $x \in {\bf R}$, $l <g(x)<u$, and $g$ is $C^{\infty}$ and invertible with $C^{\infty}$ inverse. For $n>1$ dimensions, simply apply the same transformation on each axis. The resulting diagonal map has a diagonal Jacobian with positive entries.

Suppose $f$ is a function on the open cube defined by vectors $l,u$. Then the composition $f \circ g$ is defined on ${\bf R}^n$ ($f \circ g (x) = f(g(x))$). The gradient of $f \circ g$ is 
$$
\mbox{grad }f\circ g(x) = Dg(x)^T \mbox{grad }f(g(x))
$$
so $g(x)$ is a stationary point of $f$ if and only if $x$ is a stationary point of $f \circ g$. Thus you can find the stationary points of $f$ by finding the stationary points of $f \circ g$ and mapping them to the feasible cube by $g$.

So here is the proposed algorithm, in a nutshell: find the stationary points of $f \circ g$ using an unconstrained minimization algorithm (like trust-region GN), then map to stationary points of $f$ in the feasible cube. 

I've implemented the mapping described above in *npvc.ulbounds* - it defines a *vcl.Function*. To transpose the initial solution estimate from the open cube to ${\bf R}^n$, you need the inverse function, implemented in *npvc.invulbounds*. For example, *xx* is the initial datum in ${\bf R}^n$ corresponding to the initial point for GN in the last section's example:

In [23]:
ful = npvc.ulbounds(sp,u,l)
iul = npvc.invulbounds(sp,u,l)

xx = iul(x)

print(xx.data)

yy = ful(xx)

print(yy.data)



[[-0.75      ]
 [ 0.57735027]
 [-0.75      ]
 [ 0.57735027]]
[[-1.2]
 [ 1. ]
 [-1.2]
 [ 1. ]]


Computation mapping of constrained to unconstrained minimization requires computational realization of function composition. This realization is supplied in *vcl.comp*, which returns a *vcl.Function* instance. *comp(f,g)* has the domain of *g*, the range of *f*, and derivative computed by the chain rule (a composition of linear maps, implemented in *vcl.lopcomp*).

Here is the suggested algorithm, applied to the constrained version of the problem from the last section - same parameters, same initial vector, very close to the same final estimate. The residual is somewhat larger, but note that it is the residual of the ${\bf R}^n$ problem that is displayed. The final estimate is very close (four digits) to the solution of the Rosenbrock problem, and the residual is the same - the gradient is very close also, as the Jacobian of the change of variable is a rather tame matrix at that point. So this is a solution of the accuracy specified by the parameters.

In [24]:
FBC = vcl.comp(FB,ful)

try:
    xx = iul(x)
    
    vcalg.trgn(xx, b, FBC, imax=40, eps=0.001, kmax=10, rho=1.e-6, \
               Delta=10.0, mured=0.5, muinc=1.8, \
               gammared=0.1, gammainc=0.95, \
               gnverbose=1, cgverbose=0)

    x = ful(xx)

    print('\nsolution estimate:')
    x.myNameIs()  
except Exception as ex:
    print(ex)

  i      J        |grad J|      Delta
  0  1.4907e+01  1.2450e+02  1.0000e+01
  0  1.4907e+01  1.2450e+02  5.0000e+00
  0  1.4907e+01  1.2450e+02  2.5000e+00
  0  1.4907e+01  1.2450e+02  1.2500e+00
  0  1.4907e+01  1.2450e+02  6.2500e-01
  0  1.4907e+01  1.2450e+02  3.1250e-01
  1  1.4468e+01  1.3704e+02  3.1250e-01
  2  1.3421e+01  1.4054e+02  5.6250e-01
  3  1.3037e+01  1.2262e+02  5.6250e-01
  3  1.3037e+01  1.2262e+02  2.8125e-01
  4  1.9624e+00  2.4428e+01  5.0625e-01
  4  1.9624e+00  2.4428e+01  2.5312e-01
  4  1.9624e+00  2.4428e+01  1.2656e-01
  5  1.6984e+00  2.3961e+01  1.2656e-01
  6  1.5194e+00  2.6620e+01  1.2656e-01
  7  1.2855e+00  2.9754e+01  1.2656e-01
  8  1.0135e+00  3.0919e+01  2.2781e-01
  9  7.3727e-01  3.2197e+01  2.2781e-01
 10  3.8639e-01  2.4951e+01  4.1006e-01
 11  1.1081e-01  1.3575e+01  4.1006e-01
 12  2.4490e-04  6.2951e-01  7.3811e-01
 13  3.6877e-10  7.6970e-04  1.3286e+00

solution estimate:
Vector in space:
npvc.Space of dimension 4
Data object:
[[1.  

To finish this discussion of constraint implementation, I emphasize again that the change-of-variables approach does *NOT* solve constrained optimization problems in the usual sense, that is, identify points on the constraint boundary that satisfy the KKT conditions (in addition to stationary points in the interior). Many other approaches, such as L-BFGS-B, and the Coleman-Li reflection algorithm (both implemented in SciPy, are constrained optimization solvers. The approch I have just described is not. If for example you were to change the bounds in the *DoubleRosieWithBounds* example to $u_2 = 0, l_2=-1$, then the change-of-variable algorithm cannot find a stationary point, since there are none in the feasible set. There are KKT points on the boundary, but our algorithm won't find them. So what does it do? Try it and find out.

The utility of the change-of-variable approach is in finding points at which the objective value is small and the bounds are strictly observed. This will usually not be a KKT point, or even (exactly) a stationary point, but that doesn't matter. If (1) the objective function is strictly convex, and (2) the minimum value is smaller than all values on the boundary, then the change-of-variables algorithm will approximate a minimizer. That is really all that can be said.

A practical use pattern could specify two sets of bounds, an outer set $(\bar{u},\bar{l})$ and an inner set $(u,l)$, satisfying $\bar{l} < l < u < \bar{u}$. The outer bounds define a feasible set for evaluation of the objective function, that is, for models satisfying the outer bounds, the objective function returns a value rather than an error condition. The inner set defines a feasible set for the known or posited properties of a solution, that is, conditions that a physically sensible model should satisfy. The algorithm uses the mapping from ${\bf R}^n$ to the open cube defined by the outer bounds, but terminates if the iteration exceeds the inner bounds. Under the conditiopns mentioned in the last paragraph, if such termination occurs, no stationary point satisfying the inner bounds exists.

## Truncated Gauss-Newton Algorithm for Variable Projection Reduction of Separable Nonlinear Least Squares

The variable projection method is a minimization algorithm for a scalar function on a product space $f:X \oplus W \rightarrow Y$ of class $C^2$. I will assume that $f$ is defined on the whole product space; the refinements necessary to accommodate constraints on $x$ are similar to those needed in the nonlinear least squares problem discussed in the last section. While not strictly necessary, I will also assume that $f$ is quadratic in the second variable. Put another way, $f(x,w) = 0.5*\|A(x)w-b\|^2$, where the values of $A$ are linear operators $: W \rightarrow Y$. Further, it's usually assumed that $A(x)$ is of full column rank (or coercive, in the infinite dimensional case), so that that for each $x$, there is a unique minimiser $\tilde{w}(x)$ of $w \mapsto f(x,w)$, the solution of the normal equation: $A(x)^T(A(x)\tilde{w}(x) - b)=0$. Define the *variable projection (VP) reduction* $\tilde{f}$ by 
$$
\tilde{f}(x) = 0.5*\|A(x)\tilde{w}(x)-b\|^2= \min_w f(x,w)
$$
Since $A(x)$ is assumed coercive for every $x \in X$, $A(x)^TA(x)$ is invertible, and $\tilde{w}(x) = (A(x)^TA(x))^{-1}A(x)^Tb$. So
$$ 
\tilde{f}(x) = 0.5*\|(A(x)(A(x)^TA(x))^{-1}A(x)^T - I)b\|^2
$$
The operator in parenthesis projects $Y$ onto the orthocomplement of the range of $A(x)$: call it $P(x)$. That is,
$$
P(x) = I-A(x)(A(x)^TA(x))^{-1}A(x)^T
$$
and
$$
\tilde{f}(x) = 0.5*\|P(x)b\|^2
$$
So the reduced objective is half the length squared of the projection of the data vector $b$ onto the orthocomplement of the range of $A(x)$, which of course depends on $x$. That fact accounts for the name "variable projection".

One of the main results of the Golub and Pereyra 1973 paper is that $x$ is a stationary point of $\tilde{f}$ if and only if $(x,\tilde{w}(x))$ is a stationary point of $f$. It's worth spelling out the argument because it highlights several important points about the VP reduction.

Note that $f$ is differentiable, if $(x,w)\mapsto A(x)w$ is differentiable, which I will assume. Suppose $s \in X$. Then the directional derivative of $\tilde{w}(x)=(A(x)^TA(x))^{-1}A(x)^Tb$ at $x$ in direction $s$ is 
$$
\frac{d}{dt}((A(x+ts)^TA(x+ts))^{-1}A(x+ts)^Tb)|_{t=0}
$$
$$
=-(A(x)^TA(x))^{-1}\frac{d}{dt}(A(x+ts)^TA(x+ts)(A(x)^TA(x))^{-1}A(x)^Tb)|_{t=0} + (A(x)^TA(x))^{-1}\frac{d}{dt}(A(x+ts)^Tb)|_{t=0}
$$
and in particular $\tilde{w}$ is differentiable. Therefore
$$
\frac{d}{dt}\tilde{f}(x+ts)|_{t=0} = \left\langle\frac{d}{dt}A(x+ts)\tilde{w}(x), A(x)\tilde{w}(x)-b\right\rangle + 
$$
$$
0.5\left\langle A(x)\frac{d}{dt}\tilde{w}(x+ts),A(x)\tilde{w}-b\right\rangle
$$
Since the normal equation is equivalent to the assertion that the residual $A(x)\tilde{w}(x)-b$ is orthogonal to the range of $A(x)$, the last term vanishes.

The first term can be re-written as the directional derivative of $f(x,w)$ for fixed $w=\tilde{w}(x)$, that is,
$$
\frac{d}{dt}\tilde{f}(x+ts)|_{t=0} = \frac{d}{dt}f(x+ts,w)|_{t=0,w=\tilde{w}(x)}.
$$
So $x$ is a stationary point of $\tilde{f}$ if and only if the directional derivative of $f$ at $(x,\tilde{w}(x))$ in all directions $(s,0)$ is zero. But the directional derivative of $f$ at $(x,\tilde{w}(x))$ in all directions $(0,\delta w)$ is also zero - that is the definition of $\tilde{w}(x)$. Since the directional derivative is linear in the direction, the directional derivative of $f$ at $(x,\tilde{w}(x))$ is zero in all directions, that is, $(x,\tilde{w}(x))$ is a stationary point of $f$, if and only if $x$ is a stationary point of $\tilde{f}$. 

The derivative of the linear-operator-value function $A$ is naturally a bilinear-operator-valued function, since it's linear in the argument $w$ and in the direction $s$ separately. Call it $DA$:
$$
\frac{d}{dt}A(x+ts)w = DA(x)(s,w)
$$ 
In terms of $DA$, the directional derivative of $\tilde{f}$ is
$$
\frac{d}{dt}\tilde{f}(x+ts)|_{t=0} = \left\langle DA(x)(\tilde{w}(x),s), A(x)\tilde{w}(x)-b\right\rangle 
$$
The expression on the right is the same as the derivative of the function
$$
x \mapsto f(x,w) 0.5*\|A(x)w-b\|^2,
$$ 
evaluated at $w=\tilde{w}(x)$, that is, $f(x,w)$ *for fixed w* 
The gradient of $\tilde{f}$ is the Riesz representer of the directional derivative:
$$
\langle s, \mbox{grad} \tilde{f}(x)\rangle = \frac{d}{dt}\tilde{f}(x+ts)|_{t=0}
$$
$$
= \left\langle s, DA(x)^*(\tilde{w}(x),A(x)\tilde{w}(x)-b)\right\rangle 
$$
in which $DA(x)^*$ denotes the *partial adjoint* defined by
$$
\langle s, DA(x)^*(w,y) \rangle = \langle DA(x)(w,s),y\rangle
$$
That is, $y \mapsto DA(x)^*(w,y)$ is the adjoint of the map $s \mapsto DA(x)(w,s)$, the latter being the derivative of $x \mapsto A(x)w$. 

Thus
$$
\mbox{grad} \tilde{f}(x) = DA(x)^*(\tilde{w}(x),A(x)\tilde{w}(x)-b).
$$

In fact, this is also the gradient of a least-squares objective.
For a fixed choice $w \in W$, define $F_w(x) = A(x)w$. Then $DA(x)^*(w,y) = DF_w(x)^T$. Moreover, if $f_w(x) = 0.5*\|F_w(x)-b\|^2$, then 
$$
\mbox{grad} f_w(x)|_{w=\tilde{w}(x)} =  \mbox{grad}\tilde{f}(x).
$$
From the preceding section,
$$
\mbox{grad} f_w(x) = DF_w(x)^T(F_w(x)-b).
$$ 
Therefore computing the gradient of the VP reduction can be accomplished by combining a computation of the gradient of a nonlinear least-squares objective with a solution of the normal equation.

Of course $\tilde{f}$ *is* itself a nonlinear least squares objective: if you define $F(x)=A(x)\tilde{w}(x)$, then 
$$
\tilde{f}(x) = 0.5*\|F(x)-b\|^2.
$$
so it is natural to use the Gauss-Newton algorithm to minimize $\tilde{f}$. The Gauss-Newton step $s$ solves $DF(x)^T(DF(x)s-(F(x)-b))=DF(x)^TDF(x)s+\mbox{grad}\tilde{f}(x)=0$. This is a simplification over the Newton step, but for the special case of the VP reduction can be simplified still further.
$$
DF(x)s = D(A(x)\tilde{w}(x))s = DA(x)(\tilde{w}(x),s) + A(x)D\tilde{w}(x)s
$$
From the differentiability analysis of $\tilde{w}$,
$$
D\tilde{w}(x)s = 
$$
$$
=-(A(x)^TA(x))^{-1}(DA(x)^T(A(x)((A(x)^TA(x))^{-1}A(x)^Tb),s) + A(x)^T DA(x)((A(x)^TA(x)^{-1}A(x)^Tb,s) + (A(x)^TA(x)^{-1}DA(x)^T(b,s)|
$$
$$
= -(A(x)^TA(x))^{-1}[DA(x)^T(A(x)\tilde{w}(x),s) + A(x)^TDA(x)(\tilde{w}(x),s)] + (A(x)^TA(x))^{-1}DA(x)^T(b,s)
$$
So 
$$
DF(x)s = DA(x)(\tilde{w},s) - A(x) (A(x)^TA(x))^{-1}[DA(x)^T(A(x)\tilde{w}(x),s) + A(x)^TDA(x)(\tilde{w}(x),s)] + A(x)(A(x)^TA(x))^{-1}DA(x)^T(b,s)
$$
$$
= (I-A(x)(A(x)^TA(x))^{-1}A(x)^T)DA(\tilde{w}(x),s) + A(x)(A(x)^TA(x))^{-1}DA(x)^T(b-A(x)\tilde{w}(x),s)
$$
The second term has the residual $b-A(x)\tilde{w}(x)=b-F(x)$ as the first argument of the bilinear operator $DA(x)^T$. Kaufman first pointed this out in 1974, and proposed that this term be dropped with the same justification as underlies the transition from Newton to Gauss-Newton: that is, if the residual is small (nearly noise-free data and close to the solution), this term should be negligible. Accepting this proposal, obtain the VP-GN approximation
$$
DF(x)s \approx (I-A(x)(A(x)^TA(x))^{-1}A(x)^T)DA(x)(\tilde{w}(x),s)
$$
$$
= P(x)DA(x)(\tilde{w}(x),s)
$$
where $P(x)=I-A(x)(A(x)^TA(x))^{-1}A(x)^T$ is the projection of $Y$ onto the orthocomplement of the range of $A(x)$, introduced earlier.
Since $P(x)$ is a projection, it is symmetric, positive semi-definite, and idempotent, that is $P(x)^TP(x)=P(x)^2=P(x)$. Thus the Gauss-Newton operator is approximately
$$
DF(x)^TDF(x)s \approx H_{VP}(x)s = DA(x)^*(\tilde{w}(x),P(x)DA(x)(\tilde{w}(x),s)).
$$
The Kaufman modification of GN for VP is to replace $H(x)=DF(x)^TDF(x)$ with $H_{VP}$. The solution $s$ of the modified GN equation $H_{VP}(x)s=-\mbox{grad}\tilde{f}(x)$ is a descent (or at least non-ascent) direction for $\tilde{f}$:
$$
\langle \mbox{grad}\tilde{f}(x), s \rangle 
= -\langle H_{VP}(x)s, s\rangle 
$$
$$
=-\langle DA(x)^*(\tilde{w}(x),P(x)DA(x)(\tilde{w}(x),s)),s\rangle
= - \langle DA(x)(\tilde{w}(x),s),P(x)DA(x)(\tilde{w}(x),s)\rangle
$$
$$
= -\|P(x)DA(x)(\tilde{w}(x),s)\|^2 \le 0
$$
since $P(x)$ is a projector. 

These calculations suggest a modified Gauss-Newton algorithm, which I will call the Kaufman-CG (KCG) algorithm since Kaufman supplied the key observation:

