# steveWang/Notes

Fetching contributors…
Cannot retrieve contributors at this time
2796 lines (2794 sloc) 196 KB


EE 221A: Linear System Theory

August 23, 2012

Prof. Claire Tomlin (tomlin@eecs). 721 Sutardja Dai Hall. Somewhat tentative office hours on schedule: T 1-2, W 11-12. http://inst.eecs.berkeley.edu/~ee221a

GSI: Insoon Yang (iyang@eecs). Insoon's office hours: M 1:30 - 2:30, θ 11-12.

Homeworks typically due on Thursday or Friday.

Intro

Bird's eye view of modeling in engineering + design vs. in science.

"Science":

$$\mbox{Dynamical system} \rightarrow \mbox{experiments} \leftrightarrow \mbox{Model}$$

"Engineering":

$$\mbox{dynamical system} \rightarrow \mbox{experiments} \leftrightarrow \mbox{Model} \rightarrow \mbox{control}$$

Control validation, verification, testing.

Broad brush of a couple of concepts: modeling. We're going to spend a lot of time talking about modeling in this course.

First: for any dynamical system, there's an infinite number of models you could design, depending on your level of abstraction. Typically, you choose level of abstraction based on use case. Often, only able to use certain kinds of experiments. e.g. probing of protein concentration levels. If you're able to measure just this, then the signals in your model should have something to do with these concentration levels.

As we said, same phys system can have many different models. Another example: MEMS device. Can think about having models at various different models. e.g. electrical model: silicon / electrostatics of the system. Might be interested in manipulation of the device.

Alt: mechanical model (could have a free-body diagram, e.g.).

Another example: Hubble telescope. Could think of orbital dynamics. Individual rigid body dynamics. Or properties of the telescope, the individual optical models of the mirrors and their interactions. The idea here is to just realize that the word model can mean very different things. The logical model to use depends on the task at hand. The main point: a basic tenet of engineering: the value of a model: choose the simplest model that answers the questions you are asking about the system. The simplest model that will allow you to predict something that you didn't build into it.

Predict IO relations that you didn't explicitly design the model on. One of the properties of a good linear model for a system: it obeys linearity, so if you form a basis for your domain, then you have the system response to any input spanned by this basis. Probably the most important thing to take away from this course: linearity is a very strong principle that allows us to build up a set of tools.

Time

We have this term a "dynamical system". A key part is that it changes with time, responding with behavior over time. Time will turn out to be quite important. Depending on how we model time, we can come up with different variables. We call time (t) a privileged model because it has certain properties. Namely, when we think about time, we think about time marching forward (unidirectionality of evolution). Different models: continuous time ($t \in \Re$, could be negative, could go backwards, if we are interested in backwards evolution), or discrete time $t \in \{nT, n \in \mathbb{Z}\}$, where $T$ is some sampling time. So in that sense, discrete time, we have some set. We can also come up with more complicated models of time, like discrete-time asynchronous. The previous model was some constant period $T$. In DT asynchronous, we just have a set of points in time. Now becoming a more important model now with asynchronous processes (react to events that are going to happen at previously undefined points in time).

Linear vs. nonlinear models

More on this later. Suppose we could take the system, and we could represent it as being in one of a number of states.

First: suppose a finite number of states (so can be modeled by a FSM), which represent some configuration of the system. State space represents states system can be in at any point in time. If state space is finite, we can use a finite-state automaton. Each state has an output (prints out a message, or a measurement is taken), and we also consider inputs. The inputs are used to evolve the dynamic system. Input affects a transition. We can build up the dynamics of the system by just defining the transition function.

Packet transmitting node: first state is "ready-to-send"; second state is "send packet & wait"; and the third state is "flush buffer". If buffer empty, stay in $q_1$. If not empty, transitions to $q_2$. If ACK received, then transition to $q_3$ and return to $q_1$. If $T$ time units elapse, we time out and transition directly to $q_1$. Here, no notion of linear or nonlinear systems. To be able to talk about linear or nonlinear models, we need to be able to put some vector space structure on these three elements. System must then satisfy superposition.

Back to abstract dynamical system (thing we could never hope to model perfectly): rather than thinking about a set of rules, we're going to think about a mathematical model. Three classes: CT, DT [synchronous], and discrete-state (typically finite). Within each of these classes we can further break each down. For the first two, we can consider linearity, and we can further break these down into time-varying (TV) and time-invariant (TI). This course is going to focus just on the linear systems in continuous and discrete time, both time-varying and time-invariant. We'll use differential equation models in continuous time and difference equation models in discrete time. We usually develop in continuous-time and show analogies in discrete-time.

Analysis and Control

Control is pervasive. If you go to any of the control conferences, you see areas where techniques from this course are applied. Modern control came about because of aerospace in the 50s. e.g. autopilot, air traffic control. There the system itself is the system of aircraft. Chemical process control. Mechatronics, MEMS, robotics. Novel ways to automate things that hadn't been automated previously, mostly because of a renaissance in sensing. Power systems. Network control systems: how you combine models of the system itself with the control models. Quantum chemistry. Typically, when we think about state spaces, we think about the state as a vector in $\Re^n$. In many cases, you want to think about the state spaces as more complicated (e.g. $C^\infty$, the class of smooth functions).

Difference between verification, simulation, and validation

One of the additional basic tenets of this course: if you have a model of the system, and you can analytically verify that the model behaves in given ways for ranges of initial conditions, then that is a very valuable thing to have: you have a proof that as long as the system adheres to the model, then your model will work as expected. Simulation gives you system behavior for a certain set of parameters. Very different, but they complement each other. Analyze simpler models, simulate more complex models.

Linear Algebra

Functions and their properties.

Fields, vector spaces, properties and subspaces.

(note regarding notation: $\Re^+$ means non-negative reals, as does $\mathbb{C}_+$ (non-negative real part)

$\exists!$: exists a unique, $\exists?$: does there exist, $\ni$: such that.

Cartesian product: $\{(x,y) \vert x \in X \land y \in Y\}$ (set of ordered n-tuples)

Functions and Vector Spaces

August 28, 2012

OH: M/W 5-6, 258 Cory

Today: beginning of the course: review of lin. alg topics needed for the course. We're going to go through lecture notes 2 and probably start on the third sets of notes. Will bring copies of 3 and 4 on Thursday.

We did an introduction to notation and topics last time. First topic: functions, which will be used synonymously with "maps". Terminology will be used interchangeably.

Given two sets of elements X, Y, we defined $\fn{f}{X}{Y}$. Notion of range vs. codomain (range is merely the subset of the codomain covered by f). We define $f(X) \defequals \set{f(x)}{x \in X}$ to be the range.

Properties of functions

Injectivity of functions ("one-to-one"). A function $f$ is said to be injective iff the function maps each x in X to a distinct y in Y. Equivalently, $f(x_1) = f(x_2) \iff x_1 = x_2$. This is also equivalent to $x_1 \neq x_2 \iff f(x_1) \neq f(x_2)$.

Surjectivity of functions ("onto"). A function $f$ is said to be surjective if the codomain is equal to the range. Basically, the map $f$ covers the entire codomain. A way to write this formally is that $f$ is surjective iff $\forall y \in Y \exists x \in X \ni y = f(x)$.

And then a map $f$ is bijective iff it is both injective and surjective. We can write this formally as there being a unique $x \in X$ forall $y \in Y$.

Example: inverse of a map. We can talk about left and right inverses of maps. Suppose we have a map $\fn{f}{X}{Y}$. We're going to define this map $\mathbb{1}_X$ as the identity map on X. Namely, application of this map to any $x \in X$ will yield the same $x$.

The left inverse of $f$ is $\fn{g_L}{Y}{X}$ such that $g_L \circ f = \mathbb{1}_X$. In other words, $\forall x\in X, (g_L \circ f)(x) = x$.

Prove: $f$ has a left inverse $g_L$ iff $f$ is injective. First of all, let us prove the backwards implication. Assume $f$ is injective. Prove that $g_L$ exists. We're going to construct the map $\fn{g_L}{Y}{X}$ as $g_L(f(x)) = x$, where the domain here is the range of $f$. In order for this to be a well-defined function, we require that $x$ is unique, which is met by injectivity of $f$.

Now let us prove the forward implication. Assume that this left inverse $g_L$ exists. By definition, $g_L \circ f = \mathbb{1}_x \iff \forall x \in X g_L(f(x)) = x$. If $f$ were not injective, then $g_L$ would not be well-defined ($\exists x_1 \neq x_2$ such that $f(x_1) = f(x_2)$, and so $g_L$ is no longer a function).

review: contrapositive: $(A \implies B) \iff (\lnot B \implies \lnot A)$; contradiction: $(A \not\implies B) \implies \text{contradiction}$.

We can similarly shows surjectivity $\iff$ existence of a right inverse. With these two, we can then trivially show that bijectivity $\iff$ existence of an inverse (rather, both a left and right inverse, which we can easily show must be equal). Proof will likely be part of the first homework assignment.

Fields

We need the definition of a vector and a field in order to define a vector space.

A field is an object: a set of elements $S$ with two closed binary operations defined upon $S$. These two operations are addition (which forms an abelian group over $S$) and multiplication (which forms an abelian group over $S - \{0\}$) such that multiplication distributes over addition. Note that convention dictates $0$ to be the additive identity and $1$ to be the multiplicative identity.

Other silly proofs include showing that if both a left and right identity exist, they must be equivalent, or that multiplication by $0$ maps any element to $0$.

Vector spaces (linear spaces)

A vector space is a set of vectors V and a field of scalars $\mathbb{F}$, combined with vector addition and scalar multiplication. Vector addition forms an abelian group, but this time, scalar multiplication has the properties of a monoid (existence of an identity and associativity). We then have the distributive laws $\alpha + \beta)x = \alpha x + \beta x$ and $\alpha (x + y)$.

Function spaces

We define a space $F(D,V)$, where $(V, \mathbb{F})$ is a vector space and $D$ is a set. $F$ is the set of all functions $F(D, V) = \fn{f}{D}{V}$. Is $(F, \mathbb{F})$ a vector space (yes) where vector addition is pointwise addition of functions and scalar multiplication is pointwise multiplication by a scalar?

Examples of this: space of continuous functions on the closed interval $\fn{\mathcal{C}}{\bracks{t_0, t_1}}{\Re^n}$, ($(C(\bracks{t_0, t_1}, \Re^n), \Re)$). This is indeed a vector space.

Lebesgue spaces

$L_p t_0, t_1) = \set{\fn{f}{[t_0, t_1]}{\Re}}{\int_{t_0}^{t_1} \abs{f(t)}^p dt < \infty}$.

We can then talk about $\ell_p$, which are spaces of sequences. $\ell_2$ is the space of square-summable sequences of real numbers. Informally, $\ell_2 = \set{ v = \{v_1, v_2, ... v_k\}}{v_k \in \Re \sum_k \abs{v_k}^2 < \infty}$.

In general, when looking at vector spaces, often we use $\mathbb{F} = \Re$, and we refer to the space as simply $V$.

Next: subspaces, bases, linear dependence/independence, linearity. One of the main things we're going to do is look at properties of linear functions and representation as multiplication by matrices.

Vector Spaces and Linearity

August 30, 2012

From last time

Subspaces, bases, linear dependence/independence, linearity. One of the main things we're going to do is look at properties of linear functions and representation as multiplication by matrices.

Example (of a vector space)

$\ell_2 = \{v = \{v_1, v_2, ...\} \st \sum_{i=1}^\infty \abs{v_i}^2 < \infty, v_i \in \Re \}$

Vector addition and scalar multiplication? ("pointwise" addition, multiplication by reals)

What is a vector subspace?

Consider vector space $(V, \mathbb{F})$. Consider a subset W of V combined with the same field. $(W, \mathbb{F})$ is a subspace of $(V, \mathbb{F})$ if it is closed under vector addition and scalar multiplication (formally, this must be a vector space in its own right, but these are the only vector space properties that we need to check).

Consider vectors from $\Re^n$. A plane (in $\Re^3$) is a subspace of $\Re^3$ if it contains the origin.

Aside: for $x \in V$, span$(x) = \alpha x, \alpha \in \mathbb{F}$.

Linear dependence, linear independence.

Consider a set of $p$ vectors $\{v_1, v_2, ..., v_p\}, v_i \in V$. This set of vectors is said to be a linear independent set iff no nontrivial homogeneous equation exists, i.e. $\sum_i \alpha_i v_i = 0 \implies \forall i, \alpha_i = 0$. This is equivalent to saying that no one vector can be written as a linear combination of the others.

Otherwise, the set is said to be linearly dependent.

Bases

Recall: a set of vectors $W$ is said to span a space $(V, \mathbb{F})$ if any vector in the space can be written as a linear combination of vectors in the set, i.e. $\forall v \in V, \exists \set{(\alpha_i, w_i)}{v = \sum \alpha_i w_i}$ for $w_i \in W, \alpha_i \in \mathbb{F}$.

W is a basis iff it is also linearly independent.

Coordinates

Given a basis $B$ of a space $(V, \mathbb{F})$, there is a unique representation (trivial proof) of every $v \in V$ as a linear combination of elements of $B$. We define our coordinates to be the coefficients that appear in this unique representation. A visual representation is the coordinate vector, which defines

$$\alpha = \begin{bmatrix}\alpha_i \\ \vdots \\ \alpha_n \end{bmatrix}$$

Basis is not uniquely defined, but what is constant is the number of elements in the basis. This number is the dimension (rank) of the space. Another notion is that a basis generates the corresponding space, since once you have a basis, you can acquire any element in the space.

Linearity

A function $\fn{f}{(V, \mathbb{F})}{(W, \mathbb{F})}$ (note that these spaces are defined over the same field!) is linear iff $f(\alpha_1 v_1 + \alpha_2 v_2) = \alpha_1 f(v_1) + \alpha_2 f(v_2)$.

This property is known as superposition, which is an amazing property, because if you know what this function does to the basis elements of a vector space, then you know what it does to any element in the space.

An interesting corollary is that a linear map will always map the zero vector to itself.

Definitions associated with linear maps

Suppose we have a linear map $\fn{\mathcal{A}}{U}{V}$. The range (image) of $\mathcal{A}$ is defined to be $R(\mathcal{A}) = \set{v}{v = A(u), u \in U} \subset V$. The nullspace (kernel) of $\mathcal{A}$ is defined to be $N(\mathcal{A}) = \set{u}{\mathcal{A}(u) = 0} \subset U$. Also trivial (from definition of linearity) to prove that these are subspaces.

We have a couple of very important properties now that we've defined range and nullspace.

Properties of linear maps $\fn{\mathcal{A}}{U}{V}$

$$(b \in V) \implies (\mathcal{A}(u) = b \iff b \in R(\mathcal{A}))$$

$$(b \in R(\mathcal{A})) \iff (\exists!\ u\ \st \mathcal{A}(u) = b \iff [N(\mathcal{A}) = 0])$$

(if the nullspace only contains the zero vector, we say it is trivial)

$$\mathcal{A}(x_0) = \mathcal{A}(x_1) \iff x - x_0 \in N(\mathcal{A})$$

Matrix Representation of Linear Maps

September 4, 2012

Today

Matrix multiplication as a representation of a linear map; change of basis -- what happens to matrices; norms; inner products. We may get to adjoints today.

Last time, we talked about the concept of the range and the nullspace of a linear map, and we ended with a relationship that related properties of the nullspace to properties of the linear equation $\mathcal{A}(x) = b$. As we've written here, this is not matrix multiplication. As we'll see today, it can be represented as matrix multiplication, in which case, we'll write this as $Ax = b$.

There's one more important result, called the rank-nullity theorem. We defined the range and nullspace of a linear operator. We also showed that these are subspaces (range of codomain; nullspace of domain). We call $\text{dim}(R(\mathcal{A})) = \text{rank}(\mathcal{A})$ and $\text{dim}(N(\mathcal{A})) = \text{nullity}(\mathcal{A})$. Taking the dimension of the domain as $n$ and the dimension of the codomain as $m$, $\text{rank}(\mathcal{A}) + \text{nullity}(\mathcal{A}) = n$. Left as an exercise. Hints: choose a basis for the nullspace. Presumably you'd extend it to a basis for the domain (without loss of generality, because any set of $n$ linearly independent vectors will form a basis). Then consider how these relate to the range of $\mathcal{A}$. Then map $\mathcal{A}$ over this basis.

Matrix representation

Any linear map between finite-dimensional vector spaces can be represented as matrix multiplication. We're going to show that it's true via construction.

$\fn{\mathcal{A}}{U}{V}$. We're going to choose bases for the domain and codomain. $\forall x \in U, x = \sum_{j=1}^n \xi_k u_j$. Now consider $\mathcal{A}(x) = \mathcal{A}(\sum_{j=1}^n \xi_k u_j) = \sum_{j=1}^n \xi_k \mathcal{A}(u_j)$ (through linearity). Each $\mathcal{A}(u_j) = \sum_{i=1}^n a_{ij} v_i$. Uniqueness of $a_{ij}$ and $\xi_j$ follows from writing the vector spaces in terms of a basis.

$$\mathcal{A}(x) = \sum_{j=1}^n \xi_j \sum_{i=1}^m a_{ij} v_i \\ = \sum_{i=1}^m \left(\sum_{j=1}^n a_{ij} \xi_j\right) v_i \\ = \sum_{i=1}^m \eta_i v_i$$

Uniqueness of representation tells me that $\eta_i \equiv \sum_{j=1}^n a_{ij} \xi_j$. We've got $i = \{1 .. m\}$ and $j = \{1 .. n\}$. We can turn this representation into a matrix by defining $\eta = A\xi$. $A \in \mathbb{F}^{m \times n}$ is defined such that its $j^{\text{th}}$ column is $\mathcal{A}(u_j)$ written with respect to the $v_i$s.

All we used here was the definitions of basis, coordinate vectors, and linearity.

Let's do a couple of examples. Foreshadowing of work later in controllability of systems. Consider a linear map $\fn{\mathcal{A}} {(\Re^n, \Re)}{(\Re^n, \Re)}$. Try to derive the matrix $A \in \Re^{n \times n}$. Both the domain and codomain have as basis $\{b, \mathcal{A}(b), \mathcal{A}^2(b), ..., \mathcal{A}^{n-1}(b)\}$, where $b \in \Re^n$ and $A^n = -\sum_1^n -\alpha_i \mathcal{A}^{n-i}$. Your task is to show that the representation of $b$ and $\mathcal{A}$ is:

$$\bar{b} = \begin{bmatrix}1 \\ 0 \\ \vdots \\ 0\end{bmatrix} \\ \bar{A} = \begin{bmatrix} \\ 0 & 0 & ... & 0 & -\alpha_n \\ 1 & 0 * ... & 0 & -\alpha_{n-1} \\ 0 & 1 * ... & \vdots & -\alpha_{n-2} \\ \vdots & \vdots& \ddots & \vdots & -\alpha_{n-2} \\ \vdots & \vdots & \ddots & \vdots & -\alpha_{n-2} \\ 0 & 0 & \dots & 1 & -\alpha_1 \end{bmatrix}$$

This is really quite simple; it's almost by definition.

Note that these are composable maps, where composition corresponds to matrix multiplication.

Change of basis

Consider we have $\fn{\mathcal{A}}{U}{V}$ and two sets of bases for the domain and codomain. There exist maps between the first set of bases and the second set; composing those appropriately will give you your change of basis. Essentially, do a change of coordinates to those in which $A$ is defined (represented this as $P$), apply $A$, then change the coordinates of the codomain back (represented as $Q$). Thus $\bar{A} = QAP$.

If $V = U$, then you can easily derive that $Q = P^{-1}$, so $\bar{A} = P^{-1}AP$.

We consider this transformation ($\bar{A} = QAP$) to be a similarity transformation, and $A$ and $\bar{A}$ are called similar (equivalent).

We derived these two matrices from the same linear map, but they're derived using different bases.

Proof of Sylvester's inequality on homework 2.

One last note about the dimension of the rank of a linear map, which corresponds to the rank of the associated matrix representation: that is $\text{dim}(R(A)) = \text{dim}(R(\mathcal{A}))$. Similarly, $\text {nullity}(A) = \text{dim}(\text{nullspace}(A)) = \text{dim}(\text {nullspace}(\mathcal{A}))$.

Sylvester's inequality, which is an important relationship, says the following: Suppose you have $A \in \mathbb{F}^{m \times n}$, $B \in \mathbb{F}^{n \times p}$, then $AB \in \mathbb{F}^{m \times p}$, then $\text{rk}(A) + \text{rk}(B) - n \le \text{rk}(AB) \le \min(\text{rk}(A), \text{rk}(B)$. On the homework, you'll have to show both inequalities. Note at the end about elementary row operations.

Next important concept about vector spaces: that of norms.

Norms

With some vector spaces, you can associate some entity called a norm. We can then speak of a normed vector space (more commonly known as a metric space). Suppose you have a vector space $(V, \mathbb{F})$, where $\mathbb{F}$ is either $\Re$ or $\mathbb{C}$. This is a metric space if you can find $\fn{\mag{\cdot}}{V}{\Re_+}$ that satisfies the following axioms:

$\mag{v_1 + v_2} \le \mag{v_1} + \mag{v_2}$

$\mag{\alpha v} = \abs{\alpha}\mag{v}$

$\mag{v} = 0 \iff v = \theta$

We have some common norms on these fields:

$\mag{x}_1 = \sum_{i=1}^n \abs{x_i}$ ($\ell_1$)

$\mag{x}_2 = \sum_{i=1}^n \abs{x_i}^2$ ($\ell_2$)

$\mag{x}_p = \sum_{i=1}^n \abs{x_i}^p$ ($\ell_p$)

$\mag{x}_\infty = \max \abs{x_i}$ ($\ell_\infty$)

One of the most important norms that we'll be using: the induced norm is that induced by a linear operator. We'll define $\mathcal{A}$ to be a continuous linear map between two metric spaces; the induced norm is defined as

$$\mag{\mathcal{A}}_i = \sup_{u \neq \theta} \frac{\mag{\mathcal{A}u}_V}{\mag{u}_U}$$

From analysis: the supremum is the least upper bound (the smallest $\forall y \in S, x : x \ge y$).

Guest Lecture: Induced Norms and Inner Products

September 6, 2012

Induced norms of matrices

The reason that we're going to start talking about induced norms: today we're just going to build abstract algebra machinery, and at the end, we'll do the first application: least squares. We'll see why we need this machinery and why abstraction is a useful tool.

The idea is that we want to find a norm on a matrix using existing norms on vectors.

Let 1) $\fn{A}{(U,F)}{(U,F)}$, 2) let U have the norm $\mag{\ }_u$, 3) let V have the norm $\mag{\ }_v$. Let the induced norm be $\mag{A}_{u,v} = \sup_{x\neq 0} \frac{\mag{Ax}_v}{\mag{x}_u}$. Theorem: the induced norm is a norm. Not going to bother showing positive homogeneity and triangle inequality (trivial in this case). Only going to show last property: separates points. Essentially, $\mag{A}_{u,v} = 0 \iff A = 0$. The reason that this is not necessarily trivial is because of the supremum. It's a complex operator that's trying to maximize this function over an infinite set of points. It's possible that the supremum does not actually exist at a finite point.

The first direction is easy: if $A$ is zero, then its norm is 0 (by definition -- numerator is 0).

The second direction is a hard one. If $\mag{A}_{u,v} = 0$, then given any $x \neq 0$, it holds that $\frac{\mag{Ax}_u}{\mag{v}_u} \le 0$ (from the definition of supremum). Denominator must be positive definite (being the norm of a vector), and numerator must be positive definite (also being a norm). Thus the norm is also bounded below by zero, which means that the numerator is zero for all nonzero x. Thus everything is in the nullspace of $A$, which means that $A$ is zero.

Proposition: the induced norm has (a) $\mag{Ax}_u \le \mag{A}_{u,v} \mag{x}_u$; (b) $\mag{AB}_{u,v} \le \mag{A}_{u,v} \mag{B}_{u,v}$. (b) follows from (a).

Not emphasized in Claire's notes: induced norms form a small amount of all possible norms on matrices.

Examples of induced norms:

• $\mag{A}_{1,1} = \max_j \sum_i \abs{u_{ij}}$: maximum column sum: maximum of the sum of columns;
• $\mag{A}_{2,2} = \max_j \sqrt{\lambda_j A^T A}$: max singular value norm;
• $\mag{A}_{\infty, \infty} = \max_i \sum_j \abs{u_{ij}}$: maximum row sum.

Other matrix: special case of Schatten norms. (a) Frobenius norm $\sqrt{\text{trace}(A^T A)}$. Also square root of singular values. Convenient way to write nuclear norm.

Statistical regularization; Frobenius norm is analogous to $\ell_2$ regularization; nuclear norm analogous to $\ell_1$ regularization. It is important to be aware that these other norms exist.

Sensitivity analysis

Nice application of norms, but we won't see that it's a nice application until later.

Computation for numerical linear algebra.

Some algebra can be performed to show that if $Ax_0 = b$ (when $A$ invertible), then for $(A + \delta A)(x + \delta_x) = b + \delta b$, we have an approximate bound of $\frac{\mag{\delta_x}}{\mag{x_0}} \le \mag{A}\mag{A^{-1}} \bracks{\frac{\mag{\delta A}}{\mag{A}} + \frac{\mag{\delta b}}{\mag{b}}}$. Need to engineer computation to improve situation. Namely, we're perturbing $A$ and $b$ slightly: how much can the solution vary? In some sense, we have a measure of effect ($\mag{A}\mag{A^{-1}}$) and a measure of perturbation. The first quantity is important enough that people in linear algebra have defined it and called it a condition number: $\kappa(A) = \mag{A}\mag{A^{-1}} \ge 1$. The best you can do is 1. If you have a condition number of 1, your system is well-conditioned and very robust to perturbations. Larger condition number will mean less robustness to perturbation.

More machinery: Inner Product & Hilbert Spaces

Consider a linear space $(H, \mathbb{F})$, and define a function $\fn{\braket{}{}}{(H, \mathbb{F})}{\mathbb{F}}$. This function is an inner product if it satisfies the following properties.

• Conjugate symmetry. $\braket{x}{y} = \braket{y}{x}^*$.
• Homogeneity. $\braket{x}{\alpha y} = \alpha \braket{x}{y}$.
• Linearity. $\braket{x}{y + z} = \braket{x}{y} + \braket{x}{z}$.
• Positive definiteness. $\braket{x}{x} \ge 0$, where equality only occurs when $x = 0$.

Inner product spaces have a natural norm (might not be the official name), and that's the norm induced by the inner product.

One can define $\mag{x}^2 = \braket{x}{x}$, which satisfies the axioms of a norm.

Examples of Hilbert spaces: finite-dimensional vectors. Most of the time, infinite-dimensional Hilbert spaces match up with finite-dimensional. All linear operators in finite vector spaces are continuous because they can be written as a matrix (not always the case with infinite vector spaces). Suppose I have the field $\mathbb{F}$; $(\mathbb{F}^n, \mathbb{F})$, where the inner product $\braket{x}{y} = \sum_i \bar{x_i} y_i$, but another important inner product space is the space of square-integrable functions, $L^2([t_0, t_1], \mathbb{F}^n )$. Infinite-dimensional space which actually is the space spanned by Fourier series. It turns out that the inner product (of functions) is $\int_{t_0}^{t_1} f(t)^* g(t) dt$.

We're going to power through a little more machinery, but we're getting very close to the application. Need to go through adjoints and orthogonality before we can start doing applications.

Consider Hilbert spaces $(U, \mathbb{F}, \braket{}{}_u), V, \mathbb{F}, \braket{}{}_v)$, and let $\fn{A}{U}{V}$ be a continuous linear function. The adjoint of $A$ is denoted $A^*$ and is the map $\fn{A^*}{V}{U}$ such that $\braket{x}{Ay}_v = \braket{A^*x}{y}_u$.

Reasoning? Sometimes you can simplify things. Suppose $A$ maps an infinite-dimensional space to a finite-dimensional space (e.g. functions to numbers). In some sense, you can convert that function into something that goes from real numbers to functions on numbers. Generalization of the Hermitian transpose.

Consider functions $f, g \in C([t_0, t_1], \Re^n)$. What is the adjoint of $\fn{A}{C([t_0, t_1], \Re^n)}{\Re}$, where $A = \braket{g}{f}_{C ([t_0, t_1], \Re^n)}$? (aside: this notion of the adjoint will be very important when we get to observability and reachability)

Observe that $\braket{v}{A}_\Re = v \cdot A = v \braket{g}{f}_C = \braket{v g}{f}$, and so consequently, we have that the adjoint of $A^*[v] = v g$.

Orthogonality

With Hilbert spaces, one can define orthogonality in an axiomatic manner (a more abstract form, rather). Let $(H, \mathbb{F}, \braket{}{})$ be a Hilbert space. Two vectors $x, y$ are defined to be orthogonal if $\braket{x}{y} = 0$.

Cute example: suppose $c = a + b$ and $a, b$ are orthogonal. In fact, $\mag{c}^2 = \mag{a + b}^2 = \braket{a + b}{a + b} = \braket{a}{a} + \braket{b}{b} + \braket{a}{b} + \braket{b}{a} = \mag{a}^2 + \mag{b}^2$. Cute because the result is the Pythagorean theorem, which we got just through these axioms.

One more thing: the orthogonal complement of a subspace $M$ in a Hilbert space is defined as $M^\perp = \set{y \in H}{\forall x \in M \braket{x}{y}}$.

We are at a point now where we can talk about an important theorem:

Fundamental Theorem of Linear Algebra (partially)

Let $A \in \Re^{m \times n}$. Then:

• $R(A) \perp N(A^T)$
• $R(A^T) \perp N(A)$
• $R(AA^T) = R(A)$
• $R(A^TA) = R(A^T)$
• $N(AA^T) = N(A)$
• $N(A^TA) = N(A^T)$

Proofs:

• Given any $x \in \Re^n, y \in \Re^m \st A^T y = 0$ ($y \in N(A^T)$), consider the quantity $\braket{y}{Ax} = \braket{A^Ty}{x} = 0$.

• Given any $x \in \Re^n, \exists y \in \Re^m \st x = A^T y + z$, where $z \in N(A)$(as a result of the decomposition above). Thus $Ax = AA^Ty$. Implies that $R(A) \subset R(A A^T)$

Now for the application.

Application: Least Squares

Consider the following problem: minimze $\mag{y - Ax}_2$, where $y \not\in R(A)$. If $y$ were in the range of A, and A were invertible, the solution would be trivial ($A^{-1}y$). In many problems, $A \in \Re^{m\times n}$, where $m \gg n$, $y \in \Re^m$, $x \in \Re^n$.

Since we cannot solve $Ax = y$, we instead solve $Ax = \hat{y}$. According to our intuition, we would like $y - \hat{y}$ to be orthogonal to $R(A)$. From the preceding (partial) theorem, this means that $y - \hat{y} \in N(A^T) \iff A^T(y - y_0) = 0$. Remember: what we really want to solve is $A^T(y - Ax) = 0 \implies A^T Ax = A^T y \implies x = (A^T A)^{-1} A^T y$ if $A^T A$ is invertible.

If A has full column-rank (that is, for $A \in \Re^{m \times n}$, we have $R(A) = n$), then this means that in fact $N(A) = \{0\}$, and the preceding theorem implies that the dimension of $R(A^T) = n$, which means that the dimension of $R(A^T A) = n$. However, $A^T A \in \Re^{n \times n}$. Thus, $A^T A$ is invertible.

Back to condition numbers (special case)

Consider a self-adjoint and invertible matrix in $\Re^{n \times n}$. $\hat{x} = (A^T A)^{-1} A^T y = A^{-1} y$. We have two ways of determining this value: the overdetermined least-squares solution and the standard inverse. Let us look at the condition numbers.

$\kappa(A^T A) = \mag{A^T A}\mag{(A^T A)^{-1}} = \mag{A^2}\mag{(A^{-1})^2} = \bracks{\kappa(A)}^2$. This result is more general: also applies in the $L^2$ case even if $A$ is not self-adjoint. As you can see, this is worse than if we simply use the inverse.

Gram-Schmidt orthonormalization

This is a theoretical toy, not used for computation (numerics are very bad).

More definitions:

A set of vectors S is orthogonal if $x \perp y \forall x \neq y$ and $x, y \in S$.

The set is orthonormal if also $\mag{x} = 1, \forall x \in S$. Why do we care about orthonormality? Consider Parseval's theorem. The reason you get that theorem is that the bases are required to be orthonormal so that you can get that result. Otherwise it wouldn't be as clean. That's typically why people like orthonormal bases: you can represent your vectors as just coefficients (and you don't need to store the length of the vectors).

We conclude with an example of Gram-Schmidt orthonormalization. Consider the space $L^2([t_0, t_1], \Re)$. Suppose I have $v_1 = 1, v_2 = t, v_3 = t^2$, $t_0 = 0$, $t_1 = 1$, and $\mag{v_1}^2 = \int_0^1 1 \cdot 1 dt = 1$. The key idea of Gram-Schmidt orthonormalization is the following: start with $b_1 \equiv \frac{v_1}{\mag{v_1}}$. Then go on with $b_2 = \frac{v_2 - \braket{v_2}{b_1}b_1}{\mag{v_2 - \braket{v_2}{b_1}b_1}}$, and repeat until you're done (in essence: you want to preserve only the component that is orthogonal to the space spanned by the vectors you've computed so far, then renormalize).

Basically, you get after all this computation that $b_2 = \frac{1}{12} t - \frac{1}{24}$. Same construction for $b_3$.

Singular Value Decomposition & Introduction to Differential Equations

September 11, 2012

Reviewing the adjoint, suppose we have two vector spaces $U, V$; like we have with norms, let us associated a field that is either $\Re$ or $\mathbb{C}$. Assume that these spaces are inner product spaces (we're associating with each an inner product). Suppose we have a continuous (linear) map $\fn{\mathcal{A}}{U}{V}$. We define the adjoint of this map to be $\fn{\mathcal{A}^*}{V}{U}$ such that $\braket{u}{\mathcal{A} v} = \braket{\mathcal{A}^* v}{u}$.

We define self-adjoint maps as maps that are equal to their adjoints, i.e. $\fn{\mathcal{A}}{U_1}{U_2} \st \mathcal{A} = \mathcal{A}^*$.

In finite-dimensional vector spaces, the adjoint of a map is equivalent to the conjugate transpose of the matrix representation of the map. We refer to matrices that correspond to self-adjoint maps as hermitian.

Unitary matrices

Suppose that we have $U \in \mathbb{F}^{n\times n}$. $U$ is unitary iff $U^*U = UU^* = I_n$. If $\mathbb{F}$ is $\Re$, the matrix is called orthogonal.

These constructions lead us to something useful: singular value decomposition. We'll come back to this later when we talk about matrix operations.

Singular Value Decomposition (SVD)

Suppose you have a matrix $M \in \mathbb{F}^{m\times m}$. An eigenvalue $\lambda$ of $M$ is a complex number iff there exists a nonzero vector $v$ such that $Mv = \lambda v$ ($v$ is thus called the eigenvector associated to $\lambda$). Now we can think about how to define singular values of a matrix in terms of these definitions.

Let us think about this in general for a matrix $A \in \mathbb{F}^{m \times n}$ (which we consider to be a matrix representation of some linear map with respect to a basis). Note that $A A^* = \mathbb{F}^{m \times m}$, which will have $m$ eigenvalues $\lambda_i, i = 1 ... m$.

Note that $AA^*$ is hermitian. We note that from the Spectral theorem, we can decompose the matrix into an orthonormal basis of eigenvectors corresponding to real eigenvalues. In fact, in this case, the eigenvalues must be real and non-negative.

If we write the eigenvalues of $AA^*$ as $\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_m$, where the first $r$ are nonzero, note that $r = \text{rank} AA^*$. We define the non-zero singular values of $A$ to be $\sigma_i = \sqrt{\lambda_i}, i \le r$. The remaining singular values are zero.

Recall the induced 2-norm: let us relate this notion of singular values back to the induced 2-norm of a matrix $A$ ($\mag{A}_{2,i}$). Consider the induced norm to be the norm induced by the action of $A$ on the domain of $A$; thus if we take the induced 2-norm, then this is the $\max (\lambda_i (A^*A))^{1/2}$, which is simply the maximum singular value.

Now that we know what singular values are, we can do a useful decomposition called singular value decomposition.

Take $M \in \mathbb{C}^{m \times n}$. We have the following theorem: there exist unitary matrices $U \in \mathbb{C}^{m \times m}, V \in \mathbb{C}^{n \times n}$ such that $A = U \Sigma V$, where $\Sigma$ is defined as a diagonal matrix containing the singular values of $A$. Consider the first $r$ columns of $U$ to be $U_1$, the first $r$ columns of $V$ to be $V_1$, and the $r \times r$ block of $\Sigma$ containing the nonzero singular values to be $\Sigma_r$. Then $A = U \Sigma V = U_1 \sigma_r V_1^*$.

Consider $AA^*$. With a bit of algebra, we can show that $AA^*U_1 = U_1 \sigma_r^2$. We call the columns $u_i$ of $U_1$ are the eigenvectors of $AA^*$ associated to eigenvalues $\sigma_i^2$; these are called the right-singular vectors.

Similarly, if we consider $A^*A$, we can show that $A^*A = V_1^* \Sigma_r^2 V_1$ and that $v_i^* A^*A = \Sigma_r^2 v_1^*$; the columns of this matrix are called the left-singular vectors.

Recap

We've covered a lot of ground these past few weeks: we covered functions, vector spaces, bases, and then we started to consider linearity. And then we started talking about endowing vector spaces with things like norms, inner products; induced norms. From that, we went on to talk about adjoints. We used adjoints, we went on to talk a little about projection and least-squares optimization. We then went on to talk about Hermitian matrices and singular value decomposition. I think about this first unit as having many basic units that we'll use over and over again. Two interesting applications: least-squares, SVD.

So we have this basis now to build on as we talk about linear systems. We'll also need to build a foundation on linear differential equations. We'll spend some time going over the basics: what a solution means, under what conditions a solution exists (i.e. what properties does the differential equation need to have?). We'll spend the next couple weeksn talking about properties of differential equations.

All of what we've done up to now has been covered in appendix A of Callier & Desoer. For the introduction to differential equations, we'll follow appendix B of Callier & Desoer. Not the easiest to read, but very comprehensive background reading.

The existence and uniqueness theorems are in many places, however.

Lecture notes 7.

Differential Equations

$$\dot{x} = f((x(t), t)), x(t_0) = x_0 \\ x \in \Re^n \\ \fn{f}{\Re^n \times \Re}{\Re^n}$$

(strictly speaking, $f$ maps $x$ to the tangent space, but for this course, we're going to consider the two spaces to be equivalent)

Often, we're going to consider the time-invariant case (where there is no dependence on $t$, but rather only on $x$), but this is a time-variant case. Recall that we consider time to be a privileged variable, i.e. always "marching forward".

What we're going to talk about now is how we can solve this differential equation. Rather (for now), under what conditions does there exist a (unique) solution to the differential equation (with initial condition)? We're interested in these two properties: existence and uniqueness. The solution we call $x(t)$ where $x(t_0) = x_0$. We need some understanding of some properties of that function $f$. We'll talk about continuity, piecewise continuity, Lipschitz continuity (thinking about the existence). In terms of uniqueness, we'll be talking about Cauchy sequences, Banach spaces, Bellman-Grönwall lemma.

A couple of different ways to prove uniqueness and existence; we'll use the Callier & Desoer method.

We'll finish today's lecture by just talking about some definitions of continuity. Suppose we have a function $f(x)$ that is said to be continuous: that is, $\forall \epsilon > 0, \exists \delta > 0 \st \abs{x_1 - x_2} < \delta \implies \abs{f(x_1) - f(x_2)} < \epsilon$ ($\epsilon$-$\delta$ definition).

Suppose we have $\fn{f(x,t)}{\Re^n \times \Re}{\Re^n}$. $f$ is said to be piece-wise continuous (w.r.t. $t$), $\forall x$ if $\fn{f(x, \cdot)}{\Re}{\Re^n}$ is continuous except at a finite number of (well-behaved) discontinuities in any closed and bounded interval of time. What I'm not allowing in this definition are functions with infinitely many points of discontinuity.

Next time we'll talk about Lipschitz continuity.

Existence and Uniqueness of Solutions to Differential Equations

September 13, 2012

Section this Friday only, 9:30 - 110:30, Cory 299.

Today: existence and uniqueness of solutions to differential equations.

We called this a DE or ODE, and we associated with it an initial condition. We started to talk about properties of the function $f$ as a function of $x$ only, but we can consider thinking about this as a function of $x$ for all t. This is a map from $\Re^n \to \Re^n$. In this class, recall, we used the $\epsilon$-$\delta$ definition for continuity.

We also introduced the concept of piecewise continuity, which will be important for thinking about the right-hand-side of the differential equation.

We defined piecewise continuity as $\fn{f(t)}{\Re_+}{\Re^n}$, where $f(t)$ is said to be piecewise continuous in $t$, where the function is continuous except at a set of well-behaved discontinuities (finitely many in any closed and bounded, i.e. compact, interval).

Finally, we will define Lipschitz continuity as follows: a function $\fn{f(\cdot, t)}{\Re^n}{\Re^n}$ is Lipschitz continuous in x if there exists a piecewise continuous function of time $\fn{k(t)}{\Re_+}{\Re_+}$ such that the following inequality holds: $\mag{f(x_1) - f(x_2)} \le k(t)\mag{x_1 - x_2}, \forall x_1, x_2 \in \Re^n, \forall t \in \Re_+$. This inequality (condition) is called the Lipschitz condition.

An important thing in this inequality is that there has to be one function $k(t)$, and it has to be piecewise continuous. That is, there exists such a function that is not allowed to go to infinity in compact time intervals.

It's an interesting condition, and if we look at this and compare the Lipschitz continuity definition to the general continuity definition, we can easily show that if the function is LC (Lipschitz continuous), then it's C (continuous), since LC is a stricter condition than C. That implication is fairly straightforward to show, but the inverse relationship is not necessarily true (i.e. continuity does not necessarily imply Lipschitz continuity).

Aside: think about this condition and what it takes to show that a function is Lipschitz continuous. Need to come up with a candidate $k(t)$ (often called the Lipschitz function or constant, if it's constant). Often the hardest part: trying to extract from $f$ what a possible $k$ is.

But there's a useful possible candidate for $k(t)$, given a particular function $f$. Let's forget about time for a second and consider a function just of $x$. If the Jacobian $Df$ (often you also use $\pderiv{f}{x}$), which is an $n \times n$ matrix (where $(Df)^j_i = \pderiv{f_j}{x_i}$. If the Jacobian $Df$ exists, then its norm provides a candidate Lipschitz function $k(t)$.

A norm of the Jacobian of $f$, if independent of $x$, tells you that the function is Lipschitz. If the norm always seems to depend on $x$, you can still say something about the Lipschitz properties of the function: you can call it locally Lipschitz by bounding the value of $x$ in some region.

Sketch of proof: generalization of mean value theorem (easy to sketch in $\Re^1$). Mean value theorem states that there exists a point such that the instantaneous slope is the same as the average slope (assuming that the function is differentiable). If we want to generalize it to more dimensions, we say $f(x_1) - f(x_2) = Df(\lambda x_1 + (1 - \lambda) x_2)(x_1 - x_2)$ (where $0 < \lambda < 1$). All we've required is the existence of $Df$.

Now we can just take norms (and this is what's interesting now) and use some of the results we have from norms. This provides a very useful construction for a candidate for $k$ (might not provide a great bound), but it's the second thing to try if you can't immediately extract out a function $k(t)$.

Something not in the notes, but useful. Let's go back to where we started, the differential equation with initial condition, and state the main theorem.

Fundamental Theorem of DEs / the Existence and Uniqueness theorem of (O)DEs

suppose we have a differential equation with an initial condition. Assume that $f(x)$ is piecewise continuous in $t$ and Lipschitz continuous in $x$. With that information, we have that there exists a unique function of time which maps $\Re_+ \to \Re^n$, which is differentiable ($C^1$) almost everywhere (derivative exists at all points at which $f$ is continuous), and it satisfies the initial condition and differential equation. This derivative exists at all points $t \in [t_1, t_2] - D$, where $D$ is the set of points where $f$ is discontinuous in $t$.

We are going to be interested in studying differential equations where we know these conditions hold. We're also going to prove the theorem. It's a nice thing to do (a little in depth) because it demonstrates some proof techniques (as well as giving you an idea of why the theorem works).

LC condition

The norm of the Jacobian of the example is bounded for bounded $x$. That is, we can choose a local region in $\Re$ for which our $Df$ is bounded to be less than some constant. That gives us a candidate Lipschitz constant for that local region. We say then that $f(x)$ is (at least) locally Lipschitz continuous (usually we just say this without specifying a region, since you can usually find a bound given any region). Further, it is trivially piecewise continuous in time (since it doesn't depend on time).

Note: if the Lipschitz condition holds only locally, it may be that the solution is only defined over a certain range of time.

We didn't show this, but in this example, the Lipschitz condition does not hold globally.

Local Fundamental theorem of DEs

Now assume that $f(x)$ is piecewise continuous in $t$ and Lipschitz continuous in $x$ (for all $x \in G \in \Re^n$). We now have that there exists a unique function of time and an interval $[t_0,t_1]$ (such that $t_0 \in G, t_1 \in G$) which maps $\Re_+ \to \Re^n$, which is differentiable ($C^1$) almost everywhere (derivative exists at all points at which $f$ is continuous), and it satisfies the initial condition and differential equation. As before, This derivative exists at all points $t \in [t_1, t_2] - D$, where $D$ is the set of points where $f$ is discontinuous in $t$. If it is global, we can make the interval as large as desired.

Proof

There are two pieces: the proof of existence and the proof of uniqueness. Today will likely just be existence.

Existence

Roadmap: construct an infinite sequence of continuous functions defined (recursively) as follows $x_{m+1}(t) = x_0 + \int_{t_0}^t f(x_m(\tau), \tau) d\tau$. First, show that this sequence converges to a continuous function $\fn{\Phi(\cdot)}{\Re_+}{\Re^n}$ which solves the DE/IC pair.

Would like to be able to prove the first thing here: I've constructed a sequence, and I want to show that the limit of this sequence is a solution to the differential equation.

The tool that I'm going to use is a property called Cauchy, and then I'm going to invoke the result that if I have a complete space, any Cauchy sequence on the space converges to something in the space. Gives me the basis of the existence of the thing that this converges to.

Goal: (1) to show that this sequence is a Cauchy sequence in a complete normed vector space, which means the sequence converges to something in the space, and (2) to show that the limit of this sequence satisfies the DE/IC pair.

A Cauchy sequence (on a normed vector space) is such that there exists some point in the sequence (some finite index $m$) such that if you look at any point beyond that index, the distance between the later points can be made smaller than some arbitrarily small $\epsilon > 0$. In other words: if we drop a finite number of elements from the start of the sequence, the distance between any remaining elements can be made arbitrarily small.

We define a Banach space (equivalently, a complete normed vector space) is one in which all Cauchy sequences converge. Implicitly in that, it means to something in the space itself.

Just an aside, a Hilbert space is a complete inner product space. If you have an inner product space, and you define the norm in that inner product space induced by that inner product, if all Cauchy sequences of that space converge (to a limit in the space) with this norm, then it is a Hilbert space.

Think about a Cauchy sequence on a space that converges to something not necessarily in the space. Example: any continued fraction.

To show (1), we'll show that this sequence $\{x_m\}$ that we constructed is a Cauchy sequence in a Banach space. Interestingly, it matters what norm you choose.

Proof of Existence and Uniqueness Theorem

September 18, 2012

Today:

• proof of existence and uniqueness theorem.
• [ if time ] introduction to dynamical systems.

First couple of weeks of review to build up basic concepts that we'll be drawing upon throughout the course. Either today or Thursday we will launch into linear system theory.

We're going to recall where we were last time. We had the fundamental theorem of differential equations, which said the following: if we had a differential equation, $\dot{x} = f(x,t)$, with initial condition $x(t_0) = x_0$, where $x(t) \in \Re^n$, etc, if $f( \cdot , t)$ is Lipschitz continuous, and $f(x, \cdot )$ is piecewise continuous, then there exists a unique solution to the differential equation / initial condition pair (some function $\phi(t)$) wherever you can take the derivative (may not be differentiable everywhere: loses differentiability on the points where discontinuities exist).

We spent quite a lot of time discussing Lipschitz continuity. Job is usually to test both conditions; first one requires work. We described a popular candidate function by looking at the mean value theorem and applying it to $f$: a norm of the Jacobian function provides a candidate Lipschitz if it works.

We also described local Lipschitz continuity, and often, when using a norm of the Jacobian, that's fairly easy to show.

Important point to recall: a norm of the Jacobian of $f$ provides a candidate Lipschitz function.

Another important thing to say here is that we can use any norm we want, so we can be creative in our choice of norm when looking for a better bound.

We started our proof last day, and we talked a little about the structure of the proof. We are going to proceed by constructing a sequence of functions, then show (1) that it converges to a solution, then show (2) that it is unique.

Proof of Existence

We are going to construct this sequence of functions as follows: $x_{m+1}(t) = x_0 + \int_0^t f(x_m(\tau)) d\tau$. Here we're dealing with an arbitrary interval from $t_1$ to $t_2$, and so $0 \in [t_1, t_2]$. We want to show that this sequence is a Cauchy sequence, and we're going to rely on our knowledge that the space these functions are defined in is a Banach space (hence this sequence converges to something in the space).

We have to put a norm on the set of reals, so we'll use the infinity norm. Not going to prove it, but rather state it's a Banach space. If we show that this is a Cauchy sequence, then the limit of that Cauchy sequence exists in the space. The reason that's interesting is that it's this limit that provides a candidate for this differential equation.

We will then prove that this limit satisfies the DE/IC pair. That is adequate to show existence. We'll then go on to prove uniqueness.

Our immediate goal is to show that this sequence is Cauchy, which is, we should show $\exists m \st (x_{m+p} - x_m) \to 0$ as $m$ gets large.

First let us look at the difference between $x_{m+1}$ and $x_m$. Just functions of time, and we can compute this. $\mag{x_{m+1} - x_m} = \int_{t_0}^t (f(x_m, \tau) - f(x_{m+1}, \tau)) d\tau$. Use the fact that f is Lipschitz continuous, and so it is $\le k(\tau)\mag{x_m(\tau) - x_{m+1}(\tau)} d\tau$. The function is Lipschitz, so well-defined, and it has a supremum in this interval. Let $\bar{k}$ be the supremum of $k$ over the whole interval $[t_1, t_2]$. This means that we can take this inequality and rewrite as $\mag{x_{m+1} - x_m} \le \bar{k} \int_{t_0}^t \mag{x_m(\tau) - x_{m+1}(\tau)} d\tau$. Now we have a bound that relates the bound between $x_m$ and $x_{m+1}$. You can essentially relate the distance we've just related between two subsequent elements to some further distance by counting.

Let us do two things: sort out the integral on the right-hand-side, then look at arbitrary elements beyond an index.

We know that $x_1(t) = x_0 + \int_{t_0}^t f(x_0, \tau) d\tau$, and that $x_1 - x_0 \le \int_{t_0}^{t} \mag{f(x_0, \tau)} d\tau \le \int_{t_1}{t_2} \mag{f(x_0, \tau) d\tau} \defequals M$. From the above inequalities, $\mag{x_2 - x_1} \le M \bar{k}\abs{t - t_0}$. Now I can look at general bounds: $x_3 - x_2 \le \frac{M\bar{k}^2 \abs{t - t_0}^2}{2!}$. In general, $x_{m+1} - x_m \le \frac{M\parens{\bar{k} \abs{t - t_0}}^m}{m!}$.

If we look at the norm of $\dot{x}$, that is going to be a function norm. What I've been doing up to now is look at a particular value $t_1 < t < t_2$.

Try to relate this to the norm $\mag{x_{m+1} - x_m}_\infty$. Can what we've done so far give us a bound on the difference between two functions? We can, because the infinity norm of a function is the maximum value that the function assumes (maximum vector norm for all points $t$ in the interval we're interested in). If we let $T$ be the difference between our larger bound $t_2 - t_1$, we can use the previous result on the pointwise norm, then a bound on the function norm has to be less than the same bound, i.e. if a pointwise norm function is less than this bound for all relevant $t$, then its max value must be less than this bound.

That gets us on the road we want to be, since that now gets us a bound. We can now go back to where we started. What we're actually interested in is given an index $m$, we can construct a bound on all later elements in the sequence.

$\mag{x_{m+p} - x_m}_\infty = \mag{x_{m+p} + x_{m+p-1} - x_{m+p-1} + ... - x_m} = \mag{\sum_{k=0}^{p-1} (x_{m+k+1} - x_{m+k})} \le M \sum_{k=0}^{p-1} \frac{(\bar{k}T)^{m+k}}{(m+k)!}$.

We're going to recall a few things from undergraduate calculus: Taylor expansion of the exponential function and $(m+k)! \ge m!k!$.

With these, we can say that $\mag{x_{m+p} - x_m}_\infty \le M\frac{(\bar{k}T)^m}{m!} e^{\bar{k} T}$. What we'd like to show is that this can be made arbitrarily small as $m$ gets large. We study this bound as $m \to \infty$, and we recall that we can use the Stirling approximation, which shows that factorial grows faster than the exponential function. That is enough to show that $\{x_m\}_0^\infty$ is Cauchy. Since it is in a Banach space (not proving, since beyond our scope), it converges to something in the space to a function (call it $x^\ell$) in the same space.

Now we just need to show that the limit $x^\ell$ solves the differential equation (and initial condition). Let's go back to the sequence that determines $x^\ell$. $x_{m+1} = x_0 + \int_{t_0}^t f(x_m, \tau) d\tau$. We've proven that this limit converges to $x^\ell$. What we want to show is that if we evaluate $f(x^\ell, t)$, then $\int_{t_0}^t f(x_m, \tau) \to \int_{t_0}^t f(x^\ell, \tau) d\tau$. Would be immediate if we had that the function were continuous. Clear that it satisfies initial condition by the construction of the sequence, but we need to show that it satisfies the differential equation. Conceptually, this is probably more difficult than what we've just done (establishing bounds, Cauchy sequences). Thinking about what that function limit is and what it means for it to satisfy that differential equation.

Now, you can basically use some of the machinery we've been using all along to show this. Difference between these goes to $0$ as $m$ gets large.

$$\mag{\int_{t_0}^t (f(x_m, \tau) f(x^\ell, \tau)) d\tau} \\ \le \int_{t_0}^t k(\tau) \mag{x_m - x^\ell} d\tau \le \bar{k}\mag{x_m - x^\ell}_\infty T \\ \le \bar{k} M e^{\bar{k} T} \frac{(\bar{k} T)^m}{m!}T$$

Thus $x^\ell$ solves the DE/IC pair. A solution $\Phi$ is $x^\ell$, i.e. $x^\ell(t) = f(x^\ell, t) \forall [t_1, t_2] - D$ and $x^\ell(t_0) = x_0$

To show that this solution is unique, we will use the Bellman-Gronwall lemma, which is very important. Used ubiquitously when you want to show that functions of time are equal to each other: candidate mechanism to do that.

Bellman-Gronwall Lemma

Let $u, k$ be real-valued positive piece-wise continuous functions of time, and we'll have a constant $c_1 \ge 0$ and $t_0 \ge 0$. If we have such constants and functions, then the following is true: if $u(t) \le c_1 + \int_{t_0}^t k(\tau)u(\tau) d\tau$, then $u(t) \le c_1 e^{\int_{t_0}^t k(\tau) d\tau}$.

Proof (of B-G)

$t > t_0$ WLOG.

$$U(t) = c_1 + \int_{t_0}^t k(\tau) u(\tau) d\tau \\ u(t) \le U(t) \\ u(t)k(t)e^{\int_{t_0}^t k(\tau) d\tau} \le U(t)k(t)e^{\int_{t_0}^t k(\tau) d\tau} \\ \deriv{}{t}\parens{U(t)e^{\int_{t_0}^t k(\tau) d\tau}} \le 0 \text{(then integrate this derivative, note that U(t_0) = c_1)} \\ u(t) \le U(t) \le c_1 e^{\int_{t_0}^t k(\tau) d\tau}$$

Using this to prove uniqueness of DE/IC solutions

How we're going to use this to prove B-G lemma.

We have a solution that we constructed $\Phi$, and someone else gives us a solution $\Psi$, constructed via a different method. Show that these must be equivalent. Since they're both solutions, they have to satisfy the DE/IC pair. Take the norm of the difference between the differential equations.

$$\mag{\Phi - \Psi} \le \bar{k} \int_{t_0}^t \mag{\Phi - \Psi} d\tau \forall t_0, t \in [t_1, t_2]$$

From the Bellman-Gronwall Lemma, we can rewrite this inequality as $\mag{\Phi - \Psi} \le c_1 e^{\bar{k}(t - t_0)}$. Since $c_1 = 0$, this norm is less than or equal to 0. By positive definiteness, this norm must be equal to 0, and so the functions are equal to each other.

Reverse time differential equation

We think about time as monotonic (either increasing or decreasing, usually increasing). Suppose that time is decreasing. $\exists \dot{x} = f(x,t)$. Going backwards in time, explore existence and uniqueness going backwards in time. Suppose we had a time variable $\tau$ which goes from $t_0$ backwards, and defined $\tau \defequals t_0 - t$. We want to define the solution to that differential equation backwards in time as $z(\tau) = x(t)$ if $t < t_0$. Derive what reverse order time derivative is. Equation is just $-f$; we're going to use $\bar{f}$ to represent this function ($\deriv{}{\tau}z = -\deriv{}{t}x = -f(x, t) = -f(z, \tau) = \bar{f}$).

This equation, if I solve the reverse time differential equation, we'll have some corresponding backwards solution. Concluding statement: can think about solutions forwards and backwards in time. Existence of unique solution forward in time means existence of unique solution backward in time (and vice versa). You can't have solutions crossing themselves in time-invariant systems.

Introduction to dynamical systems

September 20, 2012

Suppose we have equations $\dot{x} = f(x, u, t)$, $\fn{f}{\Re^n \times \Re^n \times \Re_+}{\Re^n}$ and $y = h(x, u, t)$, $\fn{h}{\Re^n \times \Re^n \times \Re_+}{\Re^n}$. We define $n_i$ as the dimension of the input space, $n_o$ as dimension of the output space, and $n$ as the dimension of the state space.

We've looked at the form, and if we specify a particular $\bar{u}(t)$ over some time interval of interest, then we can plug this into the right hand side of this differential equation. Typically we do not supply a particular input. Thinking about solutions to this differential equation, for now, let's suppose that it's specified.

Suppose we have some feedback function of the state. If $u$ is specified, as long as $\bar{f}$ satisfies the conditions for the existence and uniqueness theorem, we have a differential equation we can solve.

Another example: instead of differential equation (which corresponds to continuous time), we have a difference equation (which corresponds to discrete time).

Example: dynamic system represented by an LRC circuit. One practical way to define the state $x$ is as a vector of elements whose derivatives appear in our differential equation. Not formal, but practical for this example.

Notions of discretizing.

What is a dynamical system?

As discussed in first lecture, we consider time $\tau$ to be a privileged variable. Based on our definition of time, the inputs and outputs are all functions of time.

Now we're going to define a dynamical system as a 5-tuple: $(\mathcal{U}, \Sigma, \mathcal{Y}, s, r)$ (input space, state space, output space, state transition function, output map).

We define the input space as the set of input functions over time to an input set $U$ (i.e. $\mathcal{U} = \{\fn{u}{\tau}{U}\}$. Typically, $U = \Re^{n_i}$).

We also define the output space as the set of output functions over time to an output set $Y$ (i.e. $\mathcal{Y} = \{\fn{y}{\tau}{Y}\}$). Typically, $Y = \Re^{n_o}$.

$\Sigma$ is our state space. Not defined as the function, but the actual state space. Typically, $\Sigma = \Re^n$, and we can go back and think about the function $x(t) \in \Sigma$. $\fn{x}{\tau}{\Sigma}$ is called the state trajectory.

$s$ is called the state transition function because it defines how the state changes in response to time and the initial state and the input. $\fn{s}{\tau \times \tau \times \Sigma \times U }{\Sigma}$. Usually we write this as $x(t_1) = s(t_1, t_0, x_0, u)$, where $u$ is the function $u(\cdot) |_{t_0}^{t_1}$. This is important: coming towards how we define state. Only things you need to get to state at the new time are the initial state, inputs, and dynamics.

Finally, we have this output map (sometimes called the readout map) $r$. $\fn{r}{\tau \times \Sigma \times U}{Y}$. That is, we can think about $y(t) = r(t, x(t), u(t))$. There's something fundamentally different between $r$ and $s$. $s$ depended on the function $u$, whereas $r$ only depended on the current value of $u$ at a particular time.

$s$ captures dynamics, while $r$ is static. Remark: $s$ has dynamics (memory) -- things that depend on previous time, whereas $r$ is static: everything it depends on is at the current time (memoryless).

In order to be a dynamical system, we need to satisfy two axioms: a dynamical system is a five-tuple with the following two axioms:

• The state transition axiom: $\forall t_1 \ge t_0$, given $u, \tilde{u}$ that are equal to each other over a particular time interval, the state transition functions must be equal over that interval, i.e. $s(t_1, t_0, x_0, u) = s(t_1, t_0, x_0, \tilde{u})$. Requires us to not have dependence on the input outside of the time interval of interest.
• The semigroup axiom: suppose you start a system at $t_0$ and evolve it to $t_2$, and you're considering the state. You have an input $u$ defined over the whole time interval. If you were to look at an intermediate point $t_1$, and you computed the state at $t_1$ via the state transition function, we can split our time interval into two intervals, and we can compute the result any way we like. Stated as the following: $s(t_2, t_1, s(t_1, t_0, x_0, u), u) = s(t_2, t_0, x_0, u)$.

When we talk about a dynamical system, we have to satisfy these axioms.

Response function

Since we're interested in the outputs and not the states, we can define what we call the response map. It's not considered part of the definition of a dynamical system because it can be easily derived.

It's the composition of the state transition function and the readout map, i.e. $y(t) = r(t, x(t), u(t)) = r(t, s(t, t_0, x_0, u), u(t)) \defequals \rho(t, t_0, x_0, u)$. This is an important function because it is used to define properties of a dynamical system. Why is that? We've said that states are somehow mysterious. Not something we typically care about: typically we care about the outputs. Thus we define properties like linearity and time invariance.

Time Invariance

We define a time-shift operator $\fn{T_\tau}{\mathcal{U}}{\mathcal{U}}$, $\fn{T_\tau}{\mathcal{Y}}{\mathcal{Y}}$. $(T_\tau u)(t) \defequals u(t - \tau)$. Namely, the value of $T_\tau u$ is that of the old signal at $t-\tau$.

A time-invariant (dynamical) system is one in which the input space and output space are closed under $T_\tau$ for all $\tau$, and $\rho(t, t_0, x_0, u) = \rho(t + \tau, t_0 + \tau, x_0, T_\tau u)$.

Linearity

A linear dynamical system is one in which the input, state, and output spaces are all linear spaces over the same field $\mathbb{F}$, and the response map $\rho$ is a linear map of $\Sigma \times \mathcal{U}$ into $\mathcal{Y}$.

This is a strict requirement: you have to check that the response map satisfies these conditions. Question that comes up: why do we define linearity of a dynamical system in terms of linearity of the response and not the state transition function? Goes back to a system being intrinsically defined by its inputs and outputs. Often states, you can have many different ways to define states. Typically we can't see all of them. It's accepted that when we talk about a system and think about its I/O relations, it makes sense that we define linearity in terms of this memory function of the system, as opposed to the state transition function.

Let's just say a few remarks about this: zero-input response, zero-state response. If we look at the zero element in our spaces (so we have a zero vector), then we can take our superposition, which implies that the response at time $t$ is equal to the zero-state response, which is the response, given that we started at the zero state, plus the zero input response.

That is: $\rho(t, t_0, x_0, u) = \rho(t, t_0, \theta_x, u) + \rho(t, t_0, x_0, \theta_u)$ (from the definition of linearity).

The second remark is that the zero-state response is linear in the input, and similarly, the zero-input response is linear in the state.

One more property of dynamical systems before we finish: equivalence (a property derived from the definition). Take two dynamical systems $D = (U, \Sigma, Y, s, r), \tilde{D} = (U, \bar{\Sigma}, Y, \bar{s}, \bar{r})$. $x_0 \in D$ is equivalent to $\tilde{x_0} \in \tilde{D}$ at $t_0$. If $\forall t \ge t_0, \rho(t, t_0, x_0, u) = \tilde{\rho}(t, t_0, \tilde{x_0}, u)$ $\forall x$ and some $\tilde{x}$, the two systems are equivalent.

Linear time-varying systems

September 25, 2012

Recall the state transition function is given some function of the current time with initial state, initial time, and inputs, Suppose you have a differential equation; how do you acquire the state transition function? Solve the differential equation.

For a general dynamical system, there are different ways to get the state transition function. This is an instantiation of a dynamical system, and we're going to ge thte state transition function by solving the differential equation / initial condition pair.

We're going to call $\dot{x}(t) = A(t)x(t) + B(t)u(t)$ a vector differential equation with initial condition $x(t_0) = x_0$.

So that requires us to think about solving that differential equation. Do a dimension check, to make sure we know the dimensions of the matrices. $x \in \Re^n$, so $A \in \Re^{n_0 \times n}$. We could define the matrix function $A$, which takes intervals of the real line and maps them over to matrices. As a function, $A$ is piecewise continuous matrix function in time.

The entries are piecewise-continuous scalars in time. We would like to get at the state transition function; to do that, we need to solve the differential equation.

Let's assume for now that $A, B, U$ are given (part of the system definition).

Piece-wise continuous is trivial; we can use the induced norm of $A$ for a Lipschitz condition. Since this induced norm is piecewise-continuous in time, this is a fine bound. Therefore $f$ is globally Lipschitz continuous.

We're going to back off for a bit and introduce the state transition matrix. Background for solving the VDE. We're going to introduce a matrix differential equation, $\dot{X} = A(t) X$ (where $A(t)$ is same as before).

I'm going to define $\Phi(t, t_0)$ as the solution to the matrix differential equation (MDE) for the initial condition $\Phi(t_0, t_0) = 1_{n \times n}$. I'm going to define $\Phi$ as the solution to the $n \times n$ matrix when my differential equation starts out in the identity matrix.

Let's first talk about properties of this matrix $\Phi$ just from the definition we have.

• If you go back to the vector differential equation, and let's just drop the term that depends on $u$ (either consider $B$ to be 0, or the input to be 0), the solution of $\cdot{x} = A(t)x(t)$ is given by $x(t) = \Phi(t, t_0)x_0$.
• This is what we call the semigroup property, since it's reminiscent of the semigroup axiom. $\Phi(t, t_0) = \Phi(t, t_1) \Phi(t_1, t_0) \forall t, t_0, t_1 \in \Re^+$
• $\Phi^{-1}(t, t_0) = \Phi(t_0, t)$.
• $\text{det} \Phi(t, t_0) = \exp\parens{\int_{t_0}^t \text{tr} \parens{A (\tau)} d\tau}$.

Here's let's talk about some machinery we can now invoke when we want to show that two functions of time are equal to each other when they're both solutions to the differential equation. You can simply show by the existence and uniqueness theorem (assuming it applies) that they satisfy the same initial condition and the same differential equation. That's an important point, and we tend to use it a lot.

(i.e. when faced with showing that two functions of time are equal to each other, you can show that they both satisfy the same initial condition and the same differential equation [as long as the differential equation satisfies the hypotheses of the existence and uniqueness theorem])

Obvious, but good to state.

Note: the initial condition doesn't have to be the initial condition given; it just has to hold at one point in the interval. Pick your point in time judiciously.

Proof of (2): check $t=t_1$. (3) follows directly from (2). (4) you can look at if you want. Gives you a way to compute $\Phi(t, t_0)$. We've introduced a matrix differential equation and an abstract solution.

Consider (1). $\Phi(t, t_0)$ is a map that takes the initial state and transitions to the new state. Thus we call $\Phi$ the state transition matrix because of what it does to the states of this vector differential equation: it transfers them from their initial value to their final value, and it transfers them through matrix multiplication.

Let's go back to the original differential equation. Claim that the solution to that differential equation has the following form: $x(t) = \Phi(t, t_0)x_0 + \int_{t_0}^t \Phi(t, \tau)B(\tau)u(\tau) d\tau$. Proof: we can use the same machinery. If someone gives you a candidate solution, you can easily show that it is the solution.

Recall the Leibniz rule, which we'll state in general as follows: $\pderiv{}{z} \int_{a(z)}^{b(z)} f(x, z) dx = \int_{a(z)}^{b(z)} \pderiv{}{x}f(x, z) dx + \pderiv{b}{z} f(b, z) - \pderiv{a}{z} f(a, z)$.

$$\dot{x}(t) = A(t) \Phi(t, t_0) x_0 + \int_{t_0}^t \pderiv{}{t} \parens{\Phi(t, \tau)B(\tau)u(\tau)} d\tau + \pderiv{t}{t}\parens{\Phi(t, t)B(t)u(t)} - \pderiv{t_0}{t}\parens{...} \\ = A(t)\Phi(t, t_0)x_0 + \int_{t_0}^t A(t)\Phi(t,\tau)B(\tau)u(\tau)d\tau + B(t)u(t) \\ = A(\tau)\Phi(t, t_0) x_0 + A(t)\int_{t_0}^t \Phi(t, \tau)B(\tau) u(\tau) d\tau + B(t) u(t) \\ = A(\tau)\parens{\Phi(t, t_0) x_0 + \int_{t_0}^t \Phi(t, \tau)B(\tau) u(\tau) d\tau} + B(t) u(t)$$

$x(t) = \Phi(t,t_0)x_0 + \int_{t_0}^t \Phi(t,\tau)B(\tau)u(\tau) d\tau$ is good to remember.

Not surprisingly, it depends on the input function over an interval of time.

The differential equation is changing over time, therefore the system itself is time-varying. No way in general that will be time-invariant, since the equation that defines its evolution is changing. You test time-invariance or time variance through the response map. But is it linear? You have the state transition function, so we can compute the response function (recall: readout map composed with the state transition function) and ask if this is a linear map.

Linear time-Invariant systems

September 27, 2012

Last time, we talked about the time-varying differential equation, and we expressed $R(\cdot) = \bracks{A(\cdot), B(\cdot), C(\cdot), D(\cdot)}$. Used state transition matrix to show that the solution was given by $x(t) = \Phi(t, t_0) x_0 + \int_{t_0}^t B(\tau) u(\tau) d\tau$. Integral part is the state transition matrix, and we haven't talked about how we would compute this matrix. In general, computing the state transition matrix is hard. But there's one important class where computing that class becomes much simpler than usual. That is where the system does not depend on time.

Linear time-invariant case: $\dot{x} = Ax + Bu, y = Cx + Du, x(t_0) = x_0$. Does not matter at what time we start. Typically, WLOG, we use $t_0 = 0$ (we can't do this in the time-varying case).

Aside: Jacobian linearization

In practice, generally the case that someone doesn't present you with a model that looks like this. Usually, you derive this (usually nonlinear) model through physics and whatnot. What can I do to come up with a linear representation of that system? What is typically done is an approximation technique called Jacobian linearization.

So suppose someone gives you a nonlinear system and an output equation, and you want to come up with some linear representation of the system.

Two points of view: we could look at the system, and suppose we applied a particular input to the system and solve the differential equation ($u^0(t) \mapsto x^0(t)$, the nominal input and nominal solution). That would result in a solution (state trajectory, in general). Now suppose that we for some reason want to perturb that input ($u^0(t) + \delta u(t)$, the perturbed input). Suppose in general that $\delta u$ is a small perturbation. What this results in is a new state trajectory, that we'll define as $x^0(t) + \delta x(t)$, the perturbed solution.

Now we can derive from that what we call the Jacobian linearization. That tells us that if we apply the input, the solution will be $x^0 = f(x^0, u^0, t)$, and I also have that $x^0(t_0) = x_0$.

$\dot{x}^0 + \dot{\delta}x = f(x^0 + \delta x, u^0 + \delta u, t)$, where $(x^0 + \delta x)(t_0) = x_0 + \delta x_0$. Now I'm going to look at these two and perform a Taylor expansion about the nominal input and solution. Thus $f(x^0 + \delta x, u^0 + \delta u, t) = f(x^0, u^0, t) + \pderiv{}{x} f(x, u, t)\vert_{(x^0, u^0)}\delta x + \pderiv{}{u}f(x,u,t)\vert_{(x^0, u^0)} \delta u + \text{higher order terms}$ (recall that we also called $\pderiv{}{x}$ $D_1$, i.e. the derivative with respect to the first argument).

What I've done is expanded the right hand side of the differential equation. Thus $\delta x = \pderiv{}{x} f(x, u, t)\vert_{(x^0, u^0)} \delta x + \pderiv{}{u} f(...)\vert_{(x^0, y^0)}\delta u + ...$. If $\delta u, \delta x$ small, then we can assume that they are approximately zero, which gives us an approximate first-order linear differential equation. This gives us a linear time-varying approximation of the dynamics of this perturbation vector, in response to a perturbation input. That's what the Jacobian linearization gives you: the perturbation away from the nominal (we linearized about a bias point).

Consider A(t) to be the Jacobian matrix with respect to x, and B(t) to be the Jacobian matrix with respect to u. Remember that this is an approximation, and if your system is really nonlinear, and you perturb the system a lot (stray too far from the bias point), then this linearization may cease to hold.

Linear time-invariant systems

Motivated by the fact that we have a solution to the time-varying equation, it depends on the state transition matrix, which right now is an abstract thing which we don't have a way of solving. Let's go to a more specific class of systems: that where $A, B, C, D$ do not depend on time. We know that this system is linear (we don't know yet that it is time-invariant; we have to find the response function and show that it satisfies the definition of a time-invariant system), so this still requires proof.

Since these don't depend on time, we can use some familiar tools (e.g. Laplace transforms) and remember what taking the Laplace transform of a derivative is. Denote $\hat{x}(s)$ to be the Laplace transform of $x(t)$. The Laplace transform is therefore $s\hat{x}(s) - x_0 = A\hat{x}(s) + B\hat{u}(s)$; $s\hat{y}(s) - y_0 = C\hat{x}(s) + D\hat{u}(s)$. The first equation becomes $(sI - A)\hat{x}(s) = x_0 + B\hat{u}(s)$, and we'll leave the second equation alone.

Let's first consider $\hat{x} = Ax$, $x(0) = x_0$. I could have done the same thing, except my right hand side doesn't depend on B: $(sI - A)\hat{x}(s) = x_0$. Let's leave that for a second and come back to it, and make the following claim: the state transition matrix for $\hat{x} = Ax, x(t_0) = x_0$ is $\Phi(t,t_0) = e^{A(t-t_0)}$, which is called the matrix exponential, defined as $e^{A(t-t_0)} = I + A(t-t_0) + \frac{A^2(t-t_0)^2}{2!} + ...$ (Taylor expansion of the exponential function).

We just need to show that the state transition matrix, using definitions we had last day, is indeed the state transition matrix for that system. We could go back to the definition of the state transition matrix for the system, or we could go back to the state transition function for the vector differential equation.

From last time, we know that the solution to $\dot{x}A(t)x, x(t_0) = x_0$ is given by $x(t) = \Phi(t, t_0)x_0$; here, we are claiming then that $x(t) = e^{A(t - t_0)} x_0$, where $x(t)$ is the solution to $\dot{x} = Ax$ with initial condition $x_0$.

First show that it satisfies the vector differential equation: $\dot{x} = \pderiv{}{t}\exp\parens{A(t-t_0)} x_0 = (0 + A + A^2(t - t_0 + ...)x_0 = A(I + A(t-t_0) + \frac{A^2}{2}(t-t_0)^2 + ...) x_0 = Ae^{At} x_0 = Ax(t)$, so it satisfies the differential equation. Checking the initial condition, we get $e^{A \cdot 0}x_0 = I x_0 = x_0$. We've proven that this represents the solution to this time-invariant differential equation. By the existence and uniqueness theorem, this is the same solution.

Through this proof, we've shown a couple of things: the derivative of the matrix exponential, and we evaluated it at $t-t_0=0$. So now let's go back and reconsider its infinite series representation and classify some of its other properties.

Properties of the matrix exponential

• $e^0 = I$
• $e^{A(t+s)} = e^{At}e^{As}$
• $e^{(A+B)t} = e^{At}e^{Bt}$ iff $\comm{A}{B} = 0$.
• $\parens{e^{At}}^{-1} = e^{-At}$, and these properties hold in general if you're looking at $t$ or $t - t_0$.
• $\deriv{e^{At}}{t} = Ae^{At} = e^{At}A$ (i.e. $\comm{e^At}{A} = 0$)
• Suppose $X(t) \in \Re^{n \times n}$, $\dot{X} = AX, X(0) = I$, then the solution of this matrix differential equation and initial condition pair is given by $X(t) = e^{At}$. Proof in the notes; very similar to what we just did (more general proof, that the state transition matrix is just given by the matrix exponential).

Calculating $e^{At}$, given $A$

What this is now useful for is making more concrete this state transition concept. Still a little abstract, since we're still considering the exponential of a matrix.

The first point is that using the infinite series representation to compute $e^{At}$ is in general hard.

Would be doable if you knew $A$ were nilpotent ($A^k = 0$ for some $k \in \mathbb{Z}$), but it's not always feasible. Would not be feasible if $k$ large.

The way one usually computes the state transition matrix $e^{At}$ is as follows:

Recall: $\dot{X}(t) = AX(t)$, with $X(0) = I$. We know from what we've done before (property 6) that we can easily prove $X(t) = e^{At}$. We also know that $(sI - A)\hat{X}(s) = I$, so $\hat{X}(s) = (sI - A)^{-1}$. That tells me that $e^{At} = \mathcal{L}^{-1}\parens{(sI - A)^{-1}}$. That gives us a way of computing $e^{At}$, assuming we have a way to compute a matrix's inverse and an inverse Laplace transform. This is what people usually do, and most algorithms approach the problem this way. Generally hard to compute the inverse and the inverse Laplace transform.

Requires proof regarding why $sI - A$ always has an inverse given by $e^{-At}$.

Clive Moller started LINPACK (Linear algebra package; engine behind MATLAB). Famous in computational linear algebra. Paper: 19 dubious ways to compute the matrix exponential. Actually a hard problem in general. Factoring of $n$-degree polynomials.

If we were to consider our simple nilpotent case, we'll compute $sI - A = \begin{bmatrix}s & -1 \\ 0 & s\end{bmatrix}$. We can immediately write down its inverse as $\begin{bmatrix}\frac{1}{s} & \frac{1}{s^2} \\ 0 & \frac{1}{s}\end{bmatrix}$. Inverse Laplace transform takes no work; it's simply $\begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}$.

In the next lecture (and next series of lectures) we will be talking about the Jordan form of a matrix. We have a way to compute $e^{At}$. We'll write $A = TJT^{-1}$. In its simplest case, it's diagonal. Either way, all of the work is in exponentiating $J$. You still end up doing something that's the inverse Laplace transform of $sI - J$.

We've shown that for a linear TI system, $\dot{x} = Ax + Bu$; $y = Cx + Du$ ($x(0) = x_0$). $x(t) = e^{At}x_0 + \int_0^t e^{A(t-\tau)} Bu(\tau) d\tau$. We proved it last time, but you can check this satisfies the differential equation and initial condition.

From that, you can compute the response function and show that it's time-invariant. Let's conclude today's class with a planar inverted pendulum. Let's call the angle of rotation away from the vertical $\theta$, mass $m$, length $\ell$, and torque $\tau$. Equations of motion: $m\ell^2 \ddot{\theta} - mg\ell \sin \theta = \tau$. Perform Jacobian linearization; we'll define $\theta = 0$ at $\pi/2$, and we're linearizing about the trivial trajectory that the pendulum is straight up. Therefore $\delta \theta = \theta \implies m\ell^2 \ddot{\theta} + mg\ell\theta = \tau$, where $u = \frac{\tau}{m\ell^2}$, and $\Omega^2 = \frac{g}{\ell}$, $\dot{x}_1 = x_2$, and $\dot{x}_2 = \Omega^2 x_1 + u$.

$y = \theta - x_1, \dot{x}_1 = x_2, \dot{x}_2 = \Omega^2 x_1 + u, y = x_1$. Stabilization of system via feedback by considering poles of Laplace transform, etc. $\frac{\hat{y}}{\hat{u}} = \frac{1}{s^2 - \Omega^2} = G(s)$ (the plant).

In general, not a good idea: canceling unstable pole, and then using feedback. In the notes, this is some controller $K(s)$. If we look at the open-loop transfer function ($K(s)G(s) = \frac{1}{s(s+\Omega)}$), $u = \frac{s-\Omega}{s}\bar{u}$, so $\dot{u} = \dot{\bar{u}} - \Omega\bar{u}$ (assume zero initial conditions on $u, \bar{u}$). If we define a third state variable now, $x_3 = u - \bar{u}$, then that tells us that $\dot{x}_3 = \Omega \bar{u}$. Here, I have $A = \begin{bmatrix} 0 & 1 & 0 \\ \Omega^2 & 0 & -1 \\ 0 & 0 & 0 \end{bmatrix}$, $B = \begin{bmatrix}0 \\ 1 \\ \Omega\end{bmatrix}$, $C = \begin{bmatrix}1 & 0 & 0\end{bmatrix}$, $D = 0$. Out of time today, but we'll solve at the beginning of Tuesday's class.

Solve for $x(t) = \begin{bmatrix}x_1, x_2, x_3\end{bmatrix}$. We have a few approaches:

• Using $A,B,C,D$: compute the following: $y(t) = Ce^{At} x_0 + C\int_0^t e^{A(t - \tau)}Bu(\tau) d\tau$. In doing that, we'll need to compute $e^{At}$, and then we have this expression for general $u$: suppose you supply a step input.
• Suppose $\bar{u} = -y = -Cx$. Therefore $\dot{x} = Ax + B(-Cx) = (A - BC)x$. We have a new $A_{CL} = A - BC$, and we can exponentiate this instead.

Foreshadows later, when we think about control. Introduces this standard notion of feedback for stabilizing systems. Using newfound knowledge of state transition matrix for TI systems (how to compute it), see how to compute. See what MATLAB is doing.

Computing the Matrix Exponential; Dyadic Expansion

October 2, 2012

Today: example computing $x(t)$ using $e^{At}$. Sastry guest lecturing on Thursday; LN11 is linear quadratic optimization.

Date for midterm: in class, Tuesday Oct. 16 (two weeks from today). It's going to cover material up to and including what we cover in homework assignments (e.g. won't have anything on linear quadratic optimization; will probably be everything up to what we finish this week: up to lecture notes 13, not including lecture notes 11). Sample midterms, which will be posted; next Friday, Oct. 12, will be midterm review in section. On Monday, Oct. 15, Insoon will have extended office hours. One more homework: #4, posted tonight, due before midterm (so likely next Thursday).

In general, you'll compute the matrix exponential via the inverse Laplace transform of $(SI - A)^{-1}$. Similarity transformation: $TJT^{-1}$. We don't in general talk about $e^{At}$, but rather functions of a matrix $A$ which can be represented in its Jordan form.

Continuation of example. We recall that we're working in the realm of LTI systems described by the matrices $\dot{x} = Ax + Bu, \dot{y} = Cx + D$. Remember the solution given by the convolution (so to speak) of the input with the matrix exponential.

We set up this model with the inverted pendulum, and we considered the open-loop representation. We proposed a controller, $k(s)$, and then we wrote out the system dynamics. We then defined a new state variable to represent the dynamics of the controller, so we had three first-order differential equations where the right-hand side is just a function of the inputs.

We then began to consider a closed-loop representation of the system. If we write the system in terms of the transfer functions, what we're doing is measuring the output and feeding it back through negative feedback to the input. If we look at the new state update equation and the output equation, we recognize that even though the matrices is different, they still have the same dimensions.

Computed a few things in MATLAB: CL step response. What we see is that it has a step response indicative of a nice, stable system, but that's the closed loop zero-state response. If we consider the zero-input response, we find that the solution diverges.

So what's going on here? We can go back and look at the system. At some times, we can compute $x_{CL}(t)$.

Once we compute the matrix's inverse, we consider the partial fraction decomposition of the elements of the inverse. In general, you would have to do that for all nine terms (but in this case, there are repeated terms, so you'd have to do this just three times). You can perform this partial fraction expansion, and then you have to remember what the inverse Laplace transform is.

In our problem, we're asking for less: we only need to compute the first column (which is actually still everything).

One of the things you're tempted to do as an undergraduate in controls is attempt to cancel out your poles, but it is impossible to get perfect pole-zero cancellation.

Back to the main point: we're in general faced with computing the matrix exponential, and taking the inverse Laplace transform is hard. The idea is to try to represent A in a canonical form to simply compute what that Laplace transform is, if possible.

General: Functions of a matrix $A \in \mathbb{C}^{n\times n}$

(Goal: find simpler ways to compute $e^{At}$, but in general, functions of a matrix)

We started with LTI systems; we know the solution in terms of the state transition matrix is given by $e^{At}x_0$. More for terminology more than anything, we're going to write the inverse of a matrix $sI - A$ in the following form: the ratio of something called the adjugate of $SI - A$ over the determinant, i.e. $\frac{\mathrm{adj} (sI - A)}{\det (sI - A)}$. I'm going to define the characteristic polynomial of $A$ as $\hat{\chi}_A(s) \defequals \det (SI - A)$ (so it will look like $s^n + \alpha_1 s^{n-1} + ... + \alpha_n$. The adjugate is given by a polynomial $B_0 s^{n-1} = B_1 s^{n-2} + ... + B_{n-2} s + B_{n-1}$, where $B_i \in \mathbb{C}^{n \times n}$. We define this matrix as follows: $\mathrm{adj} (sI - A) = C^T$, where $C_{ij} = )-1)^{i+j} M_{ij}$ (where $M_{ij}$ is the determinant of the $n-1\times n-1$ matrix where row $i$ and column $j$ are eliminated).

What we really want to do is define a notation here for understanding the Cayley-Hamilton theorem. If we had to, we could compute it, but let's go back and use the notation.

The theorem we're going to state here is the Cayley-Hamilton theorem, which states that every $n \times n$ matrix satisfies its own characteristic polynomial. What does that mean? If you set the characteristic polynomial of the matrix to 0, and everywhere you see $s$ you plug in the matrix, that will sum up to 0.

Given the setup here, we can easily prove this. With the definition of an inverse, let's multiply both sides (of our definition of the inverse) by $sI - A$ and the characteristic polynomial. This will yield the equivalent expression that $\mathrm{adj}(sI - A)(sI - A) = \hat{\chi}_A(s) I$. Just matching coefficients, we can write out the $B_i$. From that, we can write out the polynomial and simply work through the math. All we used in the proof of that is just the general form of the matrix in terms of $sI - A$.

Really important result, since we can use this to say general things. This tells us something immediately about polynomials of matrices. If you were looking at some polynomial and wanted to simplify that, we can relate higher powers of the matrix A which are less than or equal to $n$. If you had general k-degree polynomials in $s$, $\hat{p}_1, \hat{p}_2$, then if we divide $\hat{p}_1$ by $\hat{\chi}_A$, then we can write these as $\hat{q}_1 \chi_A + \hat{r}_1$ and $\hat{q}_2 \chi_A + \hat{r}_2$. Even if the two are not equal, if the remainders are the same (i.e. $\hat{r}_1 = \hat{r}_2$), if you evaluate the two on the matrix A, $\hat{p}_1(A) = \hat{p}_2 (A)$. Just a result of the Cayley-Hamilton theorem. That tells us that every polynomial function of $A$ can be written as a function of $I, A, A^2, ..., A^{n-1}$.

In general, interesting functions are not necessarily polynomials, and we have an infinite series representation. How do we compute functions of a matrix A? In order to do that, we're going to have to think about two cases: we're going to look at the matrix A and compute its eigenvalues and eigenvectors. We've mentioned that before; we'll have a quick review of computing eigenvalues and eigenvectors. We'll break this up into two cases: we have $n$ linearly independent eigenvectors, and we have less than $n$ linearly independent eigenvectors. In the first case, $A$ is diagonalizable, meaning there exists a similarity transformation that can compute a diagonalization of the matrix where the diagonal elements are the eigenvalues, and in the latter case, $A$ can be represented in Jordan form.

If $A$ has $n$ distinct eigenvalues, that implies that $A$ has $n$ linearly independent eigenvectors (but the reverse implication is not true!).

Next: Sastry will talk about diagonalizable case, and he'll probably get through part of lecture notes 13.

Eigenvalues, etc.

October 4, 2012

If $A$ is real, then $d_1, ... d_n \in \Re$, which also tells us that roots appear in complex conjugate pairs. If the matrix is complex, all bets are off.

Amazingly, the study of these eigenvalues and eigenvectors is actually a lot easier if the $\lambda_i$ are distinct, i.e. $\lambda_i\lambda_j = \delta_{ij}$. The first thing to notice is that if you have $n$ distinct eigenvalues, from the definition of eigenvectors, there's at least one eigenvector associated with every eigenvalue.

First thing to prove is that the eigenvectors of distinct eigenvectors are linearly independent. Consider a solution to the homogeneous equation $\sum_i \alpha_i e_i$. Apply $A - \lambda_i I$ for each $i$ except for one; as a result, we have $\alpha_k (\lambda_k - \lambda_1)(\lambda_k - \lambda_2) ... (\lambda k - \lambda_n) e_k$ for some $k$. Thus since the $\lambda_i$ are all distinct, $\alpha_k$ must be identically $0$ for all $k$.

Jordan's Theorem

We have $A\begin{bmatrix}e_1 & e_2 & ... & e_n \end{bmatrix} = \begin{bmatrix} \lambda_1 & 0 & ... & 0 \\ 0 & \lambda_2 & ... & 0 \\ \vdots & & \ddots & \\ 0 & & ... & 0\end{bmatrix}\begin{bmatrix}e_1 & e_2 & ... & e_n\end{bmatrix} = T^{-1}\Lambda T$. If you look at $A$ in a new basis given by the eigenvectors, $A$ is a diagonal matrix, where the diagonal entries are given by the eigenvalues.

Jordan (canonical) form of a matrix.

Let's give a name to the inverse of $T^{-1} = \begin{bmatrix} v_1^T \\ v_2^T \\ \vdots \\ v_n^T \end{bmatrix}$. Looks also like eigenvectors; these are called the left eigenvectors ($\{e_i\}$ are the right eigenvectors) -- these are actually row vectors as opposed to column vectors.

Notion of an outer product. The dyadic (outer) product of the right and left eigenvectors is sometimes called the residual product (basically, a projection matrix): $e_i v_i^T = R_i$. If you flipped the order of the arguments, that would also be the identity, and what that yields is a scalar: $v_i^T e_j = \delta_{ij}$. How would you normally define a function $f(A)$? $f(A) = \sum_i \lambda_i R_i$. $f$ is required to be analytic at $\lambda_i$.

Let's try to apply this to (single input, single output) linear systems. Assume that $A$ has distinct eigenvalues. That means that there exists $T$ such that $A = T^{-1}\Lambda T$. If I were to rewrite this with a change of basis with $z = Tx$, then $\dot{z} = \Lambda z + Tbu$, $y = cT^{-1}z$. If we consider the Laplace transform, we have poles corresponding to the eigenvalues. No coupling; cannot observe. By putting systems in modal form, you can actually see which inputs are influencing the overall transfer function. Just the transient response corresponding to the initial condition. The multi-input, multi-output matrix is identical, by linearity.

Just to tell you that this kind of manipulation has some quite interesting other implications, let me lead you through another calculation and talk about some popular misconceptions about linear systems. In most undergraduate courses, you'll be told about the "eigenfunction property": if you put in an input at some frequency, you'll get an output at the same frequency. Two problems: is that always true for every frequency, and secondly, what about the role of the initial condition: what if the system is unstable?

What I'm going to do is just an exercise in Laplace transforms, really. (end of lecture 12 notes)

Output generally corresponds to the force response, the response to the input at the same frequency, (by the way, without further assumption, the eigenfunction property is immediately false if you force it at one of its natural modes, i.e. eigenvalues) and a second component, the contribution due to the initial condition. If $A$ is stable, then the second component goes to 0. Is it possible to still use this response by clever choice of initial condition? Yes: regularization (unique choice of $x_0$ once $u_0$ is specified and $B$ is given: $x_0 = = (\lambda I - A)^{-1}Bu_0$). Otherwise, we have a transient response, and if $A$ is growing exponentially, then this will dominate the system.

The other thing this calculation is good for is that it tells you something about the zeros of a linear system. The zeros are (1) $\exists u_0 \neq \theta \st H(\lambda) u_0 = \theta$. $\lambda$ is a zero of transmission if at this value, $\hat{H}$ has a nontrivial right nullspace. Here is a way to think about what a zero of transmission is: it blocks transmission at that value $\lambda$. If you've chosen the initial condition properly, you'll see nothing at the output.

What you need to remember: distinct eigenvalues means distinct eigenvectors; you can decompose a matrix into its Jordan form; we can see which modes affect the output; and here is a definition of a zero of transmission: if there is a nontrivial choice of $u_0$ and frequency $\lambda$ where you can block transmission.

Lecture Notes 13

Repeated eigenvalue case. Invariant subspaces. If $\fn{A}{(U, \mathbb{C})}{(U, \mathbb{C})}$, $\mathcal{M} \subset V$ is said to be an invariant subspace if $x \in \mathcal{M} \implies Ax \in \mathcal{M}$. Aside: $\mathcal{M}$ is also invariant under any function of $A$. Examples of invariant subspaces: $\mathcal{N}(A)$, $\mathcal{R}(A)$, $\mathcal{N}(A - \lambda I)$, $m_1 \cap m_2$ (with invariant subspaces $m_1, m_2$), $m_1 + m_2$ ($+$ is the sum). The reason why $m_1 \cup m_2$ is not an invariant subspace is because it is not a subspace.n

Jordan Form

October 9, 2012

We just started talking about invariant subspaces. Midterm will be up to and including lecture notes 12 (but we haven't gone through 11). Today I'd like to continue in this vein that we've been presenting for the past couple of lectures. What we've been doing is being motivated by our need to efficiently compute $e^{At}$, and we've been thinking more generally about how to compute functions of a square matrix $A$. What we've done in our presentation is divide that problem into two classes: case 1 (where $A$ is diagonalizable) -- that is, $A$ has $n$ linearly independent eigenvectors, each of which is an element of $\mathbb{C}^n$. Remember that if $A$ has $n$ distinct eigenvalues, it is given that $A$ has $n$ eigenvectors. However, we don't need $n$ distinct eigenvalues (consider degeneracy of quantum states).

What we proved last time was a theorem that in this case, the eigenvectors form a basis for $\mathbb{C}^n$, and we went ahead and constructed a similarity transform which transforms $A$ to a diagonal matrix. Consider $T^{-1}$, whose columns are the eigenvectors (since the columns form a basis, it is invertible, and $T$ exists). $\Lambda = TAT^{-1}$, and $\Lambda$, a diagonal matrix containing the corresponding eigenvalues.

Consider a function $\fn{f}{\mathbb{C}^{n\times n}}{\mathbb{C}^{n\times n}}$. We talked about polynomial functions of a matrix (recall Cayley-Hamilton theorem). Let us consider a more general case: $f$ is analytic, that is, one that is locally given by a convergent power series. We have that $f(A) = f(T^{-1}\Lambda T) = T^{-1}f(\Lambda)T$. What does this mean? This is just $f$ applied to the diagonal matrix.

In general, $A$ is not diagonalizable, and so we have to think about what a canonical form (a simple form for a general $n \times n$ matrix). We now are considering case 2, where $A$ is not diagonalizable, i.e. $A$ has strictly less than $n$ eigenvectors. This can only happen if $A$ has repeated eigenvalues (note: not guaranteed!).

We will present the "generalization" of the diagonal form, which is called Jordan form. Instead of using $\Lambda$, we'll use $J$. Basically, it'll be an upper triangular matrix with the eigenvalues on the diagonal, and on the off-diagonal, we'll have either ones or zeros (depending on how many eigenvectors we have for a given eigenvalue). We call this a canonical form (note that the diagonal form is a special case of the Jordan form when there are n eigenvectors).

We're going to approach this today (by constructing a similarity transform). In order to do that, we need a little bit of background: we need to look in more detail at the eigenstructures of a matrix. We're going to do that through this concept of $\mathcal{A}$-invariant subspaces. Suppose you have a linear map $\fn{\mathcal{A}}{V}{V}$. $M$, a subspace of $V$ is said to be $\mathcal{A}$-invariant (or invariant under $\mathcal{A}$) if $\fn{\mathcal{A}}{M}{M}$.

Example: the nullspace of $\mathcal{A}$ is $\mathcal{A}$-invariant. Very easy to show that a subspace is $\mathcal{A}$-invariant, typically.

Now consider $N(A - \lambda I)$, where $\lambda$ is an eigenvalue of $A$. We know that this space contains the eigenvectors of $A$ that correspond to $\lambda$; we can then show that this is also an invariant subspace. A few more examples: if you have a polynomial function of a matrix $A$, then the nullspace of this polynomial function is $A$-invariant. Further, given two $A$-invariant subspaces, the intersection of these subspaces is also $A$-invariant. Further, if you define sums of subspaces in the usual manner, the sum of two $A$-invariant subspaces is also $A$-invariant.

Dorect sum of subspaces

Again we have a vector space $X$, and we'll have $k$ subspaces $M_1, M_2, ..., M_K$ of $X$. $V$ is said to be a direct sum (i.e. $V = \Oplus M_i$) if $\forall v \in V, \exists! m_i \in M_i \st v = \sum a_i m_i$. In some sense, this is a generalization of linear independence.

A direct result of the uniqueness constraint of the definition of the direct sum is that the intersection of any two of these subspaces contains only the zero vector.

We need two more things before we consider the Jordan form of a matrix.

One way we can relate $A$-invariance and direct sum is going back to the beginning of the class, if you look at $A \in \mathbb{C}^{n \times n}$ with $n$ distinct eigenvalues (or even just $n$ linearly independent eigenvectors), then $\mathbb{C}^n = \Oplus N(A-\lambda_i)$ (we proved last time that in this case, the eigenvectors form a basis for the space). This is a direct sum decomposition of $\mathbb{C}^n$.

So now let's go on to something quite important, which we generally call the Second Representation Theorem. First, let's consider what representation means: we define representation as the representation of a linear map between finite-dimensional vector spaces as a matrix. Aside: the first representation theorem was that we could construct a matrix representation for any linear map between finite-dimensional vector spaces.

Now we consider the case where $\fn{\mathcal{A}}{V}{V}$. Consider $V = M_1 \oplus M_2$, where $V$ has dimension $n$, $M_1$ has dimension $k$, and $M_2$ has dimension $n-k$. If $M_1$ is $A$-invariant, we can say more about what this matrix $A$ can look like: $\mathcal{A}$ can be represented as a block matrix $A$ in which the lower left block is the zero matrix. The upper left block is $k$-dimensional, and the remaining block sizes can be inferred from this.

Easy to prove: can use first representation theorem. We'd like to prove that there exists a basis for $V$ such that $A$ has the stated form. Begin by constructing a basis for $M_1, M_2$. Let $\{b_1, ..., b_k\}$ be a basis for $M_1$ and $\{b_{k+1}, ..., b_n\}$ be a basis for $M_2$. The image of $A$ on the first $k$ basis vectors can be written as linear combinations of the first $k$ basis vectors (i.e. you don't need any components involving $b_{k+1}, ..., b_n$), and by uniqueness (guaranteed by this being a direct sum decomposition), any representation with respect to this basis will produce the lower-left block of zeros (by the first representation theorem).

If $M_1, M_2$ are both $A$-invariant, then the top-right block can also be zero (by symmetry); in fact, the basis we chose earlier will yield this result. The proof is identical.

Example:

• Suppose $A$ has one eigenvector. I'm going to write then $\mathbb{C}^n = N(A - \lambda_1 I)$. Let us just understand this example as we leave today's class: $A$ is no longer diagonalizable. Going back to our second representation theorem and applying it to this case, we'll derive the Jordan form (not on the midterm).

• Suppose $A$ has $n$ eigenvectors. $\mathbb{C}^n = \Oplus_i N(A - \lambda_i I)$. By the second representation theorem, we are guaranteed that this matrix is diagonalizable.

Jordan Form

October 11, 2012

We're going to finish up talking about the case where we don't have $n$ eigenvectors. Recall that if we have $n$ distinct eigenvalues, we are guaranteed $n$ eigenvectors, but this is not a necessary condition.

We'll then start an example, the linear quadratic regulator.

Recall: we defined the characteristic polynomial $\hat{\chi}_A(s)$ as $\det(sI-A)$, which we rewrote as $\hat{\psi}_A(s) = \prod_i (s - \lambda_i)^{d_i}$. We're using $d_i$ to represent the (algebraic) multiplicity of eigenvalue $\lambda_i$. The Cayley-Hamilton theorem tells us that the characteristic polynomial evaluated on the matrix $A$ is the $n$-by-$n$ zero matrix.

Now we're going to define something new. We've been talking about the characteristic polynomial, and we know its properties. The roots of the characteristic polynomial are the eigenvalues of the matrix, and by Cayley-Hamilton, every matrix satisfies its characteristic polynomial.

Let's now define something called the minimal polynomial of $A$, The minimal polynomial $\hat{\psi}_A(s)$ is the polynomial of least-degree such that $\hat{\psi}_A(A) = \theta_{n \times n}$. The question is whether there is a polynomial of degree less than $n$ such that this identity is satisfied. It is the structure of the characteristic polynomial and that of the minimal polynomial that allows us to explore the Jordan form.

Just by definition, we have the result that $\hat{\psi}_A(s)$ divides $\hat{\chi}_A(s)$. Proof: by the definition of "divides", we have a remainder term $\hat{r}$; by linearity, $\hat{r}(A) = 0$. $\hat{r}$ must either be lesser degree than $\hat{\psi}$ (which would violate our definition), or it is identically the zero polynomial.

Thus, we can write $\hat{\psi}_A(s) = \prod_i (s - \lambda_i)^{m_i}$, $m_i \le d_i$.

It turns out that the exponents $m_i$ in the minimal polynomial are determined by the largest Jordan block associated with $\lambda_i$. (We define a Jordan block to be a block of the Jordan form that contains only ones as the super-diagonal elements.) In general, given $A$, how do we find its minimal polynomial?

(thoughts: consider each Jordan block of $A - \lambda_i$ to be analogous to a linear feedback shift register, with no new inputs; as such, it shows that the largest Jordan block has nilpotency corresponding to the number of registers, and all other Jordan blocks for the eigenvalue reach 0 earlier. Alternately, repeat the above while considering raising and lowering operators)

Theorem: $\mathbb{C}^n = \Oplus_i N(A - \lambda_i I)^{m_i}$, where $N(A - \lambda_i I)^{m_i}$ is of dimension $d_i$.

Recall: if there were $d_i$ eigenvectors associated to $\lambda_i$ for all $i$, then the matrix would be diagonalizable. Generally, the things you add to this form are called generalized eigenvectors.

Proof: We have the minimal polynomial $\hat{\psi}_A(s)$. Consider that $\hat{\psi}_A(s) = \frac{1}{\prod_i (s - \lambda_i)^{m_i}} = \sum_j \frac{\hat{n}_i(s)}{(s-\lambda_i)^{m_1}}}$. Now, multiplying both sides by the minimal polynomial, we end up with $1 = \sum \hat{n}_1(s)\hat{p}_i(s)$, as in the notes. Evaluating this polynomial at the matrix $A$, we get $I = \sum \hat{n}_i(A)\hat{p_i}(A)$. Multiplying by an arbitrary $x \in \mathbb{C}^n$, $x = \hat{n}_1(s)\hat{p}_i(s)x$, which we've written as the sum of $\sigma$ terms.

In general, $x_i = \hat{n}_i(A)\hat{p}_i(A)x$. Since $\hat{p}_i(s) =\frac{\hat{\psi}_A(s)}{(s - \lambda_i)^{m_i}}$, we know that $(A - \lambda_i I)^{m_i}x_i = \theta$ since from the above, $x_i \in N(A - \lambda_i I)^{m_i}$.

In order to prove the theorem, we must next show that this decomposition is unique (meaning that for any vector $x$, there is not another way you can break it up into $\sum x_i$; see notes), and that the nullity of $(A - \lambda_i I)^{m_i} = d_i$, which is the multiplicity of $(s - \lambda_i)$ in the characteristic polynomial.

As a concrete but slightly abstract example, we're going to look at the geometric structure of eigenspaces.

Geometric Structure of Eigenspaces

Consider $A$ with just one eigenvalue, whose characteristic polynomial is given as defined above. In this case, $d_1 = n, 1 \le m_1 \le d_1$.

Suppose $n=6, m=3$. One question is whether you can uniquely determine a Jordan form (no; we'll see this in a second). What we're saying from the theorem we just proved is that we can decompose $\mathbb{C}^6 = N(A - \lambda I)^3$. We don't have enough information yet to determine what the Jordan form looks like, and how the characteristic and minimal polynomials lead to the Jordan form.

Suppose, in addition, that $N(A - \lambda I) = \mathrm{span}\{e_1, e_2, e_3\}$ (where $e_i$ is an eigenvector of $A$ associated to the eigenvalue $\lambda$). Further, suppose that $N(A - \lambda I)^2 = N(A - \lambda I) \oplus v_1 \oplus v_2$, where $v_1, v_2$ are linearly independent solutions of $(A - \lambda I)$.

Then, suppose without loss of generality that $(A = \lambda I) v_1 = e_1$, and $(A = \lambda I) v_2 = e_2$.

Finally, suppose $N(A - \lambda I)^3 = N(A - \lambda I)^2 \oplus w_1$, where $(A - \lambda I)w_1 = v_1$ (without loss of generality, again).

In this case (and in general), $e_1$ is an eigenvector associated to $\lambda$, $v_i$ is a generalized eigenvector (of degree 1) associated to $\lambda, e_1$, and $w_1$ is a generalized eigenvector (of degree 2) associated to $\lambda, e_1, v_1$.

$e_2$ is also an eigenvector associated to $\lambda$, and $v_2$ is also a generalized eigenvector corresponding to $\lambda, e_1$.

Finally, $e_3$ is an eigenvector associated to $\lambda$.

As such, through these definitions, we can construct a similarity transformation that gives us a canonical form for the matrix $A$, the Jordan form.

Constructing the Jordan Form

$$Ae_1 = \lambda e_1 \\ Ae_2 = \lambda e_2 \\ Ae_3 = \lambda e_3 \\ Av_1 = \lambda v_1 + e_1 \\ Av_2 = \lambda v_2 + e_2 \\ Aw_1 = \lambda w_1 + v_1$$

I know I can always construct these (in a linearly independent fashion). Now, consider a similarity transform $J = TAT^{-1}$, where $T^{-1} = \begin{pmatrix} e_1 & v_1 & w_1 & e_2 & v_2 & e_3 \end{pmatrix}$. I'm putting my eigenvectors with their generalized eigenvectors. $J$ is the Jordan form of $A$, which is effectively a diagonal matrix with some ones in the super-diagonal. This Jordan form might have been moved around, so it's not unique. However, the Jordan form is unique up to permutations of the Jordan blocks.

Note: we can't have multiple generalized eigenvectors that correspond to the same (generalized) eigenvector, or else they are no longer linearly independent with the generalized eigenvectors established thus far; we require this chain structure.

What we've done is define a construct called the generalized eigenvector. We'll start next Thursday's class with computing $e^{Jt}$.

Functions of Jordan Form, Introduction to Optimal Control

October 18, 2012

Alternative derivation of linear quadratic regulator; intro to optimal control via LQR (Dr. Jerry Ding).

Today's a bit of a different lecture: finish up the last note we have in this functions of a matrix. If we want to compute an arbitrary analytic function of a matrix (which we showed we could generally transform to its Jordan canonical form), how do we do that? We'll then go on to optimal control.

For the first five or ten minutes, let's talk about this Jordan form and where we ended last day.

Functions of a matrix ($n\times n$) in Jordan Form

From what we derived last day, we can write down a minimal polynomial, and recall that the exponent corresponds to the largest Jordan block for that polynomial. This Jordan form could have come (generally) from a matrix $A$, and we could have computed $J$ through the similarity transform we discussed last day. We recall that the columns of $T^{-1}$ are the eigenvectors and generalized eigenvectors. The number of Jordan blocks tells you the number of eigenvectors of the matrix.

$f(A) = f(T^{-1} J T) = T^{-1} f(J) T$. We claim that if you have an analytic function $f$ (i.e. locally given by a convergent power series). If we have the Jordan form and an analytic function, we have the result that if we compute $f(J)$ (which we can do by applying $f$ to each of the Jordan blocks), we can use these results to compute $f(A)$.

So we look at each Jordan block, and when we compute $f$ of each Jordan block, the diagonal elements are just the function applied to the eigenvalues, they're all diagonal matrices, where all entries to the right of the diagonal are nth derivatives of $f$, evaluated at the eigenvalues.

If in general you were looking at a Jordan block, the claim is as follows: the elements are of the following structure:

$$(T^{-1}_k)_{ij} = f^{[j - i]}(\lambda_k)\theta(j - i + \epsilon)$$

In the notes, we prove it -- this is a very interesting proof, but it's very involved and doesn't add much to our discussion. I'll refer you to the notes; we use the minimal polynomial structure and what's called the method of interpolating polynomials.

This claim gives us an important theorem, which we can pull off from this result, the spectral mapping theorem, which tells us something about the eigenvalues of a function of a matrix. Recall we used $\sigma$ to represent the spectrum (set of eigenvalues). This theorem says that $\sigma(f(J)) = f(\sigma(J))$. Specifically we see this from the Jordan form, and since we know that we can always construct this matrix, we can consider that this applies to all matrices.

This is probably one of our most powerful tools: we can look at the eigenvalues of the matrix, and by the spectral mapping theorem, we can consider the effect of the function of the eigenvalues to figure something out about the matrix.

Another thing: 10:18 is apparently the California shake-down day. We practice what we're supposed to do in the event of an earthquake.

You have some linear time-varying system $x(t) = A(t)x(t) + B(t)u(t)$ with initial condition $x(0) = x_0$. Consider our state to be in $\Re^n, u \in \Re^{n_i}, t \in [t_0, t_1]$. Our objective function in this case will be the integral of a penalty on control plus a penalty on state plus a terminal cost, i.e. $J(u) = \frac{1}{2}\bracks{\int_{t_0}^{t_1} (\mag{u(t)}^2_2 + \mag{C(t)x(t)}^2_2) dt + x^T(t_1)Sx(t_1)}$. In some sense, we want to penalize the deviation from the system.

Our optimization space $U$ here is a set of piecewise-continuous functions. Obviously not differentiable at points of discontinuity, so it satisfies the differential equation almost everywhere.

The quadratic optimization problem is this: compute the optimal cost of our objective function, subject to LTV, i.e. $J^* = \min_{u \in U} J(u)$.

What are the applications? e.g. stabilization and minimum energy control, set point tracking. More interestingly, you can do trajectory tracking (one slight problem is you also have some dynamics in the trajectory that you want to track; this changes the structure of the controller a little bit, but all the ideas are the same).

Finally, an interesting problem is LQG (linear quadratic gaussian) control. Suppose you have a plant, which is linear, pluss some gaussian noise. With some input, we get some output (and in general you don't want to observe the full state vector). Estimate system's state based on measurements and compute control. With the LQG problem, you're minimizing the control. We have a Kalman filter fed directly into a LQR. This is one of the famous instances of the equivalence principle, i.e. you can do the control completely separately. This is one of the few cases in stochastic control where you actually have this equivalence principle.

There are typically two types of solutions: either go by dynamic programming or optimality condition. So what does optimality condition mean?

Suppose we have a continuously differentiable scalar function on Euclidean space, and we want to solve $\min_{x \in \Re^n} \phi(x)$. If $x$ is optimal, then the gradient of the cause function is 0 (i.e. $\nabla \phi(x) = 0$. Your optimal point must be one of your critical points; you don't know which one you're working with. In a lot of numerical optimization problems, what you're trying to do is find the gradient; you're trying to proceed in the direction of least cost. However, you can get stuck in a local minimum.

Equivalently, a similar condition should hold. Equivalent form: the directional derivative $\nabla_v\phi(x) = 0 \forall v \in \Re^n$ (what we're doing is taking an $\epsilon$ deviation along this direction and figuring out what the perturbation is). Ideally, you'd like to find a similar derivative for the cost function, but you've got a few problems: (1) perturbation in $u$. The nice thing about $U$ is that it's a vector space. We can now define our perturbation as follows: $u_\epsilon = u + \epsilon\delta u, \delta u \in U$.

Next question: how does the cause vector change as a result of control? The trajectory of the system with this input is $x_2(t) \Phi(t,t_0)\x_0 + \int_{t_0}^t \Phi(t, \tau) B(\tau)(u(\tau) + \epsilon\delta u(\tau))d\tau = x(t) + \epsilon \int_{t_0}^{t_1} \Phi(t,\tau)B(\tau) \delta u(\tau) d\tau$; we can call this second term $\delta x(t)$, which we then refer to as our perturbation of the state trajectory. Basically, this is the zero-state response.

$\delta \dot{x} = A(t) \delta x(t) + B(t) \delta u(t); \delta x(0) = 0$.

$$J(u + \epsilon\delta u = \frac{1}{2}\bracks{\int_{t_0}^{t_1}\mag{u + \epsilon \delta u}^2_2 + \mag{C(x + \epsilon \delta x)} dt + (x(t_1) + \epsilon \delta x(t_1))^T S (x(t_1) + \epsilon \delta x(t_1))} = J(u) + \epsilon F(u, \delta u) + \epsilon^2 Q(\delta u)$$

What you see is that if you introduce an $\epsilon$ perturbation in your input, you get an $\epsilon$ perturbation in your output. As such, the directional derivative is well-defined, and can be found explicitly as $F(u, \delta u)$. This is the first variation of $J$. This first variation we can write down in an analogous manner to the static analysis we talked about.

$u$ is optimal $\iff \nabla_{\delta u}J(u) = 0, \forall \delta u \in U$ (this is two-way because it is a convex optimization problem).

Two parts: first assume $u$ is optimal. This implies that $\forall \delta u, \epsilon, J(u + \epsilon\delta u) \ge J(u)$. which means that $J(u + \epsilon \delta u) - J(u) = \epsilon(F(u, \delta u) + \epsilon Q(\delta u)) \ge 0$. Case 1: $\epsilon > 0$, so $\nabla_{\delta u}J(u) + \epsilon Q(\delta u) \ge 0 \implies \nabla_{\delta u} J(u) \ge 0$. In case 2, $\epsilon < 0 \implies \nabla_{\delta u} J(u) + \epsilon Q(\delta u) \le 0 \implies \nabla_{\delta u} J(u) \le 0 \implies \nabla_{\delta u} J(u) = 0$.

For the other direction, assume $\nabla_{\delta u} J(u) = 0, \forall \delta u$. $J(u + \epsilon \delta u) = J(u) + \epsilon \nabla_{\delta u}J(u) + \epsilon^2 Q(\delta u) \ge J(u)$ ($\forall \delta u, \epsilon$), so $u$ is optimal.

This is where we deviate from the approach in Callier and Desoer. Because of this result, what we're really interested in is this first variation. Now I'll write down more explicitly what that cross-term is. $\nabla_{\delta u} J(u) = \int_{t_0}^{t_1} \bracks{u(t)^T \delta u(t) + x(t)^T C(t)^TC(t)\delta x(t)} dt + x(t_1)^T S\delta x(t_1)$.

We'll begin by introducing a quantity that we call the control Hamiltonian: $H(x, p, u, t) = p^T(A(t)x + B(t)u) + t_2(u^Tu + x^TC(t)^T C(t)x)$. $p$ is basically your costate. The thing to note here is that your original system equation came from $\dot{x} = \nabla_p H(x, p, u, t)$. The costate is basically if you took the negative gradient with respect to $x$, i.e. $p = -\nabla _x H(x, p, u, t)$. This comes from classical mechanics and conservation of energy. It's sort of intuitive as an optimal control problem that there's a coupling relationship between your dynamics and your system, and you're trying to descend to the state of lowest cost (minimizing the action). Instead of solving the dynamic optimization problem, all we're going to minimize is our Hamiltonian.

I'll try to give an informal derivation for now. Here, $\dot{p}(t) = -A^T(t) p(t) - C(t)^TC(t) x(t)$; $p(t_1) = \nabla_x \parens{\frac{1}{2} x^T(t_1) S(x(t_1))} = Sx(t_1)$. What we do now is take a time derivative of $p(t)^T \delta x(t)$. When you evaluate this derivative, something magical sort of happens: you get $p(t)^T \delta x(t) + p(t)^T\delta x(t)$. What you can show from this is if you just plug this in, $= -x(t)^T C(t)^T C(t) \delta x(t) + p(t)^T B(t) \delta u(t)$. If you integrate both sides of this expression from $t_0$ to $t_1$, you get $p(t_1)^T \delta x(t_1) - p(t_0)^T \delta x(t_0) = \int_{t_0}^{t_1} {p(t)^T B(t) \delta u(t) - x(t)^T C(t)^T C(t) \delta x(t)} dt$.

Rearranging terms, this basically becomes $x(t_1)^T S x(t_1) + \int_{t_0}^{t_1} x(t)^T C(t)^T C(t) \delta x(t) dt = \int{t_0}^{t_1} p(t)^T B(t) \delta u(t) dt$. Essentially, we can rewrite the first variation as $\nabla_{\delta u} J(u) = \int_{t_0}^{t_1} \parens{u(t)^T + p(t)^T B(t)} \delta u(t) dt$. If $u$ is optimal, then $\int_{t_0}^{t_1} \mag{u(t)^T + p(t)^T B(t)}^2_2 dt = 0$, namely $\nabla_u H(x, p, u, t) = 0$. This informal proof doesn't tell you what the critical points are.

If you go through a more formal proof, we'd apply Pontryagin's minimum principle: if $u$ is optimal, let $(x,p)$ be corresponding state and costate trajectories. Along this trajectory, $H(x(t), p(t), u(t), t) = \min_{u \in \Re^{n_i}} H(x(t), p(t), u,t)$. Basically, you get a feedback function in terms of $x$. However, this is only a necessary condition, so you only have a set of possible solutions.

Applying the minimum principle, the unique solution to this problem $\min_u H(x, p, u, t)$ is $U^* = -B^T p$. Basically, what this says is by the minimum principle, if $u$ is optimal, then it must have this form. If you knew nothing about optimal control, and you just wrote down the Hamiltonian, this gives you the form, if it exists at all.

You then have to go back and see which value works, and then going back to the other problem (which is both necessary and sufficient, since this is a convex optimization problem), then you can verify the solution.

This is basically the main result for this linear quadratic optimization problem. Theorem 2.1.157 from Callier and Desoer:

The optimal solution to (LQ) is $u(t) = -B(t)^Tp(t)$, for almost every $t \in [t_0, t_1]$, where (x,p) is the solution of $\dot{x} = A(t)x(t) + B(t)B(t)^Tp(t), x(t_0) = x_0$ and $\dot{p} = C(t)^TC(t)x(t) - A(t)^T p(t), p(t_1) = Sx(t_1)$. So this is basically the structure of the controller.

The rest of the proof tries to find a more tractable form that you can compute. Because in general this two-point boundary problem is hard to solve, you want to turn this into an initial value or terminal condition problem. You can represent your co-state as a linear function of your state $p(t) = P(t)x(t)$; what you get out is a linear state feedback law: $u(t) = -B(t)^TP(t) x(t)$. This matrix then evolves as follows: $-\dot{P}(t) = A(t)^T P(t) + P(t) A(t) - P(t)B(t)B(t)^T P(t) + C(t)^T C(t), P(t_1) = S$.

The thing to note is that this equation is derived from the theorem.

October 23, 2012

How old is control theory? As old as Troy. Legend goes that Aeneas, when in Carthage met Queen Dido and discovered the following legend: local chieftains of Carthage gave her a piece of oxhide and told her she could be the queen of whatever she could cover with the oxhide. Legend says that she took the oxhide, cut it into a long thin strip, and she solved the corresponding isoparametric optimization problem.

Mechanics is a special case of optimal control. Part of what we're not trying to do is drag you through thousands of years of thinking, but rather to encapsulate so many years of human thought into two lectures.

Jerry gave you one version of this; different way, to get to where optimal control is going.

Start by writing down a linear system: $\dot{x} = A(t)x + B(t)u, y(t) = C(t)x + D(t) u$. Associated with this linear system, one defines the dual system (also known as the costate) -- last time we represented this with $p$. Following from the notes, $-\dot{\tilde{x}} = A^*\tilde{x} + C^*(t) \tilde{u}, \y = B^*\tilde{x} + D^*\tilde{u}$. Recall that these are linear maps between continuous functions. This is the adjoint (or dual) system, which plays a large role in optimal control.

In circuit theory or mechanical systems, you might find the dual of a circuit by replacing capacitors with inductors, series with parallel, etc, or you might see mass-damper spring systems. And so that's the relationship between them.

A pretty interesting equality that goes into them is so-called the pairing lemma, which is a way to associate the initial conditions and final state. The pairing lemma states that $\braket{\tilde{x}(t)}{x(t)} + \int_{t_0}^t \braket{\tilde{u}(\tau)}{y(\tau)}d\tau = \braket {\tilde{x}(t_0)}{x(t_0)} + \int_{t_0}^t \braket{\tilde{y}(\tau)} {u(\tau)} d\tau$. The proof is not very involved: start with $\braket{\dot{\tilde{x}} + A^*\tilde{x} + C^*\tilde{u}}{x} + \braket{-y + B^*\tilde{x} + D^*\tilde{u}}{u} = 0$; work through algebra. Think of this as a little algebraic fact; it has some meaning in terms of the language of the adjoint, but let's not get too sidetracked yet.

Using the pairing lemma, there's a more algebraic proof, compared to the one on Thursday. In these notes, we set up the optimal control problem.

The optimal control problem that you give yourself is stated for the given system: minimize, for a certain choice of $u$, the cost function as defined last time: $J(u) = \frac{1}{2}\bracks{\int_{t_0}^{t_1} (\mag{u(t)}^2_2 + \mag{C(t)x(t)}^2_2) dt + x^T(t_1)Sx(t_1)}$ (note that $S > 0$, so it's a Hermitian positive semi-definite matrix). Note: there's a cost on the final state, cost associated with regulation of state, and in fact, this whole thing doesn't need you to define an output $y$. Optimal control says that given some dynamics, maximize this.

The reason you talk about Newton and Calculus of variations is that Newton considered this. Newton's laws were a consequence of the particles minimizing the total action ($(\mag{u(t)}^2_2 + \mag{C(t)x(t)}^2_2)$ is the Lagrangian).

The way you do this is via calculus of variations: $J(u + \epsilon\delta u(t)) = J(u) + \epsilon\delta J(\delta u) + O(\epsilon)$. If $u$ were the best choice, $\forall \delta u$ (in some class), $\delta(\delta u) = 0$.

Newton invented calculus to solve these calculus of variations problems because he wanted to describe how things moved.

So what you get from this is that $\delta u$ causes a $\delta x$, and as last time, you get $\delta \dot{x} = A\delta x + B \delta u, \delta x(t_0) = 0$. From this (when you do the first variation), we saw that you get $\int_{t_0}^{t_1} \braket{u}{\delta u} dt + \int_{t_0}^{t_1}\braket{C^*Cx} {\delta x}dt + \braket{Sx}{\delta x_1}$. This is a necessary condition for optimality, but it isn't so explicit. The reason to use this pairing lemma is to simplify the differential equation. I'm going to associate with the system $\delta\dot{x} = A\delta x + B\delta u$ a dummy output, i.e. $y = \delta x$. The dual of this system is simply $-\dot{\tilde{x}} = A^*\tilde{x} + \tilde{u}, \tilde{y} = B^*\tilde{x}$.

Using the pairing lemma and working through a bit of algebra, we get that $\int_{t_0}^{t_1} \bracks{\braket{u}{\delta u} + \braket{B^* \tilde{u}} {\delta u}}dt = 0$, so $\int_{t_0}^{t_1} \bracks{\braket{u + B^*\tilde{x}}{\delta u}dt = 0$.

(aside: Kalman filtering is the dual of optimal control)

The Hamiltonian system: $\begin{bmatrix}\dot{x} \\ \dot{\tilde{x}} \end{bmatrix} = \begin{bmatrix}A & -BB^* \\ -C^*C & -A^* \end{bmatrix} \begin{bmatrix}x \\ \tilde{x} \end{bmatrix}$, $x(t_0) = x_0, \tilde{x}(t_1) = Sx(t_1)$. In mathematics, we call this a two-point boundary equation. Not fun to solve, generally.

However, the theorem (of Linear Quadratic Optimal Control) is as follows: $U = B^*(t) P(t) x(t)$; we have $-\dot{P} = A^*P + PA - PBB^*P + C^*C$, $P(t_1) = -S$. This has a very famous name associated with it: Ricatti.

Ricatti did the following: given $x$ and $y$ which were both linear equations in terms of each other, and you defined the new state variable $z = y/x$, what differential equation would $z$ satisfy? $\dot{z} = a + bz - cz + dz^2$. Further, the reason that the original equation looks quadratic is because $P(t) = \tilde{x} X^{-1}$, where the matrices satisfy the Hamiltonian system we had before. Amazingly, the answer to this problem is simply a state feedback law, $u = F(t)x(t)$, where $F(t) = B^*(t)P(t)$.

If we consider the matrix differential equation $\deriv{}{t} \begin{bmatrix}X \\ \tilde{X} \end{bmatrix} = \begin{bmatrix}A & -BB^* \\ -C^*C & -A^* \end{bmatrix} \begin{bmatrix}X \\ \tilde{X} \end{bmatrix}$, $x(t_0) = x_0, \tilde{x}(t_1) = Sx(t_1)$ where $X(t_1) = I, \tilde{X}(t_1) = S$,

What we want to show is that $u(t) = -B^*\tilde{X}X^{-1}x(t)$. One part of this proof is to show that $X(t)$ is invertible (page 36 of C&D, so we'll skip this).

Begin by choosing k such that $X(t_0)k = x_0$, then define $x(t) = X(t) k, \tilde{x}(t) = \tilde{X}(t) k$. As an artifact of defining this, $x, \tilde{x}$ satisfy the same expressions, and by the initial conditions, $x(t_1) = k$, so $\tilde{x}(t_1) = Sx(t_1)$. This shows now that $u(t) = -B^* \tilde{X} X^{-1} x(t)$, which is our linear feedback system, so we must show that $\tilde{X} X^{-1} = P$, which is more algebra.

Note: we can find the derivative of $\deriv{}{t}X^{-1}$ by differentiating $X^{-1} X = I$.

If you take the ratio of two quantities in a differential equation, then this ratio satisfies the Ricatti equation. People say that the Ricatti equation is the simplest linear differential equation because it's just a linear differential equation in disguise.

The person we should really credit for finishing off optimal control (coming up with this composite story) was R. E. Kalman. This paper (1961) was rejected by most journals, including IEEE.

There had been a Russian called Pontryagin, who came up with a version of optimal control: in classical mechanics, it's Hamilton's principle of minimized action; Pontryagin said that given $\int_{t_0}^{t_1} L(x, u)dt + V(x(t_1))$ (Pontryagin's maximum principle). Zadeh and Desoer rewrote Pontryagin's maximum principle. Kalman, of course, to his credit, made everything quadratic.

Algebraic Ricatti equation: $C^*C + A^*\bar{P} + \bar{P}A^* - \bar{P}BB^* \bar{P} = 0$, in time-invariant case. To do this right, you'll have to cover notions of stability, then controllability and observability. in undergraduate classes, you heard about root-locus, in which you tweak a parameter and make sure that poles don't end up in the right-half plane.

Input-Output Stability

October 25, 2012

Think back to LTI systems: consider a system specified by $A, B, C, D$; assume single-input, single-output; and $D = 0$; $A$ is a diagonal matrix. We've got a keen idea as to what the solution should look like; we've seen examples like this before. We've got our system model, and we have a keen sense as to what the solution is.

IO stability refers to the stability of the system from the point of view of the input/output relation. If we call the input/output response $H(s)$, we know that $H(s) = Y(s) / U(s)$, and we know from our Laplace transform when we were looking at solutions to the time-invariant equation, that this is equal to $C(sI-A)^{-1}B$. We can write $(sI-A)^{-1}$ by inspection, since this is a diagonal matrix.

For a multi-input, multi-output system, with $A$ as the same as before, we still take the Laplace transform; and we now can't just take the same ratio. The same formulation is still used to compute $H(s)$, which is now a 2-dimensional matrix.

When looking at an I/O relationship, you may not have all of the information available in $A$ in the I/O relationship because of either how you're controlling $B$ or how you're sensing: notion of hidden modes.

Important when we start thinking about I/O stability.

Consider a system whose output $y(t)$ is characterized by $\int_{-\infty}^t H(t,\tau) u(\tau) d\tau$. We define $H(t,\tau)$ (the weighting pattern / function of the system; a matrix for each $(t,\tau)$) and $u(\tau)$, which are piecewise continuous in $\tau$; require that the integral of the norm of $H(t,\tau)u(\tau)$ is finite.

For linear time-varying systems, $H(t,\tau)$ is given by the following matrix: $C(\tau)\Phi(t,\tau)B(\tau) + D(t)\delta(t-\tau)$ ($\delta(t)$ here is the delta function).

For linear time-invariant systems, it's a little easier: we can do the same thing, but it simplifies: it only depends (unsurprisingly) on the difference of $t$ and $\tau$, so we can write it as $H(t-\tau) = C\exp(A(t-\tau))B + D$ (abusing our notation a bit: to be more precise, we would then multiply this by a delta centered at $t - \tau$). Then take the unilateral Laplace transform.

If we have a vector in $\Re^n$, we know that the infinity norm is simply the max element of a vector. Recall that the corresponding induced norm is simply the max row sum.

Suppose we have a function $u$; the infinity norm is simply the supremum of the infinity norm of the resulting vectors. $L_{\infty}^{n_i}$ is simply the set of functions that are finite over the infinity-norm.

We'll consider a particular case of I/O stability, called bounded input, bounded output stability.

Bounded Input, Bounded Output Stability (BIBO)

A system ($L$) is said to be BIBO stable very generally if $u \in L_\infty^{n_i} \implies y(u) \in L_\infty^{n_o}$, or equivalently, $\exists K < \infty \st \forall u \in L_\infty^{n_i}, \mag{y(\cdot)}_\infty \le \mag{u(\cdot)}_\infty$.

Equivalently, a system is not BIBO stable if no k exists for this condition; that is, $\forall k < \infinity, \exists u \st \mag{u}_\infty > k\mag{u}\infty$.

BIBO stability theorem:

(L) is BIBO stable iff $\sup \int_{-\infty}^t \mag{H(t,\tau)}_{i,\infty} d\tau < \infty$.

Infinite Horizon, BIBO for LTI systems, State space stability

October 30, 2012

We want to write a function that minimizes the cost function (with $t_0=0$) $J = \int_0^\infty (y^TQy + u^TRu) dt$. We could consider the first term as the norm of $\mag{Cx}^2$ and the second term as the norm as $\mag{u}^2$; $Q, R$ specified by designer; only rule we have to follow here is that $Q$ is positive semi-definite (all eigenvalues are in closed right half plane) and $R$ is a positive-definite matrix (all eigenvalues are strictly positive). Typically, Q, R are chosen to be diagonal, since we generally like to penalize components of the input and output relative to each other.

Penalty on an eigenvector $y_i$ being nonzero is simply the associated eigenvalue. What's typically important is the ratio of the penalties.

Optimization problem whose constraints are defined by a dynamical system. It turns out that the optimal input is given by linear state feedback where the state is multiplied by a matrix $F$ determined by the system model and the matrices of the cost function: (as seen before): $u = -Fx, F = R^{-1}B^T P$, where $P$ is the positive-definite solution to the algebraic Ricatti equation: $PA + A^TP - PBR^{-1}B^TP + C^TQC = 0$. $P \in \Re^{n \times n}$. Out of all possible inputs, the best one is given by linear state feedback.

We have some intuition given the linear time-varying case, which is more complicated; easy to justify through completing the square. Intuition as to what cost means; now we just need to solve this equation (there's a MATLAB routine for this, ARE).

If we were to take the derivative of $x^TPx$ ($\dot{x}^TPx + x^TP\dot{x}$) and integrate it, we get $(x^TPx)(\infty) - (x^TPx)(0)$. Since we defined $J = \int_0^\infty (y^TQy + u^TRu)dt$, $J - (x^TPx)(0) = -(x^TPx)(\infty) + \int_0^\infty (y^TQy + u^TRu)dt + \int_0^\infty (\deriv{}{t}(x^TPx) dt$. From the definition of $\dot{x}$, we can rewrite the derivative. After some tweaking, $J - (x^TPx)(0) = -(x^TPx)(\infty) + \int_0^\infty (u + R^{-1}B^TPx)^TR(u + R^{-1}B^TPx)dt$

By our hypothesis, we are assuming that $P$ solves the algebraic Ricatti equation. We can use that to simplify the equation. If the system under this control law is stable (exponential, internal stability), then $x^TPx(\infty) = 0$. If we choose $u = -R^{-1}B^TPx$, then the term under the integral is 0, and so $J$ is minimized -- cost depends only on initial state, which you can't affect.

The other thing we talked about was the positive semi-definite solution to this equation. Has only one positive semi-definite solution (symmetric). We can also prove that if you choose this control input, then it will stabilize the system: not only optimal control law, but also stabilizing control law.

We defined BIBO stability as describing a system in which all bounded inputs provide bounded outputs.

We get the following theorem: if we write out the transfer function of the weighting pattern of the system, $\hat{H} = C(sI - A)^{-1}B + D$, which is a proper matrix, $R = [A,B,C,D]$ is BIBO stable, then the poles are all in the open left half of the complex plane ($\mathbb{C}^0_-$).

We can ignore $D$ because it's a constant, and so it does not generate (or cancel) any poles.

State space stability

November 1, 2012

overslept :/

Internal stability, Lyapunov condition for LTI systems, Controllability / Observability

November 6, 2012

LTI case: $\dot{x} = Ax$. Asymptotic / exponential stability iff all eigenvalues of A in $C_-^o$. Stability iff all eigenvalues in $C_-$ and each eigenvalue on $j\omega$ axis has J-block of size 1.

Connections between internal stability and BIBO stability

Go back to general linear time-varying case. If we have the result that equilibrium at 0 (WLOG) is equal to 0 (zero-input case) is exponentially stable, then assuming (which we do) that $B,C,D$ are boudned matrices, then the system is BIBO stable.

We just need to show, now, that given that this is exponentially stable, the condition (with weighting patterns and all) for BIBO stability holds.

(show, as an exercise, that given those bounds, and given our knowledge that this is an exponentially stable system and we can develop a bound on the norm, that this is BIBO).

What if $\dot{x} A$ is just asymptotically stable? Do we have BIBO stability? Something to think about.

We know from our examples that a system may be BIBO stable but not internally stable (notion of modes being blocked by $B$ or not exposed by $C$) due to the possibility of what we call hidden modes, i.e. eigenvalues of $A$ may be not accessible through the input or invisible at the output.

So when can you relate BIBO stability to internal stability? Foreshadowing of what we'll start in a minute.

If we have an LTI system, we say that if the (A,B) pair is completely controllable, and the (A,C) pair is completely observable, then we call the system (A,B,C) minimal. If this is the case, then it's true that BIBO stability is equivalent to internal exponential stability.

For linear time-variant systems, I'll make one remark. Often tempting to look at eigenvalues of A(t). In general there is no connection between the spectrum of A(t) ($\sigma(A(t))$) and the stability of the system.

There are two cases in which the eigenvalues tell us something about the stability:

• A is symmetric (or more generally, $\comm{A^\dag A}{AA^\dag} = 0$): A can be decomposed into an orthonormal basis of eigenvectors corresponding to real eigenvalues. If $\sigma(A(t)) \le -\mu$ ($\mu > 0$), then we have that equilibrium at 0 is exponentially stable.

• Proof: Consider $V(x) = x^Tx$. Its time derivative (actually a special kind of derivative) is $\dot{x}^Tx + x^T\dot{x} = 2x^TAx$. Rewrite $A = T^{-1} \Lambda T$, and we have $x^TAx = x^{-1}T \Lambda T x$ (we have that $T^{-1} = T^T$). Our initial assumption gives us an upper bound on the product $x^T A x$, which tells us that $\dot{V} \le -2\mu V$. We can integrate this inequality and come up with the following equation: $V(x(t)) \le V(x(0))\exp(-2\mu t)$. Note that $V$ is a norm of $x$, so using this norm, we have that $\mag{x(t)}^2 = \mag{x(0)}^2\exp(-2\mu t)$, so $\mag{x(t)} = \mag{x(0)}\exp(-\mu t)$.

• The reason why this proof is interesting is that it invokes a technique common to control theory: we bring in this function and use the derivative to establish the stability of the system. This is a special case of what we call a Lyapunov function. Lyapunov theory is the most popular tool for assessing stability.

• $Re(A(t)) < -\lambda and \mag{A(t)} \le \epsilon$, with $\epsilon$ small enough, then you can show that $x = A(t)x$ is exponentially stable (beyond the scope of this course).

Today we'll be looking just at Lyapunov as it applies to the time-invariant case. We spend a third of EE 222 studying Lyapunov theory and its variations and how to construct Lyapunov functions. Most exportable: use of external functions to avoid needing to integrate the system itself.

This leads us to our last test: a taste of what Lyapunov theory is.

Theorem (Lyapunov condition for exponential stability of $\dot{x} = Ax$)

We consider the following matrix equation (the Lyapunov equation) $A^TP + PA = -Q$, where $P,Q$ are square matrices. Theorem: the system is exponentially stable iff $\forall Q = Q^T > 0$ the Lyapunov equation has a unique solution $P = P^T > 0$.

Consider $\fn{V}{\Re^n}{\Re^n} = x^TPx$, where for $Q = Q^T > 0$, $P = P^T 0$ is the solution to the Lyapunov equation.

We're going to look at $\dot{V}$ (this is the same derivative as before, the Lie derivative: wherever we see the time derivative of the state, we apply the dynamics of the system). This derivative gives a measure of how much $V(x)$ is changing along the trajectories of the system. It suffices to say that $\dot{x}$ represents the dynamics of the system.

So $\dot{V} = x^TA^TPx + x^TPax = x^T(A^TP + PA)x = -x^TQx$. What we want to do is relate the changes in $V$ back to the changes in $x$. As before, $\dot{V}(x) = -x^TQx \ge -\lambda_{\min}(Q)\mag{x}^2$. Note that $-\mag{x}^2 \le -\frac{V(x}{\lambda_{\max}(P)}$, and so we can proceed as before.

This in itself is an important result. More of a technique for analysis than something that you'd use for assessing stability. The introduction of a Lyapunov function with a specific form (a positive definite function whose Lie derivative is a negative definite function) is what's useful.

Often the work in using Lyapunov theory is to construct such functions. In linear system it's easy because it's just given to you.

(four equivalent statements, proven in Callier and Desoer -- one of these is that if one Q exists such that this holds, then it holds for all Q)

Controllability / Observability

Motivation: designing and manipulating the system so you can get it to do a host of things. What we're going to launch into in the last part of the class: understanding what controllability of a system means. Part of that question is whether or not we can sufficiently observe the dynamics of the system through the output. That tells us how well we can control the system. Given that, we go on to ponder how to design controllers; we'll use the feedback topology we've seen twice already.

Two parts: assessment of controllability of systems, controller design.

In the last ten minutes: intuitive discussion about what controllability and observability mean. These are general to dynamic systems (not specific to linear systems).

Recall our definition of dynamical systems: five-tuple representation $D = (\mathcal{U}, \Sigma, \mathcal{Y}, s,r)$, semigroup and state transition axioms hold. The input is said to steer the initial condition $x_0$ @ $t_0$ to $x_1$ @ $t_1$ if $x_1 = s(t_1, t_0, x_0, u)$.

We say that D is controllable on some time interval if $\forall x_0, x_1 \in \Sigma \exists u \in \mathcal{U}$ such that $u$ steers $x_0$ @ $t_0$ to $x_1$ @ $t_1$.

We define controllability to be specific on a time interval. completely controllable (cc) denotes that the system is controllable on all time (if controllability is not indicated with an interval, it usually denotes complete controllability). Controllability of the system corresponds to surjectivity of the state transition function, i.e. $\forall x_0 \in \Sigma, \fn{s(t_1, t_0, x_0, \cdot)}{U}{\Sigma}$ is surjective.

We'll start next day by talking about observability and thinking about more specific tests.

Controllability and Observability

November 8, 2012

Difference between dynamical system and its representation. Instantiating it with the representation is simply our model of the system; the system itself is just the five-tuple. Same system with different representations; notion of a minimal representation. Controllability and observability are properties of a representation.

The computational tests for controllability and observability -- these grammians -- are dependent on the system representations.

We can define it more basically with the properties of the state transition function and the readout map (rather, recall that the composition of the two yields the response function).

note: cc = completely controllable, co = completely observable.

We discussed what it meant for an input to steer the system, and then we defined controllability to be surjectivity of the state transition function. In other words, a dynamic system $D$ is controllable on $\bracks{t_0, T_1}$ if the response map (from inputs to states, given some initial conditions) is surjective.

We'll now define observability as follows: $D$ is said to be observable on $\bracks{t_0, t_1}$ if for all valid input functions and all valid output functions, the initial state $x_0$ at $t_0$ is uniquely determined. In other words, the response map $\rho(t_1, t_0, \cdot, u)$ is injective.

What happens with controllability and observability under feedback? We're going to talk about feedback and one feed-forward topology. What can we say about the composite system? Is controllability / observability preserved with feedback? We're going to look at two kinds of feedback: state feedback and output feedback.

Memoryless feedback

No memory; static. Recall that the solution to LQR was just linear state feedback. That's memoryless: takes the state and multiplies by a constant gain matrix.

State feedback takes states and gives us inputs; output feedback takes outputs and gives us inputs.

Suppose you have a dynamical system $D$. We can actually measure the state and feed it back through this function $F_s$ to construct a new input.

Negative state feedback -- subtract this from our input. Initial input is the auxiliary input.

The output feedback case is typically more popular because we don't usually have the state to measure. We can no longer measure the state, but we can take the output.

With these feedback forms, we're going to make the well-posedness assumption: for all auxiliary inputs $v$ and all initial states and times, when you look at the algebraic relationships of those forms, there exists unique trajectories $x(t), y(t)$ over that time interval.

If you don't have well-posedness, you can come up with problems: consider $A=B=C=0, D=1, u = v+y$. The only $v$ that works is 0.

If we have that the systems are well-posed, we can make some nice observations about the observability of the feedback forms.

The first theorem states the following: suppose you have a well-posed system. $D$ is controllable on the interval iff $D_s$ is controllable on the same interval, and iff $D_o$ is controllable on the same interval.

Namely, controllability is preserved under static state feedback.

Formally rests on the fact that $F_s, F_o$ are known maps. We have all information to be able to compute one input from the other uniquely (by well-posedness assumption).

Observability

Still considering memoryless output feedback, but now we're also going to consider input feed-forward.

We have another system: D is observable on $\bracks{t_0, t_1}$ iff $D_o$ is observable and iff $D_f$ is observable.

$D$ is observable on interval given all valid inputs and outputs over the time interval if there exists a unique initial condition.

We've talked about preservation of controllability over state and output feedback; we've talked about preservation of observability of output feedback and feedforward. But the case we've missed is state feedback. Is observability preserved by state feedback? The answer, in general, is no. We can explain this in a couple of ways.

Observability is not necessarily preserved over state feedback -- consider nullspace of $F_f$.

Dynamic feedback. More assumptions that have to be made. Up to now, we've been assuming that all of these feedback / feedforward functions are memoryless. What if they actually have dynamics associated to them? A lot of control design involves using dynamic controllers.

In general these properties we've talked about (preservation of controllability / observability) hold for the memoryless case. If we now consider that the feedback and feedforward systems themselves are dynamical systems (and have memory), we have to be careful in what we mean by controllability and observability of the resulting system.

The input space of a state feedback function is the space of state trajectories of $D$, and the output space is the input space of $D$.

In order to talk about preservation of controllability and observability, we must now about the controllability and observability of the feedback system (i.e. the "loop").

We're going to actually approach that later. For now, skip over preservation of controllability and observability under dynamic (state and output) feedback until later. We'll discuss different topologies under which these are preserved. We'll return to it under the pretext of controllability and observability of linear systems.

Grammians

Given a system representation, how do we construct tests for controllability and observability? We have a test for the linear time-varying case, which is in general hard to compute, and tests for the linear time-invariant case, which are easier. In both cases, they're both matrix-rank conditions.

Controllability Grammians

Let us look at our linear system representation and focus on the time-varying case. When we talk about controllability of a linear system, we care about controllability of $(A, B)$ -- we know that controllability on $\bracks{t_0, t_1}$ is the same as surjectivity of the state transition function, which is defined only in terms of $A, B$: recall that $s(t_1, t_0, x_0, u_{[t_0,t_1]}(\cdot)) = \Phi(t_1,t_0)x_0 + \int_{t_0}^{t_1} \Phi(t, \tau)B(\tau) u(\tau) d\tau$. Let us refer to the integral as $\mathcal{L}_c(u)$, which maps input functions to states. Now we can characterize controllability in the following manner: we say that $(A,B)$ (or generally $R$) is completely controllable on $[t_0, t_1]$ iff $R(\mathcal{L}_c) = \Re^n$, which is equivalent to saying $\forall x_0 \in \Re^n \exists u_{[t_0,t_1]}$ that steers $(x_0, t_0)$ to $(\theta_m, t_1)$ (steering to origin), which is further equivalent to saying that you can steer $\theta_m$ to any state (reaching from origin).

What is initially unintuitive is that saying either of the two later statements is equivalent to saying the first statement. Notion of equating this to any vector in $\Re^n$. Because the first piece of the state transition function does not depend on $u$, it all boils down to surjectivity to $\mathcal{L}_c(u)$ (since this is the only part that depends on $u$). Thus surjectivity of $s(u)$ is equivalent to surjectivity of $\mathcal{L}_c(u)$.

Controllability and Observability: Grammians and tools

November 13, 2012

Controllability of $[A, B]$

$s(t_1, t_0, x_0, u_{[t_0, t_1]}) = \Phi(t_1, t_0) x_0 + \int_{t_0}^{t_1} \Phi(t_1,\tau)B(\tau)u(\tau)d\tau$

We refer to the term with the integral as $\mathcal(L)_c(u)$ (Linear controller map). Recall that we showed that controllability of $[A,B]$ on $[t_0, t_1]$ is equivalent to surjectivity of $\mathcal{L}_c$.

Grammians: matrix tests for controllability and observability for time-varying systems, and end up being something quite nice in the LTI case.

Recall: a system is said to be controllable if for any pair of states, there exists an input that takes you from the initial state to the terminal state over this time interval. We further discussed that this was equivalent to saying that we could reach any final state from 0, which was further equivalent to saying that we could reach 0 from any initial state. (reachability from the origin, steering to the origin)

Let us pair this, now, with

Observability

You have your model, final state, and some defined time interval, and from that information, you can uniquely determine the initial state of the system. For now we'll assume we know the input, but that turns out to be extraneous information.

Response map is the readout map composed with state transition function: in particular, it has memory.

Given the output and the input, observability relates to uniquely determining the initial state. Just as we isolated the term that depended on input in the controllability case, in observability, we isolate the dependence on the initial state.

We've turned injectivity of $\rho$ to injectivity of the linear operator $\fn{\mathcal{L}_o}{\Sigma}{Y} \defequals C(t_1)\Phi(t_1, t_0)x_0$ (Linear observability map).

Theorem: complete observability of the system $A, C$ on the interval $[t_0, t_1]$ is equivalent to saying that $\mathcal{N}(\mathcal{L}_o) = {\theta_n}$ (injectivity of the observability map; saying that nullspace is trivial)

Controllability and Observability grammians; named after the same Gram of Gram-Schmidt. Recall work in adjoints: we'll turn these tests into matrix tests through our knowledge of adjoint maps.

Recall: adjoint of linear maps. Take two inner product spaces and a map that goes from one space to the other. The map's adjoint is defined in terms of the inner product.

We can show that the adjoint of the linear controller map is simply $B^*(\cdot)\Phi^(t_1, \cdot)*$: just consider the inner product; bring $v^T$ under the integral.

Given $\fn{\mathcal{A}}{\mathcal{U}{V}}$, we have the following propositions:

$V = R(A) \perp\over\oplus N(A^*)$

$U = R(A^*) \perp\over\oplus N(A)$

What we'd want to show is one pair; you'd prove this by showing that the nullspace of $A^*$ is perpendicular to the range of $A$.

You'll often see this as bubble diagrams: the space $U$ is often drawn as two bubbles that intersect only at the zero vector.

More important propositions:

$N(AA^*) = N(A^*)$

$R(AA^*) = R(A)$

$N(A^*A) = N(A)$

$R(A^*A) = R(A^*)$

These are very important. Quick sketch of proof of (1): show that the norm induced by the adjoint map of anything in $N(AA^*)$ is the zero vector; this is adequate.

Sketch of proof of (2): use decomposition from above propositions ($u_1 \in R(A^*), u_2 \in N(A)$; apply $A$ to this.

We've deviated a little. Started with tests for controllability and observability; did adjoints, showed decomposition of domain and codomain spaces when mapping through $A$ and its adjoint, we showed how to break down these spaces.

We can now relate these to our observability and controllability tests.

Controllability

$\fn{\mathcal{L}_c}{\mathcal{U}_{[t_0, t_1]}}{\Re^n} = \int_{t_0}^{t_1} \Phi(t_1, \tau)B(\tau)u(\tau)$. Recall that we just computed that $\mathcal{L}_c^*(x) = B^*(\cdot)\Phi(t_1, \cdot)x$. Complete controllability on this interval, using a proposition from above, is equivalent to the following: $R(\mathcal{L}_v) = \Re^n = R(\mathcal{L}_c\mathcal{L}_c^*)$. This is often easier to compute. Composition of $\mathcal{L}_c$ with $\mathcal{L}_c^*$ yields $\int_{t_0}^{t_1}\Phi(t_1, \tau) B(\tau)B^*(\tau)\Phi^*(t_1, \tau) d\tau$, which is an $n\times n$ matrix. The system is completely controllable on this interval iff $W_c$ (the controllability grammian) has full rank.

If you look at the integrand, this matrix is a (symmetric) positive semidefinite matrix (just write it in terms of the norm). If completely controllable, it is positive definite.

exercise. hint: look at quadratic form, e.g. $x^* W_c x$, and show, using that quadratic form for any $x$, that it's positive semi-definite. Our controllability test is that $\text{rank } W_c = n$.

Observability

Recall this involved looking at the nullspace of the map. $\mathcal{L}_o^* \mathcal{L}_o y(\cdot) = \int_{t_0}^{t_1} \Phi^*(\tau, t_0) C^*(\tau) C(\tau) \Phi(\tau, t_0) y(\tau) d\tau$. Complete observability claims that this composition of maps has full rank. This is called $W_o$, the observability grammian. Turned some fairly abstract tests into something that's more concrete; matrix that involves computing state transition matrix. Main theme is to get at simple tests for the time-invariant case. Todo: show how these grammians simplify to very simple grammians for the time-invariant case.

Last five minutes: talk about the controller and observability maps, what they mean in terms of grammians.

Kalman filters; some thoughts about observability

Kalman filter: the way we tend to reconstruct state from measurements given our understanding of the system.

We started today by talking about the nullspace of the observability map being trivial. So how do we solve for $x_0$? Use the left-inverse, if this is completely observable. Let's take that equation and pre-multiply both sides by $\mathcal{L}_o^*$. We know that this product ($\mathcal{L}_o^* \mathcal{L}_o$) is bijective, so we can invert it. Not actually done in Kalman filters; keep a running estimate.

Controllability and Observability of LTI systems

November 15, 2012

What does it mean in terms of controlling the state and observing the system? Controllable and observable subspaces.

Recall: the injectivity of $\mathcal{L}_o$ corresponds to observability, which relates to the bijectivity of what we called the observability grammian, $\mathcal{L}_o^*\mathcal{L}_o$, in our time interval. Suppose that this is injective. How do we solve for the initial state? What we came up with last day as finding the inverse of the grammian, i.e. $x_0 = W_o^{-1} \int_{t_0}^{t_1}\Phi^*(\tau, t_0)C^*y(\tau) d\tau$.

Suppose $y$ is not in the range of this grammian, but our grammian is injective. What is the "best" estimate of $x_0$? Claim: our computation from before still gives us the best estimate. What do we mean by best? We'll explore this.

We've got this map $\mathcal{L}_o$. We can write $y$ to be the sum of two components, one being in the range of $\mathcal{L}_o$, and the other being in the nullspace of $\mathcal{L}_o^*$ (recall that we showed that these two spaces are orthogonal). Therefore we minimize the distance between the actual initial state and our estimate.

Our third case is that $\mathcal{L}_o$ is not injective, but it is surjective. $\mathcal{L}_o\mathcal{L}_o^*$ is bijective, therefore. Even though this is a map between infinite-dimensional spaces, we can define an inverse because it is bijective.

Our fourth case is that $\mathcal{L}_o$ is neither injective nor surjective. You can compute how you would find the least-squares estimate or least norm.

This formulation leads us to the idea of gathering all of the information about the input and output, and then compute exactly or estimate the state. Typically, you update your best estimate with some online algorithm.

(derivation of recursive formula, which yields Kalman filter; information form of the Kalman filter.)

LTI case

Our representation is simply $[A,B,C,D]$. Let's start with controllability.

Controllability and Observability of LTI systems

November 20, 2012

$\{A,B,C,D\}$ is completely observable (note that again, time interval irrelevant for LTI because integrand of grammian does not depend on $t$) $\iff$ rank of $\begin{bmatrix} C \\ CA \\ \vdots \\ CA^{n-1}\end{bmatrix} = n$ $\iff$ rank $\begin{bmatrix} sI - A \\ C\begin{bmatrix} = n \forall s \in \sigma(A)$.

Same proofs as controllability theorem: instead of considering $\{A, B\}$, consider matrix pair $\{A^T, C^T\}$.

We call $Q = \begin{bmatrix}B & AB & \hdots & A^{n-1} B \end{bmatrix}$ the controllability matrix; might also use $\mathcal{C}$. Similarly, we call $\mathcal{O} = \begin{bmatrix} C \\ CA \\ \vdots \\ CA^{n-1}\end{bmatrix}$ the observability matrix.

We can easily prove the following, but we'll state it as a theorem:

The range of the controllability grammian (does not matter which time interval, so again we'll consider $[0, \Delta]$) is equal to the range of the controllability map, wihch is further equal to the range of the controllability matrix. We haven't shown this directly; however, it is a direct extension of the proof of the theorems in the controllability matrix or just a direct extension of the Cayley-Hamilton theorem.

Similarly, the nullspace of the observability grammian is equal to the nullspace of the observability map (already shown), but that's equal to the nullspace of the observability matrix. That's in general some subset of $\Re^n$. If the system is completely observable, the nullspace of these maps is trivial.

The range of the controllability matrix is the set of all states reachable from $\theta$; The nullspace of the observability matrix is the set of unobservable states.

A few definitions:

• $R$ is stabilizable if all of its uncontrollable states are stable.

• $R$ is detectable if all of its unobservable states are stable.

• $R$ is stabilizable $\iff \text{rk}\begin{bmatrix} sI - A & B \end{bmatrix} = n \forall s \in \sigma(A) \cap \mathbb{C}_+$.

• $R$ is detectable $\iff \text{rk}\begin{bmatrix} sI - A \\ C \end{bmatrix} = n \forall s \in \sigma(A) \cap \mathbb{C}_+$.

Intro to control design via pole placement

Using state feedback, when can we place the eigenvalues of the closed loop system anywhere in $\mathbb{C}$? System must be controllable.

Control design using eigenvalue placement

November 27, 2012

Observer design. Final exam Tues Dec. 11, 9-12, 293 Cory.

Theorem: Let $(A, b)$ be completely controllable. Let $\hat{\pi}(s)$ be any monic polynomial of degree $n$ with real coefficients. Then $\exists! f^T \in \Re^n$ such that $\hat{X}_{A + bf}(s) = \hat{\pi}(s)$, where $f^T = -\begin{bmatrix} 0 & 0 & ... & 1\end{bmatrix}\begin{bmatrix}b & Ab & ... & A^{n-1} b\end{bmatrix} \hat{\pi}(A)$.

Proposition: Complete controllability is equivalent to saying $\exists T \in \Re^{n\times n}$ s.t. $\tilde{A} = T^{-1}AT = \begin{bmatrix}0 & 1 & 0 & ... & 0\\ 0 & 0 & 1 & ... & 0\\ vdots & \ddots & & \\ -\alpha_n & ... & ... & -\alpha_1\end{bmatrix}$.

This result is basically changing the basis; we define a new state $z = T^{-1}$. That's why we're interested in this particular similarity transform.

Recall what it means to be completely controllable. For LTI systems, we have our tests for controllability, which are easy to check.

Controllable canonical form. Canonical means that once you get a general matrix and you can transform it, the coefficients are all the same. In fact, as we saw last time, these are the coefficients of the characteristic polynomial.

Multiple-input, multiple output; Observer design

November 29, 2012

The closed-loop matrix is the same; we always have the same structure: we're feeding back an input equal to $Fx + z$.

Algorithm in LN 21; how to simplify problem by considering it as a single-input system. Basically asks the problem "does there exist a feedback gain matrix $F$ such that $(A + bF, b), b \in R(B)$?

State estimation: whether or not $\hat{z}$ is a good estimate of the state. Measure error, which tells you how good your estimate is. Luenberger model.

Theorem: if $A, C$ is completely observable, $\exists T \in \Re^{n \times n_o}$ such that $\sigma(A - TC)$ can be placed arbitrarily in $\mathbb{C}$.

We can think about this problem by relating it back to the complete controllability and placement of spectrum of $A - BF$.

If we look at the transpose of $A - TC$ (eigenvalues are preserved under transposition), then we're back to the complete controllability case.

Don't typically move observer eigenvalues much farther than needed. Stipulates that gain matrix is large; you'd have large gain in feedback loop, which is typically undesirable.

Want observer to converge faster than estimate.