Early investigators recognized that random signals could be described loosely in terms of their spectral content, but a rigorous mathematical description of such signals was not formulated until the 1940s, most notably with the work of Wiener and Rice (1, 2).

Recall that in probability we assume that we have some a priori knowledge about the likelihood of certain elemental random outcomes. Then we wish to predict the theoretical relative frequency of occurrence of combinations of these outcomes. 

In statistics we do just the reverse. We make observations of random outcomes. Then, based on these observations, we seek mathematical models that faithfully represent what we have observed.

The relative-frequency concept is especially helpful in visualizing the statistical significance of the results of probability calculations. T

It should be apparent that the intuitive concepts of probability have their limitations. The ratio-of-outcomes approach requires the equal-likelihood assumption for all outcomes.

Also, there are situations where all possible outcomes simply cannot be enumerated.

Axiomatic probability begins with the concept of a sample space.

The sample space is the set of all possible outcomes of this experiment. The individual outcomes are called elements or points in the sample space

The number of points in the sample space may be finite, countably infinite, or simply infinite, depending on the experiment under consideration.

It should be noted that elements of a sample space must always be mutually exclusive or disjoint. There is no overlap of points in a sample space.

In axiomatic probability, the term event has special meaning and should not be used interchangeably with outcome.

An event is a special subset of the sample space S. 

In discrete problems, this can always be done by defining the set of events under consideration to be all possible subsets of the sample space S. 

 We will tacitly assume that the null set is a subset of every set, and that every set is a subset of itself.

The event A is said to occur if any point in A occurs.

The first two axioms are <b> Image

The third axiom is then
<b>Image

 If the sample space S contains a finite number of elements, the probability assignment is usually made directly on the elements of S. They are, of course, elementary events themselves. 

However, if the sample space consists of an infinite “smear” of points, the probability assignment must be made on events and not on points in the sample space.

Once we have specified the sample space, the set of events, and the probabilities associated with the events, we have what is known as a probability space.

The probability P(A∩B) is known as the joint probability of A and B 

Let us say that we have a conceptual experiment for which we have defined a suitable events space and a probability assignment for the set of events. A random variable is simply a function that maps every point in the events space on to points on the real line.

The probability assignments in the events space transfer over to the corresponding points on the real line.

In many applications we have need to consider two or more chance situations where there may be probabilistic connections among the events. 

We can now think of the pairwise X and Y as defining a new sample space with the probabilities as indicated in Table 1.1

They are the probabilities of the joint occurrence of X and Y, and they will be denoted simply as pXY. The entries in the array are all nonnegative and sum to unity, so the conditions for a legitimate sample space that were stated in Section 1.3 are satisfied.

In addition to the joint probabilities in Table 1.1, the sum of the entries in the respective rows and columns are also of interest. They are sometimes called the marginal or unconditional probabilities. 

Thinking in terms of statistics, they represent the relative frequency of occurrence of a particular value of one of the random variables irrespective of the other member of the pair.

Finally, there are two other distributions that are also of interest when considering joint random variables. These are the conditional probabilities of X given Y and Y given X, and they will be denoted as pX|Y and pY|X.

Note here that we consider the “given” variable as being fixed, and the probability distribution is on the “first” variable. 

Bayes Rule:
<b>Image

Conditional and Joint Relationships: <b>Image
or
<b>Image

Note that the probabilities pX and pY are the “normalizing” factors in the denominators of Eqs. (1.5.2) and (1.5.3) that take us from the joint row and column distributions to the respective conditional distributions. 

Two discrete random variables X and Y are said to be independent if
<b>Image

That is, in words, X and Y are independent if and only if their joint probability distribution is equal to the product of their respective pX and pY distributions. 

It is tacitly implied that this must be true for all permissible values of X and Y. 

 it is worth mentioning that we do not have to rely on the intuitive notion of independence when we know the joint probability distribution in all its detail.

It works out that the integral of the probability density function is also a useful descriptor of a continuous random variable. 

It is usually defined with the integral's lower limit being set at the random variable's smallest possible realization, and the upper limit is set at some dummy variable, say x, where x is within the range of the random variable space.

   The integral is then <b>Image
 
   FX(x) is called the cumulative probability distribution function (or sometimes, for short, just distribution function in contrast to density function).
    
   In words, the cumulative probability distribution is the probability that the random variable X is equal to or less than the argument x.
    
   As a matter of notation, we will normally use an uppercase symbol for the distribution function and the corresponding lowercase symbol for the density function.
    
   In either case, random or deterministic, the average is just the sum of the numbers divided by the number of quantities being averaged.
    
   In the random case, the sample average or sample mean of a random variable X is defined as <br>Image
    
   In the study of random variables we also like to consider the conceptual average that would occur for an infinite number of trials. 
    
   This hypothetical average is called expected value and is aptly named; it simply refers to what one would “expect” in the typical statistical situation.
    
   Thus, the sample average would be

Image
    
    This suggests the following definition for expected value for the discrete probability case:

Image
    
    Similarly, for a continuous random variable X, we have

Image
    
    We can use these same arguments for defining the expectation of a function of X, as well as for X.
    
    Thus, we have the following:

Discrete case:

Image

Continuous case:

Image
    
    Eq. (1.7.5)] then provides an expression for the kth moment of X, that is,

Image
    
    The first moment is, of course, just the expectation of X, which is also known as the mean or average value of X. 
    
    Note that when the term sample is omitted, we tacitly assume that we are referring to the hypothetical infinite-sample average.
    
    We also have occasion to look at the second moment of X “about the mean.” This quantity is called the variance of X and is defined as

Image
    
    In a qualitative sense, the variance of X is a measure of the dispersion of X about its mean. 
    
    Of course, if the mean is zero, the variance is identical to the second moment.
    
    The expression for variance given by Eq. (1.7.9) can be reduced to a more convenient computational form by expanding the quantity within the brackets and then noting that the expectation of the sum is the sum of the expectations. This leads to

Image
    
    The square root of the variance is also of interest, and it has been given the name standard deviation, that is,

Image
    
    The characteristic function associated with the random variable X is defined as

Image
    
    It can be seen that ψX(ω) is just the Fourier transform of the probability density function with a reversal of sign on ω
    
    Thus, the theorems (and tables) of Fourier transform theory can be used to advantage in evaluating characteristic functions and their inverses.
    
    The characteristic function is especially useful in evaluating the moments of X
    
    It can be seen that

Image
    
    Thus, with the help of a table of Fourier transforms, you can often evaluate the moments without performing the integrations indicated in their definitions.
    
    The random variable X is called normal or Gaussian if its probability density function is

Image
    
    Note that this density function contains two parameters mX and σ2. These are the random variable's mean and variance.
    
    Note that the normal density function is completely specified by assigning numerical values to the mean and variance. Thus, a shorthand notation has come into common usage to designate a normal random variable. When we write

Image
    
     Also, as a matter of terminology, the terms normal and Gaussian are used interchangeably in describing normal random variables, and we will make no distinction between the two.
    
    Note that the density function is symmetric and peaks at its mean.
    
    Qualitatively, then, the mean is seen to be the most likely value, with values on either side of the mean gradually becoming less and less likely as the distance from the mean becomes larger. 
    
    Since many natural random phenomena seem to exhibit this central-tendency property, at least approximately, the normal distribution is encountered frequently in applied probability. 
    
    Recall that the variance is a measure of dispersion about the mean. Thus, small σ corresponds to a sharp-peaked density curve, whereas large σ will yield a curve with a flat peak.
    
    The normal distribution function is, of course, the integral of the density function:

Image
    
    Unfortunately, this integral cannot be represented in closed form in terms of elementary functions. Thus, its value must be obtained from tables or by numerical integration.
    
    A quick glance at the table will show that the distribution function is very close to unity for values of the argument greater than 4.0 (i.e., 4σ).
    
    In some applications, though, the difference between FX(x) and unity [i.e., the area under the “tail” of fX(x)] is very much of interest, even though it is quite small. 
    
    In spite of the previous remarks, tables of probabilities, both normal and otherwise, can be useful for quick and rough calculations.
    
    A word of caution is in order, though, relative to normal distribution tables. They come in a variety of forms. 
    
    some tables give the one-sided area under the normal density curve from 0 to X, rather than from -∞ to X. 
    
    Other tables do something similar by tabulating a function known as the error function, which is normalized differently than the usual distribution function.
    
    we also have occasion to consider cases where the random variable has a mixture of discrete and smooth distribution.
    
    Rectification or any sort of hard limiting of noise leads to this situation
    
    Note that in order to have the area under the density function be Image at y = 0, we must have a Dirac delta (impulse) function at the origin.
    
    This, in turn, gives rise to a jump or discontinuity in the corresponding distribution function
    
    It should be mentioned that at the point of discontinuity, the value of the distribution function is Image. That is, the distribution function is continuous from the right and not from the left. 
    
    These are also referred to as multivariate random variables with the case of two joint random variables, called the bivariate case, being encountered frequently.
    
    It then follows directly from integral calculus that the probability that X and Y both lie within the region shown as R in Fig. 1.8 is:

Image
    
    It should now be apparent that we can also define a joint cumulative distribution function for the bivariate case as:

Image
    
    It should also be apparent that we have a differential/integral relationship between the density and cumulative distribution functions that is similar to what we have in the single-variate case. The only difference is that we have a double integral for the integration and the second partial derivative for the derivative part of the analogy.
    
    It is worthy of mention that only the joint probability density function gives the complete description of the probabilistic relationship between the X and Y random variables.
    
    The other densities (i.e., conditional and marginal) contain specialized information which may be important in certain applications, but each does not, when considered alone, “tell the whole story”. 
    
    Bayes Rule, independence, and the equations for marginal density carry over directly from discrete probability to the continuous case. 
    
    Bayes Rule (Continuous Case):

Image

Conditional Densities (Continuous Case):

Image
    
    Also,

Image
    
    Independence:

Random variables X and Y are independent if and only if

Image
    
    Marginal probability density:

Image

and

Image
    
    One should always remember that to get marginal probability from joint probability, you must always “sum out,” not simply substitute a fixed value for the random variable being held fixed.
    
    The expectation of the product of two random variables X and Y is of special interest. In general, it is given by

Image
    
    There is a special simplification of Eq. (1.11.1) that occurs when X and Y are independent. Equation (1.11.1) then reduces to

Image
    
    If X and Y possess the property of Eq. (1.11.2), that is, the expectation of the product is the product of the individual expectations, they are said to be uncorrelated. 
    
    Obviously, if X and Y are independent, they are also uncorrelated. However, the converse is not true, except in a few special cases
    
    As a matter of terminology, if

Image

X and Y are said to be orthogonal.
    
    The covariance of X and Y is also of special interest, and it is defined as

Image
    
    With the definition of Eq. (1.11.4) we can now define the correlation coefficient for two random variables as

Image
    
    The correlation coefficient is a normalized measure of the degree of correlation between two random variables, and the normalization is such that ρ always lies within the range −1 ≤ ρ ≤ 1. 
    
    Let X and Y be independent random variables with probability density functions fX(x) and fY(y). Define another random variable Z as the sum of X and Y:

Image
    
    It is now apparent from Eq. (1.12.5) that the quantity within the brackets is the desired probability density function for Z. Thus,

Image
    
    It is of interest to note that the integral on the right side of Eq. (1.12.6) is a convolution integral.
    
    Thus, from Fourier transform theory, we can then write

Image
    
    We now have two ways of evaluating the density of Z: (1) We can evaluate the convolution integral directly, or (2) we can first transform fX and fY, then form the product of the transforms, and finally invert the product to get fZ.
    
    This simple example is intended to demonstrate (not prove) that a superposition of independent random variables always tends toward normality, regardless of the distribution of the individual random variables contributing to the sum. This is known as the central limit theorem of statistics. 
    
    In engineering applications the noise we must deal with is frequently due to a superposition of many small contributions. When this is so, we have good reason to make the assumption of normality.
    
    The summation of any number of random variables can always be thought of as a sequence of summing operations on two variables; therefore, it should be clear that summing any number of independent normal random variables leads to a normal random variable.
    
    This rather remarkable result can be generalized further to include the case of dependent normal random variables
    
    A mathematical transformation that takes one set of variables (say, inputs) into another set (say, outputs) is a common situation in systems analysis.
    
    Assume we know the probability density function for X, and would like to find the corresponding density for Y.
    First, let us assume that the transformation g(x) is one-to-one for all permissible x. 
    
    By this we mean that the functional relationship given by Eq. (1.13.1) can be reversed, and x can be written uniquely as a function of y. Let the “reverse” relationship be

Image
    
    The probabilities that X and Y lie within corresponding differential regions must be equal. That is,

Image

or

Image
    
    The differential equivalent of Eq. (1.13.4) is

Image

where we have tacitly assumed dx to be positive. 
    
    Also, x is constrained to be h(y). Thus, we have

Image
    
    or, equivalently,

Image
    
    It can now be seen that transforming a zero-mean normal random variable with a simple scale factor yields another normal random variable with a corresponding scale change in its standard deviation.
    
    It is important to note that normality is preserved in a linear transformation.
    
    The single-variate density function given by Eq. (1.13.22) is called the Rayleigh density function. It is of considerable importance in applied probability
    
    It is easily verified that the mode (peak value) of the Rayleigh density is equal to standard deviation of the x and y normal random variables from which it was derived.
    
    Thus, we see that similar independent, zero-mean normal densities in the x, y domain correspond to Rayleigh and uniform densities in the r, θ domain
    
    Consider a set of n random variables X1, X2, …, Xn (also called variates). We define a vector random variable X as*

Image
    
    In general, the components of X may be correlated and have nonzero means. 
    
    We denote the respective means as m1, m2, …, mn, and thus, we define a mean vector m as

Image
    
    The covariance matrix for X is defined as

Image
    
    The terms along the major diagonal of C are seen to be the variances of the variates, and the off-diagonal terms are the covariances.
    
    The random variables X1, X2, …, Xn are said to be jointly normal or jointly Gaussian if their joint probability density function is given by

Image
    
    Note that the defining function for fX is scalar and is a function of x, x2, …, xn when written out explicitly. 
    
    Also note that C−1 must exist in order for fX to be properly defined by Eq. (1.14.5). Thus, C must be nonsingular.
    
    As mentioned previously, the third- and higher-order densities are very cumbersome to write out explicitly; therefore, we will examine the bivariate density in some detail in order to gain insight into the general multivariate normal density function.
    
    The bivariate normal density function is a smooth hill-shaped surface over the x1, x2 plane. 
    
    In the more general case, a constant probability density contour projects into the x1, x2 plane as an ellipse with its center at (m1, m2) as shown in Fig. 1.15.
    
    The orientation of the ellipse in Fig. 1.15 corresponds to a positive correlation coefficient.
    
    Points on the ellipse may be thought of as equally likely combinations of x1 and x2.
    
     If ρ = 0, we have the case where X1 and X2 are uncorrelated, and the ellipses have their semimajor and semiminor axes parallel to the x1 and x2 axes. 
    
    If we specialize further and let σ1 = σ2 (and ρ = 0.) the ellipses degenerate to circles.
    
    In the other extreme, as |ρ| approaches unity, the ellipses become more and more eccentric.
    
    Thus, two normal random variables that are uncorrelated are also statistically independent.
    
    It is easily verified from Eq. (1.14.5) that this is also true for any number of uncorrelated normal random variables. 
    
    This is exceptional, because in general zero correlation does not imply statistical independence. It does, however, in the Gaussian case.
    
    We now define a new set of random variables Y1, Y2, …, Yn that are linearly related to X1, X2, …, Xn via the equation

Image
    
    A is a square matrix that will be assumed to be nonsingular (i.e., invertible).
    
    In particular, the transformation is one-to-one; therefore, a generalized version of Eq. (1.13.32) may be used.

Image
    
    We find the mean of Y by taking the expectation of both sides of the linear transformation

Image

Thus,

Image
    
    It is apparent now that fY is also normal in form with the mean and covariance matrix given by

Image

and

Image
    
    Thus, we see that normality is preserved in a linear transformation.
    
    All that is changed is the mean and the covariance matrix; the form of the density function remains unchanged.
    
    Any transformation, say, S, that produces a new covariance matrix SCXST that is diagonal is of special interest.
    
    Such a transformation will yield a new set of normal random variables that are uncorrelated, and thus they are also statistically independent. 
    
    In a given problem, we may not choose to actually make this change of variables, but it is important just to know that the variables can be decoupled and under what circumstances this can be done.
    
    It works out that a diagonalizing transformation will always exist if CX is positive definite (8).*
    
    In the case of a covariance matrix, this implies that all the correlation coefficients are less than unity in magnitude.
    
    A symmetric matrix C is said to be positive definite if the scalar xTCx is positive for all nontrivial x, that is, x ≠ 0. 
    
    It is appropriate now to summarize some of the important properties of multivariate normal random variables:
    
    The probability density function describing a vector random variable X is completely defined by specifying the mean and covariance matrix of X.
    
    The covariance matrix of X is positive definite. The magnitudes of all correlation coefficients are less than unity.
    
    If normal random variables are uncorrelated, they are also statistically independent.
    
    A linear transformation of normal random variables leads to another set of normal random variables. A decoupling (decorrelating) transformation will always exist if the original covariance matrix is positive definite.
    
    If the joint density function for n random variables is normal in form, all marginal and conditional densities associated with the n variates will also be normal in form.
    
    By convergence we mean that if a given accuracy figure is specified, we can find an appropriate number of terms such that the specified accuracy is met by a truncated version of the series.
    
     In particular, note that once we have determined how many terms are needed in the truncated series, this same number is good for all x within the interval, and there is nothing “chancy” about it.
    
    Let us now be more specific in this example, and let X (and thus X1, X2, …, Xn) be normal with mean mX and variance σX2.
    
    The sample mean is, of course, an estimate of the true mean of X, and we see from Eq. (1.16.3) that it at least yields E(X) “on the average.”
    
    Estimators that have this property are said to be unbiased. 
    
    That is, an estimator is said to be unbiased if

Image
    
    Thus, we see that the variance of the sample mean decreases with increasing n and eventually goes to zero as n → ∞.
    
    It should be clear from the figure that convergence of some sort takes place as n → ∞. 
    
    However, no matter how large we make n, there will still remain a nonzero probability that Yn will fall outside some specified accuracy interval. 
    
    Thus, we have convergence in only a statistical sense and not in an absolute deterministic sense.
    
    There are a number of types of statistical convergence that have been defined and are in common usage
    
    Consider a sequence of random variables Y1, Y2, …, Yn. The sequence Yn is said to converge in the mean (or mean square) to Y if

Image
    
    Convergence in the mean is sometimes abbreviated as

Image
    
    The sequence Yn converges in probability to Y if

Image

where ε is an arbitrarily small positive number.
    
    Roughly speaking, convergence in the mean indicates that the dispersion (variance) about the limiting value shrinks to zero in the limit.
    
    Similarly, convergence in probability means than an arbitrarily small accuracy criterion is met with a probability of one as n → ∞.
    
    Davenport and Root (9) point out that convergence in the mean is a more severe requirement than convergence in probability.
    
    Thus, if a sequence converges in the mean, we are also assured that it will converge in probability. 
    
    The converse is not true though, because convergence in probability is a “looser” sort of criterion than convergence in the mean.
    
    a consistent estimator is one that continually gets better and better with more and more observations
    
    estimation efficiency is a measure of the accuracy of the estimate relative to what we could ever expect to achieve with a given set of observations; and so forth.
    
    However, estimating parameters is a statistical problem, not a filtering problem.
    
    In the filtering problem we usually assume that the key statistical parameters have already been determined, and the remaining problem is that of estimating the random process itself as it evolves with time.
    
    Linear estimate. A linear estimate of a random variable is one that is formed as a linear function of the measurements, both past and present.
    
    We think of the random variable and its estimate as evolving with time; i.e., in general, they are not constants.
    
    Prior information about the process being estimated may also be included in the estimate, but if so, it must be accounted for linearly.
    
    Unbiased estimate. This estimate is one where

Image
    
    If the above statement is satisfied, then the expectation of the error is zero, i.e.,

Image
    
    Minimum-mean-square-error estimate. This estimate is formed such that Image is made as small as possible.
    
    Minimum variance estimate. Here, the variance of the estimate is made as small as possible. Clearly, if x itself has zero mean, then the minimum-mean-square-error estimate, or Item (c), is the same as the minimum variance estimate.
    
    Consistent estimate. When associated with the filtering of a dynamic random process, a consistent estimate is defined as one where the estimation error is zero mean with the covariance matching that calculated by the filter.
    
    
    

 