# Mathematics for Machine Learning

## 1. Linear Algebra

### In general, vectors are special objects that can be added together and miltiplied by scalars to produce another object of the same king!


#### 1.1. Geometric vectors
#### 1.2. Polynomials are also vectors
#### 1.3. Audio signals are vectors
#### 1.4. Elements of R^n (tuples of n real numbers) are vectors

### The concept of a vector space and its properties underline much of ***Machine Learning***

***INFO 1:* Gaussian elimination is an intuitive and constructive way to solve a system of linear equations with thousands of variables.**

***INFO 2:* Groups play an important role in computer science.Besides providing a fundamental framework for operations on sets, they are heavily used in
cryptography, coding theory, and graphics.**

***INFO 3:* Vector subspaces are a key idea in machine learning.**

***INFO 4:* Linear independence is one of the most important concepts in linear
algebra.**

***INFO 5:* In the machine learning literature, the distinction between linear and affine is sometimes not clear so that we can find references to affine spaces/mappings as linear spaces/mappings.**

***INFO 6:* Symmetric, positive definite matrices play an important role in machine learning, and they are defined via the inner product. The idea of symmetric positive semidefinite matrices is key in the definition of kernels**

***INFO 7:* Projections are an important class of linear transformations (besides rotations and reflections) and play an important role in graphics, coding theory, statistics and machine learning. In machine learning, we often deal with data that is high-dimensional. High-dimensional data is often hard to analyze or visualize. However, high-dimensional data quite often possesses the property that only a few dimensions contain most information, and most other dimensions are not essential to describe key properties of the data. When we compress or visualize high-dimensional data, we will lose information. To minimize this compression loss, we ideally find the most informative dimensions in the data.**

***INFO 8:* In machine learning, inner products are important in the context of kernel methods.**

***INFO 9:* Methods to analyze and learn from network data are an essential component of machine learning methods.**

***INFO 10:* The Cholesky decomposition is an important tool for the numerical computations underlying machine learning.**

***INFO 11:* The SVD is used in a variety of applications in machine learning from least-squares problems in curve fitting to solving systems of linear equations. Substituting a matrix with its SVD has often the advantage of making calculation more robust to numerical rounding errors. The SVD’s ability to approximate matrices with “simpler” matrices in a principled manner opens up machine learning applications ranging from dimensionality reduction and topic modeling to data compression and clustering.**

***INFO 12:* The low-rank approximation of a matrix appears in many machine learning applications, e.g., image processing, noise filtering, and regularization of ill-posed problems.**

***INFO 13:* Many algorithms in machine learning optimize an objective function with respect to a set of desired model parameters that control how well a model explains the data: Finding good parameters can be phrased as an optimization problem.**

***INFO 14:* Vector calculus is one of the fundamental mathematical tools we need in machine learning.**

***INFO 15:* To facilitate learning in machine learning models, we need to compute gradients of functions, since the gradient points in the direction of steepest ascent.**

***INFO 16:* When we compute gradients and implement them, we can use finite differences to numerically test our computation and implementation: We choose the value 'h' to be small (e.g. h = 10^-4) and compare the finite-difference approximation with our -analytic- implementation of the gradient. If the error is small, our gradient implementation is probably correct.**

***INFO 17:* The Jacobian determinant is important because we transform random variables and probability distributions. These transformations are extremely relevant in machine learning in the context of *training deep neural networks* using the
reparametrization trick, also called infinite perturbation analysis.**

***INFO 18:* In many machine learning applications, we find good model parameters by performing gradient descent which relies on the fact that we can compute the gradient of a learning objective with respect to the parameters of the model.**

***INFO 19:* For training deep neural network models, the backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962; Rumelhart et al., 1986) is an efficient way to compute the gradient of an error function with respect to the parameters of the model.**

***INFO 20:* An area where the chain rule is used to an extreme is deep learning, where the function value y is computed as a many-level function composition.**

***INFO 21:* It turns out that backpropagation is a special case of a general technique in numerical analysis called automatic differentiation.**

***INFO 22:* In the context of neural networks, where the input dimensionality is often much higher than the dimensionality of the labels, the reverse mode is computationally significantly cheaper than the forward mode.**

***INFO 23:* For neural network training, we backpropagate the error of the prediction with respect to the label.**

***INFO 24:* In machine learning (and other disciplines), we often need to compute expectations, i.e., we need to solve integrals.**

***INFO 25:* In machine learning and statistics, there are two major interpretations of probability: the Bayesian and frequentist interpretations.**

***INFO 26:* In machine learning, we often avoid explicitly referring to the probability space, but instead refer to probabilities on quantities of interest, which we denote by T. We refer to T as the target space and refer to elements of T as states.**

***INFO 27:* In statistics, we observe that something has happened and try to figure out the underlying process that explains the observations. In this sense, machine learning is close to statistics in its goals to construct a model that adequately represents the process that generated the data. We can use the rules of probability to obtain a “best-fitting” model for some
data.**

***INFO 28:* Another aspect of machine learning systems is that we are interested in generalization error. This means that we are actually interested in the performance of our system on instances that we will observe in future, which are not identical to the instances that we have seen so far.**

***INFO 29:* In machine learning, we use discrete probability distributions to model categorical variables, i.e., variables that take a finite set of unordered values. They could be categorical features, such as the degree taken at university when used for predicting the salary of a person, or categorical labels, such as letters of the alphabet when doing handwriting recognition.**

***INFO 30:* However, in many machine learning applications discrete states take numerical values, e.g., z1 = -1.1; z2 = 0.3; z3 = 1.5, where we could say z1 < z2 < z3. Discrete states that assume numerical values are particularly useful because we often consider expected values of random variables.**

***INFO 31:* Unfortunately, machine learning literature uses notation and nomenclature that hides the distinction between the sample space &#x2126; , the target space T , and the random variable X.** 

***INFO 32:* In line with most machine learning literature, we also rely on context to distinguish the different uses of the phrase probability distribution.**

***INFO 33:* Probabilistic modeling provides a principled foundation for designing machine learning methods.**

***INFO 34:* In machine learning and Bayesian statistics, we are often interested in making inferences of unobserved (latent)\ random variables given that we have observed other random variables.**

***INFO 35:* If we think in a bigger context, then the posterior can be used within a decision-making system, and having the full posterior can be extremely useful and lead to decisions that are robust to disturbances.**

***INFO 36:* For example, in the context of model-based reinforcement learning, Deisenroth et al. (2015) show that using the full posterior distribution of plausible transition functions leads to very fast (data/sample efficient) learning, whereas focusing on the maximum of the posterior leads to consistent failures. Therefore, having the full posterior can be very useful for a downstream task.**

***INFO 37:* The concept of the expected value is central to machine learning, and the foundational concepts of probability itself can be derived from the expected value (Whittle, 2000).**

***INFO 38:* In machine learning, we need to learn from empirical observations of data.**

***INFO 39:* We use the empirical covariance, which is a biased estimate. The unbiased (sometimes called corrected) covariance has the factor N-1 in the denominator instead of N.**

***INFO 40:* The raw-score version of the variance can be useful in machine learning, e.g., when deriving the bias–variance decomposition (Bishop, 2006).**

***INFO 41:* In machine learning, we often consider problems that can be modeled as independent and identically distributed (i.i.d.) random variables, X_1,...,X_N.**

***INFO 42:* Another concept that is important in machine learning is conditional independence.**

***INFO 43:* There are many other areas of machine learning that also benefit from using a Gaussian distribution, for example Gaussian processes, variational inference, and reinforcement learning.**

***INFO 44:* Gaussians are widely used in statistical estimation and machine learning as they have closed-form expressions for marginal and conditional distributions.**

***INFO 45:* It is worth recalling at this point the desiderata for manipulating probability distributions in the machine learning context: *1.* There is some “closure property” when applying the rules of probability, e.g., Bayes’ theorem. By closure, we mean that applying a particular operation returns an object of the same type. *2.* As we collect more data, we do not need more parameters to describe the distribution. *3.* Since we are interested in learning from data, we want parameter estimation to behave nicely.**

***INFO 46:* The rewriting above of the Bernoulli distribution, where we use Boolean variables as numerical 0 or 1 and express them in the exponents, is a trick that is often used in machine learning textbooks.**

***INFO 47:* In machine learning, we consider a finite number of samples from a distribution. One could imagine that for simple distributions we only need a small number of samples to estimate the parameters of the distributions.**

***INFO 48:* In machine learning, we often use the second level of abstraction, that is, we fix the parametric form (the univariate Gaussian) and infer the parameters from data.**

***INFO 49:* The relationship between the original Bernoulli parameter &#956; and the natural parameter &theta; is known as the sigmoid or logistic function. Observe that &#956; &isin; (0; 1) but &theta; &isin; &real; , and therefore the sigmoid function squeezes a real value into the range (0, 1). This property is useful in machine learning, for example it is used in logistic regression , as well as as a nonlinear activation functions in neural networks.**

***INFO 50:* Since machine learning algorithms are implemented on a computer, the mathematical formulations are expressed as numerical optimization methods.**



## Continuous Optimization 

***INFO 1:* Training a machine learning model often boils down to finding a good set of parameters. The notion of “good” is determined by the objective function or the probabilistic model.**

***INFO 2:* By convention, most objective functions in machine learning are intended to be minimized, that is, the best value is the minimum value.**

***INFO 3:* It turns out that many machine learning objective functions are designed such that they are convex.**

***INFO 4:* Gradient descent is a first-order optimization algorithm.**

***INFO 5:* The step-size is also called the learning rate.**

***INFO 6:* Standard gradient descent, as introduced previously, is a “batch” optimization method, i.e., optimization is performed using the full training set by updating the vector of parameters.**

***INFO 7:* When the learning rate decreases at an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to local minimum (Bottou, 1998).** 

***INFO 8:* In machine learning, optimization methods are used for training by minimizing an objective function on the training data, but the overall goal is to improve generalization performance.**

***INFO 9:* Since the goal in machine learning does not necessarily need a precise estimate of the minimum of the objective function, approximate gradients using mini-batch approaches have been widely used.**

***INFO 10:* Stochastic gradient descent is very effective in large-scale machine learning problems such as training deep neural networks on millions of images, topic models, reinforcement learning, or training of large-scale Gaussian process models**

***INFO 11:* In machine learning, we often use sums of functions; for example, the objective function of the training set includes a sum of the losses for each example in the training set.**

***INFO 12:* The Legendre-Fenchel conjugate turns out to be quite useful for machine learning problems that can be expressed as convex optimization problems. In particular, for convex loss functions that apply independently to each example, the conjugate loss is a convenient way to derive a dual problem.**

***INFO 13:* Modern applications of machine learning often mean that the size of datasets prohibit the use of batch gradient descent, and hence stochastic gradient descent is the current workhorse of large-scale machine learning methods.**


## 8. Central Machine Learning Problems
### 8.0. When Models Meet Data
#### Four pillars of Machine Learning:
1. REGRESSION

2. Dimensionality reduction

3. Density estimation

4. Classification

### 8.1. Data, Models, and Learning 

***INFO 1:* There are three major components of a machine learning system: data, models, and learning. The main question of machine learning is “What do we mean by good models?”.**
 
#### 8.1.1. Data as Vectors

***INFO 2:* Data is assumed to be in a tidy format. Each row of the table as representing a particular instance or example, and each column to be a particular feature.**

***INFO 3:* Even numerical data that could potentially be directly read into a machine learning algorithm should be carefully considered for units, scaling, and constraints.**

***INFO 4:* Each row is a particular individual x_n, often referred to as an example or data point in machine learning.Each column represents a particular feature of interest about the example, and we index the features as d = 1,....,D.**

***INFO 5:* In many machine learning algorithms, we need to additionally be able to compare two vectors.**

***INFO 6:* In recent years, deep learning methods have shown promise in using the data itself to learn new good features
and have been very successful in areas, such as computer vision, speech recognition, and natural language processing.**

#### 8.1.2. Models as Functions

***INFO 7:* A predictor is a function that, when given a particular input example produces an output**

***INFO 8:* Linear functions strike a good balance between the generality of the problems that can be solved and the amount of background mathematics that is needed.**

#### 8.1.3 Models as Probability Distributions

***INFO 9:* Instead of considering a predictor as a single function, we could consider predictors to be probabilistic models, i.e., models describing the distribution of possible functions.**

#### 8.1.4 Learning is Finding Parameters

***INFO 10:* The goal of learning is to find a model and its corresponding parameters such that the resulting predictor will perform well on unseen data.**

***INFO 11:* There are conceptually three distinct algorithmic phases when discussing machine learning algorithms:            1. *Prediction or inference* || *2. Training or parameter estimation* || *3. Hyperparameter tuning or model selection***

***INFO 12:* We are interested in learning a model based on data such that it performs well on future data. It is not enough for the model to only fit the training data well, the predictor needs to perform well on unseen data.**

***INFO 13:* The distinction between parameters and hyperparameters is somewhat arbitrary, and is mostly driven by the distinction between what can be numerically optimized versus what needs to use search techniques.**


### 8.2 Empirical Risk Minimization

***INFO 14:* The “learning” part of machine learning boils down to estimating parameters based on training data.**

#### 8.2.1 Hypothesis Class of Functions 

***INFO 15:* It is also common in machine learning to choose a parametrized class of functions, for example affine functions.**

***INFO 16:* Recent advances in neural networks allow for efficient computation of more complex non-linear function classes.**

##### 8.2.2 Loss Function for Training

***INFO 17:* We are not interested in a predictor that only performs well on the training data. Instead, we seek a predictor that performs well (has low risk) on unseen test data.**

***INFO 18:* Many machine learning tasks are specified with an associated performance measure, e.g., accuracy of prediction or root mean squared error.**

##### 8.2.3 Regularization to Reduce Overfitting

***INFO 19:* We simulate this unseen data by holding out a proportion of the whole dataset.**

***INFO 20:* It is important for the user to not cycle back to a new round of training after having observed the test set.**

***INFO 21:* In machine learning, the penalty term is referred to as regularization. Regularization is a way to compromise between accurate solution of empirical risk minimization and the size or complexity of the solution.**

##### 8.2.4 Cross-Validation to Assess the Generalization Performance

***INFO 22:* Cross-validation iterates through (ideally) all combinations of assignments of chunks to R and V (validation set)w.**

***INFO 23:* Evaluating the quality of the model, depending on these hyperparameters, may result in a number of training runs that is exponential in the number of model parameters.**

***INFO 24:* However, cross-validation is an embarrassingly parallel problem, i.e., little effort is needed to separate the  problem into a number of parallel tasks. Given sufficient computing resources (e.g., cloud computing, server farms), cross validation does not require longer than a single performance assessment.**

***INFO 25:* In this section, we saw that empirical risk minimization is based on the following concepts: the hypothesis class of functions, the loss function and regularization. **

***INFO 26:* A recent machine theory learning textbook that builds on the theoretical foundations and develops efficient learning algorithms is Shalev-Shwartz and Ben-David (2014).**

***INFO 27:* An alternative to cross-validation is bootstrap and jackknife (Efron and Tibshirani, 1993; Davidson and Hinkley, 1997; Hall, 1992).**

***INFO 28:* Thinking about empirical risk minimization as “probability free” is incorrect.**

### 8.3 Parameter Estimation

#### 8.3.1 Maximum Likelihood Estimation

***INFO 29:* The idea behind maximum likelihood estimation (MLE) is to define a function of the parameters that enables us to find a model that fits the data well.**

***INFO 30:* **

***INFO 31:* **

***INFO 32:* **

***INFO 33:* **

***INFO 34:* **

***INFO 35:* **

***INFO 36:* **

***INFO 37:* **

***INFO 38:* **

***INFO 39:* **

***INFO 40:* **

***INFO 41:* **

***INFO 42:* **

***INFO 43:* **

***INFO 44:* **