---
layout: post
title: Deep learning notes
date: 2018-06-10 
categories: [deep learning]
tags: [deep learning]
---
Deep learning notes
<!--more-->

# part II - Deep Network: Modern Practices
## Chapter6 - Deep Feedforward Networks
### 6.2 Gradient-Based Learning
For feedforward neural networks, it is important to initialize all weights to small random values. The biases may be initialized to zero or to small positive values.
#### 6.2.1 Cost Function
In most cases, our patametric model defines a distribution $p(y|x;\theta)$ and we simply use the principle of maximum likelihood. This mean we use the cross-entropy between the training data and the model's predictions as the cost function.  
Mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions.
#### 6.2.2 Output Units
In general, if we define a conditional distribution $p(y|x;\theta)$, the principle of maximum likelihood suggests we use $-logp(y|x;\theta)$.  
It has been reported that gradient-based optimization of conditional Gaussian mixtures can be unreliable, in part because one gets divisions which can be numberically unstable. One solution is to clip gradients, while another is to scale the gradients heuristically.
### 6.3 Hidden Units
Unless indicated otherwise, most hidden units can be described as accepting a vector of inputs $x$, computing an affine transformation $z=W^Tx+b$, and then applying an element-wise nonlinear function $g(z)$. Most hidden units are distinguished from each other only by the choice of the form of the activation function $g(z)$.
#### 6.3.1 Rectified linear units and their generalizations
Rectified linear units are typically used on top of an affine transformation:
$$h=g(W^T+b).$$
When initializing the parameters of the affine transformation, it can be a good practice to set all element of $b$ to a small positive value, such as 0.1.  
One drawback to rectified linear units is that they canot learn via gradient-based methods on examples for which their activation is zero. Various generalizations of rectified linear units guarantee that they receive gradient everywhere.  
Three generalizations of rectified linear units are based on using a nonzero slope $\alpha_i$ when $z_i<0:h_i=g(z,\alpha)_i=max(0,z_i)+\alpha_imin(0,z_i)$. Absolute value rectification fixes $\alpha_i=-1$ to obtain $g(z)=|z|$. A leaky ReLU fixes $\alpha_i$ to a small value like 0.01, while a parametric ReLU, or PReLU, treats $\alpha_i$ as a learnable parameter.  
Maxout units generalize rectified linear units further. Each maxout unit then outputs the maximum element of one of these groups:
$$g(z)_i=\operatorname*{max}_{j\in G^{(i)}}z_j.$$
Where $G^{(i)}$ is the set of indices into the inputs for group $i,\{(i-1)k+1,...,ik\}$.  
Rectified linear units and all these generalizations of them are based on the principle that models are easier to optimize if their behavior is closer to linear. This same general priciple of using linear behavior to obtain easier optimization also applied in other contexts besides deep linear networks.
### 6.4 Architecture Design
#### 6.4.1 Universal approximation properties and depth
Specifically, the universal approximation theorem states that a feedforward network with a linear output layer and at leas one hidden layer with any "squashing" activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provide that the network is given enough hidden units. 
#### 6.4.2 Other architectural considerations
![Eﬀect of number of parameters.](/assets/2018-06-10/6-1.png)
From the picture shows that increasing the number of parameters in layers of convolutional networks without increasing their depth is not nearly as effective at increasing test set performance. We observe that shallow models in this context overfit at around 20 million patameters while deep ones can benefit from having over 60 million. This suggests that using a deep model expresses a useful preference over the space of functions the model can learn.
### 6.5 Historical notes
Most of the improvement in neural network performance from 1986 to 2015 can be attribute to two factors. First, barger databases have reduced the degree to which statistical generalization is a challenge for neural networks. Second, neural networks have become much larger, because of more powerful computers and better software infrastructure. One of these algorithmic changes was the replacement of mean squared error with the cross-entropy family of loss functions.  
A small number algorithmic changes have also improved the performance of neural networks noticeably. The use of cross-entropy losses greatly improved the performance of models with sigmoid and softmax outputs, which had previously sufferd from saturation and slow learning when using the mean squared error loss. The other major algorithmic changs that has greatly improved the performance of feedforward networks was the replacement of sigmoid hidden units with piecewise linear hidden units, such as rectified linear units. Rectified linear can be influenced by neuroscience.
## Chapter 7 Regularization for deep learning
###  7.1 Patameter norm penalties
We donete the regularized objective function by $\tilde{J}$
$$\tilde{J}(\theta;X,y)=J(\theta;X,y)+\alpha\Omega(\theta)$$
For neural networks, we typically choose to use a parameter norm penalty $\Omega$ that penalized only the weights of the affine transformation at each layer and leaves the biases unregularized. Also, regularizing the bias parameters can introduce a significant amount of underfitting.  
It is still reasonable to use the same weight decay at all layers just to reduce the size of search space.
### 7.12 Dropout
One of advantage of dropout is that it is very computationally cheap. Another significant advantage of droput is that it does not significantly limit the type of model or training procedure that can be used.  
## Chapter 8 Optimization for Trainig Deep Models

# Part I:Applied math and machine learning basics
## Charpter 5 Machine learning basics
### 5.1 Learing algorithms
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.  
## 5.2 Capacity, overfitting and underfitting
### 5.2.2 Regularization
Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
## 5.8 Unsupversied learning algorithms
There are multiple ways of defining a simple representation. Three of the most common include lower-dimensional representations, sparse representaions, and independent representations. Low-dimensional representations attempt to compress as much information about $x$ as possible in a small representation. Sparse representations embed the dataset into a representation whose entries are mostly zeros for most inputs. The use of sparse representation typiclly requires increasing the dimensionality of the representation, so that the representaion becoming mostly zeros does not dicard too much information. This results in an overall structure of the representation that tends to distribute data along the axes of the representation space. Independent representations attempt to disentangle the sources of variation underlying the data distrubution such that the dimensions of the representation are statistically independent.
## 5.10 Building a machine learning algorithm
Nearly all deep learning algorithms can be described as particular instance of a fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model.