### Implemetation note: Unrolling Parameters

#### Advanced Optimization
Neural Network (L=4)   
* $\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}$ - matrices (Theta1, Theat2, Theta3)
* $D^{(1)}, D^{(2)}, D^{(3)} $ - matrices

"unroll" into vectors 


#### Example

$s_1 = 10, s_2 = 10, s_3 = 1$

$\Theta^{(1)} \in \mathbb{R}^{10 x 11}, \Theta^{(2)} \in \mathbb{R}^{10 x 11}, \Theta^{(3)} \in \mathbb{R}^{1 x 11}$    
$D^{(1)} \in \mathbb{R}^{10 x 11}, D^{(2)} \in \mathbb{R}^{10 x 11}, D^{(3)} \in \mathbb{R}^{1 x 11}$

thetaVec = [ Theta1(:); Theta2(:); Theta3(:)];   
DVec = [ D1(:); D2(:); D3(:)];   

Theta1 = reshape(thetaVec(1:110), 10, 11);
Theta2 = reshape(thetaVec(111:220), 10, 11);
Theta3 = reshape(thetaVec(221:231), 1, 11);

Unrll to get initialTheta to  pass to fminun(@costFuction, initalTheta, options)    

function [jVal, gradientVec] = costFunction(thetaVec)      
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; From thetaVec get $\Theta^{(1)}, \Theta^{(2)},\Theta^{(3)}$        
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; User foward prop / back prop ot compute $D^{(1)}, D^{(2)}, D^{(3)}$ and $J(\Theta)$     
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Unroll  $D^{(1)}, D^{(2)}, D^{(3)}$  to get gradientVec.    




### Gradient Checking

Userful to check if the gradient is implemeted correctly

Suppose to have a function $J(\Theta)$  , consider to calcualte the derivative in one point $\theta$ we get the slope of the function. It can be approximate to:    

$\frac{\partial }{\partial \theta_j}J(\theta) \approx \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}$


Implement gradApprox = (J(theta + EPSILON) - J(theta + EPSILON)) / (2*EPSILON)  as numerical estimate of the gradient



#### Parameter Vector $\theta$  
$\theta \in \mathbb{R}^{n}$ (E.g. $\theta$ is unrolled version of $\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}$)
$\theta = \begin{bmatrix}\theta_0 & \theta_1 & \theta_2 & \dots & \theta_n \end{bmatrix}$


$\frac{\partial }{\partial \theta_1}J(\theta) \approx 
\frac{J(\theta_1 + \epsilon, \theta_2, \dots, \theta_n) - J(\theta_1 - \epsilon, \theta_2, \dots, \theta_n)}{2\epsilon}$    

$\frac{\partial }{\partial \theta_2}J(\theta) \approx 
\frac{J(\theta_1, \theta_2 + \epsilon, \dots, \theta_n) - J(\theta_1, \theta_2 - \epsilon, \dots, \theta_n)}{2\epsilon}$

$\vdots$

$\frac{\partial }{\partial \theta_n}J(\theta) \approx 
\frac{J(\theta_1, \theta_2, \dots, \theta_n + \epsilon) - J(\theta_1, \theta_2, \dots, \theta_n - \epsilon)}{2\epsilon}$



E.g of implemetation  

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

Check that gradApprox $\approx$ DVec   

#### Implementation Note
- Implement backprop to compute DVec (unrolled D1, D2, ...)
- implement numerical gradiente check to compute gradApprox
- Make sure they give similar values
- Turn off gradient checking. Using backprop code for learning

#### Important
- Be sure to disable your gradient checking code before training your classifier if you run numerical gradient computation on every iteration of gradient descent the code will be very slow



### Random Initialization 

For gradient descent we need initial vales for $\Theta$. 
If we initialize all to zero the derivative on each node of layer have the same value and corresponding the same value of theta.

To avoid this problem initialize each $\Theta_{ij}^{(l)}$ to random value in $[-\epsilon, \epsilon]$ Note this is unrelated with grad approx.

e.g. 

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;         
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;      
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;       

