Skip to content

Commit

Permalink
Merge pull request #1706 from amchercashin/mu-to-learning_rate_2
Browse files Browse the repository at this point in the history
rename var mu to learning_rate
  • Loading branch information
Vijay Vasudevan committed Mar 31, 2016
2 parents 0ec9c62 + 3714b97 commit b4b276e
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions tensorflow/tools/docker/notebooks/2_getting_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -591,7 +591,7 @@
"# The learning rate. Also known has the step size. This changes how far\n",
"# we move down the gradient toward lower error at each step. Too large\n",
"# jumps risk inaccuracy, too small slow the learning.\n",
"mu = 0.002\n",
"learning_rate = 0.002\n",
"\n",
"# In TensorFlow, we need to run everything in the context of a session.\n",
"with tf.Session() as sess:\n",
Expand Down Expand Up @@ -620,7 +620,7 @@
" loss = tf.nn.l2_loss(yerror)\n",
"\n",
" # Perform gradient descent. \n",
" # This essentially just updates weights, like weights += grads * mu\n",
" # This essentially just updates weights, like weights += grads * learning_rate\n",
" # using the partial derivative of the loss with respect to the\n",
" # weights. It's the direction we want to go to move toward lower error.\n",
" update_weights = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)\n",
Expand Down Expand Up @@ -753,8 +753,8 @@
"\n",
" tf.initialize_all_variables().run()\n",
" \n",
" # mu is the learning rate (step size), so how much we jump from the current spot\n",
" mu = 0.002\n",
" # learning_rate is the step size, so how much we jump from the current spot\n",
" learning_rate = 0.002\n",
" \n",
" # The operations in the operation graph.\n",
" # Compute the predicted y values given our current weights\n",
Expand All @@ -764,7 +764,7 @@
" # Change the weights by subtracting derivative with respect to that weight\n",
" loss = 0.5 * tf.reduce_sum(tf.mul(yerror, yerror))\n",
" gradient = tf.reduce_sum(tf.transpose(tf.mul(input, yerror)), 1, keep_dims=True)\n",
" update_weights = tf.assign_sub(weights, mu * gradient)\n",
" update_weights = tf.assign_sub(weights, learning_rate * gradient)\n",
" \n",
" # Repeatedly run the operation graph over the training data and weights.\n",
" for _ in range(training_steps):\n",
Expand Down Expand Up @@ -820,13 +820,13 @@
"\n",
">`gradient = tf.reduce_sum(tf.transpose(tf.mul(input, yerror)), 1, keep_dims=True)`\n",
"\n",
">`update_weights = tf.assign_sub(weights, mu * gradient)`\n",
">`update_weights = tf.assign_sub(weights, learning_rate * gradient)`\n",
"\n",
"The first line calculates the L2 loss manually. It's the same as `l2_loss(yerror)`, which is half of the sum of the squared error, so $\\frac{1}{2} \\sum (\\hat{y} - y)^2$. With this code, you can see exactly what the `l2_loss` operation does. It's the total of all the squared differences between the target and our estimates. And minimizing the L2 loss will minimize how much our estimates of $y$ differ from the true values of $y$.\n",
"\n",
"The second line calculates $\\begin{bmatrix}\\sum{(\\hat{y} - y)*1} \\\\ \\sum{(\\hat{y} - y)*x_i}\\end{bmatrix}$. What is that? It's the partial derivatives of the L2 loss with respect to $w_1$ and $w_2$, the same thing as what `gradients(loss, weights)` does in the earlier code. Not sure about that? Let's look at it in more detail. The gradient calculation is going to get the partial derivatives of loss with respect to each of the weights so we can change those weights in the direction that will reduce the loss. L2 loss is $\\frac{1}{2} \\sum (\\hat{y} - y)^2$, where $\\hat{y} = w_2 x + w_1$. So, using the chain rule and substituting in for $\\hat{y}$ in the derivative, $\\frac{\\partial}{\\partial w_2} = \\sum{(\\hat{y} - y)\\, *x_i}$ and $\\frac{\\partial}{\\partial w_1} = \\sum{(\\hat{y} - y)\\, *1}$. `GradientDescentOptimizer` does these calculations automatically for you based on the graph structure.\n",
"\n",
"The third line is equivalent to `weights -= mu * gradient`, so it subtracts a constant the gradient after scaling by the learning rate (to avoid jumping too far each time, which risks moving in the wrong direction). It's also the same thing that `GradientDescentOptimizer(learning_rate).minimize(loss)` does in the earlier code. Gradient descent updates its first parameter based on the values in the second after scaling by the third, so it's equivalent to the `assign_sub(weights, mu * gradient)`.\n",
"The third line is equivalent to `weights -= learning_rate * gradient`, so it subtracts a constant the gradient after scaling by the learning rate (to avoid jumping too far each time, which risks moving in the wrong direction). It's also the same thing that `GradientDescentOptimizer(learning_rate).minimize(loss)` does in the earlier code. Gradient descent updates its first parameter based on the values in the second after scaling by the third, so it's equivalent to the `assign_sub(weights, learning_rate * gradient)`.\n",
"\n",
"Hopefully, this other code gives you a better understanding of what the operations we used previously are actually doing. In practice, you'll want to use those high level operators most of the time rather than calculating things yourself. For this toy example and simple network, it's not too bad to compute and apply the gradients yourself from scratch, but things get more complicated with larger networks."
]
Expand Down

0 comments on commit b4b276e

Please sign in to comment.