FTRL implementation in tensorflow V.S. FTRL in Google's research paper #3725

tangruiming · 2016-08-10T08:25:21Z

Hello, every one!

I am interested in digging the details how FTRL is implemented in tensorflow. I find some information in the file "gen_training_ops.py" in the folder /tensorflow/python/training. In this file, the formula of FTRL algorithm is described as follows:

def apply_ftrl(var, accum, linear, grad, lr, l1, l2, lr_power,
use_locking=None, name=None):
r"""Update '*var' according to the Ftrl-proximal scheme
accum_new = accum + grad * grad      ------ (1)
linear += grad + (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var        ------ (2)
quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2              ------ (3)
var = (sign(linear) * l1 - linear) / quadratic if |linear| > l1 else 0.0           ------(4)
accum = accum_new        ------(5)

I am also reading the paper "Ad Click Prediction: a View from the Trenches" by Google in KDD'13. The formula of FTRL algorithm is given in page 2 of this paper. Comparing this two implementations, we find some connections:
var is w_{t,i} in the paper; l1 is lambda1 in the paper; linear is zi in the paper; lr is alpha in the paper; grad is gi in the paper; accum is ni in the paper.

But also, there are some inconsistent points:
according to the paper, the Equation (2) above should be
linear += grad - (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var
also we can get the following equation by comparing the two implementations:
2l2 *alpha = beta + alpha * lambda2

For any expert who is familiar with the FTRL implementation in the tensorflow, can you help us to clarify the meaning the parameters given in tensorflow, and the connections with the FTRL code in Google's research paper "Ad Click Prediction: a View from the Trenches".

Thanks!

The text was updated successfully, but these errors were encountered:

girving · 2016-08-10T16:29:04Z

Unless you think there is a bug, this question would be better asked on StackOverflow. Github issues are for code bugs and feature requests, not requests for clarification.

tangruiming · 2016-08-11T00:20:56Z

I think there is a bug in it.

will001 · 2016-08-11T21:01:47Z

The FTRL implementation in Tensorflow was exactly coming from that paper. I think the meaning of parameters is quite consistent with the notation in the paper. Where do you think the bug is?

tangruiming · 2016-08-12T08:56:14Z

Hi, will001:

Thanks for your attention!

To our understanding, we can establish the following connections:
var is w_{t,i} in the paper; l1 is lambda1 in the paper; linear is zi in the paper; lr is alpha in the paper; grad is gi in the paper; accum is ni in the paper.

However, we found that the notation "beta" is the paper is missing from the Tensorflow implementation.

Furthermore, we guess that l2 is lambda2 in the paper, however, we are not able to get this conclusion by comparing the two implementations. Instead, we can only get this equation:
2l2 *alpha = beta + alpha * lambda2

Can you clarify these two points a little bit?

Thanks again.

yanyachen · 2016-08-14T22:19:22Z

I have same confusion. It seems that the optimizer force beta = l2 * alpha.
Is there any reason behind that ?

tangruiming · 2016-08-15T00:14:27Z

I do have the same question with @yanyachen

ageron · 2016-08-18T14:51:52Z

Although the documentation points to the right paper, it was unclear to me (until I dug into the code) whether the TensorFlow class implemented Nesterov's dual averaging (ie. FTRL) or the FTRL-Proximal variant proposed in the Ad Click Prediction paper.

It would be good to clarify this in the documentation, along with the meaning of the hyperparameters. Thanks!

will001 · 2016-08-22T22:54:58Z

Thanks tangruiming@ for pointing out, the comments 2) in get_training_ops.py is not accurate, which should be:
linear += grad - (accum_new^(-lr_power) - accum^(-lr_power)) / lr * var.
I will fix this very soon.

For the missing parameter 'beta' in the implementation, which actually comes from the initial_accumulator_value for the 'accum', where accum = initial_accumulator_value + sigma{g(i)^2}. So, you can think of it as beta + sqrt(n^i) == sqrt((initial_accumulator_value + sigma(g(i)^2))).

To ageron@, the implementation in Tensorflow as FTRL-Proximal, proposed in the Ad Click Prediction paper.

tangruiming · 2016-08-23T02:11:34Z

Hi, will001@:

Thank you very much for your clarification.

I found another place in the comments in the get_training_ops.py that may be wrong:
quadratic = 1.0 / (accum_new^(lr_power) * lr) + 2 * l2 ------ (3)
I think it should be
quadratic = 1.0 / (accum_new^(lr_power) * lr) + l2 ------ (3)
Am I right?

Secondly, I want to confirm sth about "beta":
I can have the similar equation as you stated:
beta + sqrt(n_i) = sqrt(accum_new) = sqrt(initial_accumulator_value + sigma(g(i)^2)).
Approximately, we can have beta + sqrt(n_i) = sqrt(beta + n_i), so based on these two equations, it can be concluded that beta is approximately the same as initial_accumulator_value. Am I right?

Thanks again.

Ruiming

will001 · 2016-08-23T19:54:58Z

To tangruiming@, you are right. Thanks for pointing it out.

tangruiming · 2016-08-24T00:19:43Z

To will001, thank you very much. My doubts are clear.

CasyWang · 2017-01-11T13:11:11Z

by the way, where is the bias of logistic regression?

drpngx · 2017-01-24T21:55:22Z

Closing after reading latest comment from @tangruiming.

microwish · 2017-03-29T22:13:36Z

@will001 , seems points (2) and (3) @tangruiming mentioned aren't fixed yet.

ydp · 2018-03-30T10:48:48Z

hey, sorry to dig the old issue, but go into the implementation, still have some question here.

tensorflow/core/kernel/training_ops.cc

class SparseApplyFtrlOp

          T updated_a = a + g * g;
          using Eigen::numext::pow;
          T sigma = pow(updated_a, -lr_power_scalar) - pow(a, -lr_power_scalar);
          sigma /= lr_scalar;
          T updated_l = l + g - sigma * v;
          v = FtrlCompute(updated_a, updated_l, lr_scalar, l1_scalar, l2_scalar,
                          lr_power_scalar);
          a = updated_a;
          l = updated_l;

a/updated_a -> n
l/updated_l -> z
v -> w
lr_scalar -> alpha
l1_scalar -> lambda1
l2_scalar -> lambda2

FtrlCompute

template <typename T>
inline T FtrlCompute(const T& accum, const T& linear, const T& lr, const T& l1,
                     const T& l2, const T& lr_power) {
  T quadratic;
  if (lr_power == static_cast<T>(-0.5)) {
    quadratic = Eigen::numext::sqrt(accum) / lr + static_cast<T>(2) * l2;
  } else {
    quadratic =
        Eigen::numext::pow(accum, -lr_power) / lr + static_cast<T>(2) * l2;
  }
  if (Eigen::numext::abs(linear) > l1) {
    return (l1 * sgn(linear) - linear) / quadratic;
  } else {
    return static_cast<T>(0.0);
  }
}

linear -> z
l1 -> lambda1
l2 -> lambda2
lr -> alpha
accum -> n

so , the problem is here

quadratic = Eigen::numext::sqrt(accum) / lr + static_cast<T>(2) * l2;

why there is a static_cast<T>(2)? according paper, it is only lambda2.

sjtusmartboy · 2019-07-22T02:19:25Z

@ydp still in this SparseApplyFtrlOp I cannot find where the var is set to zero if var = (sign(linear) * l1 - linear) / quadratic if |linear| <= l1 , so where is the sparse solution

tanzhenyu · 2019-09-09T20:50:07Z

I agree. Seems like an issue we should fix.

xuyan1115 · 2020-08-14T14:00:03Z

@tanzhenyu @will001
Has this bug been fixed? I see that the comment has not changed.

girving closed this as completed Aug 10, 2016

girving reopened this Aug 12, 2016

prb12 added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 22, 2016

girving assigned will001 Jan 17, 2017

girving removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 17, 2017

drpngx closed this as completed Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FTRL implementation in tensorflow V.S. FTRL in Google's research paper #3725

FTRL implementation in tensorflow V.S. FTRL in Google's research paper #3725

tangruiming commented Aug 10, 2016

girving commented Aug 10, 2016

tangruiming commented Aug 11, 2016

will001 commented Aug 11, 2016

tangruiming commented Aug 12, 2016

yanyachen commented Aug 14, 2016

tangruiming commented Aug 15, 2016

ageron commented Aug 18, 2016

will001 commented Aug 22, 2016

tangruiming commented Aug 23, 2016

will001 commented Aug 23, 2016

tangruiming commented Aug 24, 2016

CasyWang commented Jan 11, 2017

drpngx commented Jan 24, 2017

microwish commented Mar 29, 2017

ydp commented Mar 30, 2018

sjtusmartboy commented Jul 22, 2019

tanzhenyu commented Sep 9, 2019

xuyan1115 commented Aug 14, 2020 •

edited

FTRL implementation in tensorflow V.S. FTRL in Google's research paper #3725

FTRL implementation in tensorflow V.S. FTRL in Google's research paper #3725

Comments

tangruiming commented Aug 10, 2016

girving commented Aug 10, 2016

tangruiming commented Aug 11, 2016

will001 commented Aug 11, 2016

tangruiming commented Aug 12, 2016

yanyachen commented Aug 14, 2016

tangruiming commented Aug 15, 2016

ageron commented Aug 18, 2016

will001 commented Aug 22, 2016

tangruiming commented Aug 23, 2016

will001 commented Aug 23, 2016

tangruiming commented Aug 24, 2016

CasyWang commented Jan 11, 2017

drpngx commented Jan 24, 2017

microwish commented Mar 29, 2017

ydp commented Mar 30, 2018

sjtusmartboy commented Jul 22, 2019

tanzhenyu commented Sep 9, 2019

xuyan1115 commented Aug 14, 2020 • edited

xuyan1115 commented Aug 14, 2020 •

edited