support I/O of NaN, +inf, -inf values #67

bob-carpenter · 2014-05-05T17:21:51Z

This is on the border between bug and feature. We should be reading in NaN and +inf and -inf values through all our interfaces. I think the output's OK now, but we can't input NaN or inf now. Or if we can, it's not clear to me or our users how to do it.

The corresponding issue for CmdStan: stan-dev/stan#637

I'm not sure what the status is of PyStan on this.

ariddell · 2014-05-05T17:29:10Z

I'm happy to copy/translate RStan's tests -- do they exist for this?

bob-carpenter · 2014-05-05T17:37:17Z

I'm afraid not --- RStan still doesn't support I/O for NaN or inf.

So I'm not sure what you'd need to do to get it into Python.

It's not a huge issue, because rarely is data or inits going to
be NaN or +/- inf. The request came from someone wanting to code
R's NA ("undefined") value, which isn't really possible on the Stan
C++ side, because we're just using IEEE double-precision floating point
for arithmetic.

It looks like Python lets you check if a variable is defined, but
doesn't let you define it to have an NA-like value, which is what
R does. But I'm not a Python expert (I'm barely a Python novice).

Bob

On May 5, 2014, at 7:29 PM, Allen Riddell notifications@github.com wrote:

I'm happy to copy/translate RStan's tests -- do they exist for this?

—
Reply to this email directly or view it on GitHub.

bgoodri · 2014-05-05T23:05:28Z

I think we need a more general thinking about special values in Stan. I can't see what sense it makes to have Inf as input, but maybe as output. For missing values, it is often useful to distinguish between a value that could exist but happens to be missing vs. a value that is observed to be undefined or inapplicable. With mi, we use NA for the former and NaN for the latter. Another option would be to do what Stata does (which may be wise if there is ever a Stata interface) and reserve the last 27 values that are supported for that numeric type for various missingness codes, which no one should be using anyway because they will likely overflow. See http://www.stata.com/help.cgi?dta_115#numbers

bob-carpenter · 2014-05-06T09:41:54Z

On May 6, 2014, at 1:05 AM, bgoodri notifications@github.com wrote:

I think we need a more general thinking about special values in Stan. I can't see what sense it makes to have Inf as input,

I don't see much use for it yet. It could be used
for bounds if we allow variable bounds. Negative inf could be used
as data for a value that's exponentiated to produce 0. In general,
though, I just think they're valid IEEE numeric values we should handle.

but maybe as output. For missing values, it is often useful to distinguish between a value that could exist but happens to be missing vs. a value that is observed to be undefined or inapplicable.

Absolutely. I get what R does with NA.

With mi, we use NA for the former and NaN for the latter. Another option would be to do what Stata does (which may be wise if there is ever a Stata interface)

Yes, we're going to do a Stata interface for IES.

and reserve the last 27 values that are supported for that numeric type for various missingness codes, which no one should be using anyway because they will likely overflow. See http://www.stata.com/help.cgi?dta_115#numbers

That's an interesting approach. I really don't like messing
with arithmetic, though. I'd really rather just avoid the NA
thing altogether. But I could be convinced otherwise.

Bob

bgoodri · 2014-05-06T12:40:10Z

Technically, you could exp(-Inf) to get 0 in transformed data but why would
someone do that when they could just pass the data on the antilog scale?
Anyway, that is just -Inf. For +Inf, I can't think of any reason why that
would be reasonable input. So one option would be for RStan and other
interfaces to map the missing NAs to +Inf. Then, NaN could be used for
observed to be undefined.

Another option would be to distinguish between quiet and signaling NaN.

The Stata approach (in addition to being compatible with Stata) doesn't
necessarily require messing with arithmetic. We would need to change
check_finite() to return an exception if any of the values are >= the 27th
biggest representable value. That way, you get an exception with a nice
message if it gets passed to a density function.

If we supported sparse Eigen::Matrix, we could do that thing I suggested at
one point where missing values were treated as "zeros" and then shadow that
with a sparse matrix of parameters where the originally observed values
were treated as "zeros".

I don't think ignoring NA altogether is a viable option because missingness
is so common and all of the other statistical software deals with them in
one way or another.

bob-carpenter · 2014-05-06T16:28:56Z

On May 6, 2014, at 2:40 PM, bgoodri notifications@github.com wrote:

Technically, you could exp(-Inf) to get 0 in transformed data but why would
someone do that when they could just pass the data on the antilog scale?
Anyway, that is just -Inf. For +Inf, I can't think of any reason why that
would be reasonable input. So one option would be for RStan and other
interfaces to map the missing NAs to +Inf. Then, NaN could be used for
observed to be undefined.

I'd rather go with some random high valued finite values a la Stata than
to mess with unique special values. I think re-using any of these is asking
for trouble, but then I'm not really in favor of adding a NaN --- like you'd
prefer antilog rather than -inf, I'd rather they just code the missing data
as parameters directly or use a database-like notation for existing values.

I know it's not the R or BUGS/JAGS way.

Another option would be to distinguish between quiet and signaling NaN.

We could perhaps use a signaling NaN if they're implemented on all of
our platforms and compilers. I found this useful:

http://en.wikipedia.org/wiki/NaN#Signaling_NaN

I haven't tried playing with clang++ on the Mac to swee what happens with
std::numeric_limits.

The Stata approach (in addition to being compatible with Stata) doesn't
necessarily require messing with arithmetic. We would need to change
check_finite() to return an exception if any of the values are >= the 27th
biggest representable value. That way, you get an exception with a nice
message if it gets passed to a density function.

If we supported sparse Eigen::Matrix, we could do that thing I suggested at
one point where missing values were treated as "zeros" and then shadow that
with a sparse matrix of parameters where the originally observed values
were treated as "zeros".

I'd think we'd want to separate two data structures, sparse matrices, which
have zeros, and partial matrices, where only some entries are observed.

I don't think ignoring NA altogether is a viable option because missingness
is so common and all of the other statistical software deals with them in
one way or another.

R and BUGS do, but obviously not "all" stat software does!

Let's discuss this with more bandwidth at the next Stan meeting
we're both at.

Bob

bgoodri · 2014-05-08T21:39:37Z

Summarizing my view, here is basically all you can do with Stan currently

data {
  int<lower=1> K;              // number of variables
  int<lower=1> N;              // number of observations
  matrix[N,K]  X;              // data matrix
  int<lower=0,upper=1> R[N,K]; // two-dimensional array that indicates X element is observed
}
transformed data {
  int<lower=0> n_miss;
  n_miss <- 0;
  for(i in 1:N) for(j in 1:K) if(R[i,j] == 0) n_miss <- n_miss + 1;
}
parameters {
  vector[K] mu;                 # expectation of rows of X
  cholesky_factor_cov[K,K] L;   # Cholesky factor of covariance matrix
  vector[n_miss] X_miss;        # missing values to fill in
}
model {
  /* fill in missing data values as necessary and evaluate multinormal */
  count <- 1;
  for(i in 1:N) {
    vector[K] Y;
    for(j in 1:K) {
      if(R[i,j] == 1) Y[j] <- X[i,j];
      else {
        Y[j] <- X_miss[count];
        count <- count + 1;
      }
    }
    Y ~ multi_normal_cholesky(mu, L);
  }
  /* priors on mu and L */  
}

This behavior is bad in several respects. First, if X were accidentally used in the model block, Stan has no way to catch this mistake because the elements of X where R == 0 have some arbitrary but meaningless numeric value that Stan I/O accepts. Second, it is tedious and error prone to do all that fiddling with R.

I still think there was no consensus as to whether it would be a good idea for Stan to have special values for missing or undefined, but I think that if these existed, then it would then be possible to do stuff more easily with Stan. So, let's suppose missing is a quiet NaN and undefined is a signaling NaN. You could introduce some syntax like

data {
  int<lower=1> K;              // number of variables
  int<lower=1> N;              // number of observations
  matrix[N,K]  X;              // data matrix with missing values
}
parameters {
  vector[K] mu;                 # expectation of rows of X
  cholesky_factor_cov[K,K] L;   # Cholesky factor of covariance matrix
  /* an object with the same dimensions as X but with missing values replaced by unknowns */
  X_complete(X);
}

I suppose you only write the unknown elements of X_complete to the .csv file and permit something like

generated quantities {
  X_complete2(X)
}

if the user actually wants to store tons of matrices that mix observed and estimated elements.

That only works for continuous things. For discrete objects, counts are a fundamental problem. But for categoricals, I don't think it would be so hard to change the behavior of the PMFs to marginalize over the missing values. For example, bernoulli_logit would do one thing if y == 1, another if y == 0, and mix those if y == quiet_NaN.

That's why I think that some progress can be made if Stan would just do what every other statistical software does and have a way to represent missingness so that the appropriate functions could recognize it.

isaactpetersen · 2016-05-15T04:01:00Z

It's been a couple years since the last message in this thread--any updates on allowing missing data (e.g., via imputation)? Missing data are quite common in data analysis. Having the ability to handle missing data would make Stan much more flexible.

bgoodri · 2016-05-15T04:18:14Z

Little has changed. With R-style indexing, it takes a bit less typing to do
what could be done before.

On Sun, May 15, 2016 at 12:01 AM, dadrivr notifications@github.com wrote:

It's been a couple years since the last message in this thread--any
updates on allowing missing data (e.g., via imputation)? Missing data are
quite common in data analysis. Having the ability to handle missing data
would make Stan much more flexible.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#67 (comment)

isaactpetersen · 2016-05-15T17:30:20Z

Thanks for the update! Hope that handling missing data is a consideration for the future. I love Stan and would love for it to be able to handle that data that I work with! Thanks for all of your work on the project.

bob-carpenter · 2016-05-15T17:57:32Z

It's not impossible to code up missing data models in Stan
now, just a pain because of the index fiddling and declarations
required. We don't really have any good support for multiple
imputation on the outside, either. Maybe a case study would help
with these.

Bob

On May 15, 2016, at 1:30 PM, dadrivr notifications@github.com wrote:

Thanks for the update! Hope that handling missing data is a consideration for the future. I love Stan and would love for it to be able to handle that data that I work with! Thanks for all of your work on the project.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

bob-carpenter added the enhancement label May 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support I/O of NaN, +inf, -inf values #67

support I/O of NaN, +inf, -inf values #67

bob-carpenter commented May 5, 2014

ariddell commented May 5, 2014

bob-carpenter commented May 5, 2014

bgoodri commented May 5, 2014

bob-carpenter commented May 6, 2014

bgoodri commented May 6, 2014

bob-carpenter commented May 6, 2014

bgoodri commented May 8, 2014

isaactpetersen commented May 15, 2016

bgoodri commented May 15, 2016

isaactpetersen commented May 15, 2016

bob-carpenter commented May 15, 2016

support I/O of NaN, +inf, -inf values #67

support I/O of NaN, +inf, -inf values #67

Comments

bob-carpenter commented May 5, 2014

ariddell commented May 5, 2014

bob-carpenter commented May 5, 2014

bgoodri commented May 5, 2014

bob-carpenter commented May 6, 2014

bgoodri commented May 6, 2014

bob-carpenter commented May 6, 2014

bgoodri commented May 8, 2014

isaactpetersen commented May 15, 2016

bgoodri commented May 15, 2016

isaactpetersen commented May 15, 2016

bob-carpenter commented May 15, 2016