Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support I/O of NaN, +inf, -inf values #67

Open
bob-carpenter opened this issue May 5, 2014 · 11 comments
Open

support I/O of NaN, +inf, -inf values #67

bob-carpenter opened this issue May 5, 2014 · 11 comments

Comments

@bob-carpenter
Copy link

This is on the border between bug and feature. We should be reading in NaN and +inf and -inf values through all our interfaces. I think the output's OK now, but we can't input NaN or inf now. Or if we can, it's not clear to me or our users how to do it.

The corresponding issue for CmdStan: stan-dev/stan#637

I'm not sure what the status is of PyStan on this.

@ariddell
Copy link

ariddell commented May 5, 2014

I'm happy to copy/translate RStan's tests -- do they exist for this?

@bob-carpenter
Copy link
Author

I'm afraid not --- RStan still doesn't support I/O for NaN or inf.

So I'm not sure what you'd need to do to get it into Python.

It's not a huge issue, because rarely is data or inits going to
be NaN or +/- inf. The request came from someone wanting to code
R's NA ("undefined") value, which isn't really possible on the Stan
C++ side, because we're just using IEEE double-precision floating point
for arithmetic.

It looks like Python lets you check if a variable is defined, but
doesn't let you define it to have an NA-like value, which is what
R does. But I'm not a Python expert (I'm barely a Python novice).

  • Bob

On May 5, 2014, at 7:29 PM, Allen Riddell notifications@github.com wrote:

I'm happy to copy/translate RStan's tests -- do they exist for this?


Reply to this email directly or view it on GitHub.

@bgoodri
Copy link
Contributor

bgoodri commented May 5, 2014

I think we need a more general thinking about special values in Stan. I can't see what sense it makes to have Inf as input, but maybe as output. For missing values, it is often useful to distinguish between a value that could exist but happens to be missing vs. a value that is observed to be undefined or inapplicable. With mi, we use NA for the former and NaN for the latter. Another option would be to do what Stata does (which may be wise if there is ever a Stata interface) and reserve the last 27 values that are supported for that numeric type for various missingness codes, which no one should be using anyway because they will likely overflow. See http://www.stata.com/help.cgi?dta_115#numbers

@bob-carpenter
Copy link
Author

On May 6, 2014, at 1:05 AM, bgoodri notifications@github.com wrote:

I think we need a more general thinking about special values in Stan. I can't see what sense it makes to have Inf as input,

I don't see much use for it yet. It could be used
for bounds if we allow variable bounds. Negative inf could be used
as data for a value that's exponentiated to produce 0. In general,
though, I just think they're valid IEEE numeric values we should handle.

but maybe as output. For missing values, it is often useful to distinguish between a value that could exist but happens to be missing vs. a value that is observed to be undefined or inapplicable.

Absolutely. I get what R does with NA.

With mi, we use NA for the former and NaN for the latter. Another option would be to do what Stata does (which may be wise if there is ever a Stata interface)

Yes, we're going to do a Stata interface for IES.

and reserve the last 27 values that are supported for that numeric type for various missingness codes, which no one should be using anyway because they will likely overflow. See http://www.stata.com/help.cgi?dta_115#numbers

That's an interesting approach. I really don't like messing
with arithmetic, though. I'd really rather just avoid the NA
thing altogether. But I could be convinced otherwise.

  • Bob

@bgoodri
Copy link
Contributor

bgoodri commented May 6, 2014

Technically, you could exp(-Inf) to get 0 in transformed data but why would
someone do that when they could just pass the data on the antilog scale?
Anyway, that is just -Inf. For +Inf, I can't think of any reason why that
would be reasonable input. So one option would be for RStan and other
interfaces to map the missing NAs to +Inf. Then, NaN could be used for
observed to be undefined.

Another option would be to distinguish between quiet and signaling NaN.

The Stata approach (in addition to being compatible with Stata) doesn't
necessarily require messing with arithmetic. We would need to change
check_finite() to return an exception if any of the values are >= the 27th
biggest representable value. That way, you get an exception with a nice
message if it gets passed to a density function.

If we supported sparse Eigen::Matrix, we could do that thing I suggested at
one point where missing values were treated as "zeros" and then shadow that
with a sparse matrix of parameters where the originally observed values
were treated as "zeros".

I don't think ignoring NA altogether is a viable option because missingness
is so common and all of the other statistical software deals with them in
one way or another.

@bob-carpenter
Copy link
Author

On May 6, 2014, at 2:40 PM, bgoodri notifications@github.com wrote:

Technically, you could exp(-Inf) to get 0 in transformed data but why would
someone do that when they could just pass the data on the antilog scale?
Anyway, that is just -Inf. For +Inf, I can't think of any reason why that
would be reasonable input. So one option would be for RStan and other
interfaces to map the missing NAs to +Inf. Then, NaN could be used for
observed to be undefined.

I'd rather go with some random high valued finite values a la Stata than
to mess with unique special values. I think re-using any of these is asking
for trouble, but then I'm not really in favor of adding a NaN --- like you'd
prefer antilog rather than -inf, I'd rather they just code the missing data
as parameters directly or use a database-like notation for existing values.

I know it's not the R or BUGS/JAGS way.

Another option would be to distinguish between quiet and signaling NaN.

We could perhaps use a signaling NaN if they're implemented on all of
our platforms and compilers. I found this useful:

http://en.wikipedia.org/wiki/NaN#Signaling_NaN

I haven't tried playing with clang++ on the Mac to swee what happens with
std::numeric_limits.

The Stata approach (in addition to being compatible with Stata) doesn't
necessarily require messing with arithmetic. We would need to change
check_finite() to return an exception if any of the values are >= the 27th
biggest representable value. That way, you get an exception with a nice
message if it gets passed to a density function.

If we supported sparse Eigen::Matrix, we could do that thing I suggested at
one point where missing values were treated as "zeros" and then shadow that
with a sparse matrix of parameters where the originally observed values
were treated as "zeros".

I'd think we'd want to separate two data structures, sparse matrices, which
have zeros, and partial matrices, where only some entries are observed.

I don't think ignoring NA altogether is a viable option because missingness
is so common and all of the other statistical software deals with them in
one way or another.

R and BUGS do, but obviously not "all" stat software does!

Let's discuss this with more bandwidth at the next Stan meeting
we're both at.

  • Bob

@bgoodri
Copy link
Contributor

bgoodri commented May 8, 2014

Summarizing my view, here is basically all you can do with Stan currently

data {
  int<lower=1> K;              // number of variables
  int<lower=1> N;              // number of observations
  matrix[N,K]  X;              // data matrix
  int<lower=0,upper=1> R[N,K]; // two-dimensional array that indicates X element is observed
}
transformed data {
  int<lower=0> n_miss;
  n_miss <- 0;
  for(i in 1:N) for(j in 1:K) if(R[i,j] == 0) n_miss <- n_miss + 1;
}
parameters {
  vector[K] mu;                 # expectation of rows of X
  cholesky_factor_cov[K,K] L;   # Cholesky factor of covariance matrix
  vector[n_miss] X_miss;        # missing values to fill in
}
model {
  /* fill in missing data values as necessary and evaluate multinormal */
  count <- 1;
  for(i in 1:N) {
    vector[K] Y;
    for(j in 1:K) {
      if(R[i,j] == 1) Y[j] <- X[i,j];
      else {
        Y[j] <- X_miss[count];
        count <- count + 1;
      }
    }
    Y ~ multi_normal_cholesky(mu, L);
  }
  /* priors on mu and L */  
}

This behavior is bad in several respects. First, if X were accidentally used in the model block, Stan has no way to catch this mistake because the elements of X where R == 0 have some arbitrary but meaningless numeric value that Stan I/O accepts. Second, it is tedious and error prone to do all that fiddling with R.

I still think there was no consensus as to whether it would be a good idea for Stan to have special values for missing or undefined, but I think that if these existed, then it would then be possible to do stuff more easily with Stan. So, let's suppose missing is a quiet NaN and undefined is a signaling NaN. You could introduce some syntax like

data {
  int<lower=1> K;              // number of variables
  int<lower=1> N;              // number of observations
  matrix[N,K]  X;              // data matrix with missing values
}
parameters {
  vector[K] mu;                 # expectation of rows of X
  cholesky_factor_cov[K,K] L;   # Cholesky factor of covariance matrix
  /* an object with the same dimensions as X but with missing values replaced by unknowns */
  X_complete(X);
}

I suppose you only write the unknown elements of X_complete to the .csv file and permit something like

generated quantities {
  X_complete2(X)
}

if the user actually wants to store tons of matrices that mix observed and estimated elements.

That only works for continuous things. For discrete objects, counts are a fundamental problem. But for categoricals, I don't think it would be so hard to change the behavior of the PMFs to marginalize over the missing values. For example, bernoulli_logit would do one thing if y == 1, another if y == 0, and mix those if y == quiet_NaN.

That's why I think that some progress can be made if Stan would just do what every other statistical software does and have a way to represent missingness so that the appropriate functions could recognize it.

@isaactpetersen
Copy link

It's been a couple years since the last message in this thread--any updates on allowing missing data (e.g., via imputation)? Missing data are quite common in data analysis. Having the ability to handle missing data would make Stan much more flexible.

@bgoodri
Copy link
Contributor

bgoodri commented May 15, 2016

Little has changed. With R-style indexing, it takes a bit less typing to do
what could be done before.

On Sun, May 15, 2016 at 12:01 AM, dadrivr notifications@github.com wrote:

It's been a couple years since the last message in this thread--any
updates on allowing missing data (e.g., via imputation)? Missing data are
quite common in data analysis. Having the ability to handle missing data
would make Stan much more flexible.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#67 (comment)

@isaactpetersen
Copy link

Thanks for the update! Hope that handling missing data is a consideration for the future. I love Stan and would love for it to be able to handle that data that I work with! Thanks for all of your work on the project.

@bob-carpenter
Copy link
Author

It's not impossible to code up missing data models in Stan
now, just a pain because of the index fiddling and declarations
required. We don't really have any good support for multiple
imputation on the outside, either. Maybe a case study would help
with these.

  • Bob

On May 15, 2016, at 1:30 PM, dadrivr notifications@github.com wrote:

Thanks for the update! Hope that handling missing data is a consideration for the future. I love Stan and would love for it to be able to handle that data that I work with! Thanks for all of your work on the project.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants