New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stan throws error with empty array #437

Open
bob-carpenter opened this Issue Feb 16, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@bob-carpenter

bob-carpenter commented Feb 16, 2018

From @lazypanda1 on February 14, 2018 22:3

When I run the example below with N = 0, I run into an error (also below).

This error doesn't occur when data is empty but of type real, or when the data has at least one element (for both real and int types). The error message here is also unclear to me. data array is empty, it doesn't have non-int values. If array emptiness is an issue, why doesn't the program also fail for an empty real array?

Are all these expected?

data {
int<lower=0> N;
int datax[N];
}

model {
for(i in 1:N){
datax[i] ~ weibull(1.0,1.0);
}
}

Error message:

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_b792a2a57df879a15a969a03d556f524 NOW.
Traceback (most recent call last):
  File "driver.py", line 34, in <module>
    fit = sm.sampling(data=data, iter=1000, chains=4)
  File "/usr/local/lib/python2.7/dist-packages/pystan/model.py", line 671, in sampling
    fit = self.fit_class(data)
  File "stanfit4anon_model_b792a2a57df879a15a969a03d556f524_9077231157931774679.pyx", line 457, in stanfit4anon_model_b792a2a57df879a15a969a03d556f524_9077231157931774679.StanFit4Model.__cinit__
RuntimeError: int variable contained non-int values; processing stage=data initialization; variable name=data; base type=int

Environment

pystan 2.16.0.0
python 2.7.6

Copied from original issue: stan-dev/stan#2468

@bob-carpenter

This comment has been minimized.

Show comment
Hide comment
@bob-carpenter

bob-carpenter Feb 16, 2018

This is not a Stan issue, it's a PyStan issue. I verified everything works fine in RStan with

data {
  int<lower = 0> N;
  int y[N];
}
model {
  for(n in 1:N)
    y[n] ~ weibull(1.0, 1.0);
}

with

stan("foo.stan", data = list(N = 0, y = array(0, dim=c(0))),
     algorithm="Fixed_param")

So I'm going to move this issue to the PyStan tracker.

bob-carpenter commented Feb 16, 2018

This is not a Stan issue, it's a PyStan issue. I verified everything works fine in RStan with

data {
  int<lower = 0> N;
  int y[N];
}
model {
  for(n in 1:N)
    y[n] ~ weibull(1.0, 1.0);
}

with

stan("foo.stan", data = list(N = 0, y = array(0, dim=c(0))),
     algorithm="Fixed_param")

So I'm going to move this issue to the PyStan tracker.

@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Feb 16, 2018

Collaborator

How is the data inserted to StanModel?

Collaborator

ahartikainen commented Feb 16, 2018

How is the data inserted to StanModel?

@lazypanda1

This comment has been minimized.

Show comment
Hide comment
@lazypanda1

lazypanda1 Feb 16, 2018

I am loading the data from a json file, having the data as shown below:

{"datax": [], "N": 0}

Python code snippet:

with open('data.json') as dataFile:
    data = json.load(dataFile)
...

fit = sm.sampling(data=data, iter=1000, chains=4)

lazypanda1 commented Feb 16, 2018

I am loading the data from a json file, having the data as shown below:

{"datax": [], "N": 0}

Python code snippet:

with open('data.json') as dataFile:
    data = json.load(dataFile)
...

fit = sm.sampling(data=data, iter=1000, chains=4)
@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Feb 16, 2018

Collaborator

Ok. I know what is the problem:

{'datax' : np.array([], dtype=int), 'N' : 0}

The reason for this is that the np.asarray([]) has float as a default dtype.

See pystan.misc

def _split_data(data):
    data_r = {}
    data_i = {}
    # data_r and data_i are going to be converted into C++ objects of
    # type: map<string, pair<vector<double>, vector<size_t>>> and
    # map<string, pair<vector<int>, vector<size_t>>> so prepare
    # them accordingly.
    for k, v in data.items():
        if np.issubdtype(np.asarray(v).dtype, np.integer):
            data_i.update({k.encode('utf-8'): np.asarray(v, dtype=int)})
        elif np.issubdtype(np.asarray(v).dtype, np.floating):
             data_r.update({k.encode('utf-8'): np.asarray(v, dtype=float)})
        else:
            msg = "Variable {} is neither int nor float nor list/array thereof"
            raise ValueError(msg.format(k))
    return data_r, data_i
Collaborator

ahartikainen commented Feb 16, 2018

Ok. I know what is the problem:

{'datax' : np.array([], dtype=int), 'N' : 0}

The reason for this is that the np.asarray([]) has float as a default dtype.

See pystan.misc

def _split_data(data):
    data_r = {}
    data_i = {}
    # data_r and data_i are going to be converted into C++ objects of
    # type: map<string, pair<vector<double>, vector<size_t>>> and
    # map<string, pair<vector<int>, vector<size_t>>> so prepare
    # them accordingly.
    for k, v in data.items():
        if np.issubdtype(np.asarray(v).dtype, np.integer):
            data_i.update({k.encode('utf-8'): np.asarray(v, dtype=int)})
        elif np.issubdtype(np.asarray(v).dtype, np.floating):
             data_r.update({k.encode('utf-8'): np.asarray(v, dtype=float)})
        else:
            msg = "Variable {} is neither int nor float nor list/array thereof"
            raise ValueError(msg.format(k))
    return data_r, data_i
@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Feb 16, 2018

Collaborator

Thinking about this, we probably don't want to parse dtypes from the stan model and can't change the defaults for numpy, so we could give a warning about the empty list being transformed to float and tell the user how to explicitly insert int array.

@ariddell any thoughts?

Collaborator

ahartikainen commented Feb 16, 2018

Thinking about this, we probably don't want to parse dtypes from the stan model and can't change the defaults for numpy, so we could give a warning about the empty list being transformed to float and tell the user how to explicitly insert int array.

@ariddell any thoughts?

@bob-carpenter

This comment has been minimized.

Show comment
Hide comment
@bob-carpenter

bob-carpenter Feb 16, 2018

This has also been an ongoing problem with RStan. The underlying cause is the same---dynamic types making it unconventional to specify the type of something that's empty.

bob-carpenter commented Feb 16, 2018

This has also been an ongoing problem with RStan. The underlying cause is the same---dynamic types making it unconventional to specify the type of something that's empty.

@ariddell

This comment has been minimized.

Show comment
Hide comment
@ariddell

ariddell Feb 16, 2018

Member
Member

ariddell commented Feb 16, 2018

@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Feb 18, 2018

Collaborator

If we can extract the dtypes, we could use them in fit.extract to automatically set the correct dtype. So do we need to change fit.sim to have dtypes-list or something similar.

Collaborator

ahartikainen commented Feb 18, 2018

If we can extract the dtypes, we could use them in fit.extract to automatically set the correct dtype. So do we need to change fit.sim to have dtypes-list or something similar.

@ariddell ariddell added this to the 2.18 milestone Feb 18, 2018

@riddell-stan

This comment has been minimized.

Show comment
Hide comment
@riddell-stan

riddell-stan Jul 26, 2018

Contributor

This relates to an important desirable: we need to be able to get the data types from the stan model. This is needed for a lot of things.

@bob-carpenter Can we get the data types for variables from a stan_model yet? I don't see it in an older but post-2.17 stan.

Contributor

riddell-stan commented Jul 26, 2018

This relates to an important desirable: we need to be able to get the data types from the stan model. This is needed for a lot of things.

@bob-carpenter Can we get the data types for variables from a stan_model yet? I don't see it in an older but post-2.17 stan.

@riddell-stan riddell-stan added the bug label Jul 26, 2018

@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Jul 26, 2018

Collaborator

I did something "similar" in ArviZ. It will return all int params. Everything else is float.
This will also include all local variables defined in generated quantities block, which is probably fine in most cases (is this even possible: (int theta; ...); real theta).

It would be better if stan would return also dtypes for all returned variables.

https://github.com/arviz-devs/arviz/blob/master/arviz/utils/xarray_utils.py

    def infer_dtypes(self):
        pattern_remove_comments = re.compile(
            r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
            re.DOTALL|re.MULTILINE
        )
        stan_integer = r"int"
        stan_limits = r"(?:\<[^\>]+\>)*" # ignore group: 0 or more <....>
        stan_param = r"([^;=\s\[]+)" # capture group: ends= ";", "=", "[" or whitespace
        stan_ws = r"\s*" # 0 or more whitespace
        pattern_int = re.compile(
            "".join((stan_integer, stan_ws, stan_limits, stan_ws, stan_param)),
            re.IGNORECASE
        )
        stan_code = self.obj.get_stancode()
        # remove deprecated comments
        stan_code = "\n".join(\
                line if "#" not in line else line[:line.find("#")]\
                for line in stan_code.splitlines())
        stan_code = re.sub(pattern_remove_comments, "", stan_code)
        stan_code = stan_code.split("generated quantities")[-1]
        dtypes = re.findall(pattern_int, stan_code)
        dtypes = {item.strip() : 'int' for item in dtypes if item.strip() in self.varnames}
        return dtypes
Collaborator

ahartikainen commented Jul 26, 2018

I did something "similar" in ArviZ. It will return all int params. Everything else is float.
This will also include all local variables defined in generated quantities block, which is probably fine in most cases (is this even possible: (int theta; ...); real theta).

It would be better if stan would return also dtypes for all returned variables.

https://github.com/arviz-devs/arviz/blob/master/arviz/utils/xarray_utils.py

    def infer_dtypes(self):
        pattern_remove_comments = re.compile(
            r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
            re.DOTALL|re.MULTILINE
        )
        stan_integer = r"int"
        stan_limits = r"(?:\<[^\>]+\>)*" # ignore group: 0 or more <....>
        stan_param = r"([^;=\s\[]+)" # capture group: ends= ";", "=", "[" or whitespace
        stan_ws = r"\s*" # 0 or more whitespace
        pattern_int = re.compile(
            "".join((stan_integer, stan_ws, stan_limits, stan_ws, stan_param)),
            re.IGNORECASE
        )
        stan_code = self.obj.get_stancode()
        # remove deprecated comments
        stan_code = "\n".join(\
                line if "#" not in line else line[:line.find("#")]\
                for line in stan_code.splitlines())
        stan_code = re.sub(pattern_remove_comments, "", stan_code)
        stan_code = stan_code.split("generated quantities")[-1]
        dtypes = re.findall(pattern_int, stan_code)
        dtypes = {item.strip() : 'int' for item in dtypes if item.strip() in self.varnames}
        return dtypes
@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Jul 26, 2018

Collaborator

Also, this will fail with all the #include statements.

Collaborator

ahartikainen commented Jul 26, 2018

Also, this will fail with all the #include statements.

@ahartikainen

This comment has been minimized.

Show comment
Hide comment
@ahartikainen

ahartikainen Aug 4, 2018

Collaborator

I believe we can't fix this in 2.18? We could throw a warning when we see an empty list/tuple?

Collaborator

ahartikainen commented Aug 4, 2018

I believe we can't fix this in 2.18? We could throw a warning when we see an empty list/tuple?

@riddell-stan

This comment has been minimized.

Show comment
Hide comment
@riddell-stan

riddell-stan Aug 8, 2018

Contributor

@bob-carpenter when will we be able to introspect a stan model and get out the type (int or real) for a variable?

@ahartikainen I suppose we could warn if we see (1) an empty list and (2) a scalar 0 in the data dict. A decent short-term fix, right?

Contributor

riddell-stan commented Aug 8, 2018

@bob-carpenter when will we be able to introspect a stan model and get out the type (int or real) for a variable?

@ahartikainen I suppose we could warn if we see (1) an empty list and (2) a scalar 0 in the data dict. A decent short-term fix, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment