Skip to content

Stan variable#287

Merged
ahartikainen merged 7 commits intostan-dev:developfrom
LukasNeugebauer:stan_variable
Sep 14, 2020
Merged

Stan variable#287
ahartikainen merged 7 commits intostan-dev:developfrom
LukasNeugebauer:stan_variable

Conversation

@LukasNeugebauer
Copy link
Copy Markdown
Contributor

@LukasNeugebauer LukasNeugebauer commented Sep 4, 2020

Submission Checklist

  • Run unit tests
  • Declare copyright holder and open-source license: see below

Summary

CmdStanMCMC.stan_variable now returns pandas.DataFrames with named columns instead of numpy.ndarrays. I also adapted the test that checks the output shape. This is because we're now always returning 2D DataFrames, while the dimensionality was variable before.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Lukas Neugebauer

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Comment thread cmdstanpy/stanfit.py Outdated
Comment on lines +755 to +773
if dims == 1:
idx = self.column_names.index(name)
return self._draws[self._draws_warmup :, :, idx].reshape(
(dim0,), order='A'
)
return pd.DataFrame({
name: self._draws[self._draws_warmup:, :, idx].reshape(
(dim0,), order='A'
)
})
else:
idxs = [
x[0]
x
for x in enumerate(self.column_names)
if x[1].startswith(name + '.')
]
var_dims = [dim0]
var_dims.extend(dims)
return self._draws[
self._draws_warmup :, :, idxs[0] : idxs[-1] + 1
].reshape(tuple(var_dims), order='A')
return pd.DataFrame({
n: self._draws[
self._draws_warmup:, :, x
].reshape(dim0, order='A')
for x, n in idxs
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think there is now a possibility to streamline this a bit, at the same time we can use regex to find columns.

Please check if this code works.

Import re at start

import re

This works for all the variable types. Also it takes one chunk on transforms that to dataframe, so no need for dictionary handling.

dims = self._stan_variable_dims[name]
idxs = []
names = []
pattern = r'^{}(\.\d)*$'.format(name)
for i, column_name in enumerate(self.column_names):
    if re.search(pattern, column_name):
        names.append(column_name)
        idxs.append(i)

var_dims = [dim0]
var_dims.extend(dims)
return pd.DataFrame(self._draws[
    self._draws_warmup :, :, idxs[0] : idxs[-1] + 1
].reshape(tuple(var_dims), order='A'), columns=names)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!
Thanks a lot! Of course you're right. I made a few tweaks to your suggestion to make it work. Hope that's fine.

Comment thread cmdstanpy/stanfit.py
].reshape(dim0, order='A')
for x, n in idxs
})
dims = np.prod(self._stan_variable_dims[name])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of prod here

Comment thread cmdstanpy/stanfit.py
)

def stan_variables(self) -> Dict:
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do for this function? Or maybe we could basically copy the stan_variable, but just change it so it works with multiple names. I don't mind of duplicated code, but I would like to create dataframe only once.

Would it make sense to use concat? And suggest users to learn filter

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html

cc @mitzimorris

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could return the whole DataFrame as default and a DataFrame containing only the requested variables if names are given. But I guess that would make the stan_variable function kind of redundant.

Copy link
Copy Markdown
Contributor

@ahartikainen ahartikainen Sep 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we could make it so that stan_variable would call stan_variables.

What is the output in CmdStanR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know because I don't use R but I'll have a look at it later. Does stan_variables have to be a function? How about a cached property and stan_variable just uses filter on the columns?

Copy link
Copy Markdown
Contributor Author

@LukasNeugebauer LukasNeugebauer Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a look at the CmdStanR documentation and the equivalent seems to be CmdStanMCMC::draws() which returns a iterations X chains X variables array or a iterations X chains array when called with a variable name as argument.

Copy link
Copy Markdown
Member

@mitzimorris mitzimorris Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's useful about stan_variables and corresponding stan_variable_dims is that the returned dict gives you the names of all Stan variables present in the output - something that combines these two things would be good.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we could still return a dict of dataframes in stan_variables, that sounds fine by me.

@mitzimorris
Copy link
Copy Markdown
Member

mitzimorris commented Sep 5, 2020

a pandas dataframe would allow us to keep the column names from the Stan csv file.

could you add unit tests showing how this works?
the Lotka-Volterra model is solving the ODE to get populations for species u and v at time t - this is coded as transformed data variable real z[N, 2] - each column is a species - either preditor or prey - each row is a year. column 1 names are z.1.1 through z.20.1 and column 2 names are z.1.2 through z.20.2.
it would be nice to show how using column names helps.
does this make sense?

@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

a pandas dataframe would allow us to keep the column names from the Stan csv file.

could you add unit tests showing how this works?

You mean tests that check that the expected column names show up in the DataFrame?

@ahartikainen
Copy link
Copy Markdown
Contributor

Do we want to keep x.4.7 or go with x[4,7]

@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

I think [4,7] is nicer than .4.7.

@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently? Would that be worth a function on its own?

@ahartikainen
Copy link
Copy Markdown
Contributor

@mitzimorris what your opinion on renaming parameters? If CmdStanR doesn't do it, then we probably shouldn't so it either.

@mitzimorris
Copy link
Copy Markdown
Member

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently?

renaming to [4,7] should be done, as this is what has been done, not just in the interfaces, but also in CmdStan's stansummary function. the name foo.4.7 should be thought of as the internal name - that's just the header that's produced by the Stan model code.

@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently?

renaming to [4,7] should be done, as this is what has been done, not just in the interfaces, but also in CmdStan's stansummary function. the name foo.4.7 should be thought of as the internal name - that's just the header that's produced by the Stan model code.

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently?

This should work:

def rename_columns(column_names):
    return [re.sub(r",([\d,]+)$", r"[\1]", column.replace(".",",")) for column in column_names]

Any suggestions? If not, should I put it in utils.py? Or just rename the columns in stan_variable?

@ahartikainen
Copy link
Copy Markdown
Contributor

Hmm, Should we rename things already at the csv read step?

@mitzimorris
Copy link
Copy Markdown
Member

mitzimorris commented Sep 8, 2020

Hmm, Should we rename things already at the csv read step?

yes! reasons:

  • it lines up with how variables are used in Stan program
  • that's what CmdStanR does
  • that's how parameters are reported by stansummary cmds

and if I'd thought more about it - especially w/r/t the first point, it would have been done this way from the get-go.
I can see why the Stan model generates the set of column names that it does - (commas - doh!) - but those column names might as well be considered internal names.

@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

Sorry for the late reply, I was quite busy. I'll have a look at where we would have to rename the columns and try to implement it.

Copy link
Copy Markdown
Contributor

@ahartikainen ahartikainen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@mitzimorris mitzimorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ahartikainen
Copy link
Copy Markdown
Contributor

@bletham FYI I think this is a breaking change for prophet.

@mitzimorris
Copy link
Copy Markdown
Member

you do need to get rid of pylint and flake8 complaints

@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

you do need to get rid of pylint and flake8 complaints

That's the stuff in the Travis CI build, right? I had no idea what these tests are doing and was wondering why one failed anyway.
Ok, so to do for tomorrow: Find out what pylint and flake8 are and fix whatever's wrong!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 14, 2020

Codecov Report

Merging #287 into develop will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #287      +/-   ##
===========================================
+ Coverage    76.28%   76.31%   +0.03%     
===========================================
  Files            9        9              
  Lines         2197     2200       +3     
===========================================
+ Hits          1676     1679       +3     
  Misses         521      521              
Impacted Files Coverage Δ
cmdstanpy/stanfit.py 95.56% <100.00%> (ø)
cmdstanpy/utils.py 77.65% <100.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8cf8b8...eddf81f. Read the comment docs.

@mitzimorris
Copy link
Copy Markdown
Member

many thanks!

@ahartikainen ahartikainen reopened this Sep 14, 2020
@ahartikainen ahartikainen merged commit 798e178 into stan-dev:develop Sep 14, 2020
@LukasNeugebauer
Copy link
Copy Markdown
Contributor Author

My pleasure. Hopefully not the last one. Thanks for guiding me through this! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants