Stan variable#287
Stan variable#287ahartikainen merged 7 commits intostan-dev:developfrom LukasNeugebauer:stan_variable
Conversation
| if dims == 1: | ||
| idx = self.column_names.index(name) | ||
| return self._draws[self._draws_warmup :, :, idx].reshape( | ||
| (dim0,), order='A' | ||
| ) | ||
| return pd.DataFrame({ | ||
| name: self._draws[self._draws_warmup:, :, idx].reshape( | ||
| (dim0,), order='A' | ||
| ) | ||
| }) | ||
| else: | ||
| idxs = [ | ||
| x[0] | ||
| x | ||
| for x in enumerate(self.column_names) | ||
| if x[1].startswith(name + '.') | ||
| ] | ||
| var_dims = [dim0] | ||
| var_dims.extend(dims) | ||
| return self._draws[ | ||
| self._draws_warmup :, :, idxs[0] : idxs[-1] + 1 | ||
| ].reshape(tuple(var_dims), order='A') | ||
| return pd.DataFrame({ | ||
| n: self._draws[ | ||
| self._draws_warmup:, :, x | ||
| ].reshape(dim0, order='A') | ||
| for x, n in idxs | ||
| }) |
There was a problem hiding this comment.
Hi, I think there is now a possibility to streamline this a bit, at the same time we can use regex to find columns.
Please check if this code works.
Import re at start
import re
This works for all the variable types. Also it takes one chunk on transforms that to dataframe, so no need for dictionary handling.
dims = self._stan_variable_dims[name]
idxs = []
names = []
pattern = r'^{}(\.\d)*$'.format(name)
for i, column_name in enumerate(self.column_names):
if re.search(pattern, column_name):
names.append(column_name)
idxs.append(i)
var_dims = [dim0]
var_dims.extend(dims)
return pd.DataFrame(self._draws[
self._draws_warmup :, :, idxs[0] : idxs[-1] + 1
].reshape(tuple(var_dims), order='A'), columns=names)
There was a problem hiding this comment.
Hey!
Thanks a lot! Of course you're right. I made a few tweaks to your suggestion to make it work. Hope that's fine.
| ].reshape(dim0, order='A') | ||
| for x, n in idxs | ||
| }) | ||
| dims = np.prod(self._stan_variable_dims[name]) |
| ) | ||
|
|
||
| def stan_variables(self) -> Dict: | ||
| """ |
There was a problem hiding this comment.
What should we do for this function? Or maybe we could basically copy the stan_variable, but just change it so it works with multiple names. I don't mind of duplicated code, but I would like to create dataframe only once.
Would it make sense to use concat? And suggest users to learn filter
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
cc @mitzimorris
There was a problem hiding this comment.
It could return the whole DataFrame as default and a DataFrame containing only the requested variables if names are given. But I guess that would make the stan_variable function kind of redundant.
There was a problem hiding this comment.
True, we could make it so that stan_variable would call stan_variables.
What is the output in CmdStanR?
There was a problem hiding this comment.
I don't know because I don't use R but I'll have a look at it later. Does stan_variables have to be a function? How about a cached property and stan_variable just uses filter on the columns?
There was a problem hiding this comment.
Just had a look at the CmdStanR documentation and the equivalent seems to be CmdStanMCMC::draws() which returns a iterations X chains X variables array or a iterations X chains array when called with a variable name as argument.
There was a problem hiding this comment.
what's useful about stan_variables and corresponding stan_variable_dims is that the returned dict gives you the names of all Stan variables present in the output - something that combines these two things would be good.
There was a problem hiding this comment.
So maybe we could still return a dict of dataframes in stan_variables, that sounds fine by me.
could you add unit tests showing how this works? |
You mean tests that check that the expected column names show up in the DataFrame? |
|
Do we want to keep |
|
I think [4,7] is nicer than .4.7. |
|
Is the renaming of columns from .4.7 to [4,7] something that comes up frequently? Would that be worth a function on its own? |
|
@mitzimorris what your opinion on renaming parameters? If CmdStanR doesn't do it, then we probably shouldn't so it either. |
renaming to |
This should work: Any suggestions? If not, should I put it in |
|
Hmm, Should we rename things already at the csv read step? |
yes! reasons:
and if I'd thought more about it - especially w/r/t the first point, it would have been done this way from the get-go. |
|
Sorry for the late reply, I was quite busy. I'll have a look at where we would have to rename the columns and try to implement it. |
|
@bletham FYI I think this is a breaking change for prophet. |
|
you do need to get rid of pylint and flake8 complaints |
That's the stuff in the Travis CI build, right? I had no idea what these tests are doing and was wondering why one failed anyway. |
Codecov Report
@@ Coverage Diff @@
## develop #287 +/- ##
===========================================
+ Coverage 76.28% 76.31% +0.03%
===========================================
Files 9 9
Lines 2197 2200 +3
===========================================
+ Hits 1676 1679 +3
Misses 521 521
Continue to review full report at Codecov.
|
|
many thanks! |
|
My pleasure. Hopefully not the last one. Thanks for guiding me through this! :) |
Submission Checklist
Summary
CmdStanMCMC.stan_variable now returns pandas.DataFrames with named columns instead of numpy.ndarrays. I also adapted the test that checks the output shape. This is because we're now always returning 2D DataFrames, while the dimensionality was variable before.
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):
Lukas Neugebauer
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: