Stan variable by LukasNeugebauer · Pull Request #287 · stan-dev/cmdstanpy

LukasNeugebauer · 2020-09-04T19:23:09Z

Submission Checklist

Run unit tests
Declare copyright holder and open-source license: see below

Summary

CmdStanMCMC.stan_variable now returns pandas.DataFrames with named columns instead of numpy.ndarrays. I also adapted the test that checks the output shape. This is because we're now always returning 2D DataFrames, while the dimensionality was variable before.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Lukas Neugebauer

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)

…able

ahartikainen · 2020-09-04T20:28:17Z

        if dims == 1:
            idx = self.column_names.index(name)
-            return self._draws[self._draws_warmup :, :, idx].reshape(
-                (dim0,), order='A'
-            )
+            return pd.DataFrame({
+                name: self._draws[self._draws_warmup:, :, idx].reshape(
+                    (dim0,), order='A'
+                )
+            })
        else:
            idxs = [
-                x[0]
+                x
                for x in enumerate(self.column_names)
                if x[1].startswith(name + '.')
            ]
-            var_dims = [dim0]
-            var_dims.extend(dims)
-            return self._draws[
-                self._draws_warmup :, :, idxs[0] : idxs[-1] + 1
-            ].reshape(tuple(var_dims), order='A')
+            return pd.DataFrame({
+                n: self._draws[
+                    self._draws_warmup:, :, x
+                ].reshape(dim0, order='A')
+                for x, n in idxs
+            })


Hi, I think there is now a possibility to streamline this a bit, at the same time we can use regex to find columns.

Please check if this code works.

Import re at start

import re

This works for all the variable types. Also it takes one chunk on transforms that to dataframe, so no need for dictionary handling.

dims = self._stan_variable_dims[name] idxs = [] names = [] pattern = r'^{}(\.\d)*$'.format(name) for i, column_name in enumerate(self.column_names): if re.search(pattern, column_name): names.append(column_name) idxs.append(i) var_dims = [dim0] var_dims.extend(dims) return pd.DataFrame(self._draws[ self._draws_warmup :, :, idxs[0] : idxs[-1] + 1 ].reshape(tuple(var_dims), order='A'), columns=names)

Hey!
Thanks a lot! Of course you're right. I made a few tweaks to your suggestion to make it work. Hope that's fine.

ahartikainen · 2020-09-05T10:29:28Z

-                ].reshape(dim0, order='A')
-                for x, n in idxs
-            })
+        dims = np.prod(self._stan_variable_dims[name])


Good use of prod here

ahartikainen · 2020-09-05T10:37:52Z

+        )

    def stan_variables(self) -> Dict:
        """


What should we do for this function? Or maybe we could basically copy the stan_variable, but just change it so it works with multiple names. I don't mind of duplicated code, but I would like to create dataframe only once.

Would it make sense to use concat? And suggest users to learn filter

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html

cc @mitzimorris

It could return the whole DataFrame as default and a DataFrame containing only the requested variables if names are given. But I guess that would make the stan_variable function kind of redundant.

True, we could make it so that stan_variable would call stan_variables.

What is the output in CmdStanR?

I don't know because I don't use R but I'll have a look at it later. Does stan_variables have to be a function? How about a cached property and stan_variable just uses filter on the columns?

Just had a look at the CmdStanR documentation and the equivalent seems to be CmdStanMCMC::draws() which returns a iterations X chains X variables array or a iterations X chains array when called with a variable name as argument.

what's useful about stan_variables and corresponding stan_variable_dims is that the returned dict gives you the names of all Stan variables present in the output - something that combines these two things would be good.

So maybe we could still return a dict of dataframes in stan_variables, that sounds fine by me.

…estions

mitzimorris · 2020-09-05T15:48:57Z

a pandas dataframe would allow us to keep the column names from the Stan csv file.

could you add unit tests showing how this works?
the Lotka-Volterra model is solving the ODE to get populations for species u and v at time t - this is coded as transformed data variable real z[N, 2] - each column is a species - either preditor or prey - each row is a year. column 1 names are z.1.1 through z.20.1 and column 2 names are z.1.2 through z.20.2.
it would be nice to show how using column names helps.
does this make sense?

LukasNeugebauer · 2020-09-05T15:53:55Z

a pandas dataframe would allow us to keep the column names from the Stan csv file.

could you add unit tests showing how this works?

You mean tests that check that the expected column names show up in the DataFrame?

ahartikainen · 2020-09-05T15:58:25Z

Do we want to keep x.4.7 or go with x[4,7]

LukasNeugebauer · 2020-09-05T16:04:39Z

I think [4,7] is nicer than .4.7.

LukasNeugebauer · 2020-09-07T07:37:28Z

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently? Would that be worth a function on its own?

ahartikainen · 2020-09-08T12:23:00Z

@mitzimorris what your opinion on renaming parameters? If CmdStanR doesn't do it, then we probably shouldn't so it either.

mitzimorris · 2020-09-08T15:57:51Z

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently?

renaming to [4,7] should be done, as this is what has been done, not just in the interfaces, but also in CmdStan's stansummary function. the name foo.4.7 should be thought of as the internal name - that's just the header that's produced by the Stan model code.

LukasNeugebauer · 2020-09-08T16:27:11Z

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently?

renaming to [4,7] should be done, as this is what has been done, not just in the interfaces, but also in CmdStan's stansummary function. the name foo.4.7 should be thought of as the internal name - that's just the header that's produced by the Stan model code.

Is the renaming of columns from .4.7 to [4,7] something that comes up frequently?

This should work:

def rename_columns(column_names):
    return [re.sub(r",([\d,]+)$", r"[\1]", column.replace(".",",")) for column in column_names]

Any suggestions? If not, should I put it in utils.py? Or just rename the columns in stan_variable?

ahartikainen · 2020-09-08T16:58:20Z

Hmm, Should we rename things already at the csv read step?

mitzimorris · 2020-09-08T19:07:11Z

Hmm, Should we rename things already at the csv read step?

yes! reasons:

it lines up with how variables are used in Stan program
that's what CmdStanR does
that's how parameters are reported by stansummary cmds

and if I'd thought more about it - especially w/r/t the first point, it would have been done this way from the get-go.
I can see why the Stan model generates the set of column names that it does - (commas - doh!) - but those column names might as well be considered internal names.

LukasNeugebauer · 2020-09-13T13:57:45Z

Sorry for the late reply, I was quite busy. I'll have a look at where we would have to rename the columns and try to implement it.

ahartikainen

LGTM

mitzimorris

LGTM!

ahartikainen · 2020-09-14T19:00:40Z

@bletham FYI I think this is a breaking change for prophet.

mitzimorris · 2020-09-14T19:01:55Z

you do need to get rid of pylint and flake8 complaints

LukasNeugebauer · 2020-09-14T20:07:50Z

you do need to get rid of pylint and flake8 complaints

That's the stuff in the Travis CI build, right? I had no idea what these tests are doing and was wondering why one failed anyway.
Ok, so to do for tomorrow: Find out what pylint and flake8 are and fix whatever's wrong!

codecov-commenter · 2020-09-14T21:24:07Z

Codecov Report

Merging #287 into develop will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop     #287      +/-   ##
===========================================
+ Coverage    76.28%   76.31%   +0.03%     
===========================================
  Files            9        9              
  Lines         2197     2200       +3     
===========================================
+ Hits          1676     1679       +3     
  Misses         521      521

Impacted Files	Coverage Δ
cmdstanpy/stanfit.py	`95.56% <100.00%> (ø)`
cmdstanpy/utils.py	`77.65% <100.00%> (+0.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8cf8b8...eddf81f. Read the comment docs.

mitzimorris · 2020-09-14T21:43:34Z

many thanks!

LukasNeugebauer · 2020-09-15T06:26:03Z

My pleasure. Hopefully not the last one. Thanks for guiding me through this! :)

Lukas Neugebauer added 2 commits September 4, 2020 19:05

stan_variable now returns a pd.DataFrame

3f78137

Updated description and adapted unit tests to new output of stan_vari…

0c48ec5

…able

ahartikainen reviewed Sep 4, 2020

View reviewed changes

ahartikainen reviewed Sep 5, 2020

View reviewed changes

stan_variable now in more streamlined version according to Ari's sugg…

98a6457

…estions

Columns are being renamed, adapted tests

6406aee

ahartikainen approved these changes Sep 14, 2020

View reviewed changes

mitzimorris approved these changes Sep 14, 2020

View reviewed changes

Updated regex patterns to search for new column names

c2c5de5

mitzimorris closed this Sep 14, 2020

ahartikainen reopened this Sep 14, 2020

ahartikainen merged commit 798e178 into stan-dev:develop Sep 14, 2020

Lukas Neugebauer added 2 commits September 15, 2020 00:34

Minor formatting to make flake8 and pylint happy

beef6b0

deleted redundant 'shape'

eddf81f

mitzimorris mentioned this pull request Sep 15, 2020

stan_variable return pandas dataframe to allow for named columns #276

Closed

bletham mentioned this pull request Mar 4, 2021

Upgrade CmdStanPy interface facebook/prophet#1834

Closed

Uh oh!

Conversation

LukasNeugebauer commented Sep 4, 2020 • edited by ahartikainen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Summary

Copyright and Licensing

Uh oh!

ahartikainen Sep 4, 2020

Choose a reason for hiding this comment

Uh oh!

LukasNeugebauer Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

ahartikainen Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

ahartikainen Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

LukasNeugebauer Sep 5, 2020

Choose a reason for hiding this comment

Uh oh!

ahartikainen Sep 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukasNeugebauer Sep 7, 2020

Choose a reason for hiding this comment

Uh oh!

LukasNeugebauer Sep 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitzimorris Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahartikainen Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

mitzimorris commented Sep 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasNeugebauer commented Sep 5, 2020

Uh oh!

ahartikainen commented Sep 5, 2020

Uh oh!

LukasNeugebauer commented Sep 5, 2020

Uh oh!

LukasNeugebauer commented Sep 7, 2020

Uh oh!

ahartikainen commented Sep 8, 2020

Uh oh!

mitzimorris commented Sep 8, 2020

Uh oh!

LukasNeugebauer commented Sep 8, 2020

Uh oh!

ahartikainen commented Sep 8, 2020

Uh oh!

mitzimorris commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasNeugebauer commented Sep 13, 2020

Uh oh!

ahartikainen left a comment

Choose a reason for hiding this comment

Uh oh!

mitzimorris left a comment

Choose a reason for hiding this comment

Uh oh!

ahartikainen commented Sep 14, 2020

Uh oh!

mitzimorris commented Sep 14, 2020

Uh oh!

LukasNeugebauer commented Sep 14, 2020

Uh oh!

codecov-commenter commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

LukasNeugebauer commented Sep 4, 2020 •

edited by ahartikainen

Loading

ahartikainen Sep 5, 2020 •

edited

Loading

LukasNeugebauer Sep 7, 2020 •

edited

Loading

mitzimorris Sep 8, 2020 •

edited

Loading

mitzimorris commented Sep 5, 2020 •

edited

Loading

mitzimorris commented Sep 8, 2020 •

edited

Loading

codecov-commenter commented Sep 14, 2020 •

edited

Loading