Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data transformation from aes strings #188

Closed
jankatins opened this issue Jan 29, 2014 · 5 comments · Fixed by #201
Closed

data transformation from aes strings #188

jankatins opened this issue Jan 29, 2014 · 5 comments · Fixed by #201

Comments

@jankatins
Copy link
Contributor

Right now this doesn't work:

import pandas
import numpy as np
from ggplot.ggplot import _build_df_from_transforms
from ggplot import aes
df = pandas.DataFrame({"a":[1,2,3,4]})
print _build_df_from_transforms(df, aes(x="a+1")) # works
print _build_df_from_transforms(df, aes(x="a-1")) # works
print _build_df_from_transforms(df, aes(x="a-np.max(a)")) # does not work
  File "<string>", line 1
    data.get('a')-data.get('np')data.get('max')(data.get('a'))
                                   ^
SyntaxError: invalid syntax

I would very much use https://github.com/pydata/patsy/, docs: http://patsy.readthedocs.org/en/latest/formulas.html#the-formula-language
.

Unfortunately I haven't found a way to support strings/factors without getting dummy coding, but numeric columns work:

from patsy import dmatrix
_f = "np.log(a+1)-1"
print(dmatrix(_f, df, return_type="dataframe")) # works

def factor(series):
    return series.apply(str)
_f = "factor(a)-1"
# returns a dummy encoded dataframe :-(
print( dmatrix(_f, df, return_type="dataframe") )

CC: @njsmith: do you have any ideas?

@jankatins
Copy link
Contributor Author

#136 should also be fixed when working on this...

This was referenced Jan 29, 2014
@njsmith
Copy link

njsmith commented Jan 30, 2014

I'm not familiar with what you're trying to do here, so hard to make recommendations :-)

What's the intended semantics of a string like "a - np.max(a)" or "a + 1"? Is it supposed to be an arithmetic transformation, or a way to list several variables (like in patsy formulas) or...?

If you want something that has the abstract structure of a formula -- i.e., a set of individual "terms", each of which is an "interaction" of "factors" -- then patsy is probably the way to go, and if you want to avoid categorical coding and such you can work with the lower level interfaces like ModelDesc directly, instead of using dmatrix.

If these strings are supposed to be arithmetic operations, then the easiest approach is probably not to go through patsy's formula parser, and not to try transforming the code either, but instead just use Python's eval with a carefully constructed environment in which lookups get directed first to the dataframe, and then if not found then to the environment where aes was called. Patsy also has a chunk of code to abstract away the complicated parts of this -- see EvalEnvironment. Basically the way it works is you do

  def aes(..., eval_env=0):
      env = EvalEnvironment.capture(eval_env=eval_env, reference=1)
      ...

Now env is an EvalEnvironment object which allows us to interpret python code in the way that it would be interpreted in the stack frame that called aes (both for variable lookups and for __future__ flag settings). Or, if the eval_env= is given, it can be a pre-existing EvalEnvironment, or it can be an integer saying how many stack frames out you want to go (in theory this could be useful for helper functions, though in practice when clever things are going on it seems better just to pass explicit EvalEnvironments around instead of trying to count stack frames).

Anyway, then when we want to apply a transformation, we do

    values = env.eval("a - np.max(a)", inner_namespace=my_data_frame)

and we get the value of that expression, where variables are first looked for in my_data_frame, and if not found get pulled out of the EvalEnvironment namespace.

...does any of this help?

@jankatins
Copy link
Contributor Author

@njsmith the last para sounds exactly what I want to do :-) Thanks a lot!

@jankatins
Copy link
Contributor Author

reminder: look at #171 for examples which should work

jankatins added a commit to jankatins/ggplot that referenced this issue Feb 6, 2014
Data transformation in aes (`aes(x="np.log(column)")' now uses
patsy.eval.EvalEnvironment. This should enable things like
`np.log(column)`.

Closes: yhat#188

The changes also let some bug in the current unittests show up:
setting a aes mapping (`aes(fill=True)`) was considered equivalent
to setting this values in the geom `geom_density(fill=True)`). Now
this will result in the same weired result as in ggplot (if we
would have already implemented fill... -> yhat#191). The affected
unittests (test_basic.py, test_readme_examples.py) were changed.

Also implement `__depcopy__()` for `aes` and ´ggplot` to not
deepcopy the needed eval environment as deepcopy failed with
the above change. ggplot deepcopy now does *not* copy the
dataframe, so this should result in some speedups. Also adjusted
the unittest in test_geom.py to fit this new model.

Closes: yhat#145

Added unittests (test_ggplot_internals.py) to make sure that the
original data is not changed and also that no data is changed
after a geom addition.
@jankatins
Copy link
Contributor Author

Thsi was merged...

has2k1 pushed a commit to has2k1/plotnine that referenced this issue Apr 25, 2017
Data transformation in aes (`aes(x="np.log(column)")' now uses
patsy.eval.EvalEnvironment. This should enable things like
`np.log(column)`.

Closes: yhat/ggpy#188

The changes also let some bug in the current unittests show up:
setting a aes mapping (`aes(fill=True)`) was considered equivalent
to setting this values in the geom `geom_density(fill=True)`). Now
this will result in the same weired result as in ggplot (if we
would have already implemented fill... -> #191). The affected
unittests (test_basic.py, test_readme_examples.py) were changed.

Also implement `__depcopy__()` for `aes` and ´ggplot` to not
deepcopy the needed eval environment as deepcopy failed with
the above change. ggplot deepcopy now does *not* copy the
dataframe, so this should result in some speedups. Also adjusted
the unittest in test_geom.py to fit this new model.

Closes: yhat/ggpy#145

Added unittests (test_ggplot_internals.py) to make sure that the
original data is not changed and also that no data is changed
after a geom addition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants