data transformation from aes strings #188

jankatins · 2014-01-29T21:21:49Z

Right now this doesn't work:

import pandas
import numpy as np
from ggplot.ggplot import _build_df_from_transforms
from ggplot import aes
df = pandas.DataFrame({"a":[1,2,3,4]})
print _build_df_from_transforms(df, aes(x="a+1")) # works
print _build_df_from_transforms(df, aes(x="a-1")) # works
print _build_df_from_transforms(df, aes(x="a-np.max(a)")) # does not work
  File "<string>", line 1
    data.get('a')-data.get('np')data.get('max')(data.get('a'))
                                   ^
SyntaxError: invalid syntax

I would very much use https://github.com/pydata/patsy/, docs: http://patsy.readthedocs.org/en/latest/formulas.html#the-formula-language
.

Unfortunately I haven't found a way to support strings/factors without getting dummy coding, but numeric columns work:

from patsy import dmatrix
_f = "np.log(a+1)-1"
print(dmatrix(_f, df, return_type="dataframe")) # works

def factor(series):
    return series.apply(str)
_f = "factor(a)-1"
# returns a dummy encoded dataframe :-(
print( dmatrix(_f, df, return_type="dataframe") )

CC: @njsmith: do you have any ideas?

The text was updated successfully, but these errors were encountered:

jankatins · 2014-01-29T21:22:17Z

#136 should also be fixed when working on this...

njsmith · 2014-01-30T00:49:00Z

I'm not familiar with what you're trying to do here, so hard to make recommendations :-)

What's the intended semantics of a string like "a - np.max(a)" or "a + 1"? Is it supposed to be an arithmetic transformation, or a way to list several variables (like in patsy formulas) or...?

If you want something that has the abstract structure of a formula -- i.e., a set of individual "terms", each of which is an "interaction" of "factors" -- then patsy is probably the way to go, and if you want to avoid categorical coding and such you can work with the lower level interfaces like ModelDesc directly, instead of using dmatrix.

If these strings are supposed to be arithmetic operations, then the easiest approach is probably not to go through patsy's formula parser, and not to try transforming the code either, but instead just use Python's eval with a carefully constructed environment in which lookups get directed first to the dataframe, and then if not found then to the environment where aes was called. Patsy also has a chunk of code to abstract away the complicated parts of this -- see EvalEnvironment. Basically the way it works is you do

  def aes(..., eval_env=0):
      env = EvalEnvironment.capture(eval_env=eval_env, reference=1)
      ...

Now env is an EvalEnvironment object which allows us to interpret python code in the way that it would be interpreted in the stack frame that called aes (both for variable lookups and for __future__ flag settings). Or, if the eval_env= is given, it can be a pre-existing EvalEnvironment, or it can be an integer saying how many stack frames out you want to go (in theory this could be useful for helper functions, though in practice when clever things are going on it seems better just to pass explicit EvalEnvironments around instead of trying to count stack frames).

Anyway, then when we want to apply a transformation, we do

    values = env.eval("a - np.max(a)", inner_namespace=my_data_frame)

and we get the value of that expression, where variables are first looked for in my_data_frame, and if not found get pulled out of the EvalEnvironment namespace.

...does any of this help?

jankatins · 2014-01-30T11:05:35Z

@njsmith the last para sounds exactly what I want to do :-) Thanks a lot!

jankatins · 2014-01-30T22:25:55Z

reminder: look at #171 for examples which should work

Data transformation in aes (`aes(x="np.log(column)")' now uses patsy.eval.EvalEnvironment. This should enable things like `np.log(column)`. Closes: yhat#188 The changes also let some bug in the current unittests show up: setting a aes mapping (`aes(fill=True)`) was considered equivalent to setting this values in the geom `geom_density(fill=True)`). Now this will result in the same weired result as in ggplot (if we would have already implemented fill... -> yhat#191). The affected unittests (test_basic.py, test_readme_examples.py) were changed. Also implement `__depcopy__()` for `aes` and ´ggplot` to not deepcopy the needed eval environment as deepcopy failed with the above change. ggplot deepcopy now does *not* copy the dataframe, so this should result in some speedups. Also adjusted the unittest in test_geom.py to fit this new model. Closes: yhat#145 Added unittests (test_ggplot_internals.py) to make sure that the original data is not changed and also that no data is changed after a geom addition.

jankatins · 2014-02-28T12:02:28Z

Thsi was merged...

Data transformation in aes (`aes(x="np.log(column)")' now uses patsy.eval.EvalEnvironment. This should enable things like `np.log(column)`. Closes: yhat/ggpy#188 The changes also let some bug in the current unittests show up: setting a aes mapping (`aes(fill=True)`) was considered equivalent to setting this values in the geom `geom_density(fill=True)`). Now this will result in the same weired result as in ggplot (if we would have already implemented fill... -> #191). The affected unittests (test_basic.py, test_readme_examples.py) were changed. Also implement `__depcopy__()` for `aes` and ´ggplot` to not deepcopy the needed eval environment as deepcopy failed with the above change. ggplot deepcopy now does *not* copy the dataframe, so this should result in some speedups. Also adjusted the unittest in test_geom.py to fit this new model. Closes: yhat/ggpy#145 Added unittests (test_ggplot_internals.py) to make sure that the original data is not changed and also that no data is changed after a geom addition.

This was referenced Jan 29, 2014

Proper Factor Function #40

Closed

added geom_rect. addresses #171 #184

Merged

This was referenced Feb 3, 2014

pd.Categorial and level ordering and adding to a DataFrame pandas-dev/pandas#6242

Closed

Wrong Unittests #199

Closed

jankatins mentioned this issue Feb 6, 2014

ENH: Better transformations and don't deepcopy data #201

Merged

jankatins closed this as completed Feb 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data transformation from aes strings #188

data transformation from aes strings #188

jankatins commented Jan 29, 2014

jankatins commented Jan 29, 2014

njsmith commented Jan 30, 2014

jankatins commented Jan 30, 2014

jankatins commented Jan 30, 2014

jankatins commented Feb 28, 2014

data transformation from aes strings #188

data transformation from aes strings #188

Comments

jankatins commented Jan 29, 2014

jankatins commented Jan 29, 2014

njsmith commented Jan 30, 2014

jankatins commented Jan 30, 2014

jankatins commented Jan 30, 2014

jankatins commented Feb 28, 2014