Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

WIP/ENH: adding support for categorial factors #527

Closed
wants to merge 2 commits into from

3 participants

@dengemann

This relates to our recent discussion on the mailing list

  • added privat _recode function that internally recodes the x factor
  • added additional positional argument, a dict that allows the user to specify the remapping done by _recode
  • i would have prefered a kwarg but this however messes up the ax.plot below. The options i see whithin this approach are a) allowing users to optinoally pass a tuple like (x, x_levels). so no additional positional argument is required and b) explicitly passing a dict with plotting parameters instead of **kwargs

Wdyt?

dengemann added some commits
@dengemann dengemann WIP/ENH: adding support for categorial factors
- added privat _recode function that internally recodes the x factor
- added additional positional argument, a dict that allows the user to specify the remapping done by _recode
- i would have prefered a kwarg but this however messes up the ax.plot below. The options i see whithin this approach are a) allowing users to optinoally pass a tuple like (x, x_levels). so no additional positional argument is required and b) explicitly passing a dict with plotting parameters instead of **kwargs

Wdyt?
8c83c71
@dengemann dengemann ENH/WIP: adding simple example demonstrating categorial factorplots 7c61c31
@jseabold jseabold commented on the diff
statsmodels/graphics/factorplots.py
@@ -4,7 +4,7 @@
import utils
-def interaction_plot(x, trace, response, func=np.mean, ax=None, plottype='b',
+def interaction_plot(x, trace, response, x_levels, func=np.mean, ax=None, plottype='b',
@jseabold Owner

Are we okay with adding args like this? I don't much mind, but it breaks backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold
Owner

Do you think we really need the x_levels argument? Couldn't we just check the dtype in the plot and call _recode with some default levels e.g., range(n_unique)? Thoughts?

@dengemann
@jseabold
Owner

Sure. We can update the ticklabels with the categories though, so this may alleviate some of this - they'll never see the levels. Now if you really want to control treatment on left, etc. you might be better off rolling your own plot?

@dengemann
@josef-pkt
Owner

just a generic comment:

It takes me 5 minutes to understand what the argument names mean, even with reading the doc string.

@dengemann
@dengemann
@josef-pkt
Owner

in general, I think already before your changes.
Mainly I didn't understand what "trace" means, why we have a letter x, but y is "response"

(factor1, factor2, response)
(x1, x2_levels, response)
(endog, exog, groups)

in general: x1 could be continuous if we have continuous-categorical interaction.

I'm reading the function completely out of context and never tried it, so it's not obvious to me what this means, except for the basic doc string example.

I don't have a comment about the pull request directly, since I haven't figured out the levels and labels yet. (busy with other things.)

@dengemann

Closing this one, continued on clean PR.

@dengemann dengemann closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 8, 2012
  1. @dengemann

    WIP/ENH: adding support for categorial factors

    dengemann authored
    - added privat _recode function that internally recodes the x factor
    - added additional positional argument, a dict that allows the user to specify the remapping done by _recode
    - i would have prefered a kwarg but this however messes up the ax.plot below. The options i see whithin this approach are a) allowing users to optinoally pass a tuple like (x, x_levels). so no additional positional argument is required and b) explicitly passing a dict with plotting parameters instead of **kwargs
    
    Wdyt?
  2. @dengemann
This page is out of date. Refresh to see the latest.
View
16 examples/example_interaction_categorial.py
@@ -0,0 +1,16 @@
+
+import numpy as np
+from statsmodels.graphics.factorplots import interaction_plot
+from pandas import Series
+
+np.random.seed(12345)
+weight = Series(np.repeat(['low', 'hi', 'low', 'hi'], 15))
+nutrition = Series(np.repeat(['lo_carb', 'hi_carb'], 30))
+days = np.log(np.random.randint(1, 30, size=60))
+levels = dict(low=0, hi=1)
+fig = interaction_plot(weight, nutrition, days, levels,
+ colors=['red', 'blue'], markers=['D', '^'],
+ ms=10)
+
+import matplotlib.pyplot as plt
+plt.show()
View
58 statsmodels/graphics/factorplots.py
@@ -4,7 +4,7 @@
import utils
-def interaction_plot(x, trace, response, func=np.mean, ax=None, plottype='b',
+def interaction_plot(x, trace, response, x_levels, func=np.mean, ax=None, plottype='b',
@jseabold Owner

Are we okay with adding args like this? I don't much mind, but it breaks backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
xlabel=None, ylabel=None, colors=[], markers=[],
linestyles=[], legendloc='best', legendtitle=None,
**kwargs):
@@ -26,6 +26,9 @@ def interaction_plot(x, trace, response, func=np.mean, ax=None, plottype='b',
response : array-like
The reponse variable. If a `pandas.Series` is given
its name will be used in `ylabel` if `ylabel` is None.
+ x_levels: dict
+ maps categorial levels (keys, str) to factor codings (values, int)
+ for the x factor.
func : function
Anything accepted by `pandas.DataFrame.aggregate`. This is applied to
the response variable grouped by the trace levels.
@@ -105,6 +108,13 @@ def interaction_plot(x, trace, response, func=np.mean, ax=None, plottype='b',
ax.set_ylabel(ylabel)
ax.set_xlabel(x_name)
+ if isinstance(x_levels, dict):
+ x = _recode(x, x_levels)
+
+ elif x_levels != None:
+ raise ValueError('%s is not a valid option.'
+ 'A dict is required' % x_levels)
+
data = DataFrame(dict(x=x, trace=trace, response=response))
plot_data = data.groupby(['trace', 'x']).aggregate(func).reset_index()
@@ -158,3 +168,49 @@ def interaction_plot(x, trace, response, func=np.mean, ax=None, plottype='b',
ax.legend(loc=legendloc, title=legendtitle)
ax.margins(.1)
return fig
+
+
+def _recode(a, levels):
+ """ recode categorial data to int factor
+ Parameters
+ ----------
+ a : array-like
+ array like object supporting with numpy array methods of categorially
+ coded data.
+ levels : dict
+ mapping of labels to integer-codings
+
+ Returns
+ -------
+ out : instance numpy.ndarray
+
+ """
+ from pandas import Series
+ name = None
+
+ if isinstance(a, Series):
+ name = a.name
+ a = a.values
+
+ if a.dtype.type not in [np.str_, np.object_]:
+ raise ValueError('This is not a categorial factor.'
+ ' Array of str type required.')
+
+ elif not isinstance(levels, dict):
+ raise ValueError('This is not a valid value for levels.'
+ ' Dict required.')
+
+ elif not (np.unique(a) == np.unique(levels.keys())).all():
+ raise ValueError('The levels do not match the array values.')
+
+ else:
+ out = np.empty(a.shape[0], dtype=np.int)
+ for level, coding in levels.items():
+ out[a == level] = coding
+
+ if name:
+ out = Series(out)
+ out.name = name
+
+ return out
+
Something went wrong with that request. Please try again.