How to do a polynomial? #20

jseabold · 2013-06-25T00:24:09Z

I've seen https://groups.google.com/forum/#!topic/pystatsmodels/96cMRgFXBaA, but why doesn't something like this work.

from patsy import dmatrices
data = dict(y=range(1,11), x1=range(21,31), x2=range(11,21))

dmatrices("y ~ x1 + x2**2", data)

or

dmatrices("y ~ x1 + I(x2**2)", data)

This works

dmatrices("y ~ x1 + np.power(x2, 2)", data)

The text was updated successfully, but these errors were encountered:

njsmith · 2013-06-25T07:14:58Z

The reason x22 doesn't work is that
x22
is shorthand for
x2 * x2
which is shorthand for
x2 + x2 + x2:x2
which collapses down to just plain
x2
because self-interactions don't make sense. See
https://patsy.readthedocs.org/en/latest/formulas.html#the-formula-language

There's an argument that for numerical predictors, x2:x2 should be the same
as np.multiply(x2, x2), just like x1:x2 is the same as np.multiply(x1, x2).
But even then x2**2 would mean x2+np.multiply(x2, x2).

In your example, I(x2**2) doesn't work because x2 is a python list, not a
numpy array. patsy has no crystal ball that tells it that lists should be
coerced to arrays while interpreting arbitrary python code. Use DataFrame()
instead of dict() and it will just work...

On Tue, Jun 25, 2013 at 1:24 AM, Skipper Seabold
notifications@github.comwrote:

I've seen
https://groups.google.com/forum/#!topic/pystatsmodels/96cMRgFXBaA, but
why doesn't something like this work.

from patsy import dmatrices
data = dict(y=range(1,11), x1=range(21,31), x2=range(11,21))

dmatrices("y ~ x1 + x2**2", data)

or

dmatrices("y ~ x1 + I(x2**2)", data)

This works

dmatrices("y ~ x1 + np.power(x2, 2)", data)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/20
.

jseabold · 2013-06-25T12:10:55Z

Is it possible/desirable that if the RHS of ** is an integer (or a numeric term) to not do the interaction expansion, because it's surely not what someone wants?

njsmith · 2013-06-25T12:15:59Z

I don't understand. The only legal RHS for ** is an integer, anything else
raises an error in the formula evaluator...?

The intended usage is to say things like "give me all 1, 2, or 3-way
interactions, but not anything higher than that":
(a + b + c + d + e + f)**3

-n

On Tue, Jun 25, 2013 at 1:10 PM, Skipper Seabold
notifications@github.comwrote:

Is it possible/desirable that if the RHS of ** is an integer (or a numeric
term) to not do the interaction expansion, because it's surely not what
someone wants?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/20#issuecomment-19971654
.

jseabold · 2013-06-25T12:22:54Z

Yeah I was thinking of *. Hmm. Oh well.

njsmith · 2013-06-25T12:25:23Z

It's better to use something like poly(x, 3) anyway, right? B/c that can
produce orthogonalized columns? The people I know are always excited about
testing for whether the higher order terms are significant, and that's most
easily done if you use orthogonalized columns as your polynomial basis. (Of
course this requires we implement poly() but that's not hard, just, no-one
has gotten around to it.)

On Tue, Jun 25, 2013 at 1:22 PM, Skipper Seabold
notifications@github.comwrote:

Yeah I was thinking of *. Hmm. Oh well.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/20#issuecomment-19972181
.

jseabold · 2013-06-25T12:45:59Z

I suppose, if for ease of inference, though I rarely see it in practice. Might just be where I hang out. Different strokes.

jseabold · 2013-06-25T14:57:53Z

Translation of Stata's orthpoly.ado, so likely not fit for inclusion anywhere, but I'll leave it here. // cc @josef-pkt

https://gist.github.com/jseabold/5859121

njsmith · 2013-06-25T15:26:43Z

I think those are orthogonal polynomials, which are different. Orthogonal
polynomials are ones where if you evaluate them at every point in their
domain, the resulting vectors are orthogonal. For regression, we want
polynomials such that if you evaluate them at every point where there is
data, then the resulting vectors are orthogonal.

There's already code for doing this using qr in the patsy Poly class for
categorical polynomial coding, but it hasn't been adapted to be a numerical
stateful transform.
On 25 Jun 2013 16:01, "Skipper Seabold" notifications@github.com wrote:

Translation of Stata's orthpoly.ado, so likely not fit for inclusion
anywhere, but I'll leave it here. // cc @josef-pkthttps://github.com/josef-pkt

https://gist.github.com/jseabold/5859121

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/20#issuecomment-19982092
.

josef-pkt · 2013-06-25T15:41:33Z

I find it confusing that polynomials should be QR orthogonalized.
I think the main polynomials should be numpy.polynomial.xxxvander

A question of choosing names so that most users (and not just R users) understand what the different versions mean.

jseabold · 2013-06-25T15:45:32Z

AFAICT, the Poly class is only intended for equally spaced categorical variables. At least, I couldn't figure out how to replicate what orthpoly gives with Poly for a continuous regressor, but I just wanted to work on something else for an hour.

I'm not sure I understand your distinction between what's orthogonal to what. I think orthpoly is like a scaled version of legvander that Josef was after.

from statsmodels.tools.tools import webuse
dta = webuse("auto")
weight = dta["weight"].values.astype(float)
orth = orthpoly(weight, 4)
np.allclose(np.dot(orth[:,0], orth[:,1]), 0)

http://www.stata.com/help11.cgi?orthpoly

njsmith · 2013-06-25T16:24:16Z

I only looked at your code for a few seconds, so I could be confused :-)

This looks like a good skimmable paper on why orthogonal is better than
vander in many cases (and I'm not aware of any situation where it's worse):
http://www.jstor.org/stable/1403204

The confusion is that the term "orthogonal polynomial" almost always refers
to something else, that afaict is totally irrelevant to linear regression.
In ordinary mathematical usage, two polynomials f1 and f2 are orthogonal
iff the integral of f1(x)f2(x)dx = 0, evaluated over the whole domain.

What we want are polynomials so that sum f1(x)f2(x) = 0 when summing over
our data points.

Poly() implements this latter (and only uses it for categorical variables
treated as integers, but the code would work for non-integer data just as
well).

What stata does I don't know. The manual page you pointed me to strongly
implies that they produce the second kind of orthogonal polynomials (when
it says "orthog and orthpoly create a set of variables such that the
"effects" of all the preceding variable have been removed from each
variable."). But afaict the Christoffel-Darboux recurrence is for producing
the first kind of orthogonal polynomials.

Are the predictor vectors that you get out of orthpoly actually orthogonal
as vectors?

AFAICT, the Poly class is only intended for equally spaced categorical
variables. At least, I couldn't figure out how to replicate what
orthpolygives with
Poly for a continuous regressor, but I just wanted to work on something
else for an hour.

I'm not sure I understand your distinction between what's orthogonal to
what. I think orthpoly is like a scaled version of legvander that Josef was
after.

from statsmodels.tools.tools import webuse
dta = webuse("auto")
weight = dta["weight"].values.astype(float)
orth = orthpoly(weight, 4)
np.allclose(np.dot(orth[:,0], orth[:,1]), 0)

http://www.stata.com/help11.cgi?orthpoly

—
Reply to this email directly or view it on
GitHubhttps://github.com//issues/20#issuecomment-19985792
.

jseabold · 2015-04-13T19:10:07Z

Just parking this one-liner here, since I came back around to this. An equivalent to R's poly for continuous variables AFAIK, though not really numerically sound.

poly = lambda x, degree : np.linalg.qr(np.vander(x, degree + 1)[:, ::-1])[0][:, 1:]

matthewwardrop · 2021-09-26T21:58:25Z

I'm going to close this one out since a PR exists that adds a poly implementation (#92), and there's no further action to be taken.

jseabold mentioned this issue Jun 25, 2013

polynomials #21

Closed

josef-pkt mentioned this issue Jun 8, 2014

Design: patsy stateful transform - replicability - multiple imputation statsmodels/statsmodels#1748

Open

mizzao mentioned this issue Aug 21, 2015

wls_prediction_std needs to be update to handle formula-like syntax statsmodels/statsmodels#987

Closed

thequackdaddy mentioned this issue Sep 16, 2016

ENH: R poly compatibility #92

Open

matthewwardrop closed this as completed Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do a polynomial? #20

How to do a polynomial? #20

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

josef-pkt commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Apr 13, 2015

matthewwardrop commented Sep 26, 2021

How to do a polynomial? #20

How to do a polynomial? #20

Comments

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

josef-pkt commented Jun 25, 2013

jseabold commented Jun 25, 2013

njsmith commented Jun 25, 2013

jseabold commented Apr 13, 2015

matthewwardrop commented Sep 26, 2021