-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
patsy questions/wishlist #93
Comments
For item 1 here, #97 greatly improves categorical coding time... from 10's of seconds on million record datasets to only a few seconds. NA/NaN/NaT/None and the like are a real nightmare... |
Are you aware that patsy has full support for building design matrices in bounded memory, by doing multiple passes over the data? You call
Does
I'm not sure what this means. Is this a standard technique? Any references to what the resulting design matrices would look like?
True! You can always define your own stateful transform but adding features like this to |
Wow thanks for your answers.
I hadn't known about that... and it may be useful. Does What would also be helpful is that if I do something like Does that make sense?
So think of vehicle types. There are 4 door sedans, 2 door sedans, small suv, medium suv, large suv, convertibles, 1/4 ton regular cab pickup, 1/4 ton extended cap pickup, 1/2 ton extended cab pickup, 1/2 ton crew cab pickup, etc... (there's actually about 200 in a list we use... What I'd like to do is conveniently group these things (and potentially have an "all else") bucket. Maybe I decide that I just want to have 2 groups - sedans vs. SUV/pickup... maybe I want 3 groups sedans vs. SUV vs. Pickup... maybe I want 3 groups... sedans/SUV vs. 1/4 ton pickups vs. 1/2 ton pickups. There's a huge potential number of groups... and deciding which to use is honestly more gut than science. But that's the idea.
I presume with |
Re:
then you could do it by using Also, regarding:
this isn't something that patsy currently keeps track of -- but potentially it could I guess. It's a little tricky because from patsy's point of view, Re: categorical grouping: Oh, I see, you just want way to recode your categorical so that it merges some existing categories together? Sure, that makes sense (though patsy doesn't have anything for this right now). In the general case this is probably something that it makes more sense to do in pandas using all its tools before entering patsy, since patsy's core competence isn't arbitrary data manipulation, but rather providing a terse mini-language for experimenting with models. (E.g., you probably shouldn't be listing 200 different categories inside your patsy formula string :-).) But if there's a nice notation for simple kinds of grouping that would fit into patsy, then that would make sense. Remember also that you're free to define your own helper functions and use them in patsy without them having to be built into patsy, e.g., you can do stuff like: def collapse_to_sedan_vs_pickup(categorical_series):
...
return recorded_series
dmatrix("bs(Age, 4) + C(collapse_to_sedan_vs_pickup(VehicleType), Poly) + x", data=df)
Right, internally the way stateful transforms are implemented is confusing, because of the need to work with The complications you're seeing are that if we don't have all the data loaded in memory at once, then we can't just call Adding weight support would require finding and implementing an algorithm for incrementally computing the weighted std -- but if you scroll down in that wikipedia link, then the next section is called "weighted incremental algorithm", so maybe this is not so hard ;-). |
Hello,
I fit some (relatively) large-ish GLMs in statsmodels and have been experimenting with using
patsy
instead of a home rolled thing. My home rolled method isn't very good (I tend to underestimate challenges...). I've gotten some better hardware so now some models that used to not work with patsy (because of memory constraints) work now. I've run across a few things that might make it easier for me to use patsy more. Happy to work on PRs for them if there's interest.patsy
, every inidividual value is checked against a rather detailed list on how to handle NaNs/Missing Values/Empty whatever. I ran a cProfile on this and it was quite slow. I think the bottleneck is here:https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L341
I know that NaNs/NA/None/Empty is a mess in general, but its a fact of life in my line of work (insurance modeling). I'm wondering if we scope out exactly all the scenarios we need to control for and use (pandas maybe?) to do this more elegantly? I'm not sure of the scope as there's far more players here than me.
y ~ a + b + a:c
. I want to come up with predictionsy
assuming that justa
changes or justb
changes. I think the process would look something like (assuming we're talking about changing onlya
) creating a new design matrix with every unique value ofa
as a separate row, and have the most frequent (or some other innocuous value) ofb
andc
as constant for these rows. Then feed this dataset through the statsmodelspredict
routine. This is very helpful for GLMs with the log link--which is the bulk of what I work with.standardize
, it may make sense to weight the observations. (Really only applicable if you have a really skewed data where certain values are more prevalent on higher weighted records.The text was updated successfully, but these errors were encountered: