CLN/API refactor drop and filter and deprecate select #6599

hayd · 2014-03-11T16:29:40Z

need to unify api for drop and filter
should they take args for regex (regex, flags, case)
should we keep select or filter?

hayd · 2014-03-11T16:31:59Z

Also, failing tests is to convert to string, basically before doing regex I do index.astype(str), if you have non-ascii this fails.

jreback · 2014-03-11T16:38:56Z

pandas/core/generic.py

@@ -1731,44 +1710,101 @@ def _reindex_axis(self, new_index, fill_method, axis, copy):
        else:
            return self._constructor(new_data).__finalize__(self)

-    def filter(self, items=None, like=None, regex=None, axis=None):
+    def filter(self, labels=None, axis=None, inplace=False, regex=True, **kwargs):


may want to intercept previous arguments in kwargs and show a deprecation message? (and then just adjust the args and return)

ah, that's better than what I was doing (and hadn't set up the warning yet).

hayd · 2014-03-11T17:27:30Z

Also should regex be match or contains... I'm confused about difference between like/regex. (If it's contains can wrap in ^foo$ to make match.

jreback · 2014-03-11T17:45:33Z

I think if not specified should be like startswith, e.g. ^,

most often you want give me all the columns which start with foo

so match

hayd · 2014-03-11T19:15:53Z

@jreback should there be a way to distinguish between match and contains? Previously regex arg acted as contains and like arg as match (which is startswith).

Was thinking of using regex arg, could hack it to take 'match' or 'contains' (match by default)... :s or maybe regex=None use match, regex=True use contains, regex=False don't use regex ?

jreback · 2014-03-11T19:17:21Z

yes that makes sense

default is match
False is an exact
contains uses contains rather than match

jreback · 2014-03-11T19:18:50Z

maybe catch this: #5798 (deal with non-strings, e.g .0)
#5657 (I think this PR will close this - perf of filter)

jreback · 2014-03-11T19:19:48Z

@jseabold what do you think of this so far? (I think we should put this on the mailng-list when Andy is ready)....

hayd · 2014-03-11T19:22:48Z

atm I'm just False-ing numbers (rather than string-ifying them.. I tried to do that but got ascii errors, so not sure it's worth it, easy to do though). It doesn't raise now (it use string methods, although unfortunately these NaN for numbers so have to fill in after).

Will check the perf.

I can live with exact string args, gives more flexibility later.

jseabold · 2014-03-11T19:30:39Z

I'm pretty much on board AFAICT. I haven't followed too closely. I'd expect re.search rather than re.match though just becuse it's the more general one. I don't feel strongly about this though. It's not unheard of to want to match on the end of a variable name. first_educ, second_educ, etc. Something like that. If user wants match behavior it's easier to add that then to get around it. ^.+ vs ^ (or whatever).

jreback · 2014-03-11T19:31:54Z

@jseabold you can always provide a regex if you want to (or alternately can have regex='match|contains|search' maybe overkill though

jseabold · 2014-03-11T19:37:52Z

Well then I'm a bit confused about what exactly this is doing.

hayd · 2014-03-11T19:49:51Z

@jreback perf is an issue, for two reasons:

str methods are slower than == (although, if regex is False we can just use ==) and isin (esp. for short lists). Not sure how we can speed this up?
I'm boolean masking using _slice, which is slow compared to iloc (can I call loc on a generic axis / or is there a better way completely) ?

@jseabold This PR makes the API the same for filter and drop, and do the same thing under the hood. You can pass drop a regex, a boolean function (to apply to index) or a list of labels to drop.

hayd · 2014-03-11T19:52:51Z

hmmmm, unfortunately regex taking string for match/contains screws up backwards compat. Another arg?

alternative soln is to use select over filter (and dep filter instead), but not the best reason for doing so...

jreback · 2014-03-11T19:53:22Z

@hayd no I think you should always uses a compiled regex (or isin/== if you can)...that's the issue

I think you can just compute the indexer and do an iloc it should work, just construct the tuple

say indexer is your axis indexer

tuple_indexer = [ slice(None) ] * self.ndim
tuple_indexer[axis] = indexer
return self.iloc[tuple(tuple_indexer]]

jreback · 2014-03-11T19:53:56Z

@hayd what do you mean by:

hmmmm, unfortunately regex taking string for match/contains screws up backwards compat. Another arg?

hayd · 2014-03-11T19:56:45Z

Previously regex was an kwarg for filter. In my code I was doing:

if isinstance(regex, string_types): return self.filter(labels=regex, regex=True, axis=axis)

if regex is allowed to be a string (e.g. 'match') I can't do that, and of course this will break old code if they did happen to be filtering by the regex 'match' (unlikely!)

jreback · 2014-03-11T21:27:40Z

@hayd I believe this should be true

indexer = boolean indexer on axis a
df.filter(indexer,axis=a) == df.drop(df._get_axis(a)[~indexer],axis=a)

(and I think that that drop should be implemented in terms of filter)

hayd · 2014-03-12T00:22:36Z

@jreback yup, that's exactly what the _select method does.

Was thinking about this, and not sure regex should be name of flag to decide how regex is applied, as you may want to do contains without using regex. If that makes sense...

jreback · 2014-03-12T00:25:26Z

hmm regex is what replace uses

hayd · 2014-03-12T01:46:01Z

replace doesn't offer a way to tweak the way regex works (i.e. match vs contains), you either use regex or you don't. IMO regex should be bool (a la everywhere else) and have another argument to distinguish between contains vs match vs whatever ?

In [20]: s = pd.Series(['a', 'ab', 'ca'])

In [15]: s.replace(to_replace='a', value='d')
Out[15]:
0     d
1    ab
2    ca
dtype: object

In [16]: s.replace(regex='a', value='d')
Out[16]:
0     d
1    db
2    cd
dtype: object

In [17]: s.replace(to_replace='a', regex=True, value='d')
Out[17]:
0     d
1    db
2    cd
dtype: object

jseabold · 2014-03-12T02:17:16Z

I'm still just not sure I understand the issue here. It seems to me (and tell me to shut up, if you get my point already, or this is impossible for backwards compat reasons) that regex should everywhere be search. match is just a way to avoid writing '^'. But since you can just write '^', you shouldn't bend over backwards to provide some weird combinations of keywords that do match and some that do search. AFAICT, you can do everything and more with search (via re.MULTILINE flag) than you can do with match. I always thought it was a mistake (and accident of history) that match was targeted by the old regex functions. Make a one-line note in the docs "if you want to do match-like behavior, then prepend '^'". If people know what match vs contains is enough for the keywords to make sense, they probably already know about '^' right?

[~/]                                                                            
[5]: s.replace(to_replace='^a', regex=True, value='d')                          
[5]:                                                                            
0     d                                                                         
1    db                                                                         
2    ca                                                                         
dtype: object

jseabold · 2014-03-12T02:17:55Z

*old regex functions are those vectorized .str methods that Wes added in an afternoon.

hayd · 2014-03-12T04:15:14Z

Yeah, I think contains (aka search?) should be the default, you can always prepend ^ if you so wish, I'll add it as notes in the docstring (and for legacy like kwarg will still do that). regex being bool makes it easier.

hayd · 2014-03-12T04:35:33Z

I guess the issue is: that will make it a bigger API change... fixed by making regex=False the default.

jorisvandenbossche · 2014-05-02T23:18:13Z

Added a bunch of comments :-)

Sorry to sound a bit harsh (and sorry that this comes maybe a bit late, but I thougth Andy had to update it so didn't look at it yet), but I don't think this is ready to merge.

If we break backwards compatibility in these functions to clean up the API, we have one shot: we should do it once, and do it good. And I have the impression we should take a bit more care here.

My main concerns:

the API: there was some discussion about it, and in the end there were still different proposals:
- @jreback proposed to follow how filter was now (with seperate keywords for items/function and regex) (CLN/API refactor drop and filter and deprecate select #6599 (comment), second last comment of discussion)
- @hayd proposed to have regex or bool or string if labels was None (CLN/API refactor drop and filter and deprecate select #6599 (comment), last comment)
- the actual implementation is still something else (namely, regex is bool or contains/match/search to change behaviour of the regex)
I personally think the API for filter has become more difficult. Previously you had the like option, which was easy to understand. To do the same now, you have to know regex (or at least enough to know that it means you can do the behaviour of like with regex without any real regex-fancyness). And I think the majority of our users does not really know regex. (but this can maybe be solved with some more explanation in the docstring)
At the moment, for drop there is a major incompatibility, as a single string is not interpreted as a exact match anymore. At least I used this a lot this way, and also a lot of the tests are written like this (it does not let them fail now, because they often don't have overlapping column names (eg just A, B, C) so using regex or not does not matter)
just some things that are not yet correct (for example the like was not match previously)

Other todo's:

Update all docs (eg http://pandas-docs.github.io/pandas-docs-travis/indexing.html#the-select-method)
Expand the whatsnew section
Are there actually backward incompatibilities? If so, list these in appropriate section in whatsnew.
Add examples in the docstring
New tests for new functionality, and for issues that will be closed (+ it seems there are some things not covered with the old tests (eg like not only on start of string, or on ints), which would fail now)
There was a mention of a performance issue (CLN/API refactor drop and filter and deprecate select #6599 (comment)). Is this solved?

There was also said above to maybe put this on mailing list once we had a proposal for the API.

@jreback I know you want to get this in as we want to release 0.14, but I think this is important enough to not do this in a hurry just to get it in. And maybe my concerns are easily answered, and not much should be adapted. And @hayd, maybe not that clear from all my comments, but really thanks for you work on this!

jreback · 2014-05-05T00:22:28Z

ok, I think let's defer this to 0.14.1 (it should be back-compat so no biggie to putting in a minor release).

as @jorisvandenbossche brings up some good points...and don't want to throw this in w/o some more discussion / consideration

jreback · 2014-06-03T23:47:53Z

@hayd are we pushing this to 0.15 as its an API change?

can you update what the state-of-the-art with this is

hayd · 2014-06-04T03:17:23Z

I agree with @jorisvandenbossche on this, I had thought I had this a lot cleaner. I will fix up with the earlier suggestions (it's actually not that many, I just need to slog thorough it) and we can merge it soon after 0.14.1.

jreback · 2014-07-21T12:11:35Z

@hayd let's revist!

hayd · 2014-07-21T17:45:23Z

revisiting this evening, rebase + cleanup.

jreback · 2014-08-11T13:06:20Z

@hayd can we resurrect this? (and put the API changes in the top section for comment)

jreback · 2014-09-04T00:46:57Z

@hayd ????

jreback · 2014-09-14T14:22:50Z

@hayd ????

jreback · 2015-01-18T21:41:38Z

closing as stale. But this would be a much needed refactor.

jreback reviewed Mar 11, 2014
View reviewed changes

jreback added API Design labels Mar 11, 2014

jreback added this to the 0.14.0 milestone Mar 11, 2014

jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014

hayd modified the milestones: 0.15.0, 0.14.1 Jun 4, 2014

jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014

hayd self-assigned this Jul 21, 2014

jreback modified the milestones: 0.16, 0.15.0 Sep 14, 2014

jreback mentioned this pull request Sep 23, 2014

subset keyword argument will include last column if incorrect keys are given #8303

Closed

jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014

jreback mentioned this pull request Oct 10, 2014

Series.drop() with boolean index: inconsistent behaviour #8530

Closed

jreback mentioned this pull request Oct 21, 2014

Series.drop() with MultiIndex: inconsistent behaviour #8594

Closed

jreback closed this Jan 18, 2015

jorisvandenbossche removed this from the 0.16.0 milestone Jan 19, 2015

jorisvandenbossche mentioned this pull request Feb 19, 2016

Doc/df filter #12395

Closed

jreback mentioned this pull request Feb 20, 2016

DEPR: filter & select #12401

Closed

2 tasks

CLN/API refactor drop and filter and deprecate select #6599

CLN/API refactor drop and filter and deprecate select #6599

Conversation

hayd commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback Mar 11, 2014

Choose a reason for hiding this comment

hayd Mar 11, 2014

Choose a reason for hiding this comment

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

jreback commented Mar 11, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jseabold commented Mar 11, 2014

jreback commented Mar 11, 2014

jseabold commented Mar 11, 2014

hayd commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

jreback commented Mar 11, 2014

hayd commented Mar 11, 2014

jreback commented Mar 11, 2014

hayd commented Mar 12, 2014

jreback commented Mar 12, 2014

hayd commented Mar 12, 2014

jseabold commented Mar 12, 2014

jseabold commented Mar 12, 2014

hayd commented Mar 12, 2014

hayd commented Mar 12, 2014

jorisvandenbossche commented May 2, 2014

jreback commented May 5, 2014

jreback commented Jun 3, 2014

hayd commented Jun 4, 2014

jreback commented Jul 21, 2014

hayd commented Jul 21, 2014

jreback commented Aug 11, 2014

jreback commented Sep 4, 2014

jreback commented Sep 14, 2014

jreback commented Jan 18, 2015