Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN/API refactor drop and filter and deprecate select #6599

Closed
wants to merge 1 commit into from

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Mar 11, 2014

closes #4818

  • need to unify api for drop and filter
  • should they take args for regex (regex, flags, case)
  • should we keep select or filter?

closes #6189
xref #8530
xref #8594

@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

Also, failing tests is to convert to string, basically before doing regex I do index.astype(str), if you have non-ascii this fails.

@@ -1731,44 +1710,101 @@ def _reindex_axis(self, new_index, fill_method, axis, copy):
else:
return self._constructor(new_data).__finalize__(self)

def filter(self, items=None, like=None, regex=None, axis=None):
def filter(self, labels=None, axis=None, inplace=False, regex=True, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to intercept previous arguments in kwargs and show a deprecation message? (and then just adjust the args and return)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that's better than what I was doing (and hadn't set up the warning yet).

@jreback jreback added this to the 0.14.0 milestone Mar 11, 2014
@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

Also should regex be match or contains... I'm confused about difference between like/regex. (If it's contains can wrap in ^foo$ to make match.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

I think if not specified should be like startswith, e.g. ^,

most often you want give me all the columns which start with foo

so match

@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

@jreback should there be a way to distinguish between match and contains? Previously regex arg acted as contains and like arg as match (which is startswith).

Was thinking of using regex arg, could hack it to take 'match' or 'contains' (match by default)... :s or maybe regex=None use match, regex=True use contains, regex=False don't use regex ?

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

yes that makes sense

  • default is match
  • False is an exact
  • contains uses contains rather than match

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

maybe catch this: #5798 (deal with non-strings, e.g .0)
#5657 (I think this PR will close this - perf of filter)

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@jseabold what do you think of this so far? (I think we should put this on the mailng-list when Andy is ready)....

@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

atm I'm just False-ing numbers (rather than string-ifying them.. I tried to do that but got ascii errors, so not sure it's worth it, easy to do though). It doesn't raise now (it use string methods, although unfortunately these NaN for numbers so have to fill in after).

Will check the perf.

I can live with exact string args, gives more flexibility later.

@jseabold
Copy link
Contributor

I'm pretty much on board AFAICT. I haven't followed too closely. I'd expect re.search rather than re.match though just becuse it's the more general one. I don't feel strongly about this though. It's not unheard of to want to match on the end of a variable name. first_educ, second_educ, etc. Something like that. If user wants match behavior it's easier to add that then to get around it. ^.+ vs ^ (or whatever).

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@jseabold you can always provide a regex if you want to (or alternately can have regex='match|contains|search' maybe overkill though

@jseabold
Copy link
Contributor

Well then I'm a bit confused about what exactly this is doing.

@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

@jreback perf is an issue, for two reasons:

  1. str methods are slower than == (although, if regex is False we can just use ==) and isin (esp. for short lists). Not sure how we can speed this up?
  2. I'm boolean masking using _slice, which is slow compared to iloc (can I call loc on a generic axis / or is there a better way completely) ?

@jseabold This PR makes the API the same for filter and drop, and do the same thing under the hood. You can pass drop a regex, a boolean function (to apply to index) or a list of labels to drop.

@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

hmmmm, unfortunately regex taking string for match/contains screws up backwards compat. Another arg?

alternative soln is to use select over filter (and dep filter instead), but not the best reason for doing so...

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@hayd no I think you should always uses a compiled regex (or isin/== if you can)...that's the issue

I think you can just compute the indexer and do an iloc it should work, just construct the tuple

say indexer is your axis indexer

tuple_indexer = [ slice(None) ] * self.ndim
tuple_indexer[axis] = indexer
return self.iloc[tuple(tuple_indexer]]

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@hayd what do you mean by:

hmmmm, unfortunately regex taking string for match/contains screws up backwards compat. Another arg?

@hayd
Copy link
Contributor Author

hayd commented Mar 11, 2014

Previously regex was an kwarg for filter. In my code I was doing:

if isinstance(regex, string_types): return self.filter(labels=regex, regex=True, axis=axis)

if regex is allowed to be a string (e.g. 'match') I can't do that, and of course this will break old code if they did happen to be filtering by the regex 'match' (unlikely!)

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@hayd I believe this should be true

indexer = boolean indexer on axis a
df.filter(indexer,axis=a) == df.drop(df._get_axis(a)[~indexer],axis=a)

(and I think that that drop should be implemented in terms of filter)

@hayd
Copy link
Contributor Author

hayd commented Mar 12, 2014

@jreback yup, that's exactly what the _select method does.

Was thinking about this, and not sure regex should be name of flag to decide how regex is applied, as you may want to do contains without using regex. If that makes sense...

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

hmm regex is what replace uses

@hayd
Copy link
Contributor Author

hayd commented Mar 12, 2014

replace doesn't offer a way to tweak the way regex works (i.e. match vs contains), you either use regex or you don't. IMO regex should be bool (a la everywhere else) and have another argument to distinguish between contains vs match vs whatever ?

In [20]: s = pd.Series(['a', 'ab', 'ca'])

In [15]: s.replace(to_replace='a', value='d')
Out[15]:
0     d
1    ab
2    ca
dtype: object

In [16]: s.replace(regex='a', value='d')
Out[16]:
0     d
1    db
2    cd
dtype: object

In [17]: s.replace(to_replace='a', regex=True, value='d')
Out[17]:
0     d
1    db
2    cd
dtype: object

@jseabold
Copy link
Contributor

I'm still just not sure I understand the issue here. It seems to me (and tell me to shut up, if you get my point already, or this is impossible for backwards compat reasons) that regex should everywhere be search. match is just a way to avoid writing '^'. But since you can just write '^', you shouldn't bend over backwards to provide some weird combinations of keywords that do match and some that do search. AFAICT, you can do everything and more with search (via re.MULTILINE flag) than you can do with match. I always thought it was a mistake (and accident of history) that match was targeted by the old regex functions. Make a one-line note in the docs "if you want to do match-like behavior, then prepend '^'". If people know what match vs contains is enough for the keywords to make sense, they probably already know about '^' right?

[~/]                                                                            
[5]: s.replace(to_replace='^a', regex=True, value='d')                          
[5]:                                                                            
0     d                                                                         
1    db                                                                         
2    ca                                                                         
dtype: object 

@jseabold
Copy link
Contributor

*old regex functions are those vectorized .str methods that Wes added in an afternoon.

@hayd
Copy link
Contributor Author

hayd commented Mar 12, 2014

Yeah, I think contains (aka search?) should be the default, you can always prepend ^ if you so wish, I'll add it as notes in the docstring (and for legacy like kwarg will still do that). regex being bool makes it easier.

@hayd
Copy link
Contributor Author

hayd commented Mar 12, 2014

I guess the issue is: that will make it a bigger API change... fixed by making regex=False the default.

@jorisvandenbossche
Copy link
Member

Added a bunch of comments :-)

Sorry to sound a bit harsh (and sorry that this comes maybe a bit late, but I thougth Andy had to update it so didn't look at it yet), but I don't think this is ready to merge.

If we break backwards compatibility in these functions to clean up the API, we have one shot: we should do it once, and do it good. And I have the impression we should take a bit more care here.

My main concerns:

  • the API: there was some discussion about it, and in the end there were still different proposals:
  • I personally think the API for filter has become more difficult. Previously you had the like option, which was easy to understand. To do the same now, you have to know regex (or at least enough to know that it means you can do the behaviour of like with regex without any real regex-fancyness). And I think the majority of our users does not really know regex. (but this can maybe be solved with some more explanation in the docstring)
  • At the moment, for drop there is a major incompatibility, as a single string is not interpreted as a exact match anymore. At least I used this a lot this way, and also a lot of the tests are written like this (it does not let them fail now, because they often don't have overlapping column names (eg just A, B, C) so using regex or not does not matter)
  • just some things that are not yet correct (for example the like was not match previously)

Other todo's:

There was also said above to maybe put this on mailing list once we had a proposal for the API.

@jreback I know you want to get this in as we want to release 0.14, but I think this is important enough to not do this in a hurry just to get it in. And maybe my concerns are easily answered, and not much should be adapted. And @hayd, maybe not that clear from all my comments, but really thanks for you work on this!

@jreback
Copy link
Contributor

jreback commented May 5, 2014

ok, I think let's defer this to 0.14.1 (it should be back-compat so no biggie to putting in a minor release).

as @jorisvandenbossche brings up some good points...and don't want to throw this in w/o some more discussion / consideration

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014
@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

@hayd are we pushing this to 0.15 as its an API change?

can you update what the state-of-the-art with this is

@hayd
Copy link
Contributor Author

hayd commented Jun 4, 2014

I agree with @jorisvandenbossche on this, I had thought I had this a lot cleaner. I will fix up with the earlier suggestions (it's actually not that many, I just need to slog thorough it) and we can merge it soon after 0.14.1.

@hayd hayd modified the milestones: 0.15.0, 0.14.1 Jun 4, 2014
@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014
@jreback
Copy link
Contributor

jreback commented Jul 21, 2014

@hayd let's revist!

@hayd hayd self-assigned this Jul 21, 2014
@hayd
Copy link
Contributor Author

hayd commented Jul 21, 2014

revisiting this evening, rebase + cleanup.

@jreback
Copy link
Contributor

jreback commented Aug 11, 2014

@hayd can we resurrect this? (and put the API changes in the top section for comment)

@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

@hayd ????

1 similar comment
@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

@hayd ????

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

closing as stale. But this would be a much needed refactor.

@jreback jreback closed this Jan 18, 2015
@jorisvandenbossche jorisvandenbossche removed this from the 0.16.0 milestone Jan 19, 2015
@jreback jreback mentioned this pull request Feb 20, 2016
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop with boolean raises a ValueError Feature Request: regex for drop
4 participants