Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add regex functionality to DataFrame.replace #3584

Merged
merged 1 commit into from
May 17, 2013
Merged

ENH: add regex functionality to DataFrame.replace #3584

merged 1 commit into from
May 17, 2013

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented May 13, 2013

addresses #2285. cc @jreback and #3582.

@jreback
Copy link
Contributor

jreback commented May 13, 2013

if you are passed a non-compiled regex, e.g. r'b' can you tell this is a regex or do you need to pass regex=True?

@cpcloud
Copy link
Member Author

cpcloud commented May 13, 2013

the difference between r'b' and 'b' is at the syntactic level (sugar, raw strings are just a way to avoid 1e9 backslashes) not at the representation level (unlike e.g., u'\u22ee', which indicates a representation level change) so u must pass regex=True.

@jreback
Copy link
Contributor

jreback commented May 13, 2013

ok in order to use a regex I have to pass to_replace=my_regex, and regex=True (or pass a compiled regex)

alternatively could make to_replace default to None, and allow thru an actual regex in regex=myregex?

so:

df.replace(value='b',regex='a')

is would be equiv of:

df.replace('a','b',regex=True)

?

@cpcloud
Copy link
Member Author

cpcloud commented May 13, 2013

I chose to not detect compiled regexes, so those must also have a regex=True ATM. I will change that and make them detect. Currently regex can only be bool. I thought that would place less of a burden on the user. I know having args with many type options can sometimes tax my working memory and this method has a bunch of variable type arguments, but I like your suggestion. will change. will also allows rege to be compiled.

@cpcloud
Copy link
Member Author

cpcloud commented May 13, 2013

@jreback Requested functionality added, along with more tests for it. regex/value calling works and now to_replace can be a regex (if it's compiled) without having to pass regex=True. Need to add release notes/few notes to docs.

@cpcloud
Copy link
Member Author

cpcloud commented May 13, 2013

possible to change this to 0.11.1 or can I add v0.12.0.txt?

@jreback
Copy link
Contributor

jreback commented May 13, 2013

@y-p what do you think, 0.11.1 or 0.12?

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

I take that back. Probably 2 API changes should be left for 12.0. I also need a bit more time to document interpolate because of this.

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

@jreback @y-p This API might be more drastic than I thought, (exposing interpolate and disallowing value=None unless to_replace is a dict). The tests suggest that replace with value=None is equivalent to calling fillna with method=a_valid_method which does nothing (at least on simple frames), BUT it does break a lot of tests which makes me think there might be people using it in this way. Well, okay, not a lot of tests but enough to make me a bit dubious.

@jreback
Copy link
Contributor

jreback commented May 14, 2013

ok how about we add your new functionality
and the interpolate API change for 0.12?

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

cool. will also allow me doc the rlnshp btwn fillna/interpolate/replace more thoroughly.

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

oh sweetness, got nested dicts of regexes working, i.e.,

df = DataFrame({'a': list(letters[:4]), 'b': list(letters[4:8]), 'c': range(4)})
df.replace({'b': {'.*e.*': nan}}, regex=True)
# or
df.replace(regex={'b': {'.*e.*': nan}})

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

Pushing even though the next build will fail because of 2 empty tests, just in case someone wants to play with it.

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

@jreback So far, I'm not seeing the point of having interpolate and replace since they are virtually the same thing. I could see interpolate being a DataFrame version of the Series version, but keeping interpolate along with the new replace seems pointless. If we are going to keep it, int dtypes must be converted to float64 since the padding/filling functions in common don't work with ints. For now I'll assume the latter is what u want.

@jreback
Copy link
Contributor

jreback commented May 14, 2013

I agree, though interpolate functionaily is really encompassesd in fillna

And I would be ok with removing the method arg on replace for that matter then

interpolating is just a form of filling

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

okay. sounds good. will remove method and no new interpolate frame method will be added.

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

I've also added an infer_types argument that allows conversion to a "better" type if possible, e.g., a regular expression replaces an entire column with some numeric value.

@jreback
Copy link
Contributor

jreback commented May 14, 2013

and forever more you get to be mr. replace, fixing bugs.....

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

i hope that is a good thing...fyi goodbye limit arg as well, it was only used in interpolate

@jreback
Copy link
Contributor

jreback commented May 14, 2013

yep :)

so you r going to do a pr for 0.11.1

then all this stuff in separate for 0.12?

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

oh crap...i thought u meant combine them in your previous message :( oh well, git cherry pick here i come.

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

although the current state of regex replace gh is solid, the failing tests have to do with interpolate; i will remove them so u can merge and then a separate pr for the stuff we just discussed.

@jreback
Copy link
Contributor

jreback commented May 14, 2013

sorry for the confusion...go ahead and leave that for 0.12 (the interpolate stuff)...that is API change, while your other stuff is just added functionaility...(and seems to be done anyhow)

@jreback
Copy link
Contributor

jreback commented May 14, 2013

I am a little iffy on the infer_types kw, why do we need that in here?

a string -> string replacement (that is actually a number)?

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

No to your last question. Here's an example:

from string import ascii_letters as ltrs
a, b, c = list(ltrs[:4]), list(ltrs[4:8]), list(ltrs[:3]) + [4]
df = DataFrame({'a': a, 'b': b, 'c': c})
df.replace(regex={'c': {'[a-c]': nan}}, infer_types=True, inplace=True)
print df.c.dtype # should be float64

(This won't work yet, I forgot to add the infer_types argument in the recursive calls to replace, (about to push the fix)).

@jreback
Copy link
Contributor

jreback commented May 14, 2013

I think you can dispense with it, and just do a df.convert_objects() at the end. The convert_objects with the arguments are pretty forceful, all I think you need is to do soft-conversions. E.g. a column is full of floats but happens to be object dtype. Try with this first.

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

Right now there's a call into the blkmgr method convert. Is that what u mean by soft-conversions? If not, how do I do those, or do u just mean get rid of the arg altogether.

@jreback
Copy link
Contributor

jreback commented May 14, 2013

 In [10]: df = DataFrame([[nan,1.,1.],[nan,1.,'foo']],dtype=object)

In [11]: df
Out[11]: 
     0  1    2
0  NaN  1    1
1  NaN  1  foo

In [12]: df.dtypes
Out[12]: 
0    object
1    object
2    object
dtype: object

In [13]: df.convert_objects().dtypes
Out[13]: 
0    float64
1    float64
2     object
dtype: object

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

sure ok, will remove the param.

@jreback
Copy link
Contributor

jreback commented May 14, 2013

``convert_objectsjust call the block methodconvert`, so you can do it there too (just no need to pass any of the options, let them default)

I dont think you need this (though if you wanted to get really fancy, maybe, but shouldn't be by default)

In [14]: df = DataFrame([['1','1.']])

In [15]: df
Out[15]: 
   0   1
0  1  1.

In [16]: df.dtypes
Out[16]: 
0    object
1    object
dtype: object

In [17]: df.convert_objects()
Out[17]: 
   0   1
0  1  1.

In [18]: df.convert_objects(convert_numeric=True)
Out[18]: 
   0  1
0  1  1

In [19]: df.convert_objects(convert_numeric=True).dtypes
Out[19]: 
0      int64
1    float64
dtype: object

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

ok so i will just use convert on the _data attribute (sans arguments). also how do u create that ipython output so fast?

@cpcloud
Copy link
Member Author

cpcloud commented May 14, 2013

ah %%capture magic. am so going to write a magic to output to gfm...

@cpcloud
Copy link
Member Author

cpcloud commented May 15, 2013

@jreback will it be annoying if I branch off of this to remove interpolate and then submit a pr based on that branch?

@jreback
Copy link
Contributor

jreback commented May 15, 2013

nope...close em and open 2 new PR,just fine

@cpcloud
Copy link
Member Author

cpcloud commented May 15, 2013

not sure i follow. close this one, open it again and then open a separate one based on this...? why not just branch of this and open a new pr? sorry if being a little thick here.

@jreback
Copy link
Contributor

jreback commented May 15, 2013

sorry....I think we talked about a new PR for the interpolate stuff (for 0.12), so this one is for 0.11.1?

just rebase as you need on this one, and make new one (I guess you are asking if the new one is going to be based off of this one, ok by me)

@cpcloud
Copy link
Member Author

cpcloud commented May 15, 2013

yup thanks.

add default of None to to_replace

add ability to pass regex as to_replace regex

Remove cruft

more tests and add ability to pass regex and value

Make exceptions more clear; push examples to missing_data.rst

remove interpolation call

make inplace work across axes in interpolate method

ability to use nested dicts for regexs and others

mostly doc updates

formatting

infer_types correction

rls notes
@cpcloud
Copy link
Member Author

cpcloud commented May 17, 2013

@jreback this is ready to go (modulo travis build passing).

@jreback
Copy link
Contributor

jreback commented May 17, 2013

@cpcloud fyi...you don't need the imports in the docs, all of that is imported at the top of each file (unless you need something special)

@cpcloud
Copy link
Member Author

cpcloud commented May 17, 2013

Ah ok. Thanks. I'm paranoid about impenetrable sphinx errors :) I guess. Will remove.

@jreback jreback merged commit 064445f into pandas-dev:master May 17, 2013
@jreback
Copy link
Contributor

jreback commented May 17, 2013

@cpcloud merged..thanks!

I added a link in v0.11.1 to the place in the docs

maybe add a couple of example (you can pull them right out of the docs) in v0.11.1? for string replacement

@jseabold
Copy link
Contributor

I just happened upon this. Does this close #1479? I don't follow closely how interpolation came into a regex PR. The interpolate needs documentation AFAICT. I didn't read through things closely.

@cpcloud
Copy link
Member Author

cpcloud commented May 19, 2013

@jseabold Doesn't close #1479 (at least in full). There's a Block method called interpolate that basically does what fillna does and it is called when u call fillna. FYI Series.interpolate and a would-be DataFrame.interpolate are unrelated insofar as Series.interpolate does numeric interpolation and DataFrame.interpolate doesn't (not in the traditional sense of interpolation). My apologies if u already knew this stuff...

As per my conversation with @jreback above we decided that the functionality that would've been provided the would-be DataFrame.interpolate is provided by DataFrame.fillna. There's a slightly tangled web of connection between this PR and interpolation, but the main reason interpolate came up is because interpolation is performed when there was no value argument to this (DataFrame.replace) method. Hope this clears things up :)

@jseabold
Copy link
Contributor

Ah, ok. I did not know about the Block method and that does clear it up. Thanks.

@jreback
Copy link
Contributor

jreback commented May 20, 2013

@cpcloud
Copy link
Member Author

cpcloud commented May 20, 2013

Oh nice! Thanks. Bonus that it is faster too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants