ENH: add regex functionality to DataFrame.replace #3584

cpcloud · 2013-05-13T03:37:37Z

jreback · 2013-05-13T15:15:08Z

if you are passed a non-compiled regex, e.g. r'b' can you tell this is a regex or do you need to pass regex=True?

cpcloud · 2013-05-13T15:19:36Z

the difference between r'b' and 'b' is at the syntactic level (sugar, raw strings are just a way to avoid 1e9 backslashes) not at the representation level (unlike e.g., u'\u22ee', which indicates a representation level change) so u must pass regex=True.

jreback · 2013-05-13T15:25:57Z

ok in order to use a regex I have to pass to_replace=my_regex, and regex=True (or pass a compiled regex)

alternatively could make to_replace default to None, and allow thru an actual regex in regex=myregex?

so:

df.replace(value='b',regex='a')

is would be equiv of:

df.replace('a','b',regex=True)

?

cpcloud · 2013-05-13T15:30:51Z

I chose to not detect compiled regexes, so those must also have a regex=True ATM. I will change that and make them detect. Currently regex can only be bool. I thought that would place less of a burden on the user. I know having args with many type options can sometimes tax my working memory and this method has a bunch of variable type arguments, but I like your suggestion. will change. will also allows rege to be compiled.

cpcloud · 2013-05-13T23:47:50Z

@jreback Requested functionality added, along with more tests for it. regex/value calling works and now to_replace can be a regex (if it's compiled) without having to pass regex=True. Need to add release notes/few notes to docs.

cpcloud · 2013-05-13T23:53:37Z

possible to change this to 0.11.1 or can I add v0.12.0.txt?

jreback · 2013-05-13T23:56:51Z

@y-p what do you think, 0.11.1 or 0.12?

cpcloud · 2013-05-14T15:44:31Z

I take that back. Probably 2 API changes should be left for 12.0. I also need a bit more time to document interpolate because of this.

cpcloud · 2013-05-14T17:55:48Z

@jreback @y-p This API might be more drastic than I thought, (exposing interpolate and disallowing value=None unless to_replace is a dict). The tests suggest that replace with value=None is equivalent to calling fillna with method=a_valid_method which does nothing (at least on simple frames), BUT it does break a lot of tests which makes me think there might be people using it in this way. Well, okay, not a lot of tests but enough to make me a bit dubious.

jreback · 2013-05-14T18:04:18Z

ok how about we add your new functionality
and the interpolate API change for 0.12?

cpcloud · 2013-05-14T18:14:34Z

cool. will also allow me doc the rlnshp btwn fillna/interpolate/replace more thoroughly.

cpcloud · 2013-05-14T19:43:50Z

oh sweetness, got nested dicts of regexes working, i.e.,

df = DataFrame({'a': list(letters[:4]), 'b': list(letters[4:8]), 'c': range(4)})
df.replace({'b': {'.*e.*': nan}}, regex=True)
# or
df.replace(regex={'b': {'.*e.*': nan}})

cpcloud · 2013-05-14T19:56:00Z

Pushing even though the next build will fail because of 2 empty tests, just in case someone wants to play with it.

cpcloud · 2013-05-14T21:07:05Z

@jreback So far, I'm not seeing the point of having interpolate and replace since they are virtually the same thing. I could see interpolate being a DataFrame version of the Series version, but keeping interpolate along with the new replace seems pointless. If we are going to keep it, int dtypes must be converted to float64 since the padding/filling functions in common don't work with ints. For now I'll assume the latter is what u want.

jreback · 2013-05-14T21:24:19Z

I agree, though interpolate functionaily is really encompassesd in fillna

And I would be ok with removing the method arg on replace for that matter then

interpolating is just a form of filling

cpcloud · 2013-05-14T21:25:50Z

okay. sounds good. will remove method and no new interpolate frame method will be added.

cpcloud · 2013-05-14T21:31:41Z

I've also added an infer_types argument that allows conversion to a "better" type if possible, e.g., a regular expression replaces an entire column with some numeric value.

jreback · 2013-05-14T21:35:43Z

and forever more you get to be mr. replace, fixing bugs.....

cpcloud · 2013-05-14T21:53:39Z

i hope that is a good thing...fyi goodbye limit arg as well, it was only used in interpolate

jreback · 2013-05-14T21:54:55Z

yep :)

so you r going to do a pr for 0.11.1

then all this stuff in separate for 0.12?

cpcloud · 2013-05-14T21:55:48Z

oh crap...i thought u meant combine them in your previous message :( oh well, git cherry pick here i come.

cpcloud · 2013-05-14T21:57:17Z

although the current state of regex replace gh is solid, the failing tests have to do with interpolate; i will remove them so u can merge and then a separate pr for the stuff we just discussed.

jreback · 2013-05-14T21:58:54Z

sorry for the confusion...go ahead and leave that for 0.12 (the interpolate stuff)...that is API change, while your other stuff is just added functionaility...(and seems to be done anyhow)

jreback · 2013-05-14T22:01:23Z

I am a little iffy on the infer_types kw, why do we need that in here?

a string -> string replacement (that is actually a number)?

cpcloud · 2013-05-14T22:17:19Z

No to your last question. Here's an example:

from string import ascii_letters as ltrs
a, b, c = list(ltrs[:4]), list(ltrs[4:8]), list(ltrs[:3]) + [4]
df = DataFrame({'a': a, 'b': b, 'c': c})
df.replace(regex={'c': {'[a-c]': nan}}, infer_types=True, inplace=True)
print df.c.dtype # should be float64

(This won't work yet, I forgot to add the infer_types argument in the recursive calls to replace, (about to push the fix)).

jreback · 2013-05-14T22:22:28Z

I think you can dispense with it, and just do a df.convert_objects() at the end. The convert_objects with the arguments are pretty forceful, all I think you need is to do soft-conversions. E.g. a column is full of floats but happens to be object dtype. Try with this first.

cpcloud · 2013-05-14T22:25:23Z

Right now there's a call into the blkmgr method convert. Is that what u mean by soft-conversions? If not, how do I do those, or do u just mean get rid of the arg altogether.

jreback · 2013-05-14T22:25:36Z

 In [10]: df = DataFrame([[nan,1.,1.],[nan,1.,'foo']],dtype=object)

In [11]: df
Out[11]: 
     0  1    2
0  NaN  1    1
1  NaN  1  foo

In [12]: df.dtypes
Out[12]: 
0    object
1    object
2    object
dtype: object

In [13]: df.convert_objects().dtypes
Out[13]: 
0    float64
1    float64
2     object
dtype: object

cpcloud · 2013-05-14T22:26:33Z

sure ok, will remove the param.

jreback · 2013-05-14T22:28:21Z

``convert_objectsjust call the block methodconvert`, so you can do it there too (just no need to pass any of the options, let them default)

I dont think you need this (though if you wanted to get really fancy, maybe, but shouldn't be by default)

In [14]: df = DataFrame([['1','1.']])

In [15]: df
Out[15]: 
   0   1
0  1  1.

In [16]: df.dtypes
Out[16]: 
0    object
1    object
dtype: object

In [17]: df.convert_objects()
Out[17]: 
   0   1
0  1  1.

In [18]: df.convert_objects(convert_numeric=True)
Out[18]: 
   0  1
0  1  1

In [19]: df.convert_objects(convert_numeric=True).dtypes
Out[19]: 
0      int64
1    float64
dtype: object

cpcloud · 2013-05-14T22:32:12Z

ok so i will just use convert on the _data attribute (sans arguments). also how do u create that ipython output so fast?

cpcloud · 2013-05-14T22:36:48Z

ah %%capture magic. am so going to write a magic to output to gfm...

cpcloud · 2013-05-15T22:22:15Z

@jreback will it be annoying if I branch off of this to remove interpolate and then submit a pr based on that branch?

jreback · 2013-05-15T22:27:06Z

nope...close em and open 2 new PR,just fine

cpcloud · 2013-05-15T22:30:18Z

not sure i follow. close this one, open it again and then open a separate one based on this...? why not just branch of this and open a new pr? sorry if being a little thick here.

jreback · 2013-05-15T22:42:10Z

sorry....I think we talked about a new PR for the interpolate stuff (for 0.12), so this one is for 0.11.1?

just rebase as you need on this one, and make new one (I guess you are asking if the new one is going to be based off of this one, ok by me)

cpcloud · 2013-05-15T22:43:02Z

yup thanks.

add default of None to to_replace add ability to pass regex as to_replace regex Remove cruft more tests and add ability to pass regex and value Make exceptions more clear; push examples to missing_data.rst remove interpolation call make inplace work across axes in interpolate method ability to use nested dicts for regexs and others mostly doc updates formatting infer_types correction rls notes

cpcloud · 2013-05-17T18:17:41Z

@jreback this is ready to go (modulo travis build passing).

jreback · 2013-05-17T19:02:42Z

@cpcloud fyi...you don't need the imports in the docs, all of that is imported at the top of each file (unless you need something special)

cpcloud · 2013-05-17T19:10:29Z

Ah ok. Thanks. I'm paranoid about impenetrable sphinx errors :) I guess. Will remove.

jreback · 2013-05-17T19:25:41Z

@cpcloud merged..thanks!

I added a link in v0.11.1 to the place in the docs

maybe add a couple of example (you can pull them right out of the docs) in v0.11.1? for string replacement

jseabold · 2013-05-19T04:17:59Z

I just happened upon this. Does this close #1479? I don't follow closely how interpolation came into a regex PR. The interpolate needs documentation AFAICT. I didn't read through things closely.

cpcloud · 2013-05-19T04:26:54Z

@jseabold Doesn't close #1479 (at least in full). There's a Block method called interpolate that basically does what fillna does and it is called when u call fillna. FYI Series.interpolate and a would-be DataFrame.interpolate are unrelated insofar as Series.interpolate does numeric interpolation and DataFrame.interpolate doesn't (not in the traditional sense of interpolation). My apologies if u already knew this stuff...

As per my conversation with @jreback above we decided that the functionality that would've been provided the would-be DataFrame.interpolate is provided by DataFrame.fillna. There's a slightly tangled web of connection between this PR and interpolation, but the main reason interpolate came up is because interpolation is performed when there was no value argument to this (DataFrame.replace) method. Hope this clears things up :)

jseabold · 2013-05-19T05:19:45Z

Ah, ok. I did not know about the Block method and that does clear it up. Thanks.

jreback · 2013-05-20T11:50:42Z

I used your new method!

http://stackoverflow.com/questions/16643695/pandas-convert-strings-to-float-for-multiple-columns-in-dataframe

cpcloud · 2013-05-20T13:33:04Z

Oh nice! Thanks. Bonus that it is faster too!

jreback merged commit 064445f into pandas-dev:master May 17, 2013

jreback mentioned this pull request May 17, 2013

Replace with regular expression #2285

Closed

ENH: add regex functionality to DataFrame.replace #3584

ENH: add regex functionality to DataFrame.replace #3584

Conversation

cpcloud commented May 13, 2013

jreback commented May 13, 2013

cpcloud commented May 13, 2013

jreback commented May 13, 2013

cpcloud commented May 13, 2013

cpcloud commented May 13, 2013

cpcloud commented May 13, 2013

jreback commented May 13, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

jreback commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 14, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

cpcloud commented May 17, 2013

jreback commented May 17, 2013

cpcloud commented May 17, 2013

jreback commented May 17, 2013

jseabold commented May 19, 2013

cpcloud commented May 19, 2013

jseabold commented May 19, 2013

jreback commented May 20, 2013

cpcloud commented May 20, 2013