Replace with regular expression #2285

changhiskhan · 2012-11-19T00:00:17Z

xref: http://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas

paulproteus · 2012-12-16T08:04:20Z

It seems to me this would be a good ticket for a first-time contributor who is familiar with (or willing to learn about) regular expressions.

You could model your test case after the code at the Stack Overflow post that @changhiskhan linked to here.

You would presumably be modifying "class DataFrame" in pandas/core/frame.py by adding a method to it.

If it makes things more convenient, presumably your new method could insist that the regular expression that comes has already been compiled with re.compile, so that your function does not have to compile it.

louist87 · 2013-04-12T14:49:04Z

I would also be in favor of such a feature provided we can also pass exact matches as literals. I have eyetracker data in ASCII file format in which missing values are indicated with the '.' string.

I suggest castna as a method name since we're recasting certain values as another type.

ghost · 2013-04-12T14:54:46Z

possibly #3276 would half do the job.

cpcloud · 2013-05-10T15:56:03Z

I would be happy to implement this.

@paulproteus regex compiles are so fast (a string of length 1000 compiles in < 1 us) that it's not really an issue, plus you can get the string via the pattern attr on the compiled regex.

@louist87 A regular expression consisting of no metacharacters (escape metacharacters to match them) does exactly that. I think castna is a little misleading since replace works on non-NA values, probably accepting a regex in the to_replace argument is better.

jreback · 2013-05-10T16:10:27Z

@cpcloud FYI replace exists in core/series and in the core/internals (right now)
and do the same thing

cpcloud · 2013-05-10T16:29:01Z

@jreback maybe better to wait until series-as-ndframe is merged before working on this?

jreback · 2013-05-10T16:31:25Z

@cpcloud up 2 you....series-as-ndframe is coming in 0.12 ...why don't you add to the core/internals/Block/replace....that way no wasted code.....(and do for 0.12),

and if I recall...I didn't fix replace in series yet anyhow......so go ahead and do if you would like...

cpcloud · 2013-05-10T16:58:12Z

Ok. Cool. This should only work on strings correct? For example passing the numeric token regexes from the tokenize module will not match anything unless that number is actually a string.

jreback · 2013-05-10T17:10:01Z

@cpcloud I think it has to be a valid re.compileable expression (which i think stringifies)?

jreback · 2013-05-10T17:11:32Z

actually.....are we adding an argument for this? e.g. regex= otherwise how to determine when you want an actual replacement (as opposed to a re replacement)?

cpcloud · 2013-05-10T18:03:03Z

@jreback Ok hold on, I think I may be confused about something. Do we want to be able to pass a dict/list of regexes to essentially "vectorize" re.sub? In that case, another arg is prob the way 2 go. Or some try..except mania. That could be wrong, though. I haven't thought about it completely yet.

jreback · 2013-05-10T18:13:48Z

my understanding may be wrong but isn't this something like

replace_expr -> value ?
how do u distinguish the replace_expr from a straight string
or would u always do that?

so 'foo' is and exact match but 'foo*' is the re

I guess u could always do re matching

cpcloud · 2013-05-10T18:15:45Z

Always re matching was my thinking since straight strings are just special cases of regexes (in the most formal sense). Of course, once I actually start working on this things may change.

jreback · 2013-05-10T18:16:35Z

I would make a dict/list always exact match (like now)

I see the problem, if I pass the number 2 what do u match on the string 2?
or match numbers == 2 ?

you could handle it by dtype

cpcloud · 2013-05-10T18:23:59Z

Right that was my orig. concern. Do we want regexes as replacement values as well? For example, I think the following is intuitive

(ascii eyelink example from @louist87)

df.replace({r'^\s*\.\s*$': NaN})

But what if the user wants to replace a str value based on the matched re? E.g.,

df.replace({r'\s*(\w{2})(\w{2})\s*': r'\1\2'})

Additionally, how should the empty match case be handled (this could come for free from pandas' convenient None handling)?

jreback · 2013-05-10T18:49:25Z

this makes sense (and if you can conditionally handle via different block dtypes), e.g. a re is only valid for ObjectBlocks, whereas you would only allow numeric replacements via Int/Float Block types (maybe punt Series replace for now as this gets tricky)

You can prob leave the Block.replace alone and just right a new one for the ObjectBlock (though you CAN putmask btw, if you know the values to replace)

cpcloud · 2013-05-10T19:00:29Z

ok. will do. i had already started looking at object block replace :) since that sidesteps the issue of re's matching numeric types. well not sidesteps...but makes it easier

cpcloud · 2013-05-12T00:42:53Z

@jreback Are the keys of df.blocks subject to change?

jreback · 2013-05-12T00:49:05Z

what do u mean?

how r u using it?

cpcloud · 2013-05-12T00:50:01Z

sorry. nvm. preemptive.

cpcloud · 2013-05-12T05:24:29Z

Do we want to be able to match regexs on the columns similar to df.filter? If no, I can submit this pr tmrw. E.g.,

df.replace({'\w+\.\w+': nan}, {'\w+\.\w+': 1})

I think this is probably a bad idea, after typing that out. However, something like

df.replace({'\w+\.\w+': {'to_replace_regex': nan}})

would be cool I think. This would mean get all the columns with names matching the first level of regexs then within those columns replace the values matching each inner dict's regexs. However, this will obviously take more time.

jreback · 2013-05-12T11:32:59Z

I would not add the column matching, too much magic

cpcloud · 2013-05-12T19:38:23Z

@jreback Can u give a description of what the single dict replace method is supposed to do? For example, I could see that

df.replace({'a': 'b'})

will replace all occurrences of 'a' with 'b' or it could raise an exception saying something about ambiguity and that you need to provide the value argument. For a Series with the same argument ({'a': 'b'}), the former is what's happening, which makes sense because there's no notion of columns in a Series.

cpcloud · 2013-05-12T20:35:51Z

I have another question. Hope this isn't too annoying. When should one use make_block as opposed to just copying the block and setting the values attribute (copying being dependent on the value of a possible inplace parameter)?

jreback · 2013-05-12T20:57:40Z

almost always use make_block

if u want to be a specific dtype pass the klass arg
otherwise it will be inferred (eg u don't know whether say you would have int or float
so let it infer int)

Very rarely do i set the values directly

cpcloud · 2013-05-12T21:14:41Z

FYI there is a possible back compat issue here with my implementation, since every string is now treated as a regex. For example, df.replace('.', 'a') will match any character and replace it with 'a'. Is this a huge problem? A way around it is to provide an argument that tells the method to interpret the string as a regex. @jreback had suggested this earlier.

jreback · 2013-05-12T21:29:01Z

I think u could do a minimal validation and reject too general regexes
eg just a few where u raise ( or warn) or just document it, maybe is good enough

unless u want to do string exact matching by default
and maybe use /r'regex' to signal regexes?

cpcloud · 2013-05-12T21:32:53Z

i think a regex=True argument and default to exact matching is probably the way to go for back compat.

jreback · 2013-05-12T21:43:23Z

yes that sounds right

cpcloud · 2013-05-12T22:40:53Z

the docs for replace are going to be kind of a beast since there are so many ways to slice and dice here...

cpcloud · 2013-05-12T22:46:44Z

@jreback this is ready 2 go modulo some additional documentation (regexs and examples in the method) and your word on what to do with single dict to_replace. also nested-dict regexs will not work until this is implemented for Series.

jreback · 2013-05-17T19:26:08Z

closed via #3584

jreback · 2013-05-29T16:53:25Z

This replace is quite flexible!

http://stackoverflow.com/questions/16818871/extracting-value-and-creating-new-column-out-of-it

It might be slower than method one though?

as that uses the vectorized string methods (which in theory this could use too)

cpcloud · 2013-05-29T17:02:11Z

glad u like! replace uses np.vectorize under the hood. i chose to use this since lib.map_infer_mask only works over one dimension although i could use that and then reshape or maybe there's some stuff in common.py that fully replaces vectorize

jreback · 2013-05-29T17:06:19Z

FYI, this is a pretty common idiom

function(2_d_values.ravel()).reshape(2_d_values.shape)

cpcloud · 2013-05-29T17:07:14Z

ok. yeah i'm always paranoid about copies even though i know that reshape shares data. matlab and r have scarred me 4 life

jreback · 2013-05-29T17:11:38Z

hahah....this is all view based, no copies!

cpcloud · 2013-05-29T17:27:36Z

actually just made then change (using _na_map from core/strings.py) and then compared master to my change, vectorize is faster by about 40 us and the replace method is about 4x faster than the vec strings method. the map_infer_mask function signature is just ndarray, object, ndarray[i8] mask, bint convert which won't give a huge speedup (except the difference in time to go to the next iteration is smaller) since just declaring ndarray doesn't really get you much except some attribute access is faster and you get some assert isinstance(obj, ndarray)-ish checks at compile time.

cpcloud · 2013-05-29T17:28:09Z

(this is using timeit)

jreback · 2013-05-29T17:34:22Z

ok....so what you are doing is fine; but outght to change all of the string methods to use vectorize (or write them in cython)?

cpcloud · 2013-05-29T17:48:20Z

hm writing them in cython could be done via generate_code.py. might be better to check vectorize first since that change will be easier. looks like a vbench for strings might be needed at this point

jreback · 2013-05-29T17:50:37Z

look at #2802, can you just do a quick test using vectorize on the example?

cpcloud · 2013-05-29T18:01:42Z

In [15]: timeit vectorize(lambda s: s.endswith('world'))(p.strings)
100 loops, best of 3: 3.35 ms per loop

In [16]: %%timeit
   ....: for ii in xrange(1000):
   ....:     p['ishello'] = p['strings'].str.endswith('world')
   ....:
1 loops, best of 3: 3.61 s per loop

In [17]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = [s.endswith('world') for s in p['strings'].values]
   ....:
1 loops, best of 3: 2.69 s per loop

In [18]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])
   ....:
1 loops, best of 3: 2.27 s per loop

In [19]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])

KeyboardInterrupt

In [19]: f = vectorize(lambda x: x.endswith('world'))

In [20]: %%timeit
for ii in xrange(1000) :
    p['isHello'] = f(p['strings'])
   ....:
1 loops, best of 3: 3.49 s per loop

In [21]: %%timeit
for ii in xrange(1000) :
    p['isHello'] = f(p['strings'])
   ....:
1 loops, best of 3: 3.49 s per loop

cpcloud · 2013-05-29T18:02:32Z

not really a big difference and the current methods are faster

jreback · 2013-05-29T18:03:46Z

yep.....got the same....ok...not a big deal then

cpcloud mentioned this issue May 13, 2013

ENH: add regex functionality to DataFrame.replace #3584

Merged

jreback closed this as completed May 17, 2013

Replace with regular expression #2285

Replace with regular expression #2285

Comments

changhiskhan commented Nov 19, 2012

paulproteus commented Dec 16, 2012

louist87 commented Apr 12, 2013

ghost commented Apr 12, 2013

cpcloud commented May 10, 2013

jreback commented May 10, 2013

cpcloud commented May 10, 2013

jreback commented May 10, 2013

cpcloud commented May 10, 2013

jreback commented May 10, 2013

jreback commented May 10, 2013

cpcloud commented May 10, 2013

jreback commented May 10, 2013

cpcloud commented May 10, 2013

jreback commented May 10, 2013

cpcloud commented May 10, 2013

jreback commented May 10, 2013

cpcloud commented May 10, 2013

cpcloud commented May 12, 2013

jreback commented May 12, 2013

cpcloud commented May 12, 2013

cpcloud commented May 12, 2013

jreback commented May 12, 2013

cpcloud commented May 12, 2013

cpcloud commented May 12, 2013

jreback commented May 12, 2013

cpcloud commented May 12, 2013

jreback commented May 12, 2013

cpcloud commented May 12, 2013

jreback commented May 12, 2013

cpcloud commented May 12, 2013

cpcloud commented May 12, 2013

jreback commented May 17, 2013

jreback commented May 29, 2013

cpcloud commented May 29, 2013

jreback commented May 29, 2013

cpcloud commented May 29, 2013

jreback commented May 29, 2013

cpcloud commented May 29, 2013

cpcloud commented May 29, 2013

jreback commented May 29, 2013

cpcloud commented May 29, 2013

jreback commented May 29, 2013

cpcloud commented May 29, 2013

cpcloud commented May 29, 2013

jreback commented May 29, 2013