Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Replace with regular expression #2285

Closed
changhiskhan opened this Issue · 45 comments

6 participants

@paulproteus

It seems to me this would be a good ticket for a first-time contributor who is familiar with (or willing to learn about) regular expressions.

You could model your test case after the code at the Stack Overflow post that @changhiskhan linked to here.

You would presumably be modifying "class DataFrame" in pandas/core/frame.py by adding a method to it.

If it makes things more convenient, presumably your new method could insist that the regular expression that comes has already been compiled with re.compile, so that your function does not have to compile it.

@louist87

I would also be in favor of such a feature provided we can also pass exact matches as literals. I have eyetracker data in ASCII file format in which missing values are indicated with the '.' string.

I suggest castna as a method name since we're recasting certain values as another type.

@y-p

possibly #3276 would half do the job.

@cpcloud
Collaborator

I would be happy to implement this.

@paulproteus regex compiles are so fast (a string of length 1000 compiles in < 1 us) that it's not really an issue, plus you can get the string via the pattern attr on the compiled regex.

@louist87 A regular expression consisting of no metacharacters (escape metacharacters to match them) does exactly that. I think castna is a little misleading since replace works on non-NA values, probably accepting a regex in the to_replace argument is better.

@jreback
Owner

@cpcloud FYI replace exists in core/series and in the core/internals (right now)
and do the same thing

@cpcloud
Collaborator

@jreback maybe better to wait until series-as-ndframe is merged before working on this?

@jreback
Owner

@cpcloud up 2 you....series-as-ndframe is coming in 0.12 ...why don't you add to the core/internals/Block/replace....that way no wasted code.....(and do for 0.12),

and if I recall...I didn't fix replace in series yet anyhow......so go ahead and do if you would like...

@cpcloud
Collaborator

Ok. Cool. This should only work on strings correct? For example passing the numeric token regexes from the tokenize module will not match anything unless that number is actually a string.

@jreback
Owner

@cpcloud I think it has to be a valid re.compileable expression (which i think stringifies)?

@jreback
Owner

actually.....are we adding an argument for this? e.g. regex= otherwise how to determine when you want an actual replacement (as opposed to a re replacement)?

@cpcloud
Collaborator

@jreback Ok hold on, I think I may be confused about something. Do we want to be able to pass a dict/list of regexes to essentially "vectorize" re.sub? In that case, another arg is prob the way 2 go. Or some try..except mania. That could be wrong, though. I haven't thought about it completely yet.

@jreback
Owner

my understanding may be wrong but isn't this something like

replace_expr -> value ?
how do u distinguish the replace_expr from a straight string
or would u always do that?

so 'foo' is and exact match but 'foo*' is the re

I guess u could always do re matching

@cpcloud
Collaborator

Always re matching was my thinking since straight strings are just special cases of regexes (in the most formal sense). Of course, once I actually start working on this things may change.

@jreback
Owner

I would make a dict/list always exact match (like now)

I see the problem, if I pass the number 2 what do u match on the string 2?
or match numbers == 2 ?

you could handle it by dtype

@cpcloud
Collaborator

Right that was my orig. concern. Do we want regexes as replacement values as well? For example, I think the following is intuitive

(ascii eyelink example from @louist87)

df.replace({r'^\s*\.\s*$': NaN})

But what if the user wants to replace a str value based on the matched re? E.g.,

df.replace({r'\s*(\w{2})(\w{2})\s*': r'\1\2'})

Additionally, how should the empty match case be handled (this could come for free from pandas' convenient None handling)?

@jreback
Owner

this makes sense (and if you can conditionally handle via different block dtypes), e.g. a re is only valid for ObjectBlocks, whereas you would only allow numeric replacements via Int/Float Block types (maybe punt Series replace for now as this gets tricky)

You can prob leave the Block.replace alone and just right a new one for the ObjectBlock (though you CAN putmask btw, if you know the values to replace)

@cpcloud
Collaborator

ok. will do. i had already started looking at object block replace :) since that sidesteps the issue of re's matching numeric types. well not sidesteps...but makes it easier

@cpcloud
Collaborator

@jreback Are the keys of df.blocks subject to change?

@jreback
Owner

what do u mean?

how r u using it?

@cpcloud
Collaborator

sorry. nvm. preemptive.

@cpcloud
Collaborator

Do we want to be able to match regexs on the columns similar to df.filter? If no, I can submit this pr tmrw. E.g.,

df.replace({'\w+\.\w+': nan}, {'\w+\.\w+': 1})

I think this is probably a bad idea, after typing that out. However, something like

df.replace({'\w+\.\w+': {'to_replace_regex': nan}}) 

would be cool I think. This would mean get all the columns with names matching the first level of regexs then within those columns replace the values matching each inner dict's regexs. However, this will obviously take more time.

@jreback
Owner

I would not add the column matching, too much magic

@cpcloud
Collaborator

@jreback Can u give a description of what the single dict replace method is supposed to do? For example, I could see that

df.replace({'a': 'b'})

will replace all occurrences of 'a' with 'b' or it could raise an exception saying something about ambiguity and that you need to provide the value argument. For a Series with the same argument ({'a': 'b'}), the former is what's happening, which makes sense because there's no notion of columns in a Series.

@cpcloud
Collaborator

I have another question. Hope this isn't too annoying. When should one use make_block as opposed to just copying the block and setting the values attribute (copying being dependent on the value of a possible inplace parameter)?

@jreback
Owner

almost always use make_block

if u want to be a specific dtype pass the klass arg
otherwise it will be inferred (eg u don't know whether say you would have int or float
so let it infer int)

Very rarely do i set the values directly

@cpcloud
Collaborator

FYI there is a possible back compat issue here with my implementation, since every string is now treated as a regex. For example, df.replace('.', 'a') will match any character and replace it with 'a'. Is this a huge problem? A way around it is to provide an argument that tells the method to interpret the string as a regex. @jreback had suggested this earlier.

@jreback
Owner

I think u could do a minimal validation and reject too general regexes
eg just a few where u raise ( or warn) or just document it, maybe is good enough

unless u want to do string exact matching by default
and maybe use /r'regex' to signal regexes?

@cpcloud
Collaborator

i think a regex=True argument and default to exact matching is probably the way to go for back compat.

@jreback
Owner

yes that sounds right

@cpcloud
Collaborator

the docs for replace are going to be kind of a beast since there are so many ways to slice and dice here...

@cpcloud
Collaborator

@jreback this is ready 2 go modulo some additional documentation (regexs and examples in the method) and your word on what to do with single dict to_replace. also nested-dict regexs will not work until this is implemented for Series.

@jreback
Owner

closed via #3584

@jreback jreback closed this
@jreback
Owner

This replace is quite flexible!

http://stackoverflow.com/questions/16818871/extracting-value-and-creating-new-column-out-of-it

It might be slower than method one though?

as that uses the vectorized string methods (which in theory this could use too)

@cpcloud
Collaborator

glad u like! replace uses np.vectorize under the hood. i chose to use this since lib.map_infer_mask only works over one dimension although i could use that and then reshape or maybe there's some stuff in common.py that fully replaces vectorize

@jreback
Owner

FYI, this is a pretty common idiom

function(2_d_values.ravel()).reshape(2_d_values.shape)
@cpcloud
Collaborator

ok. yeah i'm always paranoid about copies even though i know that reshape shares data. matlab and r have scarred me 4 life

@jreback
Owner

hahah....this is all view based, no copies!

@cpcloud
Collaborator

actually just made then change (using _na_map from core/strings.py) and then compared master to my change, vectorize is faster by about 40 us and the replace method is about 4x faster than the vec strings method. the map_infer_mask function signature is just ndarray, object, ndarray[i8] mask, bint convert which won't give a huge speedup (except the difference in time to go to the next iteration is smaller) since just declaring ndarray doesn't really get you much except some attribute access is faster and you get some assert isinstance(obj, ndarray)-ish checks at compile time.

@cpcloud
Collaborator

(this is using timeit)

@jreback
Owner

ok....so what you are doing is fine; but outght to change all of the string methods to use vectorize (or write them in cython)?

@cpcloud
Collaborator

hm writing them in cython could be done via generate_code.py. might be better to check vectorize first since that change will be easier. looks like a vbench for strings might be needed at this point

@jreback
Owner

look at #2802, can you just do a quick test using vectorize on the example?

@cpcloud
Collaborator
In [15]: timeit vectorize(lambda s: s.endswith('world'))(p.strings)
100 loops, best of 3: 3.35 ms per loop

In [16]: %%timeit
   ....: for ii in xrange(1000):
   ....:     p['ishello'] = p['strings'].str.endswith('world')
   ....:
1 loops, best of 3: 3.61 s per loop

In [17]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = [s.endswith('world') for s in p['strings'].values]
   ....:
1 loops, best of 3: 2.69 s per loop

In [18]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])
   ....:
1 loops, best of 3: 2.27 s per loop

In [19]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])

KeyboardInterrupt

In [19]: f = vectorize(lambda x: x.endswith('world'))

In [20]: %%timeit
for ii in xrange(1000) :
    p['isHello'] = f(p['strings'])
   ....:
1 loops, best of 3: 3.49 s per loop

In [21]: %%timeit
for ii in xrange(1000) :
    p['isHello'] = f(p['strings'])
   ....:
1 loops, best of 3: 3.49 s per loop
@cpcloud
Collaborator

not really a big difference and the current methods are faster

@jreback
Owner

yep.....got the same....ok...not a big deal then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.