Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace with regular expression #2285

Closed
changhiskhan opened this issue Nov 19, 2012 · 45 comments
Closed

Replace with regular expression #2285

changhiskhan opened this issue Nov 19, 2012 · 45 comments
Milestone

Comments

@changhiskhan
Copy link
Contributor

xref: http://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas

@paulproteus
Copy link

It seems to me this would be a good ticket for a first-time contributor who is familiar with (or willing to learn about) regular expressions.

You could model your test case after the code at the Stack Overflow post that @changhiskhan linked to here.

You would presumably be modifying "class DataFrame" in pandas/core/frame.py by adding a method to it.

If it makes things more convenient, presumably your new method could insist that the regular expression that comes has already been compiled with re.compile, so that your function does not have to compile it.

@louist87
Copy link

I would also be in favor of such a feature provided we can also pass exact matches as literals. I have eyetracker data in ASCII file format in which missing values are indicated with the '.' string.

I suggest castna as a method name since we're recasting certain values as another type.

@ghost
Copy link

ghost commented Apr 12, 2013

possibly #3276 would half do the job.

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

I would be happy to implement this.

@paulproteus regex compiles are so fast (a string of length 1000 compiles in < 1 us) that it's not really an issue, plus you can get the string via the pattern attr on the compiled regex.

@louist87 A regular expression consisting of no metacharacters (escape metacharacters to match them) does exactly that. I think castna is a little misleading since replace works on non-NA values, probably accepting a regex in the to_replace argument is better.

@jreback
Copy link
Contributor

jreback commented May 10, 2013

@cpcloud FYI replace exists in core/series and in the core/internals (right now)
and do the same thing

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

@jreback maybe better to wait until series-as-ndframe is merged before working on this?

@jreback
Copy link
Contributor

jreback commented May 10, 2013

@cpcloud up 2 you....series-as-ndframe is coming in 0.12 ...why don't you add to the core/internals/Block/replace....that way no wasted code.....(and do for 0.12),

and if I recall...I didn't fix replace in series yet anyhow......so go ahead and do if you would like...

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

Ok. Cool. This should only work on strings correct? For example passing the numeric token regexes from the tokenize module will not match anything unless that number is actually a string.

@jreback
Copy link
Contributor

jreback commented May 10, 2013

@cpcloud I think it has to be a valid re.compileable expression (which i think stringifies)?

@jreback
Copy link
Contributor

jreback commented May 10, 2013

actually.....are we adding an argument for this? e.g. regex= otherwise how to determine when you want an actual replacement (as opposed to a re replacement)?

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

@jreback Ok hold on, I think I may be confused about something. Do we want to be able to pass a dict/list of regexes to essentially "vectorize" re.sub? In that case, another arg is prob the way 2 go. Or some try..except mania. That could be wrong, though. I haven't thought about it completely yet.

@jreback
Copy link
Contributor

jreback commented May 10, 2013

my understanding may be wrong but isn't this something like

replace_expr -> value ?
how do u distinguish the replace_expr from a straight string
or would u always do that?

so 'foo' is and exact match but 'foo*' is the re

I guess u could always do re matching

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

Always re matching was my thinking since straight strings are just special cases of regexes (in the most formal sense). Of course, once I actually start working on this things may change.

@jreback
Copy link
Contributor

jreback commented May 10, 2013

I would make a dict/list always exact match (like now)

I see the problem, if I pass the number 2 what do u match on the string 2?
or match numbers == 2 ?

you could handle it by dtype

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

Right that was my orig. concern. Do we want regexes as replacement values as well? For example, I think the following is intuitive

(ascii eyelink example from @louist87)

df.replace({r'^\s*\.\s*$': NaN})

But what if the user wants to replace a str value based on the matched re? E.g.,

df.replace({r'\s*(\w{2})(\w{2})\s*': r'\1\2'})

Additionally, how should the empty match case be handled (this could come for free from pandas' convenient None handling)?

@jreback
Copy link
Contributor

jreback commented May 10, 2013

this makes sense (and if you can conditionally handle via different block dtypes), e.g. a re is only valid for ObjectBlocks, whereas you would only allow numeric replacements via Int/Float Block types (maybe punt Series replace for now as this gets tricky)

You can prob leave the Block.replace alone and just right a new one for the ObjectBlock (though you CAN putmask btw, if you know the values to replace)

@cpcloud
Copy link
Member

cpcloud commented May 10, 2013

ok. will do. i had already started looking at object block replace :) since that sidesteps the issue of re's matching numeric types. well not sidesteps...but makes it easier

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

@jreback Are the keys of df.blocks subject to change?

@jreback
Copy link
Contributor

jreback commented May 12, 2013

what do u mean?

how r u using it?

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

sorry. nvm. preemptive.

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

Do we want to be able to match regexs on the columns similar to df.filter? If no, I can submit this pr tmrw. E.g.,

df.replace({'\w+\.\w+': nan}, {'\w+\.\w+': 1})

I think this is probably a bad idea, after typing that out. However, something like

df.replace({'\w+\.\w+': {'to_replace_regex': nan}}) 

would be cool I think. This would mean get all the columns with names matching the first level of regexs then within those columns replace the values matching each inner dict's regexs. However, this will obviously take more time.

@jreback
Copy link
Contributor

jreback commented May 12, 2013

I would not add the column matching, too much magic

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

@jreback Can u give a description of what the single dict replace method is supposed to do? For example, I could see that

df.replace({'a': 'b'})

will replace all occurrences of 'a' with 'b' or it could raise an exception saying something about ambiguity and that you need to provide the value argument. For a Series with the same argument ({'a': 'b'}), the former is what's happening, which makes sense because there's no notion of columns in a Series.

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

I have another question. Hope this isn't too annoying. When should one use make_block as opposed to just copying the block and setting the values attribute (copying being dependent on the value of a possible inplace parameter)?

@jreback
Copy link
Contributor

jreback commented May 12, 2013

almost always use make_block

if u want to be a specific dtype pass the klass arg
otherwise it will be inferred (eg u don't know whether say you would have int or float
so let it infer int)

Very rarely do i set the values directly

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

FYI there is a possible back compat issue here with my implementation, since every string is now treated as a regex. For example, df.replace('.', 'a') will match any character and replace it with 'a'. Is this a huge problem? A way around it is to provide an argument that tells the method to interpret the string as a regex. @jreback had suggested this earlier.

@jreback
Copy link
Contributor

jreback commented May 12, 2013

I think u could do a minimal validation and reject too general regexes
eg just a few where u raise ( or warn) or just document it, maybe is good enough

unless u want to do string exact matching by default
and maybe use /r'regex' to signal regexes?

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

i think a regex=True argument and default to exact matching is probably the way to go for back compat.

@jreback
Copy link
Contributor

jreback commented May 12, 2013

yes that sounds right

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

the docs for replace are going to be kind of a beast since there are so many ways to slice and dice here...

@cpcloud
Copy link
Member

cpcloud commented May 12, 2013

@jreback this is ready 2 go modulo some additional documentation (regexs and examples in the method) and your word on what to do with single dict to_replace. also nested-dict regexs will not work until this is implemented for Series.

@jreback
Copy link
Contributor

jreback commented May 17, 2013

closed via #3584

@jreback jreback closed this as completed May 17, 2013
@jreback
Copy link
Contributor

jreback commented May 29, 2013

This replace is quite flexible!

http://stackoverflow.com/questions/16818871/extracting-value-and-creating-new-column-out-of-it

It might be slower than method one though?

as that uses the vectorized string methods (which in theory this could use too)

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

glad u like! replace uses np.vectorize under the hood. i chose to use this since lib.map_infer_mask only works over one dimension although i could use that and then reshape or maybe there's some stuff in common.py that fully replaces vectorize

@jreback
Copy link
Contributor

jreback commented May 29, 2013

FYI, this is a pretty common idiom

function(2_d_values.ravel()).reshape(2_d_values.shape)

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

ok. yeah i'm always paranoid about copies even though i know that reshape shares data. matlab and r have scarred me 4 life

@jreback
Copy link
Contributor

jreback commented May 29, 2013

hahah....this is all view based, no copies!

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

actually just made then change (using _na_map from core/strings.py) and then compared master to my change, vectorize is faster by about 40 us and the replace method is about 4x faster than the vec strings method. the map_infer_mask function signature is just ndarray, object, ndarray[i8] mask, bint convert which won't give a huge speedup (except the difference in time to go to the next iteration is smaller) since just declaring ndarray doesn't really get you much except some attribute access is faster and you get some assert isinstance(obj, ndarray)-ish checks at compile time.

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

(this is using timeit)

@jreback
Copy link
Contributor

jreback commented May 29, 2013

ok....so what you are doing is fine; but outght to change all of the string methods to use vectorize (or write them in cython)?

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

hm writing them in cython could be done via generate_code.py. might be better to check vectorize first since that change will be easier. looks like a vbench for strings might be needed at this point

@jreback
Copy link
Contributor

jreback commented May 29, 2013

look at #2802, can you just do a quick test using vectorize on the example?

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

In [15]: timeit vectorize(lambda s: s.endswith('world'))(p.strings)
100 loops, best of 3: 3.35 ms per loop

In [16]: %%timeit
   ....: for ii in xrange(1000):
   ....:     p['ishello'] = p['strings'].str.endswith('world')
   ....:
1 loops, best of 3: 3.61 s per loop

In [17]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = [s.endswith('world') for s in p['strings'].values]
   ....:
1 loops, best of 3: 2.69 s per loop

In [18]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])
   ....:
1 loops, best of 3: 2.27 s per loop

In [19]: %%timeit
   ....: for ii in xrange(1000) :
   ....:     p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])

KeyboardInterrupt

In [19]: f = vectorize(lambda x: x.endswith('world'))

In [20]: %%timeit
for ii in xrange(1000) :
    p['isHello'] = f(p['strings'])
   ....:
1 loops, best of 3: 3.49 s per loop

In [21]: %%timeit
for ii in xrange(1000) :
    p['isHello'] = f(p['strings'])
   ....:
1 loops, best of 3: 3.49 s per loop

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

not really a big difference and the current methods are faster

@jreback
Copy link
Contributor

jreback commented May 29, 2013

yep.....got the same....ok...not a big deal then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants