ENH: More convenient regex? #4685

danielballan · 2013-08-27T04:57:44Z

How about circumventing annoying match objects by providing an additional, somewhat redundant string method?

s.str.extract(pattern)

as a shortcut for

s.str.match(pattern).str.get(0)

If the pattern contains multiple groups, a DataFrame should be returned. Thoughts? Maybe @hayd would be into this one.

The text was updated successfully, but these errors were encountered:

hayd · 2013-08-27T10:31:54Z

I do hate match objects (is this like a get groups thing?)... this sounds useful. I think should be able to extract using group name (and the moment get is by integer location only), not sure how that would work with mulitple.

I also dislike how findall returns a series of lists...

danielballan · 2013-08-27T14:36:27Z

This is how I think single and multiple groups should work.

>>> Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
0    1
1    2
2    NaN
dtype: object

>>> Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
     0    1
0    a    1
1    b    2
2  NaN  NaN

What are you thinking for group names?

jtratner · 2013-08-27T14:50:47Z

Would this / does this only work for series? Also, how would you handle
named match groups? (looks like '(?pattern)')

danielballan · 2013-08-27T15:02:37Z

I think it would only work neatly for Series. As for named match groups, what about this?

>>> Series(['a1', 'b2', 'c3']).str.match('(?P<letter>[ab])(?P<digit>\d)')
     letter  digit
0    a       1
1    b       2
2    NaN   NaN

As above, if there is no match, the whole row is NaN. If there is a optional group and that group is absent from a given entry, only the absent group is NaN.

>>> Series(['a1', 'b2', '3']).str.match('(?P<letter>[ab])?(?P<digit>\d)') # first group optional
     letter  digit
0    a       1
1    b       2
2    NaN     3

jtratner · 2013-08-27T16:00:25Z

That's a good way to do it.

hayd · 2013-08-27T17:29:03Z

Yep, that was exactly what I was getting at with group names, this would be a great enhancement.

The stupid thing I was also thinking was multiple optional things, but actually this doesn't appear to be supported in re anyway (nor does it really work with unnamed stuff):

In [11]: re.match('((?P<letter>[ab]))*','ab').groupdict()
Out[11]: {'letter': 'b'}

danielballan · 2013-08-27T19:54:42Z

Two examples that work in the PR referenced above. Will write tests out of these, and more.

In [1]: Series(['a1', 'b1', '1', 'a']).str.extract('([ab])?(?P<digit>\d)')
Out[1]: 
     1 digit
0    a     1
1    b     1
2  NaN     1
3  NaN   NaN

In [2]: Series(['a1', 'b1', '1', 'a']).str.extract('([ab])?')
Out[2]: 
0      a
1      b
2    NaN
3      a
dtype: object

Notice that, in the column names, I follow the re module's convention of labeling any unnamed groups with 1. It's actually '1'. Should it be 1? Which is less confusing?

hayd · 2013-08-27T20:01:07Z

Oh sorry, just commented about that same thing. Not sure what's better. Probably should follow their (weird) convention.

danielballan · 2013-08-27T20:14:07Z

Ha. Saw your comment and changed it. I'm really not sure which is better.

danielballan mentioned this issue Aug 27, 2013

ENH: Series.str.extract returns regex matches more conveniently #4696

Merged

jreback closed this as completed in #4696 Sep 20, 2013

This was referenced Sep 22, 2013

Problem with Series.str.match #2074

Closed

Add stack argument to str.findall #4428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: More convenient regex? #4685

ENH: More convenient regex? #4685

danielballan commented Aug 27, 2013

hayd commented Aug 27, 2013

danielballan commented Aug 27, 2013

jtratner commented Aug 27, 2013

danielballan commented Aug 27, 2013

jtratner commented Aug 27, 2013

hayd commented Aug 27, 2013

danielballan commented Aug 27, 2013

hayd commented Aug 27, 2013

danielballan commented Aug 27, 2013

ENH: More convenient regex? #4685

ENH: More convenient regex? #4685

Comments

danielballan commented Aug 27, 2013

hayd commented Aug 27, 2013

danielballan commented Aug 27, 2013

jtratner commented Aug 27, 2013

danielballan commented Aug 27, 2013

jtratner commented Aug 27, 2013

hayd commented Aug 27, 2013

danielballan commented Aug 27, 2013

hayd commented Aug 27, 2013

danielballan commented Aug 27, 2013