Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: More convenient regex? #4685

Closed
danielballan opened this issue Aug 27, 2013 · 9 comments · Fixed by #4696
Closed

ENH: More convenient regex? #4685

danielballan opened this issue Aug 27, 2013 · 9 comments · Fixed by #4696
Milestone

Comments

@danielballan
Copy link
Contributor

How about circumventing annoying match objects by providing an additional, somewhat redundant string method?

s.str.extract(pattern)

as a shortcut for

s.str.match(pattern).str.get(0)

If the pattern contains multiple groups, a DataFrame should be returned. Thoughts? Maybe @hayd would be into this one.

@hayd
Copy link
Contributor

hayd commented Aug 27, 2013

I do hate match objects (is this like a get groups thing?)... this sounds useful. I think should be able to extract using group name (and the moment get is by integer location only), not sure how that would work with mulitple.

I also dislike how findall returns a series of lists...

@danielballan
Copy link
Contributor Author

This is how I think single and multiple groups should work.

>>> Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
0    1
1    2
2    NaN
dtype: object

>>> Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
     0    1
0    a    1
1    b    2
2  NaN  NaN

What are you thinking for group names?

@jtratner
Copy link
Contributor

Would this / does this only work for series? Also, how would you handle
named match groups? (looks like '(?pattern)')

@danielballan
Copy link
Contributor Author

I think it would only work neatly for Series. As for named match groups, what about this?

>>> Series(['a1', 'b2', 'c3']).str.match('(?P<letter>[ab])(?P<digit>\d)')
     letter  digit
0    a       1
1    b       2
2    NaN   NaN

As above, if there is no match, the whole row is NaN. If there is a optional group and that group is absent from a given entry, only the absent group is NaN.

>>> Series(['a1', 'b2', '3']).str.match('(?P<letter>[ab])?(?P<digit>\d)') # first group optional
     letter  digit
0    a       1
1    b       2
2    NaN     3

@jtratner
Copy link
Contributor

That's a good way to do it.

@hayd
Copy link
Contributor

hayd commented Aug 27, 2013

Yep, that was exactly what I was getting at with group names, this would be a great enhancement.

The stupid thing I was also thinking was multiple optional things, but actually this doesn't appear to be supported in re anyway (nor does it really work with unnamed stuff):

In [11]: re.match('((?P<letter>[ab]))*','ab').groupdict()
Out[11]: {'letter': 'b'}

@danielballan
Copy link
Contributor Author

Two examples that work in the PR referenced above. Will write tests out of these, and more.

In [1]: Series(['a1', 'b1', '1', 'a']).str.extract('([ab])?(?P<digit>\d)')
Out[1]: 
     1 digit
0    a     1
1    b     1
2  NaN     1
3  NaN   NaN

In [2]: Series(['a1', 'b1', '1', 'a']).str.extract('([ab])?')
Out[2]: 
0      a
1      b
2    NaN
3      a
dtype: object

Notice that, in the column names, I follow the re module's convention of labeling any unnamed groups with 1. It's actually '1'. Should it be 1? Which is less confusing?

@hayd
Copy link
Contributor

hayd commented Aug 27, 2013

Oh sorry, just commented about that same thing. Not sure what's better. Probably should follow their (weird) convention.

@danielballan
Copy link
Contributor Author

Ha. Saw your comment and changed it. I'm really not sure which is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants