Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add expand kw to str.get_dummies #10103

Closed
wants to merge 1 commit into from

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented May 11, 2015

Ref: #10008.

Though this still under work (needs #10089 to simplify get_dummies flow), would like to discuss followings.

#### .str.extract note: overlaps with #11386

Currently it returns Series for a single group and DataFrame for multiples. To support expand kw, we have to choose :

1. Add expand option keeping existing behavior with warning for future change to extract=True (current impl).
2. Add expand option keeping existing behavior. Standardize extract=None (or other option) to select returning dimensionality automatically.
3. Add expand option with default True (or False). This breaks the API.
4. Make Index.str.extract return MultiIndex in multiple group case without adding expand option.

.str.get_dummies

  1. Add expand kw with default True. Currently this always returns DataFrame (and raises TypeError in Index). This doesn't break an API (current impl).
  2. Make Index.str.get_dummies return MultiIndex without adding expand option.

CC @mortada

@sinhrks sinhrks added API Design Strings String extension data type and string data Compat pandas objects compatability with Numpy or Python functions labels May 11, 2015
@sinhrks sinhrks added this to the 0.17.0 milestone May 11, 2015
1 2
2 NaN
dtype: object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to change this example to showcase expand=False when it actually has multiple groups, i.e.,

>>> Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)', expand=False)
0        [a, 1]
1        [b, 2]
2    [nan, nan]
Name: [0, 1], dtype: object

I'd also move this one example down, so we'd have:

"A pattern with more than one group will return a DataFrame."

"But you can specify expand=False to return Series."

@jorisvandenbossche
Copy link
Member

Question: it there actually a need to have the option of expand here for those two functions?

If we add them, my opinion about the discussion points:

  • extract:
    • I would keep the current behaviour (seems also as a good behaviour (series for single group, dataframe for multiple groups), I see no reason to change?)
    • Have expand=None to let you able to override the default behaviour
  • get_dummies:
    • OK. How would this look for expand=False? A series/index with lists as elements? (question is if we want to encourage this? -> coming back to my initial question)

@sinhrks
Copy link
Member Author

sinhrks commented May 12, 2015

@jorisvandenbossche Correct. There is an option to make Index.str work as the same as Series.str without adding expand kw. Added as alternatives. Though I prefer to make them to have unified kw/behavior.

get_dummies: How would this look for expand=False? A series/index with lists as elements?

Result will be a Series/Index of tuples, as the same as str.split(expand=False).

@jorisvandenbossche
Copy link
Member

Not that important for this discussion, but str.split gives a list, not a tuple for me?

@jorisvandenbossche
Copy link
Member

Though I prefer to make them to have unified kw/behavior.

Well, I would also like that very much, but the default values of the keyword would in any case not be unified. So therefore, as it is not really possible to unify it that way, I was considering the option of not adding the keyword at all (which is also no unified behaviour)

@sinhrks sinhrks force-pushed the str_expand branch 3 times, most recently from f30f63c to da9a38e Compare May 23, 2015 21:33
@sinhrks sinhrks changed the title (WIP) ENH: add expand kw to str.extract and str.get_dummies ENH: add expand kw to str.extract and str.get_dummies May 29, 2015
@sinhrks sinhrks changed the title ENH: add expand kw to str.extract and str.get_dummies (WIP)ENH: add expand kw to str.extract and str.get_dummies May 29, 2015
@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

status?

@sinhrks
Copy link
Member Author

sinhrks commented Jul 29, 2015

I hope to work on this, but it requires Index.fillna to make flow simple (#10089).

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015
@sinhrks sinhrks force-pushed the str_expand branch 2 times, most recently from 5ab40f1 to 08c283c Compare November 13, 2015 21:45
@sinhrks sinhrks changed the title (WIP)ENH: add expand kw to str.extract and str.get_dummies ENH: add expand kw to str.extract and str.get_dummies Nov 14, 2015
@sinhrks sinhrks force-pushed the str_expand branch 2 times, most recently from d7cf295 to 8f867ca Compare November 23, 2015 05:17
@jreback jreback changed the title ENH: add expand kw to str.extract and str.get_dummies ENH: add expand kw to str.get_dummies Feb 13, 2016
@jreback
Copy link
Contributor

jreback commented Mar 12, 2016

@sinhrks what are we doing with this one?

@jreback jreback removed this from the Next Major Release milestone Mar 13, 2016
@sinhrks
Copy link
Member Author

sinhrks commented Mar 14, 2016

There are 2 points, and I think 1st point (add expand=False to return Series) is less useful (only for consistency). How about adding Index.str.get_dummies and close the issue?

  1. Add expand kw with default True. Currently this always returns DataFrame (and raises TypeError in Index). This doesn't break an API (current impl).
  2. Make Index.str.get_dummies return MultiIndex without adding expand option.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

@sinhrks I suppose you could add Index.str.get_dummies not really how useful this is, but it makes things consistent.

@sinhrks sinhrks mentioned this pull request Apr 10, 2016
4 tasks
@jreback jreback added this to the 0.18.1 milestone Apr 11, 2016
@jreback jreback closed this in e1aa2d9 Apr 11, 2016
@sinhrks sinhrks deleted the str_expand branch April 12, 2016 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants