Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: convert_objects, xref #11116, instead of warning, raise a ValueError #11133

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 17, 2015

closes #11116

This is a lot more noisy that the current impl. It pretty much raises on the old-style input if it had coerce in it. This is correct as the actual behavior of coerce has changed (in that it now it much more strict).

errror on old-style input

  • raise a ValueError for df.convert_objects('coerce')
  • raise a ValueError for df.convert_objects(convert_dates='coerce') (and convert_numeric,convert_timedelta)
In [1]: data = """foo,bar
   ...: 2015-09-14,True
   ...: 2015-09-15,
   ...: """

In [2]: df = pd.read_csv(StringIO(data),sep=',')

In [3]: df
Out[3]: 
          foo   bar
0  2015-09-14  True
1  2015-09-15   NaN
In [4]: df.convert_objects('coerce')
ValueError: The use of 'coerce' as an input is deprecated. Instead set coerce=True.

In [5]: df.convert_objects(coerce=True)
ValueError: coerce=True was provided, with no options for conversions.excatly one of 'datetime', 'numeric' or 'timedelta' must be True when when coerce=True.

In [8]: df.convert_objects(convert_dates='coerce')
/Users/jreback/miniconda/bin/ipython:1: FutureWarning: the 'convert_dates' keyword is deprecated, use 'datetime' instead
ValueError: The use of 'coerce' as an input is deprecated. Instead set coerce=True.

…e a ValueError on old-style input

- raise a ValueError for df.convert_objects('coerce')
- raise a ValueError for df.convert_objects(convert_dates='coerce') (and convert_numeric,convert_timedelta)
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas labels Sep 17, 2015
@jreback jreback added this to the 0.17.0 milestone Sep 17, 2015
@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

cc AmbroseKeith
cc @bashtage

@jorisvandenbossche @TomAugspurger

@jorisvandenbossche
Copy link
Member

I am not really in favor of this, as this really completely breaks people's code.

Another option would be to not do any mapping of the old keyword arguments to the new ones. So leave the old keywords with deprecation warnings but their original behaviour, and the new keywords with the new behaviour.

@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

it's not the deprecation mappings that are a problem. it's that the coerce conversion by definition is forced. using it on s frame is just not a good idea at all.

in prior versions it worked because it was really never a forced conversion.

users code was simply wrong and should break

I don't see any way around this nor should their be
they were relying on broken behavior

this is a loud break showing that what you were doing was wrong

I suppose s better error message would help

@jorisvandenbossche
Copy link
Member

Yes, the fact that previously the conversion was not really 'forced' (as if it only converted if at least one could be converted), and now it is 'forced', this is what it is making a breaking change. Peoples code was not wrong, they were just using the function how it worked.

Not doing the mapping of the old to the new keyword arguments would be a way to introduce the new behaviour (under the new kwargs) without breaking people's code (but of course, the old behaviour should still be supported in the code base in this case, which could make it more complex)

I agree that this PR is in fact (in some cases) better than current master, as it raises an error notifying the user he/she should look at it, instead of just changing the result and breaking possible further processing.

@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

I don't see any reason to support old code like that
it just prolongs the change period

@jorisvandenbossche
Copy link
Member

Actually, what you are doing here is partly the same as I am saying: not doing the mapping of old to new kwargs (for the 'coerce' case). But the question is then what to do with the old kwargs: raising an error (this PR), or keeping the old behaviour but deprecating (allowing a more smooth transition).

@jorisvandenbossche
Copy link
Member

@jreback I don't say that is the better or easy approach, but the reason to do this is pretty clear: keep current user code working with 0.17.0 and giving them some time to adapt.

@jorisvandenbossche
Copy link
Member

Further, not doing the mapping would also allow us to further clean up some inconsistencies in the current API (which would otherwise be more breaking changes):

  • numeric=True also implies coerce=True (while this is not the case for datetime and timedelta), so converting numeric strings without coercing is not possible (actually, it is possible with astype, but not with convert_objects)
  • datetime=True does not parse strings to datetimes, while datetime=True, coerce=True does parse strings. While coerce should be about converting inconvertibles to NaN or nto.

@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

as I said I think we could give a better error message (kind of like instructions on how his changed)

I'll put that up a bit later

but this is a pretty narrow case - breaking is sometimes necessary

@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

I think that we should just change this whole approach - this is too error prone

let's deprecate this entire function - not adding anything (so I guess revert to the prior behavior) - or just outright raise an error

add pd.to_numeric(errors='raise'|'ignore'|'coerce)

eg same signature as pd.to_datetime/timedelta

that way the pattern is that the user needs to apply this to individual columns -

@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

another option is to make a new method
.convert with the new behavior

and leave .convert_objects alone (but deprecate it fully)

@bashtage
Copy link
Contributor

using it on s frame is just not a good idea at all.

I agree with this. Better practice is to use

df['a'] = df['a'].astype('float')

or

df['a'] = pd.to_datetime(df['a'])

so the the minimal conversions is used (or an error is raised).

let's deprecate this entire function - not adding anything (so I guess revert to the prior behavior) - or just outright raise an error

I think this is probably the right approach, plus adding to_numeric which will handle the numeric uses of convert_objects

@jreback
Copy link
Contributor Author

jreback commented Sep 17, 2015

@bashtage you want to take a crack at this?

  • restore original .convert_objects (though you prob want to use your new structure, maybe have a compat parm (into the _possibly_convert_objects)
  • deprecate .convert_objects
  • change whatsnew
  • pd.to_numeric

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2015

@bashtage are you going to be able to address this?

@jorisvandenbossche
Copy link
Member

+1 to adding a pd.to_numeric

I would not add a new convert method. If we want this functionality, we can do it within convert_objects (instead of deprecating the method in favor of a new, we can deprecate the keyword arguments in favor of new arguments, within the same method. In the end, the result can be the same, but we don't need a new method on DataFrame).

@bashtage
Copy link
Contributor

I can get to it this week.

IMO the old behavior of convert_objects is at best poor and in the long run it would be better to remove it. For example, it didn't even respect the coerce flag, and has wierd dependencies between types (since it tries one, moves on to the next,etc, and so it not robust to lots of reasonable refactorings).

I think there are two ways forward:

  1. add to_numeric to there is a to_datetime and to_timestamp which completes the set. Start deprecation of convert_objects
  2. Move new version to convert and start deprecation.

In principle adding to_numeric and convert aren't orthogonal, so both could be done.

@jreback
Copy link
Contributor Author

jreback commented Sep 21, 2015

@bashtage thanks!

  • I think let's just add pd.to_numeric
  • as far as the original .convert_objects. I agree the original behavior was just misleading. I think it might simply be best to just use kind of what this PR does. E.g. map the original arguments, if no coerce is specified, then it should work, if not, refuse to guess and just raise. Further I think let's simply deprecate the entire function (.convert_objects).

Yes this will break code, but it will loudly do it and I think only a very very small subset of users will experience this. We cannot cleanly do this as the fundamental behavior is changed. I suppose could revert to the original behavior, but simplest/easiest from this point is just raise.

@jorisvandenbossche
Copy link
Member

Agree on pd.to_numeric, this is useful in any case!

On convert_objects, if we want to deprecate it, I would revert it to how it was originally (0.16.2). I don't really see a point in breaking people's code to make something more consistent in a feature we are going to deprecate anyway. So why not just deprecating it then and pointing them to pd.to_numeric/datetime. I would say it is either change how convert_objects (to make it better, in order to keep it) or either deprecate it.

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2015

ok, let's rollback the entire .convert_objects change, and just deprecate it entirely.
If you want to leave the logic you have internally for _possibly_convert_objects thats fine as well (though should be the same exact external API), so not sure what is easier.

and add pd.to_numeric (which is essentially your new logic).

@bashtage change the documentation to advertise to_numeric/to_datetime/to_timedelta, rather than .convert_objects.

@bashtage
Copy link
Contributor

I have been working on the following:

  1. Revert the public convert_objects with the small bug fixed.
  2. Turn the current convert_objects to a private _convert. This function is used internally in a handful of places and works correctly.
  3. Add to_numeric

One question: where should to_numeric live? generic.py? The other to_ live in time series specific modules.

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2015

@bashtage awesome!

pd.to_numeric could live in pandas/tools/util.py (not much there) ATM (then import in __init__ into the .pd API).

prob should reorg this at some point, though that is purely internal.

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2015

closing in favor of #11173

@jreback jreback closed this Sep 22, 2015
bashtage pushed a commit to bashtage/pandas that referenced this pull request Oct 1, 2015
Restores the v0.16 behavior of convert_objects and moves the new
version of _convert
Adds to_numeric for directly converting numeric data

closes pandas-dev#11116
closes pandas-dev#11133
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0.17 regression: convert_objects coercion with multiple dtypes
3 participants