Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use errors='coerce' when converting to numerical or datetime in TableVectorizer's _apply_cast #666

Merged
merged 9 commits into from
Jul 21, 2023

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Jul 20, 2023

Fix #631

Questions:

  • do we want to warn the user if there is an issue with one of the inferred types?
  • what we would like to do with the new entry of the wrong type would be to ignore it. If we convert it to nan, some model may want to use what it learnt about missing values during training on this entry, though it wouldn't actually be missing.

@GaelVaroquaux
Copy link
Member

Answers to your question:

  • I think that we should warn, possibly with a dedicated error to make it easy to catch this error
  • We should attempt the "best conversion possible", whatever that means. This may not be easy.

I'll review tomorrow, it's getting too late

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment: I think that we should code a mechanism to raise a warning. We can discuss this to gauge how important it is.

# we want to ignore entries that cannot be converted
# to this dtype
if pd.api.types.is_numeric_dtype(dtype):
X[col] = pd.to_numeric(X[col], errors="coerce")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current usage of "coerce" it is hard to do a warning if an invalid value is encountered.

If we feel that it is important to warn (I am leaning in this direction but not 100% sure), one way to do it is to do a try/except with 'errors="raise"', if an error (the "except" block), capture it, pop up a warning (ideally informative) and then do the same call but with 'error="coerce"'.

elif pd.api.types.is_datetime64_any_dtype(dtype):
X[col] = pd.to_datetime(X[col], errors="coerce")
else:
# this should not happen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@GaelVaroquaux GaelVaroquaux merged commit f9f2b1c into skrub-data:main Jul 21, 2023
23 of 25 checks passed
@GaelVaroquaux
Copy link
Member

Thanks!! Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TableVectorizer fails when the type change between train and test
2 participants