Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change auto_cast in SuperVectorizer #238

Merged
merged 15 commits into from
Mar 17, 2022

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Feb 14, 2022

Uses pandas TextParser (used in read_csv, as per @GaelVaroquaux suggestion in #234) instead of convert_dtypes in the SuperVectorizer _auto_cast method. This has several advantages:

Some comments:

  • TODO: I'd like to enable the user to specify a specific date format if he wants to.
  • Strings detected as NaNs are the default STR_NA_VALUES, and I've added [None, " ", "?", "..."]. I think we could find a more principled way to choose which strings to include.
  • I haven't properly measured the speed of the conversion.
  • To use TextParser, I need to convert the input data into a list of list. I don't think this is a problem, but tell me if you think otherwise.
  • All variables which are not numerics or datetimes are converted to object (and thus handled as categorical in the SuperVectorizer).
  • Right now, if a datetime column contains mixed timezones, it is encoded as an object.

…hich fixed the bug which made the parser convert a column to numeric if it had failed to convert it as date). We now use pd.to_datetime after the parsing.
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, it looks nice !

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Feb 20, 2022

In trying to take into account @GaelVaroquaux's comment, I've actually realized that instead of using TextParser, it would be simpler to go back to @LilianBoulard's version of autocast, and:

  • replace convert_dtypes by pd.to_numeric and pd.to_datetime
  • Replace every missing value with np.nan before type conversion

@GaelVaroquaux
Copy link
Member

Very interesting!! You know these questions better than us, now, so your input is super useful. Tell us when you want a second review.

@GaelVaroquaux
Copy link
Member

Darn, the tests errors are showing us that the strategy here is probably quite sensitive to the version of pandas.

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
LeoGrin and others added 2 commits March 3, 2022 15:39
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts (I'm too tired to finish tonight).

CHANGES.rst Outdated Show resolved Hide resolved
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging. Thanks a lot!!

@GaelVaroquaux GaelVaroquaux merged commit 262434b into skrub-data:master Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants