-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change auto_cast in SuperVectorizer #238
Change auto_cast in SuperVectorizer #238
Conversation
…hich fixed the bug which made the parser convert a column to numeric if it had failed to convert it as date). We now use pd.to_datetime after the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, it looks nice !
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
…irty_cat into dates-super-vectorizer merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
In trying to take into account @GaelVaroquaux's comment, I've actually realized that instead of using TextParser, it would be simpler to go back to @LilianBoulard's version of autocast, and:
|
Very interesting!! You know these questions better than us, now, so your input is super useful. Tell us when you want a second review. |
Darn, the tests errors are showing us that the strategy here is probably quite sensitive to the version of pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good !
Co-authored-by: Lilian <lilian@boulard.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts (I'm too tired to finish tonight).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging. Thanks a lot!!
Uses pandas
TextParser
(used inread_csv
, as per @GaelVaroquaux suggestion in #234) instead ofconvert_dtypes
in the SuperVectorizer_auto_cast
method. This has several advantages:["3", "2", "1"]
would get detected as integer. Solves Handle string in auto_cast (SuperVectorizer) #234[1, 2, "", 4]
is detected as integers. (solves Replace strings which indicates missing values by NaNs (for SuperVectorizer) #231). Which strings get detected as Nans is easy to change (see below).Some comments:
STR_NA_VALUES
, and I've added[None, " ", "?", "..."]
. I think we could find a more principled way to choose which strings to include.