Change auto_cast in SuperVectorizer #238

LeoGrin · 2022-02-14T11:38:52Z

Uses pandas TextParser (used in read_csv, as per @GaelVaroquaux suggestion in #234) instead of convert_dtypes in the SuperVectorizer _auto_cast method. This has several advantages:

Type detection works on string columns, e.g ["3", "2", "1"] would get detected as integer. Solves Handle string in auto_cast (SuperVectorizer) #234
Missing values encoded as strings get treated as missing values, and don't prevent type detection. For instance, [1, 2, "", 4] is detected as integers. (solves Replace strings which indicates missing values by NaNs (for SuperVectorizer) #231). Which strings get detected as Nans is easy to change (see below).
Datetime columns are converted to datetimes, and the format is inferred by Pandas. Solve the first part of Handling date columns in SuperVectorizer #233.

Some comments:

TODO: I'd like to enable the user to specify a specific date format if he wants to.
Strings detected as NaNs are the default STR_NA_VALUES, and I've added [None, " ", "?", "..."]. I think we could find a more principled way to choose which strings to include.
I haven't properly measured the speed of the conversion.
To use TextParser, I need to convert the input data into a list of list. I don't think this is a problem, but tell me if you think otherwise.
All variables which are not numerics or datetimes are converted to object (and thus handled as categorical in the SuperVectorizer).
Right now, if a datetime column contains mixed timezones, it is encoded as an object.

…hich fixed the bug which made the parser convert a column to numeric if it had failed to convert it as date). We now use pd.to_datetime after the parsing.

LilianBoulard

Thanks for the PR, it looks nice !

dirty_cat/super_vectorizer.py

…arser

Co-authored-by: Lilian <lilian@boulard.fr>

…tead of == int64

…irty_cat into dates-super-vectorizer merge

LilianBoulard

LGTM !

dirty_cat/super_vectorizer.py

…ser)

LeoGrin · 2022-02-20T17:09:05Z

In trying to take into account @GaelVaroquaux's comment, I've actually realized that instead of using TextParser, it would be simpler to go back to @LilianBoulard's version of autocast, and:

replace convert_dtypes by pd.to_numeric and pd.to_datetime
Replace every missing value with np.nan before type conversion

GaelVaroquaux · 2022-02-20T19:42:11Z

Very interesting!! You know these questions better than us, now, so your input is super useful. Tell us when you want a second review.

GaelVaroquaux · 2022-02-20T19:43:47Z

Darn, the tests errors are showing us that the strategy here is probably quite sensitive to the version of pandas.

dirty_cat/test/test_super_vectorizer.py

LilianBoulard

Looks good !

dirty_cat/super_vectorizer.py

Co-authored-by: Lilian <lilian@boulard.fr>

GaelVaroquaux

A few thoughts (I'm too tired to finish tonight).

CHANGES.rst

GaelVaroquaux

LGTM. Merging. Thanks a lot!!

LeoGrin added 3 commits February 14, 2022 09:57

test name for fit_transform_equiv (for pytest)

755b7ed

Change auto_cast to use TextParser + tests

564dc1a

Changes how dates are converted to be compatible with pandas 1.4.0 (w…

c183cf1

…hich fixed the bug which made the parser convert a column to numeric if it had failed to convert it as date). We now use pd.to_datetime after the parsing.

LilianBoulard reviewed Feb 16, 2022

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

LeoGrin and others added 6 commits February 16, 2022 19:26

fix python3.6 compatibility by considering pd.NA as NaNs in the TextP…

0ad1cd9

…arser

Update dirty_cat/super_vectorizer.py

58516c9

Co-authored-by: Lilian <lilian@boulard.fr>

Update dirty_cat/super_vectorizer.py

aeb241b

Co-authored-by: Lilian <lilian@boulard.fr>

Taking into account Lilian's suggestion on using is_numeric_dtype ins…

4aceccc

…tead of == int64

Merge branch 'dates-super-vectorizer' of https://github.com/LeoGrin/d…

641e3a5

…irty_cat into dates-super-vectorizer merge

change array_equal to all_close to be compatible with numpy < 1.19

fa35070

LilianBoulard approved these changes Feb 18, 2022

View reviewed changes

GaelVaroquaux reviewed Feb 18, 2022

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

handling missing values and type conversion by hand (not with TextPar…

f51e46e

…ser)

GaelVaroquaux reviewed Feb 20, 2022

View reviewed changes

dirty_cat/test/test_super_vectorizer.py Show resolved Hide resolved

LeoGrin added 2 commits February 21, 2022 12:18

don't import STR_NA_VALUES

a9bb0dd

hardcode STR_NA_VALUES to prevent import errors

32b21c7

LilianBoulard reviewed Mar 3, 2022

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

LeoGrin and others added 2 commits March 3, 2022 15:39

Apply suggestions from code review

1277f1c

Co-authored-by: Lilian <lilian@boulard.fr>

fix error message (Lilian's review)

cef56d4

GaelVaroquaux reviewed Mar 13, 2022

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

Update CHANGES.rst

e9d00fe

GaelVaroquaux approved these changes Mar 17, 2022

View reviewed changes

GaelVaroquaux merged commit 262434b into skrub-data:master Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change auto_cast in SuperVectorizer #238

Change auto_cast in SuperVectorizer #238

LeoGrin commented Feb 14, 2022 •

edited

LilianBoulard left a comment

LilianBoulard left a comment

LeoGrin commented Feb 20, 2022

GaelVaroquaux commented Feb 20, 2022

GaelVaroquaux commented Feb 20, 2022

LilianBoulard left a comment

GaelVaroquaux left a comment

GaelVaroquaux left a comment

Change auto_cast in SuperVectorizer #238

Change auto_cast in SuperVectorizer #238

Conversation

LeoGrin commented Feb 14, 2022 • edited

LilianBoulard left a comment

Choose a reason for hiding this comment

LilianBoulard left a comment

Choose a reason for hiding this comment

LeoGrin commented Feb 20, 2022

GaelVaroquaux commented Feb 20, 2022

GaelVaroquaux commented Feb 20, 2022

LilianBoulard left a comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

LeoGrin commented Feb 14, 2022 •

edited