-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DatetimeEncoder fixes #743
DatetimeEncoder fixes #743
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LeoGrin, I'd advocate not to change the scope of PRs once started because it's confusing to review. To fix new issues, let's open new PRs instead :) |
Dismissing my approval since the scope of the PR has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning the total number of seconds since Epoch when self.extract_until=None
is misleading for the user.
I'd rather have self.extract_until=None
as a no-op (not returning anything), and have an additional keyword init parameter like self.seconds_since_epoch
.
What do you both think?
Otherwise, I haven't specific remarks on the PR itself, but I plan to refactorize a lot of the DatetimeEncoder implementation that I'm not super happy with in a subsequent PR.
I think returning the total number of seconds since Epoch when `self.extract_until=None` is misleading for the user.
I'd rather have `self.extract_until=None` as a no-op (not returning anything), and have an additional keyword init parameter like `self.seconds_since_epoch`.
What do you both think?
I agree, that seems better. If we don't want to add a parameter we could
even always return the seconds since epoch in addition to any other
features the user requests
Otherwise, I haven't specific remarks on the PR itself, but I plan to refactorize a lot of the DatetimeEncoder implementation that I'm not super happy with in a subsequence PR.
I agree the DateTimeEncoder could use some cleanup. For example it re-parses the whole date column to extract each feature.
Also the DateTimeEncoder and the TableVectorizer both parse date strings, and they do it in different ways so it may be worth thinking about their respective responsibilities and doing some refactoring.
|
Yes, I think that's a good idea. Thinking about it a bit more, it would better for the shape of the output to not depend on the data, for instance if we want to use this inside a grid search, so it may be better to get rid of the logic we have now (check if columns are constant). |
Thinking about it a bit more, it would better for the shape of the output to not depend on the data, for instance if we want to use this inside a grid search, so it may be better to get rid of the logic we have now (check if columns are constant).
I think that's a good idea. Dropping constant columns can be done later
in the pipeline with a VarianceThreshold transformer from scikit-learn.
I guess the idea was to avoid adding eg an hour column full of 0 when
the column we are transforming contains dates not times. But for that it
may be better to rely on the date format detected when parsing the dates
than to check the variance a posteriori
|
Alternatively, we could let the user switch between these two behaviors with an additional
We definitely need to add the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the behavior of extract_until=None
to be a no-op and add an extra add_total_seconds_since_epoch
or add_seconds_since_epoch
keyword parameter (default True) as discussed :)
@@ -137,6 +137,10 @@ Minor changes | |||
which provides some more information about the job title. | |||
:pr:`581` by :user:`Lilian Boulard <LilianBoulard>` | |||
|
|||
* Fix bugs which was triggered when `extract_until` was "year", "month", "microseconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Fix bugs which was triggered when `extract_until` was "year", "month", "microseconds" | |
* Fix bugs which where triggered when `extract_until` was "year", "month", "microseconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge so that we can do the refactoring soon.
exact_until='microsecond'
#741 Remove millisecond extraction in DatetimeEncoder (Turns out you pandas'DatetimeIndex
don't have amillisecond
attribute.)extract_until
is "year" or "month" #745 Stop using thefloor
function and simplify the logic to check if we needtotal_time
None
option forextract_until
to only extracttotal_time
.@jeromedockes