Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_date_cyclic(...) to feature code time into sin and cos #702

lindeloev opened this issue Apr 28, 2021 · 10 comments

step_date_cyclic(...) to feature code time into sin and cos #702

lindeloev opened this issue Apr 28, 2021 · 10 comments
feature a feature request or enhancement


Copy link

lindeloev commented Apr 28, 2021

Updated: I initially proposed an API like step_date(..., cyclic = TRUE). I now think that would be confusing because features = "month" would then mean "time of month" whereas it currently means "month of year". Instead, I propose a new function called step_date_cyclic


When forecasting, cyclical time-trends can be an important predictor. I propose to add a new function caled step_date_cyclic which feature codes POSIXct columns using trigonometric columns as mlr3pipelines::PipeOpDateFeatures(). E.g., for features = "month" it would add two columns: sin(time) and cos(time). This avoids the data reduction inherent in the current categorical feature coding. And since many learners support numerical features, this enables more models to work on forecasting.

Here's one exposition of the rationale behind this feature coding:

R code demo

Say the user has data with some timestamps:

df = data.frame(
  timestamps = seq(as.POSIXct("2021-02-01"), as.POSIXct("2021-05-01"), length.out = 300)

Calling step_date_cyclic(..., features = "month"), would add two new columns:

timestamps_secs = as.numeric(df$timestamps)
secs_per_month = 30 * 24 * 60 * 60
df$timestamps_month_sin = sin(timestamps_secs * 2 * pi / secs_per_month)
df$timestamps_month_cos = cos(timestamps_secs * 2 * pi / secs_per_month)


plot(feature_month_sin ~ timestamps, df)
points(feature_month_cos ~ timestamps, df, col = "red")


Jointly, these two columns uniquely identifies each time of month:

plot(feature_month_cos ~ feature_month_sin, df)



For each features, two columns would be added. For each feature, just use secs_per_{feature}, and the rest should be the same.

@lindeloev lindeloev changed the title step_date(..., cyclic = TRUE) to feature code time into sin and cos step_date_cyclic(...) to feature code time into sin and cos May 4, 2021
Copy link

topepo commented May 4, 2021

That would be great. Can you create a PR?

@juliasilge juliasilge added the feature a feature request or enhancement label May 4, 2021
Copy link

Would be happy to do a PR, but I can't promise a time frame. I'm just beginning to use tidymodels now, coming from mlr3.

Copy link

When you get started, check out:

Copy link

jkennel commented Jun 17, 2021

Any progress on this? I have some code for step_harmonic which basically does this that I could submit. If you have or are working on it I'll hold off.

Copy link

topepo commented Jun 17, 2021

Go ahead and make a PR 👍

Copy link

I likely won't work on this for at least a few months, so go ahead, @jkennel! This is an informal function I wrote up and use day-to-day myself. Feel free to use, if useful. It converts POSIX/numeric columns using user-specified periods. TO DO:

  • Adherence to the recipe guide, e.g., tidyselect columns instead of character
  • Perhaps a way to specify phase shift
#' Code numerical and POSIXct columns as cyclic features
#' @param df A data.frame, tibble, or data.table
#' @param cols Names of the columns to cyclify. Must be `numeric` or `POSIXct`
#' @param periods A named list of periods to cycle over (numerical). The number
#'   of seconds for `POSIXct` features. The names are used to set the name of
#'   the new feature columns (see "Value" section).
#' @param keep_cols Logical. Keep `cols` after coding or drop them?
#' @return `df` with two added columns per `period`:
#'   `colname.{period_name}_sin` and `colname.{period_name}_cos`.
step_cyclic = function(df, cols = character(0), periods = list(month = 60*60*24*30.5, year = 60*60*24*365.15), keep_cols = FALSE) {
  is.POSIXct = function(x) any(class(x) %in% c("POSIXct"))

  # Assert stuff here
  if ( == FALSE && tibble::is_tibble(df) == FALSE && == FALSE)
    stop("`df` must be table-like.")

  ok_cols = unlist(lapply(df[, cols], function(x) is.numeric(x) || is.POSIXct(x)))
  if (any(ok_cols == FALSE))
    stop("The following column(s) were not POSIXct or numeric: ", paste0(names(ok_cols[ok_cols == FALSE]), collapse = ", "))

  if (any(names(periods) == ""))
    stop("Some periods were unnamed")

  ok_periods = lapply(periods, is.numeric)
  if (any(ok_periods == FALSE))
    stop("The following period(s) were not numeric: ", paste0(ok_periods[ok_periods == FALSE], ", "))

  # Convert to sin/cos
  for (col in cols) {
    # Convert cols to numeric. Cheap way to convert any POSIXct to numeric
    # (it's expensive to test)
    col_vec = as.numeric(df[, col, drop = TRUE])
    for (period_id in seq_along(periods)) {
      period_name = names(periods)[period_id]
      df[, paste0(col, ".", period_name, "_sin")] = sin(col_vec * 2 * pi / periods[[period_name]])
      df[, paste0(col, ".", period_name, "_cos")] = cos(col_vec * 2 * pi / periods[[period_name]])
    # Remove
    if (keep_cols == FALSE)
      df[, col] = NULL


Copy link

jkennel commented Jun 18, 2021

Thanks! I will document and test to have a PR for next week.

Copy link

jkennel commented Jun 24, 2021

I have added a pull request. I went with the name step_harmonic as the technique may be applied to values other than time (distance is another common application). For @lindeloev's example you would be able to do the following.

#> Loading required package: dplyr
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>     step

# prepare data ------------------------------------------------------------
df = data.frame(
  timestamps = seq(as.POSIXct("2021-02-01", tz = 'UTC'), 
                   as.POSIXct("2021-05-01", tz = 'UTC'), 
                   length.out = 300)

secs_per_month = 30 * 24 * 60 * 60 # month seconds
# signal to fit begins at zero to simplify phase comparison
timestamps_secs = as.numeric(df$timestamps) - as.numeric(df$timestamps)[1]
amp <- 22   # amplitude
sh  <- pi/3 # phase shift
df$output <- amp * sin(timestamps_secs * 2 * pi / secs_per_month + sh)

# recipe ------------------------------------------------------------------
rec_harm <- recipe(output~timestamps, df) %>%
                period = secs_per_month,
                cycle_unit = 'second') %>%
  prep() %>%
  bake(new_data = NULL)

# model -------------------------------------------------------------------
fit <- lm(output~timestamps_sin_p_2592000 + timestamps_cos_p_2592000 - 1, 
co  <- coefficients(fit)

sqrt(sum(co^2))          # 22 amplitude
#> [1] 22
atan2(co[[2]], co[[1]])  # pi/3 phase shift relative to first timestamp
#> [1] 1.047198

Created on 2021-06-24 by the reprex package (v2.0.0)

Copy link

Thank you @jkennel for implementing this step! I'm gonna close this issue since this step got merged in this PR #735

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
feature a feature request or enhancement
None yet

No branches or pull requests

5 participants