# Lab 11: Factors, dates and times




In [None]:
library(tidyverse)
library(nycflights13)
library(lubridate)

## Factors basics

Throughout this course we have encountered factor variables, but have yet to explore what a `factor` is in R. Factors are used to process cateogrical data in R. At a surface level they look like character variables but have additional proporties that provide convenience when peforming data analysis.

In this section of the lab we will cover how to construct and modify factors.

### Creating factors

Imagine that we have the the following responses to a particular question in a student evaluation:

In [None]:
response = c("Agree", "Agree", "Strongly Agree", "Disagree", "Agree")

In [None]:
response
class(response)

Currently, `response` is a character vector. Certain operations in R may not interact with this object the way we would like given our interpretation of the values. For example, sorting the vector orders it alphabetically though we may want to sort it according to the degree the student agrees with the statment. 

In [None]:
sort(response)

Using `factor` we can add additional structure to the object that better matches our understanding of the variable's values.

In [None]:
response_levels  <- c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree")
response_fct  <- factor(response, levels = response_levels)

Now when we sort `response_fct` it is according to the stength of the statement.

In [None]:
sort(response_fct)

Note that how that values are sorted depends on the order of how the levels appear when inputed to `factor`.

In [None]:
response_levels  <- c("Strongly Agree", "Disagree", "Neutral", "Agree", "Strongly Disagree")
response_fct  <- factor(response, levels = response_levels)
sort(response_fct)

This added structure will be useful when analyzing and visualizing categorical data that matches our intuition for the data. 

### A deeper look at factors

Below we create a variable that represents the decades 1980-2010.

In [None]:
decades   <- c(1980, 1990, 2000, 2010)
class(decades)

We now create a new factor variable based on the numeric `decades` variable and then observe what happens when we 
try to convert it back to a numeric variable.

In [None]:
decades_fct <- factor(decades)
decades_2  <- as.numeric(decades_fct)


decades
decades_fct
decades_2

Using `as.numeric` does not return the original numeric data. Rather, we see that 1980 is mapped to 1, 1990 is mapped to 2, and so on. Then 1,2,3, and 4 are each associated with the `levels`  '1980', '1990', '2000', and '2010'.

Using the function `levels` we can access the levels of a factor variable. Note that this returns a character variable.

In [None]:
levels(decades_fct)
class(levels(decades_fct))

##  Factor Basics Exercises 

1. If `x = c(1, 2, 3, 3, 5, 3, 2, 4, NA)`, what are the levels of `factor(x)`?


2. If `z <- factor(c("p", "q", "p", "r", "q"))` and levels of z are "p", "q" ,"r", write an R expression that will change the level "p" to "w" so that z is equal to: "w", "q" , "w", "r" , "q".

3. Let `df <- data.frame(q=c(2, 4, 6), p=c("a", "b", "c"))`. Write an R statement that will replace levels a, b, c with labels "fertiliser1", "fertliser2", "fertiliser3".  
_hint_ : There is an additional argument to `factor` that lets you modify the labels of a factor variable.

In [None]:
# Answer to Exercise 1:
x = c(1, 2, 3, 3, 5, 3, 2, 4, NA)
factor(x)

In [None]:
# Answer to Exercise 2:
z <- factor(c("p", "q", "p", "r", "q"))
levels(z)[1] <- "w"
z

In [None]:
# Answer to Exercise 3:
df <- data.frame(q=c(2, 4, 6), p=c("a", "b", "c"))
df$p <- factor(df$p, levels=c("a", "b", "c"), labels=c("fertiliser1", "fertiliser2", "fertiliser3"))
df

In exercise 1 we saw that `NA` was treated as a level. This might not be that behavior that we want. Using `fct` from `forcats` we can catch this behavior. Notice how `fct` will treat the `NA` as an error rather than another level of the factor variable.

In [None]:
fct(x)

Similary, if a character variable has values not in the specificied levels, `factor` will convert it to `NA`. Again, if we want to catch this we can use `fct`.

In [None]:
response = c("Agree", "Agree", "Strongly Agree", "Disagre", "Agree")
factor(response, levels = response_levels)
fct(response, levels = response_levels)

`forcats` (an anagram of factor) is a tidyverse package that provides a wide range of functions for working with factors. The rest of this section will explore using these functions to handle factor variables.

### Modifying factor order

Going forward we will use `forcats::gss_cat`, a sample of data from the General Social Survey, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. 

In [None]:
head(gss_cat)

As we have seen in previous homeworks and labs, we often need to adjust the order of our categorical variable to clearly demonstrate a trend in the plot. 

Below we look at the average age for each religion:

In [None]:
gss_cat %>% 
    group_by(relig) %>%
    summarise(age  = mean(age, na.rm = T)) %>%
    ggplot(aes(x = relig, y = age)) +
    geom_bar(stat = "identity") + coord_flip()

For visualization purposes it is nicer to have the bars in our plot ordered by their height. The function `fct_reoder` allows us to do this. `fct_reorder()` takes three arguments:

- `f`, the factor whose levels you want to modify.
- `x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. 

In [None]:
gss_cat %>% 
    group_by(relig) %>%
    summarise(age  = mean(age, na.rm = T)) %>%
    ggplot(aes(x = fct_reorder(relig, age), y = age)) +
    geom_bar(stat = "identity") + coord_flip()

In [None]:
gss_cat %>% 
    ggplot(aes(x = fct_reorder(relig, age, mean, na.rm = T), y = age)) +
    geom_boxplot() + coord_flip()

Note that the default function used in `fct_reorder` is median.

In [None]:
gss_cat %>% 
    filter(!is.na(age)) %>% # we need to filter out NAs for fct_reorder to work as intended
    ggplot(aes(x = fct_reorder(relig, age), y = age)) +
    geom_boxplot() + coord_flip()

### Modifying factor levels

Often when working categorical data, the levels are not initially in a clean and presentable form. The most common way to alter factor levels is using `fct_recode`, which allows us to change the value of each level to our liking. 

In [None]:
gss_cat %>%
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat"
    )
  ) %>%
  count(partyid)

We can use `fct_collapse` to aggergate levels of a factor variable into broader categories.

In [None]:
gss_cat %>%
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) %>%
  count(partyid)

The `fct_lump_*()` functions allow us to keep the most frequent categories and aggregate the smaller categories into "Other".

In [None]:
gss_cat %>%
  mutate(relig = fct_lump_lowfreq(relig)) %>%
  count(relig)

In [None]:
gss_cat %>%
  mutate(relig = fct_lump_n(relig, n = 5)) %>%
  count(relig, sort = TRUE)

## Factor Exercises

1.) Create a categorical variable of the following age brackets: 18-30, 30-40, 40-50, 50-65, 65 + and make a bar plot of the mean tv hours across age groups in increasing order.

In [None]:
#@title Exercise 1 Answer

gss_cat %>%
    mutate(age_cat = cut(gss_cat$age,c(18,30,40,50,65, 90),include.lowest = T, right = F)) %>% 
    filter(!is.na(age_cat)) %>%
    group_by(age_cat) %>%
    summarise(tvhours = mean(tvhours, na.rm = T)) %>%
    ggplot(aes(x = fct_reorder(age_cat, tvhours), y = tvhours)) + geom_col()

2.) Change rincome to only have the levels "\\$20000 or more", "\\$10000 - 19999", "\\$5000 to 9999", "0 to 4999", and "Unknown".      

In [None]:
#@title Exercise 2 Answer

gss_cat %>%
  mutate(
    rincome = fct_collapse(rincome,
      "$20000 or more" = c("$25000 or more", "$20000 - 24999"),
      "$10000 - 19999" = c("$15000 - 19999", "$10000 - 14999"),
      "$5000 to 9999"  = c("$8000 to 9999",  "$7000 to 7999",  "$6000 to 6999", "$5000 to 5999"),
      "0 to 4999"      = c("$4000 to 4999",  "$3000 to 3999",  "$1000 to 2999", "Lt $1000"),
      "Unknown"        = c("No answer", "Don't know", "Refused", "Not applicable")
    )
  ) %>%
  count(rincome)

## Dates and times


Now we turn our attention to working with dates and times in R. For this portion of the lab we will focus on the `lubridate` package.

### Creating date/times

There are three types of date/time data that refer to an instance in time:

- A **date**. Tibbles print this as `<date>`.
- A **time**. Tibbles print this as `<time>`.
- A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as `<dttm>`. Base R calls these POSIXct.

We will focus on dates and date-times.


The first thing to know about working with date-times in R (and other langauges) are date-time formats. These are standard across many programming languages and decribe a date component using a `%` followed by a single character.

While there are many date-time format components there are only a few that are truly necessary to know as described in this table:

| Type   | Code |Meaning         |Example|
|------  |------|-----           |-----|
|   Year | %Y   |4 digit year    |2021|
|        | %y   |2 digit year    |21|
|   Month | %m   |number    |2|
|    | %b   |Abbreviated name     |Feb|
|    | %B   |Full name     |Februrary|
| Day   | %d  |Two digits   |02|
|    | %e  |One or two digits   |2|
|  Time  | %H  | 24-hour hour  |13|
|    | %M | Minutes  |35|
|    | %S | Seconds |45|




For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day such as `2023-02-13`.

`lubridate` provides functions that automatically detect these formats so we typically do not need to specify them directly. One scenario where you might need to manually specify the date-time format is when reading data into R, 
as shown in the toy exmaple below.


In [None]:
csv <- "
  date
  01/02/15
"

read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))

We must be careful as there are multiple ways we could interpret this date value depending on how we specify the format:

In [None]:
read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))

We typically generate date-time variables from strings. The package `lubridate` provides a convenient set of functions that automatically determine the date-time formats. To use the lubridate functions identify the date the order at which year, month, and day appear in the date and then arrange "y", "m", and "d" in the same order. The sequence of "y", "m", and "d" gives the name of the function.

See the example below. Note that the specific format is not important as long as we know the date components come in the order year, month, day.

In [None]:
ymd("2020-01-31")
ymd("2020/01/31")
ymd("20200131")

Below we see some examples with different orders of the date components and different date formats.

In [None]:
mdy("January 31st, 2017")
dmy("31-Jan-2017")

To work with date-time variables we simply add an underscore to the above functions with one or more of "h", "m", "s" depending on whether hours, minutes, or seconds appear in the date.

In [None]:
ymd_hms("2017-01-31 20:11:59")

`lubridate` also provides functions for "rounding" dates.

In [None]:
floor_date(ymd_hms("2017-01-31 20:11:59"), "minute")
floor_date(ymd_hms("2017-01-31 20:11:59"), "hour")
floor_date(ymd_hms("2017-01-31 20:11:59"), "day")
floor_date(ymd_hms("2017-01-31 20:11:59"), "month")
floor_date(ymd_hms("2017-01-31 20:11:59"), "year")

We can also build date variables from inidividual date components.

In [None]:
make_datetime(2013, 3, 13, 10, 30)

If need we can specify the timezone (this goes for other `lubridate` functions that build date-time variables).

In [None]:
make_datetime(2013, 3, 13, 10, 30, tz = "EST")

Let's look at an example fo creating date-time variables using the flights dataset from `nycflights13`.

Here we have numeric variables that represent the year,month,day, hour, and minute of the flight.

In [None]:
flights %>% select(year, month, day, hour, minute, sched_dep_time) %>% head()

In [None]:
flights %>% 
    select(year, month, day, hour, minute, time_hour) %>%
    mutate(departure = make_datetime(year, month, day, hour, minute)) %>% head()

We can also easily plot date-times with `ggplot`. Note for histograms the unit for binwidth is seconds so in following plot a bindwith of 600 represents 10 minutes.

In [None]:
flights %>% 
    select(year, month, day, hour, minute, time_hour) %>%
    mutate(departure = make_datetime(year, month, day, hour, minute)) %>% 
    filter(day == 2, month==1) %>%
    ggplot(aes(x = departure)) + 
    geom_histogram(binwidth = 600) # 600 s = 10 minutes

### UnixEpoch 

Sometimes dates/times are reperesented as numeric increments from the "UnixEpoch" 1970-01-01. This means that numeric values are interpreted either as seconds or days since 1970-01-01. If we interpret the numeric value as days we can use `as_date` to get a date variable, otherwise use `as_datetime` to get a date-time variable.

In [None]:
as_date(365)

In [None]:
as_datetime(60)

### Getting components 

Sometimes we start with a date-time variable and want to work with specific date-time components. `lubridate` also provides packages for obtaining these components.

In [None]:
datetime <- ymd_hms("2026-07-08 12:34:56")


year(datetime)
month(datetime)
mday(datetime)
yday(datetime)
wday(datetime)

We have the option of extracting date components as a `factor` variable.

In [None]:
month(datetime, label = TRUE)
wday(datetime, label = TRUE, abbr = FALSE)

Using these functions we can also modify datetimes in the following manner:

In [None]:
datetime

year(datetime) <- 2030
month(datetime) <- 01


hour(datetime) <- hour(datetime) + 1

datetime

###  Time Spans 

We may also want to do arithmetic with date-time variables. This results in three classes of time spans:

- **Durations**, which represent an exact number of seconds.
- **Periods**, which represent human units like weeks and months.
- **Intervals**, which represent a starting and ending point.

Subtracting two date-times yields a difftime class object which records a time span of seconds, minutes, hours, days, or weeks. 

In [None]:
days_in_23 = today() - ymd("2023-01-01")
days_in_23

`lubridate` uses **duration** which always uses seconds to maintain consistency.

In [None]:
as.duration(days_in_23)

The are a number of functions that help us work with durations:

In [None]:
dminutes(1)
ddays(0:5)

dhours(10) + ddays(1:3)

These functions allow us to conveniently do arithmetic with date-times

In [None]:
ymd_hms("2026-07-08 12:34:56") + dhours(2)

ymd_hms("2026-07-08 12:34:56") - dweeks(2)

ymd_hms("2026-07-08 12:34:56") - ddays(1:3)

**Periods** are time spans that work with human units such as days and months. Similar to **durations** `lubridate` contains functions that allow us to use arithmetic to build periods.

In [None]:
hours(c(12, 24))
months(1:4)

In [None]:
years(1) + months(6)

years(1) + months(4)

As with **durations** we can use **periods** to modify date-time variables

In [None]:
today() + months(6)

Using **periods** are beneficial for processing date-time variables when considering changes in date-times as a result of day light savings or leap year.

We can see below that since **durations** are in terms of seconds we don't account for daylight savings.

In [None]:
 ymd_hms("2023-03-11 03:00:00", tz = "America/New_York") + ddays(1)

 ymd_hms("2023-03-11 03:00:00", tz = "America/New_York") + days(1)

The lengths of years and days will vary depending on the specific date. Some years have 365 days while other have 366. The length of a day differs depending on whether it is daylight savings time. As a result **durations** are defined as the number of seconds per average year or day. We can see this below.

In [None]:
dyears(1) / ddays(365)
years(1) / days(1)

**Duraitons** and **periods** do not give precise measurements. For that we can use **intervals**, to evaluate the precise time span for a particular interval of time. 
You can create an interval by writing start `%--%` end:

In [None]:
y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
y2023

In [None]:
y2023 / days(1)

## Dates and Times Exercises

1.) Plot the average depature delay per hour for each day of the week.

2.) Convert `arr_time` and `dep_time` to date-time variables. Calculate their difference and 
compare the result to air_time. Compare this to taking the difference of `arr_time` and `dep_time`  directly.

_hint_ 1: You'll probably need to consider the different time zones between destination and origin. How can we incorporate the airports dataset to account for this?   
_hint_ 2: Note that `dep_time` and `arr_time` are in HHMM format (e.g 513 is 5 hours and 13 minutes).   
_hint_ 3:The function  `%/%` is integer division and `%%` returns the remainder from divison so 513 %/% returns 5 and 513 %% 100 returns 13. Use theses function with your knowledge of `dep_time` and `arr_time` to construct a date time variable.

In [None]:
#@title Exercise 1 Answer
flights %>%
    mutate(weekday = wday(time_hour, label = T)) %>% 
    group_by(weekday, hour) %>% 
    summarise(dep_delay = mean(dep_delay, na.rm=T) , .groups = 'drop') %>% 
    ggplot(aes(x = hour, y = dep_delay)) + 
    geom_col() + facet_wrap(~weekday)

In [None]:
#@title Exercise 2 Answer
flights %>% inner_join(select(airports, faa, origin_tz = tzone),
                       by = c("origin" = "faa")) %>% 
            inner_join(select(airports, faa, dest_tz = tzone),
                         by = c("dest" = "faa")) %>% 
        mutate(
             departure  = make_datetime(year, month, day,
                                        hour = dep_time %/% 100,
                                        min = dep_time %% 100, 
                                       tz = origin_tz),
             arrival   = make_datetime(year, month, day,
                                        hour = arr_time %/% 100,
                                        min = arr_time %% 100, tz = dest_tz),
            flight_duration = arrival - departure,
            flight_duration_no_tz = arr_time - dep_time, 
            air_time = air_time / 60
        ) %>%
select(flight_duration, flight_duration_no_tz, air_time, departure, arrival,
       dep_time,arr_time, origin, dest) %>% head(20)
