In [None]:
library(data.table)
library(nycflights13)
library(DT)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=30) # for limiting the number of top and bottom rows of tables printed 

In this session, we will cover basic features of data.table

Important: Do not confuse data.table package for data wrangling and datatable function from DT package for creating interactive widgets to view data

# Datasets

You can get info on and preview the structure and some rows of the datasets and navigate through them

## Airlines

In [None]:
#?(airlines)

In [None]:
head(airlines)

In [None]:
str(airlines)

In [None]:
datatable(airlines, filter = "top")

## airports

In [None]:
#?airports

In [None]:
head(airports)

In [None]:
str(airports)

In [None]:
datatable(airports, filter = "top")

## planes

In [None]:
#?planes

In [None]:
head(planes)

In [None]:
str(planes)

In [None]:
datatable(planes, filter = "top")

## weather

In [None]:
#?weather

In [None]:
head(weather)

In [None]:
str(weather)

In [None]:
#datatable(weather, filter = "top")

## flights

In [None]:
#?flights

In [None]:
head(flights)

In [None]:
str(flights)

In [None]:
#datatable(flights, filter = "top")

In [None]:
class(flights)

In [None]:
attributes(flights)

# convert data.frames to data.tables

In [None]:
airlines <- copy(airlines)
setDT(airlines)

In [None]:
airports <- copy(airports)
setDT(airports)

In [None]:
planes <- copy(planes)
setDT(planes)

In [None]:
weather <- copy(weather)
setDT(weather)

In [None]:
class(flights)

In [None]:
flights <- copy(flights)
setDT(flights)

In [None]:
class(flights)

# basic operations

DT[i, j, by]

## i: filtering rows

- You don't have to put an additional "," if you just want to filter rows
- You don't have to repeat the object name with flights$xxx to refer to the columns

In [None]:
flights[distance < 400 & dep_delay > 20]

In [None]:
flights[!hour %between% c(7, 24)]

## j: column operations

You have to use an `i` filter to use column operations

If you want to take all rows, use a leading "," inside the brackets as such:

DT[, ...]

Column names:

In [None]:
names(flights)

Extract a single column as vector:

In [None]:
flights[, head(carrier)]

Combine filter and column operations

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
       carrier]

Extract a single column as data.table

`.()` is an alias for `list()`

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35, .(carrier)]

Extract multiple columns as data.table

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        .(carrier, origin, dest)]

A new data output with calculated columns that summarize the data

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        .(max_at = max(air_time), min_at = min(air_time))]

A new data output with calculated columns that matches the rows

For example numeric 517 in dep_time represents 5:17, so we have to separate into two columns for that

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        .(dep_hour = dep_time %/% 100, dep_min = dep_time %% 100)]

Calculate new columns, with a value for each row, and add to the dataset

Note the `:=` operator for in-place modification

We can do that for adding a new column or modifying an existing column

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        dep_hour := dep_time %/% 100]

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        dep_min := dep_time %% 100]

For filtered-out rows, columns will have NA values

In [None]:
flights[!is.na(dep_min), .(dep_time, dep_hour, dep_min)]

In [None]:
str(flights)

Delete existing columns:

In [None]:
flights[, dep_min := NULL]
flights[, dep_hour := NULL]

Calculate and add multiple columns in a single step.

Note the wrapping () in the LHS and .() in the RHS

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        (c("dep_hour", "dep_min")) := .(dep_time %/% 100, dep_time %% 100)]

In [None]:
flights[!is.na(dep_min), .(dep_time, dep_hour, dep_min)]

Add a summarizing column, summarization done using filtered rows

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        (c("max_at", "min_at", "av_at")) := .(max(air_time), min(air_time), mean(air_time))]

Note the values will be same across rows when a summarized column is added

In [None]:
flights[distance < 400 & dep_delay > 20 & !hour %between% c(7, 24) & air_time < 35,
        .(max_at, min_at, av_at)]

We can make calculations with interim unsaved step and assign or return the last calculation, if we wrap multiple statements inside a curly pair `{}` and end each with a semicolon `;`

In [None]:
flights[, z_at := { std_at <- sd(air_time, na.rm = T);
                    av_at <- mean(air_time, na.rm = T);
                    (air_time - av_at) / std_at
                   }
                   ]

In [None]:
flights[, summary(z_at)]

In [None]:
flights[, mean(z_at, na.rm = T)]

## by operations

Repeat operations for each unique value of selected column(s), create a new output

In [None]:
flights[, .(max_at_o = max(air_time, na.rm = T), min_at_o = min(air_time, na.rm = T), av_at_o = mean(air_time, na.rm = T)), by = origin]

In [None]:
flights[, .(max_at_oc = max(air_time, na.rm = T), min_at_oc = min(air_time, na.rm = T), av_at_oc = mean(air_time, na.rm = T)),
        by = c("origin", "carrier")]

Or we can add the column back to the dataset, values will be calculated separately for each unique values of selected column(s)

Note that linebreaks are arbitrary, but as a style guidelines, try to keep the widgth of lines shorter for readability

In [None]:
flights[, (c("max_at_oc", "min_at_oc", "av_at_oc")) := .(max(air_time, na.rm = T),
                                                         min(air_time, na.rm = T),
                                                         mean(air_time, na.rm = T)),
        by = c("origin", "carrier")]

In [None]:
flights[seq(1e4, 1e5, 1e4), .(origin, carrier, max_at_oc, min_at_oc, av_at_oc)]

# chaining

We can make successive operations chaining braces

Note that in-place modification operation is quite. To return the last output use a last empty chain with `[]`

In [None]:
flights[, (c("std_at", "av_at")) := .(sd(air_time, na.rm = T),
                                      mean(air_time, na.rm = T))][,
    z_at := (air_time - av_at) / std_at][]

# ordering and naming

## setorder

setorder sets the row order according to columns and assigns in place

In [None]:
head(flights)

In [None]:
setorder(flights, carrier, time_hour)

In [None]:
setorder(flights)

## setcolorder

Changes the order of columns

In [None]:
setcolorder(flights, c("time_hour", "carrier")) 

In [None]:
flights

## setnames

Changes the names of columns

In [None]:
setnames(flights, "time_hour", "date_time")

In [None]:
flights

In [None]:
setnames(flights, "date_time", "time_hour")

In [None]:
flights

# symbols and shortcuts

## .SD

To refer to all columns or repeat operations on each column, you can use .SD

You may get the first flights of each day:

In [None]:
flights[, .SD[1], by = c("year", "month", "day")]

Get the classes of all columns:

In [None]:
flights[, lapply(.SD, class)]

Get the min max ranges of selected columns

In [None]:
flights[, .(max_at_oc, min_at_oc, av_at_oc)][, lapply(.SD, range)]

## .SDcols

We may want to repeat calculations on or refer to multiple columns but not all of them

.SDcols makes a selection of columns that .SD will refer to

In [None]:
flights[, lapply(.SD, range), .SDcols = c("max_at_oc", "min_at_oc", "av_at_oc")]

In [None]:
flights[, .SD[1], by = c("year", "month", "day"), .SDcols = c("dep_time", "arr_time")]

## .N

Gives the number of filtered or grouped rows

In [None]:
nrow(flights)

In [None]:
flights[, .N]

In [None]:
flights[month == 1, .N]

This returns the last row:

In [None]:
flights[.N]

Number of rows by carrier

In [None]:
flights[, .N, by = carrier]

Or along with the shares in total

In [None]:
flights[, .N, by = carrier][, nshare := N / sum(N)][]

Or create an index column separately for each unique value combinations of selected columns

In [None]:
flights[, index1 := 1:.N, by = c("carrier", "origin")]

In [None]:
flights

## .I

.I refers to the index of rows.

There is only a single index for the filtered rows, not separate for each group according to the `by`

In [None]:
flights[, index2 := .I]

In [None]:
flights[, index3 := .I, by = c("carrier", "origin")]

In [None]:
flights

In [None]:
flights[, identical(index2, index3)]

## .GRP

Unique id number for each group according to `by`

In [None]:
flights[, grp1 := .GRP, by = c("carrier", "origin")]

In [None]:
flights[, grp1[1], by = c("carrier", "origin")]

## rleid

Unique id for contigious same values

In [None]:
setorder(flights, origin, time_hour)

In [None]:
flights[, rl_co := rleid(carrier), by = origin]

In [None]:
flights[, .(time_hour, origin, carrier, rl_co)]

# reshaping

## dcast

Let's revisit the example where we calculated the summaries of air time for each origin and carrier

In [None]:
flights_at <- flights[, .(max_at_oc = max(air_time, na.rm = T), min_at_oc = min(air_time, na.rm = T), av_at_oc = mean(air_time, na.rm = T)),
        by = c("origin", "carrier")]

In [None]:
flights_at

Now we may want to see the average values for each origin and carrier in a grid form

In [None]:
flights_at_wide <- dcast(flights_at, carrier ~ origin, value.var = "av_at_oc")

In [None]:
flights_at_wide

Now let's make this wide format into a long again

In [None]:
melt(flights_at_wide, id.vars = "carrier", variable.name = "origin2", value.name = "av_time2")

# joins

## Exact merge

Merge the coordinates of origins using merge()

In [None]:
merge(flights[, .(origin, dest, distance)], airports[, .(faa, lat, lon)], by.x = "origin", by.y = "faa")

Or using the right join method: B[A, on ...]

In [None]:
airports[, .(faa, lat, lon)][flights[, .(origin, dest, distance)], on = c(faa = "origin")]

## roll merge

Suppose we want to join the wind speed as at the time of departure for each origin

Now first calculate the exact date time of departure by adding the minutes of departure time in seconds:

In [None]:
flights[, time_hour2 := time_hour + ifelse(is.na(dep_time), 30, dep_time) %% 100 * 60]

In [None]:
flights

Now let's do exact merge but a great majority of rows do not match:

In [None]:
weather[, .(origin, time_hour, wind_speed)][flights[, .(time_hour2, origin, carrier, flight)],
                                            on = c("origin", time_hour = "time_hour2")]

In [None]:
weather

Now let's match the closest wind speed reading before the flight if there is no exact match:

In [None]:
weather[, .(origin, time_hour, time_hour_org = time_hour, wind_speed)][flights[, .(time_hour2, origin, carrier, flight)],
                                            on = c("origin", time_hour = "time_hour2"), roll = Inf]

Or the closest wind speed reading after the flight if there is no exact match:

In [None]:
weather[, .(origin, time_hour, wind_speed)][flights[, .(time_hour2, origin, carrier, flight)],
                                            on = c("origin", time_hour = "time_hour2"), roll = -Inf]