# Secondary indices

Secondary indices are similar to keys in data.table, except for two major differences:

    It doesn't physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that order vector in an additional attribute called index.

    There can be more than one secondary index for a data.table (as we will see below).

How can we set the column origin as a secondary index in the data.table flights?

In [1]:
library(data.table)

In [2]:
flights <- fread("flights.wiki/NYCflights14/flights14.csv")

In [3]:
setindex(flights, origin)

In [4]:
names(attributes(flights))

setindex(flights, NULL) would remove all secondary indices.

In [5]:
indices(flights)
setindex(flights, origin, dest)
indices(flights)

Reordering a data.table can be expensive and not always ideal

Consider the case where you would like to perform a fast key based subset on origin column for the value “JFK”. We'd do this as:

Secondary indices can be reused

Since there can be multiple secondary indices, and creating an index is as simple as storing the order vector as an attribute, this allows us to even eliminate the time to recompute the order vector if an index already exists.

# Fast subsetting using on argument and secondary indices

## Fast subsets in i

Subset all rows where the origin airport matches “JFK” using on

In [6]:
flights["JFK", on = "origin"]

year,month,day,dep_time,dep_delay,arr_time,arr_delay,cancelled,carrier,tailnum,flight,origin,dest,air_time,distance,hour,min
2014,1,1,914,14,1238,13,0,AA,N338AA,1,JFK,LAX,359,2475,9,14
2014,1,1,1157,-3,1523,13,0,AA,N335AA,3,JFK,LAX,363,2475,11,57
2014,1,1,1902,2,2224,9,0,AA,N327AA,21,JFK,LAX,351,2475,19,2
2014,1,1,1347,2,1706,1,0,AA,N319AA,117,JFK,LAX,350,2475,13,47
2014,1,1,2133,-2,37,-18,0,AA,N323AA,185,JFK,LAX,338,2475,21,33
2014,1,1,1542,-3,1906,-14,0,AA,N328AA,133,JFK,LAX,356,2475,15,42
2014,1,1,1509,-1,1828,-17,0,AA,N5FJAA,145,JFK,MIA,161,1089,15,9
2014,1,1,1848,-2,2206,-14,0,AA,N3HYAA,235,JFK,SEA,349,2422,18,48
2014,1,1,1752,7,2120,-5,0,AA,N332AA,177,JFK,SFO,365,2586,17,52
2014,1,1,1253,3,1351,1,0,AA,N3JWAA,178,JFK,BOS,39,187,12,53


If we had already created a secondary index, using setindex(), then on would reuse it instead of (re)computing it. We can see that by using verbose = TRUE:



In [7]:
setindex(flights, origin)
flights["JFK", on = "origin", verbose = TRUE][1:5]

on= matches existing index, using index
Starting bmerge ...done in 0.001 secs


year,month,day,dep_time,dep_delay,arr_time,arr_delay,cancelled,carrier,tailnum,flight,origin,dest,air_time,distance,hour,min
2014,1,1,914,14,1238,13,0,AA,N338AA,1,JFK,LAX,359,2475,9,14
2014,1,1,1157,-3,1523,13,0,AA,N335AA,3,JFK,LAX,363,2475,11,57
2014,1,1,1902,2,2224,9,0,AA,N327AA,21,JFK,LAX,351,2475,19,2
2014,1,1,1347,2,1706,1,0,AA,N319AA,117,JFK,LAX,350,2475,13,47
2014,1,1,2133,-2,37,-18,0,AA,N323AA,185,JFK,LAX,338,2475,21,33


How can I subset based on origin and dest columns?

For example, if we want to subset "JFK", "LAX" combination, then:

In [8]:
flights[.("JFK", "LAX"), on = c("origin", "dest")][1:5]

year,month,day,dep_time,dep_delay,arr_time,arr_delay,cancelled,carrier,tailnum,flight,origin,dest,air_time,distance,hour,min
2014,1,1,914,14,1238,13,0,AA,N338AA,1,JFK,LAX,359,2475,9,14
2014,1,1,1157,-3,1523,13,0,AA,N335AA,3,JFK,LAX,363,2475,11,57
2014,1,1,1902,2,2224,9,0,AA,N327AA,21,JFK,LAX,351,2475,19,2
2014,1,1,1347,2,1706,1,0,AA,N319AA,117,JFK,LAX,350,2475,13,47
2014,1,1,2133,-2,37,-18,0,AA,N323AA,185,JFK,LAX,338,2475,21,33


## Select in j

Return arr_delay column alone as a data.table corresponding to origin = "LGA" and dest = "TPA"



In [9]:
flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]

arr_delay
1
14
-17
-4
-12
94
193
11
56
5


## Aggregation using "by"

Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month


In [10]:
ans <- flights["JFK", max(dep_delay), keyby = month, on = "origin"]
head(ans)

month,V1
1,881
1,1014
1,920
1,1241
1,853
1,798


## Auto indexing

When we use == or %in% on a single column for the first time, a secondary index is created automtically, and it is used to perform the subset.