# `data.table`

- `data.table`은 `data.frame`보다 더 효율적일 때가 많다
- `data.frame`을 상속받는다
    - 데이터프레임에서 가능한 메서드는 모두 사용 가능
- C로 작성되어 훨씬 빠르다
- 그룹을 나누고, 서브셋을 만들고, 자료를 업데이트하는데 훨씬 훨씬 빠르다

http://stackoverflow.com/questions/13618488/what-you-can-do-with-data-frame-that-you-cant-in-data-table

https://github.com/raphg/Biostat-578/blob/master/Advanced_data_manipulation.Rmd

### 1. Creating Data Tables

In [4]:
library(data.table)

In [45]:
DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z=rnorm(9))
head(DF, 3)

x,y,z
-0.1259656,a,-1.741379
0.1307752,a,-2.821392
-0.167333,a,-1.264231


In [5]:
DT = data.table(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z=rnorm(9))
head(DT, 3)

x,y,z
1.2983127,a,0.4161695
-0.9594838,a,0.8963377
0.5943397,a,1.6486333


In [6]:
# 메모리에 있는 모든 데이터 테이블 목을 출력
tables()

     NAME NROW NCOL MB COLS  KEY
[1,] DT      9    3  1 x,y,z    
Total: 1MB


### 2. Subsetting Rows

데이터프레임과 다른 점은 일반적인 서브셋이 행을 기준으로 만들어진다는 점이다.

In [49]:
# 3번째 컬럼
head(DF[1],3)

x
-0.1259656
0.1307752
-0.167333


In [50]:
# 3번째 행
head(DT[1],3)

x,y,z
1.298313,a,0.4161695


In [35]:
DT[2,]
DT[DT$y=="a", ]

x,y,z
-0.9594838,a,0.8963377


x,y,z
1.2983127,a,0.4161695
-0.9594838,a,0.8963377
0.5943397,a,1.6486333


### 3. Subsetting Columns

- 서브셋 함수는 데이터테이블에서 변형된다
- 컴마(,) 이후에 전달되는 인자는 "표현식(expression)"이라고 부른다
- R에서, 표현식은 {}안에 포함된 명령문의 집합이다

In [52]:
head(DT[,c(2,3)],3)

y,z
a,0.4161695
a,0.8963377
a,1.6486333


In [55]:
# 정의할 땐 10을 출력하지만
# 반환은 5를 한다
k = {print(10); 5}

print(k)

[1] 10
[1] 5


In [58]:
# 데이터 테이블에서는 원하는 함수를 바로 적용할 수 있다
DT[, list(mean(x), sum(z))]

V1,V2
0.4683495,-0.4775512


In [59]:
DT[,table(y)]

y
a b c 
3 3 3 

In [62]:
# Data Table Transformation
print(DT[,w:=z^2])

              x y          z          w
1:  1.298312716 a  0.4161695 0.17319703
2: -0.959483753 a  0.8963377 0.80342121
3:  0.594339660 a  1.6486333 2.71799176
4:  1.005841924 b -3.0001965 9.00117881
5:  0.008340454 b -1.2837106 1.64791278
6:  0.622138020 b  0.7330602 0.53737722
7:  0.332431202 c -0.2226859 0.04958902
8:  1.780496679 c -0.5777695 0.33381760
9: -0.467271742 c  0.9126106 0.83285815


In [63]:
DT2 <- DT
print(DT[, y:=2])

“Coerced 'double' RHS to 'character' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 9 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.”

              x y          z          w
1:  1.298312716 2  0.4161695 0.17319703
2: -0.959483753 2  0.8963377 0.80342121
3:  0.594339660 2  1.6486333 2.71799176
4:  1.005841924 2 -3.0001965 9.00117881
5:  0.008340454 2 -1.2837106 1.64791278
6:  0.622138020 2  0.7330602 0.53737722
7:  0.332431202 2 -0.2226859 0.04958902
8:  1.780496679 2 -0.5777695 0.33381760
9: -0.467271742 2  0.9126106 0.83285815


In [65]:
head(DT,n=3)

x,y,z,w
1.2983127,2,0.4161695,0.173197
-0.9594838,2,0.8963377,0.8034212
0.5943397,2,1.6486333,2.7179918


In [67]:
# 얘도 바뀌어버렸다
# 따라서 복사를 하려면 copy 함수를 이용하도록
head(DT2,n=3)

x,y,z,w
1.2983127,2,0.4161695,0.173197
-0.9594838,2,0.8963377,0.8034212
0.5943397,2,1.6486333,2.7179918


### 4. Multiple Operations

In [69]:
DT[, m:= {tmp <- (x+z); log2(tmp+5)}]
DT

x,y,z,w,m
1.298312716,2,0.4161695,0.17319703,2.747276
-0.959483753,2,0.8963377,0.80342121,2.303592
0.59433966,2,1.6486333,2.71799176,2.856582
1.005841924,2,-3.0001965,9.00117881,1.587675
0.008340454,2,-1.2837106,1.64791278,1.897097
0.62213802,2,0.7330602,0.53737722,2.667937
0.332431202,2,-0.2226859,0.04958902,2.353251
1.780496679,2,-0.5777695,0.3338176,2.632903
-0.467271742,2,0.9126106,0.83285815,2.445022


In [71]:
DT[, a:=x>0]
DT

x,y,z,w,m,a
1.298312716,2,0.4161695,0.17319703,2.747276,True
-0.959483753,2,0.8963377,0.80342121,2.303592,False
0.59433966,2,1.6486333,2.71799176,2.856582,True
1.005841924,2,-3.0001965,9.00117881,1.587675,True
0.008340454,2,-1.2837106,1.64791278,1.897097,True
0.62213802,2,0.7330602,0.53737722,2.667937,True
0.332431202,2,-0.2226859,0.04958902,2.353251,True
1.780496679,2,-0.5777695,0.3338176,2.632903,True
-0.467271742,2,0.9126106,0.83285815,2.445022,False


In [78]:
# grouped by a
# 즉, a가 True인 곳의 b값을 모두 특정 값으로 바꾸고
# a가 False인 곳의 b값을 모두 특정 값으로 바꾼다
DT[, b:=mean(x+w), by=a]
DT

x,y,z,w,m,a,b
1.298312716,2,0.4161695,0.17319703,2.747276,True,2.8718521
-0.959483753,2,0.8963377,0.80342121,2.303592,False,0.1047619
0.59433966,2,1.6486333,2.71799176,2.856582,True,2.8718521
1.005841924,2,-3.0001965,9.00117881,1.587675,True,2.8718521
0.008340454,2,-1.2837106,1.64791278,1.897097,True,2.8718521
0.62213802,2,0.7330602,0.53737722,2.667937,True,2.8718521
0.332431202,2,-0.2226859,0.04958902,2.353251,True,2.8718521
1.780496679,2,-0.5777695,0.3338176,2.632903,True,2.8718521
-0.467271742,2,0.9126106,0.83285815,2.445022,False,0.1047619


### 5. Special Variables

`.N` : An integer, length 1, containing the number of times of particular group

In [85]:
set.seed(123)
DT <- data.table(x=sample(letters[1:3], 1E5, TRUE))
DT[, .N, by=x]

x,N
a,33387
c,33201
b,33412


### 6. Keys

Key를 설정하면 훨씬 빠르게 연산을 수행할 수 있다.

In [86]:
DT <- data.table(x=rep(c("a", "b", "c"), each = 100), y=rnorm(300))
setkey(DT, x)
DT['a']

x,y
a,0.25958973
a,0.91751072
a,-0.72231834
a,-0.80828402
a,-0.14135202
a,2.25701345
a,-2.37955015
a,-0.45425393
a,-0.06007418
a,0.86090061


In [90]:
# 키를 이용한 Join
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'), z=5:7)
setkey(DT1, x); setkey(DT2, x)
merge(DT1, DT2)

x,y,z
a,1,5
a,2,5
b,3,6


### 7. Fast Reading

In [91]:
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))

#fread는 데이터테이블을 읽어오는 함

   user  system elapsed 
  0.614   0.012   0.627 

In [92]:
system.time(read.table(file, header=TRUE, sep="\t"))

   user  system elapsed 
  8.484   0.075   8.580 