# Dplyr Package

- Optimized and distilled version of `plyr`
- very fast (coded with C++)
- 항상 새로운 데이터프레임을 반환한다

- `select` : return a subset of the columns of a data frame
- `filter` : extract a subset of rows from a data frame based on logical cond
- `rename` : rename
- `mutate` : add new var/colum or transform existing vars
- `summarize` : generate summary statistics of different vars, possibly with strata


In [4]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [6]:
chicago <- readRDS("chicago.rds")
dim(chicago)

In [8]:
str(chicago)

'data.frame':	6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...


In [9]:
names(chicago)

### 1. Select

In [10]:
# city~dptp 칼럼을 전부 보여줌
head(select(chicago, city:dptp))

city,tmpd,dptp
chic,31.5,31.5
chic,33.0,29.875
chic,33.0,27.375
chic,29.0,28.625
chic,32.0,28.875
chic,40.0,35.125


In [11]:
# 특정 컬럼 제외하고
head(select(chicago, -(city:dptp)))

date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
1987-01-01,,34.0,4.25,19.9881
1987-01-02,,,3.304348,23.19099
1987-01-03,,34.16667,3.333333,23.81548
1987-01-04,,47.0,4.375,30.43452
1987-01-05,,,4.75,30.33333
1987-01-06,,48.0,5.833333,25.77233


In [13]:
# R에서도 할 수는 있다, 다만 번거로울 뿐
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))

head(chicago[, -(i:j)])

date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
1987-01-01,,34.0,4.25,19.9881
1987-01-02,,,3.304348,23.19099
1987-01-03,,34.16667,3.333333,23.81548
1987-01-04,,47.0,4.375,30.43452
1987-01-05,,,4.75,30.33333
1987-01-06,,48.0,5.833333,25.77233


### 2. Filter

In [14]:
chic.f <- filter(chicago, pm25tmean2 >30)
head(chic.f)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,23,21.9,1998-01-17,38.1,32.46154,3.180556,25.3
chic,28,25.8,1998-01-23,33.95,38.69231,1.75,29.3763
chic,55,51.3,1998-04-30,39.4,34.0,10.786232,25.3131
chic,59,53.7,1998-05-01,35.4,28.5,14.295125,31.42905
chic,57,52.0,1998-05-02,33.3,35.0,20.662879,26.79861
chic,57,56.0,1998-05-07,32.1,34.5,24.270422,33.99167


In [19]:
# Same as 
head(chicago[ !is.na(chicago["pm25tmean2"]) & chicago["pm25tmean2"]>30, ])

Unnamed: 0,city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
4035,chic,23,21.9,1998-01-17,38.1,32.46154,3.180556,25.3
4041,chic,28,25.8,1998-01-23,33.95,38.69231,1.75,29.3763
4138,chic,55,51.3,1998-04-30,39.4,34.0,10.786232,25.3131
4139,chic,59,53.7,1998-05-01,35.4,28.5,14.295125,31.42905
4140,chic,57,52.0,1998-05-02,33.3,35.0,20.662879,26.79861
4145,chic,57,56.0,1998-05-07,32.1,34.5,24.270422,33.99167


In [16]:
chic.f <- filter(chicago, pm25tmean2 >30 & tmpd > 80)
head(chic.f)

city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
chic,81,71.2,1998-08-23,39.6,59.0,45.86364,14.32639
chic,81,70.4,1998-09-06,31.5,50.5,50.6625,20.3125
chic,82,72.2,2001-07-20,32.3,58.5,33.0038,33.675
chic,84,72.9,2001-08-01,43.7,81.5,45.17736,27.44239
chic,85,72.6,2001-08-08,38.8375,70.0,37.98047,27.62743
chic,84,72.6,2001-08-09,38.2,66.0,36.73245,26.46742
