# Transforming Data

## Table of content

## Subsetting and sorting

### Subsetting

In [2]:
set.seed(13435)
X <- data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X <- X[sample(1:5),]; X$var2[c(1,3)] = NA
X

Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12
3,5,6.0,14
5,4,9.0,13


### Subsetting - quick review

In [3]:
X[,1]
X[,"var1"]
X[1:2,"var2"]

### Logicals ands and ors

In [32]:
X[(X$var1 <= 3 & X$var3 > 11),]
X[(X$var1 <= 3 | X$var3 > 15),]

var1,var2,var3
2,,15
3,,12


Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12


### Dealing with missing values

In [4]:
X[which(X$var2 > 8),]

Unnamed: 0,var1,var2,var3
4,1,10,11
5,4,9,13


### Sorting

In [5]:
sort(X$var1)
sort(X$var1,decreasing=TRUE)
sort(X$var2,na.last=TRUE)

### Ordering

In [6]:
X[order(X$var1),]

Unnamed: 0,var1,var2,var3
4,1,10.0,11
1,2,,15
2,3,,12
5,4,9.0,13
3,5,6.0,14


### Ordering

In [7]:
X[order(X$var1,X$var3),]

Unnamed: 0,var1,var2,var3
4,1,10.0,11
1,2,,15
2,3,,12
5,4,9.0,13
3,5,6.0,14


### Ordering with plyr

In [8]:
library(plyr)
arrange(X,var1)
arrange(X,desc(var1))

var1,var2,var3
1,10.0,11
2,,15
3,,12
4,9.0,13
5,6.0,14


var1,var2,var3
5,6.0,14
4,9.0,13
3,,12
2,,15
1,10.0,11


### Adding rows and columns

In [9]:
X$var4 <- rnorm(5)
X

Unnamed: 0,var1,var2,var3,var4
1,2,,15,0.187596
4,1,10.0,11,1.7869764
2,3,,12,0.4966936
3,5,6.0,14,0.063183
5,4,9.0,13,-0.5361329


### Adding rows and columns

In [10]:
Y <- cbind(X,rnorm(5))
Y

Unnamed: 0,var1,var2,var3,var4,rnorm(5)
1,2,,15,0.187596,0.6257849
4,1,10.0,11,1.7869764,-2.4508375
2,3,,12,0.4966936,0.08909424
3,5,6.0,14,0.063183,0.4783857
5,4,9.0,13,-0.5361329,1.00053336


### Notes and further resources

* R programming in the Data Science Track
* Andrew Jaffe's lecture notes [http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf)

## Summarizing data

### Example data set 

<img class=center src="https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/restaurants.png" height=500 />


[https://data.baltimorecity.gov/Community/Restaurants/k5ry-ef3g](https://data.baltimorecity.gov/Community/Restaurants/k5ry-ef3g)

### Getting the data from the web


In [2]:
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "http://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile="./data/restaurants.csv", method = "curl")
restData <- read.csv("./data/restaurants.csv")

### Look at a bit of the data


In [4]:
head(restData,n=3)
tail(restData,n=3)


name,zipCode,neighborhood,councilDistrict,policeDistrict,Location.1
410,21206,Frankford,2,NORTHEASTERN,"4509 BELAIR ROAD Baltimore, MD"
1919,21231,Fells Point,1,SOUTHEASTERN,"1919 FLEET ST Baltimore, MD"
SAUTE,21224,Canton,1,SOUTHEASTERN,"2844 HUDSON ST Baltimore, MD"


Unnamed: 0,name,zipCode,neighborhood,councilDistrict,policeDistrict,Location.1
1325,ZINK'S CAF,21213,Belair-Edison,13,NORTHEASTERN,"3300 LAWNVIEW AVE Baltimore, MD"
1326,ZISSIMOS BAR,21211,Hampden,7,NORTHERN,"1023 36TH ST Baltimore, MD"
1327,ZORBAS,21224,Greektown,2,SOUTHEASTERN,"4710 EASTERN Ave Baltimore, MD"


### Make summary

In [5]:
summary(restData)

                           name         zipCode             neighborhood
 MCDONALD'S                  :   8   Min.   :-21226   Downtown    :128  
 POPEYES FAMOUS FRIED CHICKEN:   7   1st Qu.: 21202   Fells Point : 91  
 SUBWAY                      :   6   Median : 21218   Inner Harbor: 89  
 KENTUCKY FRIED CHICKEN      :   5   Mean   : 21185   Canton      : 81  
 BURGER KING                 :   4   3rd Qu.: 21226   Federal Hill: 42  
 DUNKIN DONUTS               :   4   Max.   : 21287   Mount Vernon: 33  
 (Other)                     :1293                    (Other)     :863  
 councilDistrict       policeDistrict                         Location.1   
 Min.   : 1.000   SOUTHEASTERN:385    1101 RUSSELL ST\nBaltimore, MD:   9  
 1st Qu.: 2.000   CENTRAL     :288    201 PRATT ST\nBaltimore, MD   :   8  
 Median : 9.000   SOUTHERN    :213    2400 BOSTON ST\nBaltimore, MD :   8  
 Mean   : 7.191   NORTHERN    :157    300 LIGHT ST\nBaltimore, MD   :   5  
 3rd Qu.:11.000   NORTHEASTERN: 72  

### More in depth information

In [6]:
str(restData)

'data.frame':	1327 obs. of  6 variables:
 $ name           : Factor w/ 1277 levels "#1 CHINESE KITCHEN",..: 9 3 992 1 2 4 5 6 7 8 ...
 $ zipCode        : int  21206 21231 21224 21211 21223 21218 21205 21211 21205 21231 ...
 $ neighborhood   : Factor w/ 173 levels "Abell","Arlington",..: 53 52 18 66 104 33 98 133 98 157 ...
 $ councilDistrict: int  2 1 1 14 9 14 13 7 13 1 ...
 $ policeDistrict : Factor w/ 9 levels "CENTRAL","EASTERN",..: 3 6 6 4 8 3 6 4 6 6 ...
 $ Location.1     : Factor w/ 1210 levels "1 BIDDLE ST\nBaltimore, MD",..: 835 334 554 755 492 537 505 530 507 569 ...


### Quantiles of quantitative variables

In [7]:
quantile(restData$councilDistrict,na.rm=TRUE)
quantile(restData$councilDistrict,probs=c(0.5,0.75,0.9))


### Make table

In [8]:
table(restData$zipCode,useNA="ifany")


-21226  21201  21202  21205  21206  21207  21208  21209  21210  21211  21212 
     1    136    201     27     30      4      1      8     23     41     28 
 21213  21214  21215  21216  21217  21218  21220  21222  21223  21224  21225 
    31     17     54     10     32     69      1      7     56    199     19 
 21226  21227  21229  21230  21231  21234  21237  21239  21251  21287 
    18      4     13    156    127      7      1      3      2      1 

### Make table

In [9]:
table(restData$councilDistrict,restData$zipCode)

    
     -21226 21201 21202 21205 21206 21207 21208 21209 21210 21211 21212 21213
  1       0     0    37     0     0     0     0     0     0     0     0     2
  2       0     0     0     3    27     0     0     0     0     0     0     0
  3       0     0     0     0     0     0     0     0     0     0     0     2
  4       0     0     0     0     0     0     0     0     0     0    27     0
  5       0     0     0     0     0     3     0     6     0     0     0     0
  6       0     0     0     0     0     0     0     1    19     0     0     0
  7       0     0     0     0     0     0     0     1     0    27     0     0
  8       0     0     0     0     0     1     0     0     0     0     0     0
  9       0     1     0     0     0     0     0     0     0     0     0     0
  10      1     0     1     0     0     0     0     0     0     0     0     0
  11      0   115   139     0     0     0     1     0     0     0     1     0
  12      0    20    24     4     0     0     0     0     0

### Check for missing values

In [10]:
sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)

### Row and column sums

In [11]:
colSums(is.na(restData))
all(colSums(is.na(restData))==0)

### Values with specific characteristics

In [12]:
table(restData$zipCode %in% c("21212"))
table(restData$zipCode %in% c("21212","21213"))


FALSE  TRUE 
 1299    28 


FALSE  TRUE 
 1268    59 

### Values with specific characteristics

In [13]:
restData[restData$zipCode %in% c("21212","21213"),]


Unnamed: 0,name,zipCode,neighborhood,councilDistrict,policeDistrict,Location.1
29,BAY ATLANTIC CLUB,21212,Downtown,11,CENTRAL,"206 REDWOOD ST Baltimore, MD"
39,BERMUDA BAR,21213,Broadway East,12,EASTERN,"1801 NORTH AVE Baltimore, MD"
92,ATWATER'S,21212,Chinquapin Park-Belvedere,4,NORTHERN,"529 BELVEDERE AVE Baltimore, MD"
111,BALTIMORE ESTONIAN SOCIETY,21213,South Clifton Park,12,EASTERN,"1932 BELAIR RD Baltimore, MD"
187,CAFE ZEN,21212,Rosebank,4,NORTHERN,"438 BELVEDERE AVE Baltimore, MD"
220,CERIELLO FINE FOODS,21212,Chinquapin Park-Belvedere,4,NORTHERN,"529 BELVEDERE AVE Baltimore, MD"
266,CLIFTON PARK GOLF COURSE SNACK BAR,21213,Darley Park,14,NORTHEASTERN,"2701 ST LO DR Baltimore, MD"
276,CLUB HOUSE BAR & GRILL,21213,Orangeville Industrial Area,13,EASTERN,"4217 ERDMAN AVE Baltimore, MD"
289,CLUBHOUSE BAR & GRILL,21213,Orangeville Industrial Area,13,EASTERN,"4217 ERDMAN AVE Baltimore, MD"
291,COCKY LOU'S,21213,Broadway East,12,EASTERN,"2101 NORTH AVE Baltimore, MD"


### Cross tabs

In [14]:
data(UCBAdmissions)
DF = as.data.frame(UCBAdmissions)
summary(DF)

      Admit       Gender   Dept       Freq      
 Admitted:12   Male  :12   A:4   Min.   :  8.0  
 Rejected:12   Female:12   B:4   1st Qu.: 80.0  
                           C:4   Median :170.0  
                           D:4   Mean   :188.6  
                           E:4   3rd Qu.:302.5  
                           F:4   Max.   :512.0  

### Cross tabs

In [15]:
xt <- xtabs(Freq ~ Gender + Admit,data=DF)
xt

        Admit
Gender   Admitted Rejected
  Male       1198     1493
  Female      557     1278

### Flat tables

In [16]:
warpbreaks$replicate <- rep(1:9, len = 54)
xt = xtabs(breaks ~.,data=warpbreaks)
xt

, , replicate = 1

    tension
wool  L  M  H
   A 26 18 36
   B 27 42 20

, , replicate = 2

    tension
wool  L  M  H
   A 30 21 21
   B 14 26 21

, , replicate = 3

    tension
wool  L  M  H
   A 54 29 24
   B 29 19 24

, , replicate = 4

    tension
wool  L  M  H
   A 25 17 18
   B 19 16 17

, , replicate = 5

    tension
wool  L  M  H
   A 70 12 10
   B 29 39 13

, , replicate = 6

    tension
wool  L  M  H
   A 52 18 43
   B 31 28 15

, , replicate = 7

    tension
wool  L  M  H
   A 51 35 28
   B 41 21 15

, , replicate = 8

    tension
wool  L  M  H
   A 26 30 15
   B 20 39 16

, , replicate = 9

    tension
wool  L  M  H
   A 67 36 26
   B 44 29 28


### Flat tables

In [17]:
ftable(xt)

             replicate  1  2  3  4  5  6  7  8  9
wool tension                                     
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

### Size of a data set

In [18]:
fakeData = rnorm(1e5)
object.size(fakeData)
print(object.size(fakeData),units="Mb")

800048 bytes

0.8 Mb


## Creating new variables

### Why create new variables?

* Often the raw data won't have a value you are looking for
* You will need to transform the data to get the values you would like
* Usually you will add those values to the data frames you are working with
* Common variables to create
  * Missingness indicators
  * "Cutting up" quantitative variables
  * Applying transforms

### Example data set 

<img class=center src="https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/restaurants.png" height=500 />


[https://data.baltimorecity.gov/Community/Restaurants/k5ry-ef3g](https://data.baltimorecity.gov/Community/Restaurants/k5ry-ef3g)

### Creating sequences

_Sometimes you need an index for your data set_

In [19]:
s1 <- seq(1,10,by=2) ; s1
s2 <- seq(1,10,length=3); s2
x <- c(1,3,8,25,100); seq(along = x)

### Subsetting variables

In [20]:
restData$nearMe = restData$neighborhood %in% c("Roland Park", "Homeland")
table(restData$nearMe)


FALSE  TRUE 
 1314    13 

### Creating binary variables

In [21]:
restData$zipWrong = ifelse(restData$zipCode < 0, TRUE, FALSE)
table(restData$zipWrong,restData$zipCode < 0)

       
        FALSE TRUE
  FALSE  1326    0
  TRUE      0    1

### Creating categorical variables

In [22]:
restData$zipGroups = cut(restData$zipCode,breaks=quantile(restData$zipCode))
table(restData$zipGroups)
table(restData$zipGroups,restData$zipCode)


(-2.123e+04,2.12e+04]  (2.12e+04,2.122e+04] (2.122e+04,2.123e+04] 
                  337                   375                   282 
(2.123e+04,2.129e+04] 
                  332 

                       
                        -21226 21201 21202 21205 21206 21207 21208 21209 21210
  (-2.123e+04,2.12e+04]      0   136   201     0     0     0     0     0     0
  (2.12e+04,2.122e+04]       0     0     0    27    30     4     1     8    23
  (2.122e+04,2.123e+04]      0     0     0     0     0     0     0     0     0
  (2.123e+04,2.129e+04]      0     0     0     0     0     0     0     0     0
                       
                        21211 21212 21213 21214 21215 21216 21217 21218 21220
  (-2.123e+04,2.12e+04]     0     0     0     0     0     0     0     0     0
  (2.12e+04,2.122e+04]     41    28    31    17    54    10    32    69     0
  (2.122e+04,2.123e+04]     0     0     0     0     0     0     0     0     1
  (2.123e+04,2.129e+04]     0     0     0     0     0     0     0     0     0
                       
                        21222 21223 21224 21225 21226 21227 21229 21230 21231
  (-2.123e+04,2.12e+04]     0     0     0     0     0     0     0

### Easier cutting

In [24]:
install.packages("Hmisc")
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode,g=4)
table(restData$zipGroups)

also installing the dependencies ‘checkmate’, ‘Formula’, ‘latticeExtra’, ‘cluster’, ‘acepack’, ‘foreign’, ‘gridExtra’, ‘data.table’, ‘htmlTable’, ‘viridis’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:

    format.pval, units




[-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287] 
           338            375            300            314 

### Creating factor variables

In [25]:
restData$zcf <- factor(restData$zipCode)
restData$zcf[1:10]
class(restData$zcf)

### Levels of factor variables

In [26]:
yesno <- sample(c("yes","no"),size=10,replace=TRUE)
yesnofac = factor(yesno,levels=c("yes","no"))
relevel(yesnofac,ref="no")
as.numeric(yesnofac)

### Cutting produces factor variables

In [27]:
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode,g=4)
table(restData$zipGroups)


[-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287] 
           338            375            300            314 

### Using the mutate function

In [28]:
library(Hmisc); library(plyr)
restData2 = mutate(restData,zipGroups=cut2(zipCode,g=4))
table(restData2$zipGroups)


Attaching package: ‘plyr’

The following objects are masked from ‘package:Hmisc’:

    is.discrete, summarize




[-21226,21205) [ 21205,21220) [ 21220,21227) [ 21227,21287] 
           338            375            300            314 

### Common transforms

* `abs(x)` absolute value
* `sqrt(x)` square root
* `ceiling(x)` ceiling(3.475) is 4
* `floor(x)` floor(3.475) is 3
* `round(x,digits=n)` round(3.475,digits=2) is 3.48
* `signif(x,digits=n)` signif(3.475,digits=2) is 3.5
* `cos(x), sin(x)` etc.
* `log(x)` natural logarithm
* `log2(x)`, `log10(x)` other common logs
* `exp(x)` exponentiating x

[http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf)
[http://statmethods.net/management/functions.html](http://statmethods.net/management/functions.html)

### Notes and further reading

* A tutorial from the developer of plyr - [http://plyr.had.co.nz/09-user/](http://plyr.had.co.nz/09-user/)
* Andrew Jaffe's R notes [http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf)
* A nice lecture on categorical and factor variables [http://www.stat.berkeley.edu/classes/s133/factors.html](http://www.stat.berkeley.edu/classes/s133/factors.html)

## Reshaping data

### The goal is tidy data

<img class=center src="https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/01_DataScientistToolbox/excel.png" height=300 />


1. Each variable forms a column
2. Each observation forms a row
3. Each table/file stores data about one kind of observation (e.g. people/hospitals).


[http://vita.had.co.nz/papers/tidy-data.pdf](http://vita.had.co.nz/papers/tidy-data.pdf)

[Leek, Taub, and Pineda 2011 PLoS One](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026895)


### Start with reshaping

In [29]:
library(reshape2)
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


### Melting data frames


In [30]:
mtcars$carname <- rownames(mtcars)
carMelt <- melt(mtcars,id=c("carname","gear","cyl"),measure.vars=c("mpg","hp"))
head(carMelt,n=3)
tail(carMelt,n=3)

carname,gear,cyl,variable,value
Mazda RX4,4,6,mpg,21.0
Mazda RX4 Wag,4,6,mpg,21.0
Datsun 710,4,4,mpg,22.8


Unnamed: 0,carname,gear,cyl,variable,value
62,Ferrari Dino,5,6,hp,175
63,Maserati Bora,5,8,hp,335
64,Volvo 142E,4,4,hp,109


[http://www.statmethods.net/management/reshape.html](http://www.statmethods.net/management/reshape.html)

### Casting data frames

[http://www.statmethods.net/management/reshape.html](http://www.statmethods.net/management/reshape.html)


In [32]:
cylData <- dcast(carMelt, cyl ~ variable)
cylData
cylData <- dcast(carMelt, cyl ~ variable,mean)
cylData


Aggregation function missing: defaulting to length


cyl,mpg,hp
4,11,11
6,7,7
8,14,14


cyl,mpg,hp
4,26.66364,82.63636
6,19.74286,122.28571
8,15.1,209.21429


### Averaging values

[http://www.r-bloggers.com/a-quick-primer-on-split-apply-combine-problems/](http://www.r-bloggers.com/a-quick-primer-on-split-apply-combine-problems/)


In [35]:
head(InsectSprays)
tapply(InsectSprays$count,InsectSprays$spray,sum)

count,spray
10,A
7,A
20,A
14,A
14,A
12,A


### Another way - split

In [36]:
spIns =  split(InsectSprays$count,InsectSprays$spray)
spIns

### Another way - apply

In [37]:
sprCount = lapply(spIns,sum)
sprCount

### Another way - combine

In [38]:
unlist(sprCount)
sapply(spIns,sum)

### Another way - plyr package

In [39]:
ddply(InsectSprays,.(spray),summarize,sum=sum(count))

spray,sum
A,174
B,184
C,25
D,59
E,42
F,200


### Creating a new variable


In [40]:
spraySums <- ddply(InsectSprays,.(spray),summarize,sum=ave(count,FUN=sum))
dim(spraySums)
head(spraySums)

spray,sum
A,174
A,174
A,174
A,174
A,174
A,174


### More information

* A tutorial from the developer of plyr - [http://plyr.had.co.nz/09-user/](http://plyr.had.co.nz/09-user/)
* A nice reshape tutorial [http://www.slideshare.net/jeffreybreen/reshaping-data-in-r](http://www.slideshare.net/jeffreybreen/reshaping-data-in-r)
* A good plyr primer - [http://www.r-bloggers.com/a-quick-primer-on-split-apply-combine-problems/](http://www.r-bloggers.com/a-quick-primer-on-split-apply-combine-problems/)
* See also the functions
  * acast - for casting as multi-dimensional arrays
  * arrange - for faster reordering without using order() commands
  * mutate - adding new variables

## Package dplyr

### dplyr

The data frame is a key data structure in statistics and in R.

* There is one observation per row

* Each column represents a variable or measure or characteristic

* Primary implementation that you will use is the default R
  implementation

* Other implementations, particularly relational databases systems

### dplyr

* Developed by Hadley Wickham of RStudio

* An optimized and distilled version of `plyr` package (also by Hadley)

* Does not provide any "new" functionality per se, but **greatly**
  simplifies existing functionality in R

* Provides a "grammar" (in particular, verbs) for data manipulation

* Is **very** fast, as many key operations are coded in C++

### dplyr Verbs

* `select`: return a subset of the columns of a data frame

* `filter`: extract a subset of rows from a data frame based on
  logical conditions

* `arrange`: reorder rows of a data frame


* `rename`: rename variables in a data frame

* `mutate`: add new variables/columns or transform existing variables

* `summarise` / `summarize`: generate summary statistics of different
  variables in the data frame, possibly within strata

There is also a handy `print` method that prevents you from printing a
lot of data to the console.

### dplyr Properties

* The first argument is a data frame.

* The subsequent arguments describe what to do with it, and you can
  refer to columns in the data frame directly without using the $
  operator (just use the names).

* The result is a new data frame

* Data frames must be properly formatted and annotated for this to all
  be useful

### Load the `dplyr` package


This step is important!

In [43]:
library(dplyr)

### Download data file

In [46]:
fileUrl <- "https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/03_GettingData/dplyr/chicago.rds"
download.file(fileUrl,destfile="./chicago.rds",method="curl")

### `select`

In [47]:
chicago <- readRDS("chicago.rds")
dim(chicago)
head(select(chicago, 1:5))

city,tmpd,dptp,date,pm25tmean2
chic,31.5,31.5,1987-01-01,
chic,33.0,29.875,1987-01-02,
chic,33.0,27.375,1987-01-03,
chic,29.0,28.625,1987-01-04,
chic,32.0,28.875,1987-01-05,
chic,40.0,35.125,1987-01-06,


### `select`


In [48]:
names(chicago)[1:3]
head(select(chicago, city:dptp))

city,tmpd,dptp
chic,31.5,31.5
chic,33.0,29.875
chic,33.0,27.375
chic,29.0,28.625
chic,32.0,28.875
chic,40.0,35.125


### `select`

In dplyr you can do

In [49]:
head(select(chicago, -(city:dptp)))

date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
1987-01-01,,34.0,4.25,19.9881
1987-01-02,,,3.304348,23.19099
1987-01-03,,34.16667,3.333333,23.81548
1987-01-04,,47.0,4.375,30.43452
1987-01-05,,,4.75,30.33333
1987-01-06,,48.0,5.833333,25.77233


Equivalent base R

In [50]:
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))
head(chicago[, -(i:j)])

date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
1987-01-01,,34.0,4.25,19.9881
1987-01-02,,,3.304348,23.19099
1987-01-03,,34.16667,3.333333,23.81548
1987-01-04,,47.0,4.375,30.43452
1987-01-05,,,4.75,30.33333
1987-01-06,,48.0,5.833333,25.77233


### `filter`

In [51]:
chic.f <- filter(chicago, pm25tmean2 > 30)
head(select(chic.f, 1:3, pm25tmean2), 10)


city,tmpd,dptp,pm25tmean2
chic,23,21.9,38.1
chic,28,25.8,33.95
chic,55,51.3,39.4
chic,59,53.7,35.4
chic,57,52.0,33.3
chic,57,56.0,32.1
chic,75,65.8,56.5
chic,61,59.0,33.8
chic,73,60.3,30.3
chic,78,67.1,41.4


### `filter`

In [52]:
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(select(chic.f, 1:3, pm25tmean2, tmpd), 10)

city,tmpd,dptp,pm25tmean2
chic,81,71.2,39.6
chic,81,70.4,31.5
chic,82,72.2,32.3
chic,84,72.9,43.7
chic,85,72.6,38.8375
chic,84,72.6,38.2
chic,82,67.4,33.0
chic,82,63.5,42.5
chic,81,70.4,33.1
chic,82,66.2,38.85


### `arrange`

Reordering rows of a data frame (while preserving corresponding order
of other columns) is normally a pain to do in R.

In [53]:
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
tail(select(chicago, date, pm25tmean2), 3)


date,pm25tmean2
1987-01-01,
1987-01-02,
1987-01-03,


Unnamed: 0,date,pm25tmean2
6938,2005-12-29,7.45
6939,2005-12-30,15.05714
6940,2005-12-31,15.0


### `arrange`

Columns can be arranged in descending order too.

In [54]:
chicago <- arrange(chicago, desc(date))
head(select(chicago, date, pm25tmean2), 3)
tail(select(chicago, date, pm25tmean2), 3)

date,pm25tmean2
2005-12-31,15.0
2005-12-30,15.05714
2005-12-29,7.45


Unnamed: 0,date,pm25tmean2
6938,1987-01-03,
6939,1987-01-02,
6940,1987-01-01,


### `rename`

Renaming a variable in a data frame in R is surprising hard to do!

In [55]:
head(chicago[, 1:5], 3)
chicago <- rename(chicago, dewpoint = dptp, 
                  pm25 = pm25tmean2)
head(chicago[, 1:5], 3)

city,tmpd,dptp,date,pm25tmean2
chic,35,30.1,2005-12-31,15.0
chic,36,31.0,2005-12-30,15.05714
chic,35,29.4,2005-12-29,7.45


city,tmpd,dewpoint,date,pm25
chic,35,30.1,2005-12-31,15.0
chic,36,31.0,2005-12-30,15.05714
chic,35,29.4,2005-12-29,7.45


### `mutate`

In [56]:
chicago <- mutate(chicago, 
                  pm25detrend=pm25-mean(pm25, na.rm=TRUE))
head(select(chicago, pm25, pm25detrend))

pm25,pm25detrend
15.0,-1.230958
15.05714,-1.173815
7.45,-8.780958
17.75,1.519042
23.56,7.329042
8.4,-7.830958


### `group_by`

Generating summary statistics by stratum

In [57]:
chicago <- mutate(chicago, 
                  tempcat = factor(1 * (tmpd > 80), 
                                   labels = c("cold", "hot")))
hotcold <- group_by(chicago, tempcat)
summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE), 
          o3 = max(o3tmean2), 
          no2 = median(no2tmean2))

tempcat,pm25,o3,no2
cold,15.97807,66.5875,24.54924
hot,26.48118,62.969656,24.9387
,47.7375,9.416667,37.44444


### `group_by`

Generating summary statistics by stratum

In [61]:
chicago <- mutate(chicago, 
                  year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE), 
          o3 = max(o3tmean2, na.rm = TRUE), 
          no2 = median(no2tmean2, na.rm = TRUE))
chicago$year <- NULL  ## Can't use mutate to create an existing variable

year,pm25,o3,no2
1987,,62.96966,23.49369
1988,,61.67708,24.52296
1989,,59.72727,26.14062
1990,,52.22917,22.59583
1991,,63.10417,21.38194
1992,,50.8287,24.78921
1993,,44.30093,25.76993
1994,,52.17844,28.475
1995,,66.5875,27.26042
1996,,58.39583,26.38715


### `%>%`

In [74]:
chicago %>% 
    mutate(month = as.POSIXlt(date)$mon + 1) %>% 
    group_by(month) %>% 
    summarize(pm25 = mean(pm25, na.rm = TRUE), 
              o3 = max(o3tmean2, na.rm = TRUE), 
              no2 = median(no2tmean2, na.rm = TRUE))                                                                         

month,pm25,o3,no2
1,17.76996,28.22222,25.35417
2,20.37513,37.375,26.78034
3,17.40818,39.05,26.76984
4,13.85879,47.94907,25.03125
5,14.0742,52.75,24.22222
6,15.86461,66.5875,25.0114
7,16.57087,59.54167,22.38442
8,16.9338,53.96701,22.98333
9,15.91279,57.48864,24.47917
10,14.23557,47.09275,24.15217


In [75]:
chicago %>% mutate(month = as.POSIXlt(date)$mon + 1) %>% group_by(month) %>% 
summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE), no2 = median(no2tmean2, na.rm = TRUE))

month,pm25,o3,no2
1,17.76996,28.22222,25.35417
2,20.37513,37.375,26.78034
3,17.40818,39.05,26.76984
4,13.85879,47.94907,25.03125
5,14.0742,52.75,24.22222
6,15.86461,66.5875,25.0114
7,16.57087,59.54167,22.38442
8,16.9338,53.96701,22.98333
9,15.91279,57.48864,24.47917
10,14.23557,47.09275,24.15217


### dplyr

Once you learn the dplyr "grammar" there are a few additional benefits

* dplyr can work with other data frame "backends"

* `data.table` for large fast tables

* SQL interface for relational databases via the DBI package


## Merging data

### Peer review experiment data


<img class=center src="https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/cooperation.png" height=500 />


[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895](http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895)

### Peer review data

In [77]:
if(!file.exists("./data")){dir.create("./data")}
fileUrl1 = "https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/03_GettingData/03_05_mergingData/data/reviews.csv"
fileUrl2 = "https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/03_GettingData/03_05_mergingData/data/solutions.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews = read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)
head(solutions,2)

id,solution_id,reviewer_id,start,stop,time_left,accept
1,3,27,1304095698,1304095758,1754,1
2,4,22,1304095188,1304095206,2306,1


id,problem_id,subject_id,start,stop,time_left,answer
1,156,29,1304095119,1304095169,2343,B
2,269,25,1304095119,1304095183,2329,C


### Merging data - merge()

* Merges data frames
* Important parameters: _x_,_y_,_by_,_by.x_,_by.y_,_all_

In [79]:
names(reviews)
names(solutions)

### Merging data - merge()

In [80]:
mergedData = merge(reviews,solutions,by.x="solution_id",by.y="id",all=TRUE)
head(mergedData)

solution_id,id,reviewer_id,start.x,stop.x,time_left.x,accept,problem_id,subject_id,start.y,stop.y,time_left.y,answer
1,4,26,1304095267,1304095423,2089,1,156,29,1304095119,1304095169,2343,B
2,6,29,1304095471,1304095513,1999,1,269,25,1304095119,1304095183,2329,C
3,1,27,1304095698,1304095758,1754,1,34,22,1304095127,1304095146,2366,C
4,2,22,1304095188,1304095206,2306,1,19,23,1304095127,1304095150,2362,D
5,3,28,1304095276,1304095320,2192,1,605,26,1304095127,1304095167,2345,A
6,16,22,1304095303,1304095471,2041,1,384,27,1304095131,1304095270,2242,C


### Default - merge all common column names

In [81]:
intersect(names(solutions),names(reviews))
mergedData2 = merge(reviews,solutions,all=TRUE)
head(mergedData2)

id,start,stop,time_left,solution_id,reviewer_id,accept,problem_id,subject_id,answer
1,1304095119,1304095169,2343,,,,156.0,29.0,B
1,1304095698,1304095758,1754,3.0,27.0,1.0,,,
2,1304095119,1304095183,2329,,,,269.0,25.0,C
2,1304095188,1304095206,2306,4.0,22.0,1.0,,,
3,1304095127,1304095146,2366,,,,34.0,22.0,C
3,1304095276,1304095320,2192,5.0,28.0,1.0,,,


### Using join in the plyr package 

_Faster, but less full featured - defaults to left join, see help file for more_

In [82]:
df1 = data.frame(id=sample(1:10),x=rnorm(10))
df2 = data.frame(id=sample(1:10),y=rnorm(10))
arrange(join(df1,df2),id)

Joining by: id


id,x,y
1,-1.12776022,-0.2511562
2,0.12237495,-1.8707042
3,1.15277358,-0.2262397
4,0.7739239,0.2800294
5,-0.06356345,0.6836208
6,1.20481314,-0.4803662
7,0.85872874,2.110322
8,-1.66200452,-0.403788
9,-0.33168102,-1.1575186
10,0.22187199,-2.07279


### If you have multiple data frames

In [83]:
df1 = data.frame(id=sample(1:10),x=rnorm(10))
df2 = data.frame(id=sample(1:10),y=rnorm(10))
df3 = data.frame(id=sample(1:10),z=rnorm(10))
dfList = list(df1,df2,df3)
join_all(dfList)

Joining by: id
Joining by: id


id,x,y,z
1,-0.3313171,0.095952941,0.89247231
8,0.0902374,0.355782558,-0.0464247
4,-1.0156393,-0.002788904,-1.10335279
10,1.0728304,0.412972922,0.86268372
6,-0.216883,1.846073745,1.40645167
9,-0.8451512,0.271928758,0.03163361
3,0.1696061,-0.121032591,-0.66245901
2,-0.5499409,-0.644574775,-0.49926052
5,-0.7746011,0.943470973,0.40611362
7,-0.7813288,1.85554411,-0.74399529


### More on merging data

* The quick R data merging page - [http://www.statmethods.net/management/merging.html](http://www.statmethods.net/management/merging.html)
* plyr information - [http://plyr.had.co.nz/](http://plyr.had.co.nz/)
* Types of joins - [http://en.wikipedia.org/wiki/Join_(SQL)](http://en.wikipedia.org/wiki/Join_(SQL))

## Editing text variables

### Example - Baltimore camera data

<img class=center src=https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/cameras.png height=500>

[https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru](https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru)


### Fixing character vectors - tolower(), toupper()

In [84]:
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/cameras.csv",method="curl")
cameraData <- read.csv("./data/cameras.csv")
names(cameraData)
tolower(names(cameraData))

### Fixing character vectors - strsplit()

* Good for automatically splitting variable names
* Important parameters: _x_, _split_

In [85]:
splitNames = strsplit(names(cameraData),"\\.")
splitNames[[5]]
splitNames[[6]]

### Quick aside - lists

[http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf)


In [86]:
mylist <- list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:25, ncol = 5))
head(mylist)


0,1,2,3,4
1,6,11,16,21
2,7,12,17,22
3,8,13,18,23
4,9,14,19,24
5,10,15,20,25


### Quick aside - lists

[http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf)


In [87]:
mylist[1]
mylist$letters
mylist[[1]]


### Fixing character vectors - sapply()

* Applies a function to each element in a vector or list
* Important parameters: _X_,_FUN_

In [88]:
splitNames[[6]][1]
firstElement <- function(x){x[1]}
sapply(splitNames,firstElement)

### Peer review experiment data


<img class=center src="https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/cooperation.png" height=500 />


[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895](http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895)


### Fixing character vectors - sub()

* Important parameters: _pattern_, _replacement_, _x_

In [89]:
names(reviews)
sub("_","",names(reviews),)

### Fixing character vectors - gsub()

In [90]:
testName <- "this_is_a_test"
sub("_","",testName)
gsub("_","",testName)

### Finding values - grep(),grepl()

In [91]:
grep("Alameda",cameraData$intersection)
table(grepl("Alameda",cameraData$intersection))
cameraData2 <- cameraData[!grepl("Alameda",cameraData$intersection),]


FALSE  TRUE 
   77     3 

### More on grep()

[http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf)



In [93]:
grep("Alameda",cameraData$intersection,value=TRUE)
grep("JeffStreet",cameraData$intersection)
length(grep("JeffStreet",cameraData$intersection))

### More useful string functions


In [94]:
library(stringr)
nchar("Jeffrey Leek")
substr("Jeffrey Leek",1,7)
paste("Jeffrey","Leek")

### More useful string functions

In [95]:
paste0("Jeffrey","Leek")
str_trim("Jeff      ")

### Important points about text in data sets

* Names of variables should be 
  * All lower case when possible
  * Descriptive (Diagnosis versus Dx)
  * Not duplicated
  * Not have underscores or dots or white spaces
* Variables with character values
  * Should usually be made into factor variables (depends on application)
  * Should be descriptive (use TRUE/FALSE instead of 0/1 and Male/Female versus 0/1 or M/F)

## Regular expressions

### Regular expressions

- Regular expressions can be thought of as a combination of literals and _metacharacters_
- To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
- Regular expressions have a rich set of metacharacters


### Literals

Simplest pattern consists only of literals. The literal “nuclear” would match to the following lines:

```markdown
Ooh. I just learned that to keep myself alive after a
nuclear blast! All I have to do is milk some rats
then drink the milk. Aweosme. :}

Laozi says nuclear weapons are mas macho

Chaos in a country that has nuclear weapons -- not good.

my nephew is trying to teach me nuclear physics, or
possibly just trying to show me how smart he is
so I’ll be proud of him [which I am].

lol if you ever say "nuclear" people immediately think
DEATH by radiation LOL
```

### Literals

The literal “Obama” would match to the following lines

```markdown
Politics r dum. Not 2 long ago Clinton was sayin Obama
was crap n now she sez vote 4 him n unite? WTF?
Screw em both + Mcain. Go Ron Paul!

Clinton conceeds to Obama but will her followers listen??

Are we sure Chelsea didn’t vote for Obama?

thinking ... Michelle Obama is terrific!

jetlag..no sleep...early mornig to starbux..Ms. Obama
was moving
```

### Regular Expressions

- Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested

- What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”?


### Regular Expressions

We need a way to express 
- whitespace word boundaries 
- sets of literals
- the beginning and end of a line 
- alternatives (“war” or “peace”)
Metacharacters to the rescue!

### Metacharacters

Some metacharacters represent the start of a line

```markdown
^i think
```

will match the lines

```markdown
i think we all rule for participating
i think i have been outed
i think this will be quite fun actually
i think i need to go to work
i think i first saw zombo in 1999.
```

### Metacharacters

$ represents the end of a line

```markdown
morning$
```

will match the lines

```markdown
well they had something this morning
then had to catch a tram home in the morning
dog obedience school in the morning
and yes happy birthday i forgot to say it earlier this morning
I walked in the rain this morning
good morning
```

### Character Classes with []

We can list a set of characters we will accept at a given point in the match

```markdown
[Bb][Uu][Ss][Hh]
```

will match the lines

```markdown
The democrats are playing, "Name the worst thing about Bush!"
I smelled the desert creosote bush, brownies, BBQ chicken
BBQ and bushwalking at Molonglo Gorge
Bush TOLD you that North Korea is part of the Axis of Evil
I’m listening to Bush - Hurricane (Album Version)
```

### Character Classes with []

```markdown
^[Ii] am
```

will match

```markdown
i am so angry at my boyfriend i can’t even bear to
look at him

i am boycotting the apple store

I am twittering from iPhone

I am a very vengeful person when you ruin my sweetheart.

I am so over this. I need food. Mmmm bacon...
```

### Character Classes with []

Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter

```markdown
^[0-9][a-zA-Z]
```

will match the lines

```markdown
7th inning stretch
2nd half soon to begin. OSU did just win something
3am - cant sleep - too hot still.. :(
5ft 7 sent from heaven
1st sign of starvagtion
```

### Character Classes with []

When used at the beginning of a character class, the “\^” is also a metacharacter and indicates matching characters NOT in the indicated class

```markdown
[^?.]$
```

will match the lines

```markdown
i like basketballs
6 and 9
dont worry... we all die anyway!
Not in Baghdad
helicopter under water? hmmm
```


## Regular expressions II

### More Metacharacters

“.” is used to refer to any character. So

```markdown
9.11
```

will match the lines

```markdown
its stupid the post 9-11 rules
if any 1 of us did 9/11 we would have been caught in days.
NetBios: scanning ip 203.169.114.66
Front Door 9:11:46 AM
Sings: 0118999881999119725...3 !
```

### More Metacharacters: |

This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives

```markdown
flood|fire
```

will match the lines

```markdown
is firewire like usb on none macs?
the global flood makes sense within the context of the bible
yeah ive had the fire on tonight
... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.
￼
```


### More Metacharacters: |

We can include any number of alternatives...

```markdown
flood|earthquake|hurricane|coldfire
```

will match the lines

```markdown
Not a whole lot of hurricanes in the Arctic.
We do have earthquakes nearly every day somewhere in our State
hurricanes swirl in the other direction
coldfire is STRAIGHT!
’cause we keep getting earthquakes
```

### More Metacharacters: |

The alternatives can be real expressions and not just literals

```markdown
^[Gg]ood|[Bb]ad
```

will match the lines

```markdown
good to hear some good knews from someone here
Good afternoon fellow american infidels!
good on you-what do you drive?
Katie... guess they had bad experiences...
my middle name is trouble, Miss Bad News
```

### More Metacharacters: ( and )

Subexpressions are often contained in parentheses to constrain the alternatives

```markdown
^([Gg]ood|[Bb]ad)
```

will match the lines

```markdown
bad habbit
bad coordination today
good, becuase there is nothing worse than a man in kinky underwear
Badcop, its because people want to use drugs
Good Monday Holiday
Good riddance to Limey
```

### More Metacharacters: ?

The question mark indicates that the indicated expression is optional

```markdown
[Gg]eorge( [Ww]\.)? [Bb]ush
```

will match the lines

```markdown
i bet i can spell better than you and george bush combined
BBC reported that President George W. Bush claimed God told him to invade I
a bird in the hand is worth two george bushes
```


### One thing to note...

In the following

```markdown
[Gg]eorge( [Ww]\.)? [Bb]ush
```

we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match

### More metacharacters: * and +

The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”

```markdown
(.*)
```
will match the lines
```markdown
anyone wanna chat? (24, m, germany)
hello, 20.m here... ( east area + drives + webcam )
(he means older men)
()
```

### More metacharacters: * and +
The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”
```markdown
[0-9]+ (.*)[0-9]+
```

will match the lines

```markdown
working as MP here 720 MP battallion, 42nd birgade
so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
it went down on several occasions for like, 3 or 4 *days*
Mmmm its time 4 me 2 go 2 bed
```

### More metacharacters: { and }

{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression

```markdown
[Bb]ush( +[^ ]+ +){1,5} debate
```

will match the lines

```markdown
Bush has historically won all major debates he’s done.
in my view, Bush doesn’t need these debates..
bush doesn’t need the debates? maybe you are right
That’s what Bush supporters are doing about the debate.
Felix, I don’t disagree that Bush was poorly prepared for the debate.
indeed, but still, Bush should have taken the debate more seriously.
Keep repeating that Bush smirked and scowled during the debate
```

### More metacharacters: and

- m,n means at least m but not more than n matches 
- m means exactly m matches
- m, means at least m matches


### More metacharacters: ( and ) revisited

- In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed
- We refer to the matched text with \1, \2, etc.


### More metacharacters: ( and ) revisited

So the expression

```markdown
+([a-zA-Z]+) +\1 +
```

will match the lines

```markdown
time for bed, night night twitter!
blah blah blah blah
my tattoo is so so itchy today
i was standing all all alone against the world outside...
hi anybody anybody at home
estudiando css css css css.... que desastritooooo
```


### More metacharacters: ( and ) revisited

The * is “greedy” so it always matches the _longest_ possible string that satisfies the regular expression. So

```markdown
^s(.*)s
```
matches
```markdown
sitting at starbucks
setting up mysql and rails
studying stuff for the exams
spaghetti with marshmallows
stop fighting with crackers
sore shoulders, stupid ergonomics
```

### More metacharacters: ( and ) revisited
The greediness of * can be turned off with the ?, as in
```markdown
^s(.*?)s$
```


### Summary

- Regular expressions are used in many different languages; not unique to R.
- Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
- Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file)
- Used with the functions `grep`,`grepl`,`sub`,`gsub` and others that involve searching for text strings

## Working with dates

### Starting simple

In [96]:
d1 = date()
d1
class(d1)

### Date class

In [97]:
d2 = Sys.Date()
d2
class(d2)


### Formatting dates

`%d` = day as number (0-31), `%a` = abbreviated weekday,`%A` = unabbreviated weekday, `%m` = month (00-12), `%b` = abbreviated month,
`%B` = unabbrevidated month, `%y` = 2 digit year, `%Y` = four digit year

In [98]:
format(d2,"%a %b %d")


### Creating dates

In [99]:
x = c("1jan1960", "2jan1960", "31mar1960", "30jul1960"); z = as.Date(x, "%d%b%Y")
z
z[1] - z[2]
as.numeric(z[1]-z[2])

Time difference of -1 days

### Converting to Julian 

In [100]:
weekdays(d2)
months(d2)
julian(d2)

### Lubridate 

[http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/](http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/)


In [101]:
library(lubridate); ymd("20140108")
mdy("08/04/2013")
dmy("03-04-2013")


Attaching package: ‘lubridate’

The following object is masked from ‘package:plyr’:

    here

The following object is masked from ‘package:base’:

    date



### Dealing with times

[http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/](http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/)


In [102]:
ymd_hms("2011-08-03 10:15:03")
ymd_hms("2011-08-03 10:15:03",tz="Pacific/Auckland")
?Sys.timezone

[1] "2011-08-03 10:15:03 UTC"

[1] "2011-08-03 10:15:03 NZST"

### Some functions have slightly different syntax

In [103]:
x = dmy(c("1jan2013", "2jan2013", "31mar2013", "30jul2013"))
wday(x[1])
wday(x[1],label=TRUE)


### Notes and further resources

* More information in this nice lubridate tutorial [http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/](http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/)
* The lubridate vignette is the same content [http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html](http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html)
* Ultimately you want your dates and times as class "Date" or the classes "POSIXct", "POSIXlt". For more information type `?POSIXlt`

## Data resources

### Open Government Sites

* United Nations [http://data.un.org/](http://data.un.org/)
* U.S. [http://www.data.gov/](http://www.data.gov/)
  * [List of cities/states with open data](http://simplystatistics.org/2012/01/02/list-of-cities-states-with-open-data-help-me-find/)
* United Kingdom [http://data.gov.uk/](http://data.gov.uk/)
* France [http://www.data.gouv.fr/](http://www.data.gouv.fr/)
* Ghana [http://data.gov.gh/](http://data.gov.gh/)
* Australia [http://data.gov.au/](http://data.gov.au/)
* Germany [https://www.govdata.de/](https://www.govdata.de/) 
* Hong Kong [http://www.gov.hk/en/theme/psi/datasets/](http://www.gov.hk/en/theme/psi/datasets/)
* Japan [http://www.data.go.jp/](http://www.data.go.jp/)
* Many more [http://www.data.gov/opendatasites](http://www.data.gov/opendatasites)


### Gapminder

<img class=center src=https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/gapminder.png height=400/>

[http://www.gapminder.org/](http://www.gapminder.org/)


### Survey data from the United States

<img class=center src=https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/asdfree.png height=400/>

[http://www.asdfree.com/](http://www.asdfree.com/)

### Infochimps Marketplace

<img class=center src=https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/infochimps.png height=400/>

[http://www.infochimps.com/marketplace](http://www.infochimps.com/marketplace)


### Kaggle

<img class=center src=https://raw.githubusercontent.com/DataScienceSpecialization/courses/master/assets/img/03_ObtainingData/kaggle.png  height=400 />

[http://www.kaggle.com/](http://www.kaggle.com/)


### Collections by data scientists

* Hilary Mason http://bitly.com/bundles/hmason/1
* Peter Skomoroch https://delicious.com/pskomoroch/dataset
* Jeff Hammerbacher http://www.quora.com/Jeff-Hammerbacher/Introduction-to-Data-Science-Data-Sets
* Gregory Piatetsky-Shapiro http://www.kdnuggets.com/gps.html
* [http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists](http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists)


### More specialized collections

* [Stanford Large Network Data](http://snap.stanford.edu/data/)
* [UCI Machine Learning](http://archive.ics.uci.edu/ml/)
* [KDD Nugets Datasets](http://www.kdnuggets.com/datasets/index.html)
* [CMU Statlib](http://lib.stat.cmu.edu/datasets/)
* [Gene expression omnibus](http://www.ncbi.nlm.nih.gov/geo/)
* [ArXiv Data](http://arxiv.org/help/bulk_data)
* [Public Data Sets on Amazon Web Services](http://aws.amazon.com/publicdatasets/)


### Some API's with R interfaces

* [twitter](https://dev.twitter.com/) and [twitteR](http://cran.r-project.org/web/packages/twitteR/index.html) package
* [figshare](http://api.figshare.com/docs/intro.html) and [rfigshare](http://cran.r-project.org/web/packages/rfigshare/index.html)
* [PLoS](http://api.plos.org/) and [rplos](http://cran.r-project.org/web/packages/rplos/rplos.pdf)
* [rOpenSci](http://ropensci.org/packages/index.html)
* [Facebook](https://developers.facebook.com/) and [RFacebook](http://cran.r-project.org/web/packages/Rfacebook/)
* [Google maps](https://developers.google.com/maps/) and [RGoogleMaps](http://cran.r-project.org/web/packages/RgoogleMaps/index.html)

## Questions