Week 1
===
This course covers the basic ideas behind getting data ready for analysis
    - Finding and extracting raw data
    - Tidy data principles and how to make data tiny
    - Practical implementation through a range of R packages
What this course depends on
    - The Data Scientist's Toolkit
    - R Programming
    
## The goal of this course
<b>Raw data -> Processing script -> tidy data</b> -> data analysis -> data communication

## Raw and Processed Data
Definition of <b>data</b>

"Data are values of qualitative or quantitative variables, belonging to a set of items."

Qualitative: Country of origin, sex, treatment 
Quantitative: Height, weight, blood pressure

<b>Raw data</b>
    - The original source of the data
    - Often hard to use for data analyses
    - Data analysis <i>includes</i> processing
    - Raw data may only need to be processed once
    
<b>Processed data</b>
    - Data that is ready for analysis
    - Processing can include merging, subsetting, transforming, etc.
    - There may be standards for processing
    - All steps should be recorded

## Components of Tidy Data
### The four things you should have
    1. The raw data
    2. A tidy data set
    3. A code book describing each variable and its values in the tidy data set.
    4. An explicit and exact recipe that you used to go 1 -> 2,3
    
### The raw data
    - The strange binary file your measurement machine spits out
    - The unformatted Excel file with 10 worksheets the company you contracted with sent you
    - The complicated JSON data you got from scraping the Twitter API
    - The hand-entered numbers you collected looking through a microscope
    
<i>You know the raw data is in the right format if you </i>
    1. Ran no software on the data
    2. Did not manipulate any of the numbers in the data
    3. you did not remove any data from the data set
    4. you did not summarize the data in any way
    
### The tidy data
    1. Each variable you measure should be in one column
    2. Each different observation of that variable should be in a different row
    3. There should be one table for each "kind" of variable
    4. If you have multiple tables, they should include a column in the table that allows them to be linked
    
<i>Some other important tips</i>
    - Include a row at the top of each file with variable names.
    - Make variable names human readable AgeAtDiagnosis instead of AgeDx
    - In general data should be saved in one file per table

### The code book
    1. Information about the variables (including units!) in the data set not contained in the tidy data
    2. Information about the summary choices you made
    3. Information about the experimental study design you used
    
<i>Some other important tips</i>
    - A common format for this document is a Word/text file.
    - There should be a section called "Study design" that has a thorough description of how you collected the data
    - There must be a section called "Code book" that describes each variable and its units.
    
### The instruction list
    - Ideally a computer script (R or Python)
    - The input for the script is the raw data
    - The output is the processed, tidy data
    - There are no parameters to the script
    
In some cases it will not be possible to script every step. In that case you should provide instructions like:
    1. Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
    2. Step 2 - run the software seperately for each sample
    3. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set

In [1]:
getwd()

## Downloading files
### Get/set your working directory
* A basic component of working with data is knowing your working directory
* The two main commands are getwd() and setwd()
* Be aware of relative versus absolute paths
    - Relative - setwd("./data"), setwd("../")
    - Absolute - setwd("/Users/Users/data/")

### Checking for and creating directories
* file.exists("directoryName") will check to see if the directory exists
* dir.create("directoryName") will create a directory if it doesn't exist

### Getting data from the internet - download.file()
* Downloads a file from the internet
* Even if you could do this by hand, helps with reproducibility
* Important parameters are <i>url, destfile, method</i>
* Useful for downloading tab-delimited, csv, and other files

Downloading a file from the web:

In [2]:
fileUrl <- "http://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/cameras.csv", method = "curl")
list.files("./data")

In [3]:
dateDownloaded <- date()
dateDownloaded

### Some notes about download.file()
* If the url starts with <i>http</i> you can use download.file()
* If the url starts with <i>https</i> on Mac you may need to set method = "curl"
* Be sure to record when you downloaded

## Reading Local files
### Loading flat files - read.table()
* This is the main function for reading data into R
* Flexible and robust be requires more parameters
* Reads the data into RAM - big data can cause problems
* Important parameters <i>file, header, sep, row.names, nrows</i>
* Related: read.csv(), read.csv2()

### Baltimore example

In [4]:
cameraData <- read.table("./data/cameras.csv")

ERROR: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 13 elements


In [5]:
head(cameraData)

ERROR: Error in head(cameraData): object 'cameraData' not found


Error is thrown because there are commas seperating; read.table searches for tabbed delimiters

In [6]:
cameraData <- read.table("./data/cameras.csv", sep = ",", header = TRUE)
head(cameraData, n=10)

#read.csv sets sep="," and header=TRUE

address,direction,street,crossStreet,intersection,Location.1
S CATON AVE & BENSON AVE,N/B,Caton Ave,Benson Ave,Caton Ave & Benson Ave,"(39.2693779962, -76.6688185297)"
S CATON AVE & BENSON AVE,S/B,Caton Ave,Benson Ave,Caton Ave & Benson Ave,"(39.2693157898, -76.6689698176)"
WILKENS AVE & PINE HEIGHTS AVE,E/B,Wilkens Ave,Pine Heights,Wilkens Ave & Pine Heights,"(39.2720252302, -76.676960806)"
THE ALAMEDA & E 33RD ST,S/B,The Alameda,33rd St,The Alameda & 33rd St,"(39.3285013141, -76.5953545714)"
E 33RD ST & THE ALAMEDA,E/B,E 33rd,The Alameda,E 33rd & The Alameda,"(39.3283410623, -76.5953594625)"
ERDMAN AVE & N MACON ST,E/B,Erdman,Macon St,Erdman & Macon St,"(39.3068045671, -76.5593167803)"
ERDMAN AVE & N MACON ST,W/B,Erdman,Macon St,Erdman & Macon St,"(39.306966535, -76.5593122365)"
N CHARLES ST & E LAKE AVE,S/B,Charles,Lake Ave,Charles & Lake Ave,"(39.3690535299, -76.625826716)"
E MADISON ST & N CAROLINE ST,W/B,Madison,Caroline St,Madison & Caroline St,"(39.2993257666, -76.5976760827)"
ORLEANS ST & N LINWOOD AVE,E/B,Orleans,Linwood Ave,Orleans & Linwood Ave,"(39.2958661981, -76.5764270078)"


### Some more important parameters
* quote - you can tell R whether there are any quoted values quote = "" means no quotes.
* na.strings - set the character that represents a missing value.
* nrows - how many rows to read of the file
* skip - number of lines to skip before starting to read

In my experience, the biggest trouble with reading flat files are quotation marks ' or " placed in data values, setting quote="" often resolves these.

## Reading Excel Files
Excel files are still probably the most widely used format for sharing data

Downloading the Excel version of the cameras dataset:

In [7]:
download.file(fileUrl, destfile="./data/cameras.xlsx", method="curl")
dateDownloaded <- date()
dateDownloaded

In [8]:
library(xlsx)
cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex = 1, header = TRUE)
head(cameraData)

Loading required package: rJava
Loading required package: xlsxjars


ERROR: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream


### Reading specific rows and columns
You can read specific rows and columns. In this example you only read columns 2 and 3, and rows 1 through 4.

In [None]:
colIndex <- 2:3
rowIndex <- 1:4
cameraDataSubset <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex)
cameraDataSubset

### Further notes
* The write.xlsx function will write out an Excel file with similar arguments
* read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable.
* The XLConnect package has more options for writing and manipulating Excel files
* The XLConnect vignette is a good place to start for that package
* In general it is advised to store your data in either a database or in comma seperated files (.csv) or tab separated files (.tab/.txt) as they are easier to distribute.

## Reading XML
### XML
* Extensible markup language
* Frequently used to store structured data
* Particularly widely used in internet applications
* Extracting XML is the basis for most web scraping
* Components:
    - Markup - labels that give the text structure
    - Content - actual text of the document

### Tags, elements and attributes
* Tags correspond to general labels
    - Start tags <section
    - End tags </section
    - Empty tags 
* Elements are specific examples of tags
    - <Greeting Hello, world </Greeting
* Attributes are components of the label
    - <img src="me.jpg" alt="student"/
    - <step number="3" Connect A to B. </step

http://www.w3schools.com/xml/simple.xml

In [9]:
library(XML)
fileUrl <- "./data/simple.xml"
doc <- xmlTreeParse(fileUrl, useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)

In [10]:
names(rootNode)

### Directly access parts of the XML document

In [11]:
rootNode[[1]]

<food>
  <name>Belgian Waffles</name>
  <price>$5.95</price>
  <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
  <calories>650</calories>
</food> 

In [12]:
rootNode[[1]][[1]]

<name>Belgian Waffles</name> 

### Programatically extract parts of the file

In [13]:
xmlSApply(rootNode, xmlValue)

### XPath
* /node Top level node
* //node Node at any level
* node[@attr-name] Node with an attribute name
* node[@attr-name='bob'] Node with attribute name attr-name='bob'

http://www.stat.berkeley.edu/

### Get the items on the menu and prices

In [14]:
xpathSApply(rootNode,"//name",xmlValue)

In [15]:
xpathSApply(rootNode,"//price",xmlValue)

### Another example
http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens

In [18]:
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl, useInternal=TRUE)
scores <- xpathSApply(doc,"//li[@class='score']",xmlValue)
teams <- xpathSApply(doc,"//li[@teams='team-name']",xmlValue)
scores

In [19]:
teams

## Reading JSON
### JSON
* Javascript Object Notation
* Lightweight data storage
* Common format for data from application programming interfaces (APIs)
* Similar structure to XML but different syntax/format
* Data stored as
    - Numbers (double)
    - Strings (double quoted)
    - Boolean (true or false)
    - Array (ordered [])
    - Object (unordered{})

https://en.wikipedia.org/wiki/JSON

### Example JSON file
https://api.github.com/users/jtleek/repos

In [20]:
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData)

In [21]:
names(jsonData$owner)

In [22]:
jsonData$owner$login

### Writing data frames to JSON

In [23]:
myjson <- toJSON(iris, pretty=TRUE)
cat(myjson)

[
  {
    "Sepal.Length": 5.1,
    "Sepal.Width": 3.5,
    "Petal.Length": 1.4,
    "Petal.Width": 0.2,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 4.9,
    "Sepal.Width": 3,
    "Petal.Length": 1.4,
    "Petal.Width": 0.2,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 4.7,
    "Sepal.Width": 3.2,
    "Petal.Length": 1.3,
    "Petal.Width": 0.2,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 4.6,
    "Sepal.Width": 3.1,
    "Petal.Length": 1.5,
    "Petal.Width": 0.2,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 5,
    "Sepal.Width": 3.6,
    "Petal.Length": 1.4,
    "Petal.Width": 0.2,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 5.4,
    "Sepal.Width": 3.9,
    "Petal.Length": 1.7,
    "Petal.Width": 0.4,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 4.6,
    "Sepal.Width": 3.4,
    "Petal.Length": 1.4,
    "Petal.Width": 0.3,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 5,
    "Sepal.Width": 3.4,
    "Petal.Length": 1.5,
    "Peta

### Convert back to JSON

In [24]:
iris2 <- fromJSON(myjson)
head(iris2)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


### Further resources
* http://www.json.org/

## Using data.table

### data.table
* Inherets from data.frame
    - All functions that accept data.frame work on data.table
* Written in C so it is much faster
* Much, much faster at subsetting, group, and updating

### Create data tables just like data frames

In [28]:
library(data.table)
DF = data.frame(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))
head(DF,3)

x,y,z
0.3096739,a,1.3007757
-1.5439204,a,0.9210169
-0.2349689,a,2.0753142


In [29]:
DT = data.table(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))
head(DT,3)

x,y,z
-0.12513889,a,-1.6064954
-1.16185355,a,0.7312962
-0.08420981,a,0.2532261


### See all the data tables in memory

In [30]:
tables()

     NAME NROW NCOL MB COLS  KEY
[1,] DT      9    3  1 x,y,z    
Total: 1MB


### Subsetting rows

In [31]:
DT[2,]

x,y,z
-1.161854,a,0.7312962


In [34]:
DT[DT$y=="a",]

x,y,z
-0.12513889,a,-1.6064954
-1.16185355,a,0.7312962
-0.08420981,a,0.2532261


In [35]:
DT[c(2,3)]

x,y,z
-1.16185355,a,0.7312962
-0.08420981,a,0.2532261


### Subsetting columns!?

In [36]:
DT[,c(2,3)]

y,z
a,-1.6064954
a,0.7312962
a,0.2532261
b,-0.6856561
b,-1.7739271
b,-0.5892033
c,1.3613298
c,0.5009019
c,-0.8500094


### Column subsetting in data.table
* The subsetting function is modified for data.table
* The argument you pass after the comma is called an "expression"
* In R an expression is a collection of statements enclosed in curley brackets

In [37]:
{
    x = 1
    y = 2
}
k = {print(10); 5}

[1] 10


In [38]:
print(k)

[1] 5


### Calculating values for variables with expressions

In [39]:
DT[,list(mean(x),sum(z))]

V1,V2
0.01593698,-2.658537


In [40]:
DT[,table(y)]

y
a b c 
3 3 3 

### Adding new columns

In [41]:
DT[,w:=z^2]

x,y,z,w
-0.12513889,a,-1.6064954,2.58082743
-1.16185355,a,0.7312962,0.53479411
-0.08420981,a,0.2532261,0.06412345
0.72498088,b,-0.6856561,0.47012434
-0.3175159,b,-1.7739271,3.1468173
0.63671102,b,-0.5892033,0.34716052
-0.60397626,c,1.3613298,1.85321895
0.93432497,c,0.5009019,0.25090274
0.14011033,c,-0.8500094,0.72251594


In [42]:
DT2 <- DT
DT[, y:=2]

“Coerced 'double' RHS to 'character' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 9 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.”

x,y,z,w
-0.12513889,2,-1.6064954,2.58082743
-1.16185355,2,0.7312962,0.53479411
-0.08420981,2,0.2532261,0.06412345
0.72498088,2,-0.6856561,0.47012434
-0.3175159,2,-1.7739271,3.1468173
0.63671102,2,-0.5892033,0.34716052
-0.60397626,2,1.3613298,1.85321895
0.93432497,2,0.5009019,0.25090274
0.14011033,2,-0.8500094,0.72251594


### Careful!

In [43]:
head(DT, n=3)

x,y,z,w
-0.12513889,2,-1.6064954,2.58082743
-1.16185355,2,0.7312962,0.53479411
-0.08420981,2,0.2532261,0.06412345


In [44]:
head(DT2, n=3)

x,y,z,w
-0.12513889,2,-1.6064954,2.58082743
-1.16185355,2,0.7312962,0.53479411
-0.08420981,2,0.2532261,0.06412345


### Multiple operations

In [45]:
DT[,m:={tmp <- (x+x); log2(tmp+5)}]

x,y,z,w,m
-0.12513889,2,-1.6064954,2.58082743,2.247843
-1.16185355,2,0.7312962,0.53479411,1.420236
-0.08420981,2,0.2532261,0.06412345,2.272495
0.72498088,2,-0.6856561,0.47012434,2.689291
-0.3175159,2,-1.7739271,3.1468173,2.125971
0.63671102,2,-0.5892033,0.34716052,2.649253
-0.60397626,2,1.3613298,1.85321895,1.922977
0.93432497,2,0.5009019,0.25090274,2.780027
0.14011033,2,-0.8500094,0.72251594,2.400598


### plyr like operations

In [46]:
DT[,a:=x>0]

x,y,z,w,m,a
-0.12513889,2,-1.6064954,2.58082743,2.247843,False
-1.16185355,2,0.7312962,0.53479411,1.420236,False
-0.08420981,2,0.2532261,0.06412345,2.272495,False
0.72498088,2,-0.6856561,0.47012434,2.689291,True
-0.3175159,2,-1.7739271,3.1468173,2.125971,False
0.63671102,2,-0.5892033,0.34716052,2.649253,True
-0.60397626,2,1.3613298,1.85321895,1.922977,False
0.93432497,2,0.5009019,0.25090274,2.780027,True
0.14011033,2,-0.8500094,0.72251594,2.400598,True


In [47]:
DT[,b:=mean(x+w),by=a]

x,y,z,w,m,a,b
-0.12513889,2,-1.6064954,2.58082743,2.247843,False,1.177417
-1.16185355,2,0.7312962,0.53479411,1.420236,False,1.177417
-0.08420981,2,0.2532261,0.06412345,2.272495,False,1.177417
0.72498088,2,-0.6856561,0.47012434,2.689291,True,1.056708
-0.3175159,2,-1.7739271,3.1468173,2.125971,False,1.177417
0.63671102,2,-0.5892033,0.34716052,2.649253,True,1.056708
-0.60397626,2,1.3613298,1.85321895,1.922977,False,1.177417
0.93432497,2,0.5009019,0.25090274,2.780027,True,1.056708
0.14011033,2,-0.8500094,0.72251594,2.400598,True,1.056708


### Special variables
.N an integer, length 1, containing the number of times a group appears

In [48]:
set.seed(123);
DT <- data.table(x=sample(letters[1:3]), 1E5, TRUE)
DT[, .N, by=x]

x,N
a,1
b,1
c,1


### Keys

In [49]:
DT <- data.table(x=rep(c("a","b","c"),each=100), y=rnorm(300))
setkey(DT, x)
DT['a']

x,y
a,1.19020663
a,-1.68955566
a,1.23949589
a,-0.10896597
a,-0.11724196
a,0.18308261
a,1.28055488
a,-1.72727063
a,1.69018435
a,0.50381245


### Joins
* You can also keys to facilitate joins between data tables

In [50]:
DT1 <- data.table(x=c('a','a','b','dt1'), y=1:4)
DT2 <- data.table(x=c('a','b','dt2'), z=5:7)
setkey(DT1,x); setkey(DT2, x)
merge(DT1, DT2)

x,y,z
a,1,5
a,2,5
b,3,6


### Fast reading
Advantagous to use data.tables to read things fast from disk

In [51]:
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))

   user  system elapsed 
  0.442   0.028   0.491 

In [52]:
system.time(read.table(file, header=TRUE, sep="\t"))

   user  system elapsed 
  9.404   0.274  10.028 

### Summary and further reading
* The latest development version contains new functions like melt and dcast for data.tables
* Here is a list of differences between data.table and data.frame