In [38]:
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()


# Lecture 08: tibbles & importing data
In this notebook, we will learn about:

* [Tibbles](#Tibbles), a modern take on R data frames.
* [Importing data](#Data-import)

In [89]:
iris[1:2,1:2]

  Sepal.Length Sepal.Width
1 5.1          3.5        
2 4.9          3.0        

## Tibbles
A `tibble` is the `tidyverse` way of storing a table full of data. We've already used them extensively when working with the `dplyr` data manipulation commands. 

`tibble`s are not thet "traditional" way of representing data in R. That is called a `data.frame`. If you consult the R help functions, or read R tutorials online, you will most likely encounter `data.frames`.

Compared to `data.frame`s, `tibble`s have some nice features:
- They print nicely.
- They behave in expected ways.
- The work better with `tidyverse`.

In [182]:
head(iris) # the famous iris data set used by Sir Ronald Fisher
class(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1          3.5         1.4          0.2         setosa 
2 4.9          3.0         1.4          0.2         setosa 
3 4.7          3.2         1.3          0.2         setosa 
4 4.6          3.1         1.5          0.2         setosa 
5 5.0          3.6         1.4          0.2         setosa 
6 5.4          3.9         1.7          0.4         setosa 

[1] "data.frame"

In [90]:
iris_tibble = as_tibble(iris) # convert to a tibble
print(iris_tibble)

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
 1          5.1         3.5          1.4         0.2  setosa
 2          4.9         3.0          1.4         0.2  setosa
 3          4.7         3.2          1.3         0.2  setosa
 4          4.6         3.1          1.5         0.2  setosa
 5          5.0         3.6          1.4         0.2  setosa
 6          5.4         3.9          1.7         0.4  setosa
 7          4.6         3.4          1.4         0.3  setosa
 8          5.0         3.4          1.5         0.2  setosa
 9          4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with 140 more rows


  Sepal.Length Sepal.Width
1 5.1          3.5        
2 4.9          3.0        

Note that `tibble`s *are* a special type of data frame, so all the commands that work on data frames will work on tibbles as well. However, `tibble`s have some nice added functionality that data frames do not possess.

### Differences between tibble and data frame
Data frames are known for doing some unexpected things in R, for example:
- Converting strings to factors.
- Renaming variables automatically.
- Adding 'row names' that look like row numbers, but are not.
- Returning different data types depending on how you index.

#### Strings to factors

In [95]:
df = tibble(string.to.factor=c('a','b','c'))
class(df$string.to.factor)
print(df)

[1] "character"

# A tibble: 3 x 1
  string.to.factor
             <chr>
1                a
2                b
3                c


#### Auto-renaming

In [99]:
tbl = tibble("var with spaces"=c(1,2,3))
tbl$`var with spaces`

[1] 1 2 3

#### Row names

In [116]:
df = tibble(x=1:100)
# df  # note row-number looking things
subset = df[10:100,]
subset[7,]

  x 
1 16

#### Differing return types

In [117]:
df = data.frame(x=1:10)
df1 = df[1:5,]  # select first five rows
class(df)
class(df1)
#df = data.frame(x=1:10, y=1)
#class(df[1:5,])

[1] "data.frame"

[1] "integer"

Tibbles handle these cases nicely:

In [123]:
mpg %>% print
mpg %>% slice(c(1,2,3)) %>% print

# A tibble: 234 x 11
   manufacturer      model displ  year   cyl      trans   drv   cty   hwy    fl
          <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int> <chr>
 1         audi         a4   1.8  1999     4   auto(l5)     f    18    29     p
 2         audi         a4   1.8  1999     4 manual(m5)     f    21    29     p
 3         audi         a4   2.0  2008     4 manual(m6)     f    20    31     p
 4         audi         a4   2.0  2008     4   auto(av)     f    21    30     p
 5         audi         a4   2.8  1999     6   auto(l5)     f    16    26     p
 6         audi         a4   2.8  1999     6 manual(m5)     f    18    26     p
 7         audi         a4   3.1  2008     6   auto(av)     f    18    27     p
 8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26     p
 9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25     p
10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28     p
# ... with 224 more

### Non-features
There are a couple of data.frame features that tibbles intentionally lack. This is because they can lead to unintended behavior, a.k.a. bugs.



#### Argument expansion
Data frames will automatically expand column vectors to have matching length. tibble only does this for scalars (vectors of length 1).

In [128]:
#data.frame(x=1:4, y=1)
#tibble(x=1:4, y=1) # ok
#data.frame(x=1:4, y=1:2)
tibble(x=1:4, y=1) # error

  x y
1 1 1
2 2 1
3 3 1
4 4 1

#### Name abbreviation
Data frames allow you to abbreviate column names. Avoid this.

In [131]:
data.frame(xyz=1:4)
tibble(xyz=1:4)$x

  xyz
1 1  
2 2  
3 3  
4 4  

“Unknown or uninitialised column: 'x'.”

NULL

## Creating tibbles
tibbles can be created using either row or column form. Column form mimics the interface of `data.frame`. You pass key-value pairs. The keys represent column names and the values are vectors of the column entries.

In [134]:
tibble(
    x = seq(100, 1000, 100),
    y = 50,     # scalars are automatically expanded
    z = x + 50  # variables are created sequentially so can refer to a variable already created
) %>% print

# A tibble: 10 x 3
       x     y     z
   <dbl> <dbl> <dbl>
 1   100    50   150
 2   200    50   250
 3   300    50   350
 4   400    50   450
 5   500    50   550
 6   600    50   650
 7   700    50   750
 8   800    50   850
 9   900    50   950
10  1000    50  1050


You can also use a different notation to create tibbles row-wise. This is more natural for hand-entered data. To create tibbles by specifiying rows, use `tribble` (for **tr**ansposed t**ibble**)

In [42]:
tribble(
    ~letter, ~Letter, ~number, # note use of R formulas in specifying variables names in tribble()
    #--|---|---
    "a", "A", 1,
    "f", "F", 6,
    "k", "K", 11,
    "p", "P", 16,
    "u", "U", 21,
    "z", "Z", 26
) %>% print

# A tibble: 6 x 3
  letter Letter number
   <chr>  <chr>  <dbl>
1      a      A      1
2      f      F      6
3      k      K     11
4      p      P     16
5      u      U     21
6      z      Z     26


You can control the default printing behavior of tribbles by adding options to `print()`:

In [143]:
mpg %>% print(n=20, width=100)

# A tibble: 234 x 11
   manufacturer              model displ  year   cyl      trans   drv   cty   hwy    fl   class
          <chr>              <chr> <dbl> <int> <int>      <chr> <chr> <int> <int> <chr>   <chr>
 1         audi                 a4   1.8  1999     4   auto(l5)     f    18    29     p compact
 2         audi                 a4   1.8  1999     4 manual(m5)     f    21    29     p compact
 3         audi                 a4   2.0  2008     4 manual(m6)     f    20    31     p compact
 4         audi                 a4   2.0  2008     4   auto(av)     f    21    30     p compact
 5         audi                 a4   2.8  1999     6   auto(l5)     f    16    26     p compact
 6         audi                 a4   2.8  1999     6 manual(m5)     f    18    26     p compact
 7         audi                 a4   3.1  2008     6   auto(av)     f    18    27     p compact
 8         audi         a4 quattro   1.8  1999     4 manual(m5)     4    18    26     p compact
 9         audi    

If you want to alter these for printed tibbles, use the `options()` function:

In [145]:
options(tibble.print_min = 10, tibble.print_max = 30, tibble.width = Inf)
print(mpg)

# A tibble: 234 x 11
   manufacturer      model displ  year   cyl      trans   drv   cty   hwy    fl   class
          <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int> <chr>   <chr>
 1         audi         a4   1.8  1999     4   auto(l5)     f    18    29     p compact
 2         audi         a4   1.8  1999     4 manual(m5)     f    21    29     p compact
 3         audi         a4   2.0  2008     4 manual(m6)     f    20    31     p compact
 4         audi         a4   2.0  2008     4   auto(av)     f    21    30     p compact
 5         audi         a4   2.8  1999     6   auto(l5)     f    16    26     p compact
 6         audi         a4   2.8  1999     6 manual(m5)     f    18    26     p compact
 7         audi         a4   3.1  2008     6   auto(av)     f    18    27     p compact
 8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26     p compact
 9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25     p compact
10         

### Selecting columns

In [158]:
tbl <- tibble(
    x = 1:10,
    y = 1:10,
)
print(tbl)

# A tibble: 10 x 2
       x     y
   <int> <int>
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    10


The syntax for extracting a column from a tibble is
```{r}
tbl$<column name>
```

In [152]:
mean(select(tbl, x)) # extracts a variable, name only

“argument is not numeric or logical: returning NA”

[1] NA

Remember: `tbl$x` is a vector of numbers. It is different than `select(tbl, x)`, which is a tibble with one numeric column named `x`:

In [51]:
select(tbl, x)  # tibble
tbl$x  # vector 

   x        
1  0.4820801
2  0.5995658
3  0.4935413
4  0.1862176
5  0.8273733
6  0.6684667
7  0.7942399
8  0.1079436
9  0.7237109
10 0.4112744

 [1] 0.4820801 0.5995658 0.4935413 0.1862176 0.8273733 0.6684667 0.7942399
 [8] 0.1079436 0.7237109 0.4112744

Another way to extract columns is with the `[[]]` operator. This is useful because you can also access columns by their number, or via another variable:

In [162]:
# v = "x"
# tbl[[v]] # double brackets can extract variables using names
tbl[[1]] # or using positions
tbl$1  # error

ERROR: Error in parse(text = x, srcfile = src): <text>:4:5: unexpected numeric constant
3: tbl[[1]] # or using positions
4: tbl$1
       ^


To use `$` or `[[` in a pipe, have to use special placeholder: `.`. Recall that the `.` in a pipe stands for the LHS of the pipe:

In [163]:
tbl %>% .$x  # same as tbl$x
tbl$x
tbl %>% .[[1]]  # same as tbl[[1]] 

 [1]  1  2  3  4  5  6  7  8  9 10

 [1]  1  2  3  4  5  6  7  8  9 10

 [1]  1  2  3  4  5  6  7  8  9 10

In [56]:
var_name = "x" # store variable name in a variable

In [53]:
tbl$var_name # doesn't work
tbl[[var_name]]

“Unknown or uninitialised column: 'var_name'.”

NULL

ERROR: Error in `[[.tbl_df`(tbl, var_name): object 'var_name' not found


We can use backticks to create variable names that are not valid R variable names. e.g., numbers

In [166]:
(cubes <- tibble(
    `1` = 1:10,
    `2` = `1`^2,
    `3 space 4` = `1`^3
)) %>% filter(`1` > 5)

  1  2   3 space 4
1  6  36  216     
2  7  49  343     
3  8  64  512     
4  9  81  729     
5 10 100 1000     

However, it's best to make life easy and avoid this. Using these variables requires you to always quote them, for example:

In [1]:
rename(cubes, x = `1`, squares = `2`, cubes = `3`)

ERROR: Error in rename(cubes, x = `1`, squares = `2`, cubes = `3`): could not find function "rename"


### Selecting rows
Above we saw how to access columns in tibbles. Now let's see how to access rows. The way we access rows in tibble is by using the `filter()` command, for logical conditions, or the `slice()` command, for numerical ranges.

In [68]:
filter(tbl, x < .2)  # All rows with x < .2
slice(tbl, 1:3)  # First 3 rows

  x          y       
1 0.06178627 1.124931

  x         y         
1 0.2655087 -0.8204684
2 0.3721239  0.4874291
3 0.5728534  0.7383247

The traditional syntax for selecting rows in a data frame is to pass in a numerical or boolean vector:

In [56]:
tbl[tbl$x < .2, ]
tbl[c(1,2,3), ]

  x         y        
1 0.1862176 -1.989352
2 0.1079436 -1.470752

  x         y         
1 0.4820801 0.91897737
2 0.5995658 0.78213630
3 0.4935413 0.07456498

Notice that accessing rows is fundamentally different in the sense that it does not make sense to access the vector corresponding to a particular row. There is no such thing since the entries of a row can be heterogeneous (have different data types). So when we say "accessing rows", we mean selecting a subset of the rows in a data table, and returning a new data table containing only those rows.

### Selecting specific entries
Finally, it's possible to select specific entries of a tibble or data frame. The most basic way is to treat the data table as a 2D matrix and select by row and column coordinates:

In [175]:
tbl
mean(tbl[[1, 2]])  # returns scalar.
mean(tbl[1, 2])  # select entry in row 1, column 2 (returns data.frame)
# class(tbl)
# 

   x  y 
1   1  1
2   2  2
3   3  3
4   4  4
5   5  5
6   6  6
7   7  7
8   8  8
9   9  9
10 10 10

[1] 1

“argument is not numeric or logical: returning NA”

[1] NA

Alternatively, we could first select the column we are interested in, and then select the value.

In [177]:
tbl[["y"]][1]
tbl$y[1]

[1] 1

[1] 1

## Data import
Next we will cover tools for taking existing data sets and importing them into R.

We will focus on the `read_csv()` function but there are other ones available:

* `read_csv2()`: uses semicolon as delimiter
* `read_tsv()`: uses tab as delimiter
* `read_delim()`: can use any delimiter
* `read_fwf()`: to read fix width files
* `read_table()`: to read fixed wdith files where columns are separated by white space

### Comma-separated value data
Comma-separated value (csv) is one of the most common formats for sharing data. It has the advantage of being human-readable. The disadvantage is that there is no actual standard for reading or writing these files!

Here's an example of CSV data on heights:
    
    "earn","height","sex","ed","age","race"
    50000,74.4244387818035,"male",16,45,"white"
    60000,65.5375428255647,"female",16,58,"white"
    30000,63.6291977374349,"female",16,29,"white"
    50000,63.1085616752971,"female",16,91,"other"
    51000,63.4024835710879,"female",17,39,"white"
    9000,64.3995075440034,"female",15,26,"white"
    
The first row (usually) has a *header* giving the column names. Subsequent rows give the actual data. Strings are (usually) quoted.

To read in csv data we will use the `read_csv` command. Note that this command is part of `tidyverse` and is different from `read.csv` in R! You generally want to use `read_csv` over `read.csv` since:
- It is much faster.
- It outputs nicely formatted `tibble`s which you can pass into other tidyverse functions.

In [178]:
heights <- read_csv("heights.csv")

Parsed with column specification:
cols(
  earn = col_double(),
  height = col_double(),
  sex = col_character(),
  ed = col_integer(),
  age = col_integer(),
  race = col_character()
)


Here `read_csv` has told us what columns it found, and also what the data types it found for them are. Generally these will be correct but we will see examples later where it guesses wrongly and we have to manually override them.

To create short examples illustrating `read_csv`'s behavior, we can specify the contents of a csv file inline.

In [91]:
read_csv(
    "a, b, c
     1, 2, 3
     4, 5, 6
")

  a b c
1 1 2 3
2 4 5 6

You might want to skip a few rows in the beginning that have metadata.

In [181]:
read_csv(
"First row to skip
Second row to skip
Third row to skip
a, b, c
1, 2, 3
4, 5, 6
", skip = 2)

“3 parsing failures.
row # A tibble: 3 x 5 col     row   col  expected    actual         file expected   <int> <chr>     <chr>     <chr>        <chr> actual 1     1  <NA> 1 columns 3 columns literal data file 2     2  <NA> 1 columns 3 columns literal data row 3     3  <NA> 1 columns 3 columns literal data
”

  Third row to skip
1 a                
2 1                
3 4                

You can also skip comments line by specifying a comment character.

In [184]:
read_csv("
- First comment line
a, b, c
- This separate the header from the data
1, 2, 3
4, 5, 6
- Another comment line
", comment = '-')

  a b c
1 a b c
2 1 2 3
3 4 5 6

Set `col_names = FALSE` when you don't have column names in the file. The column names are then set to X1, X2, ...

In [186]:
read_csv("
1, 2, 3
4, 5, 6
", col_names = FALSE)

  X1 X2 X3
1 1  2  3 
2 4  5  6 

You can specify your own column names.

In [101]:
read_csv("
1, 2, 3
4, 5, 6
", col_names = c("a", "b", "c"))

  a b c
1 1 2 3
2 4 5 6

You can specify how missing values are represented in the file.

In [189]:
read_csv(
    "a, b, c
     1, 2, 3
     4,,6
") %>% print

# A tibble: 2 x 3
      a     b     c
  <int> <int> <int>
1     1     2     3
2     4    NA     6


In [191]:
read_csv(
    "a, b, c
     1, 2, 3
     4, ., 6
", na = ".") %>% print

# A tibble: 2 x 3
      a     b     c
  <int> <int> <int>
1     1     2     3
2     4    NA     6


You can write a tibble to a csv file using `write_csv()`.

In [193]:
cubes %>% print
write_csv(cubes, "cubes.csv")

# A tibble: 10 x 3
     `1`   `2` `3 space 4`
   <int> <dbl>       <dbl>
 1     1     1           1
 2     2     4           8
 3     3     9          27
 4     4    16          64
 5     5    25         125
 6     6    36         216
 7     7    49         343
 8     8    64         512
 9     9    81         729
10    10   100        1000


In [194]:
cat(read_file('cubes.csv'))

1,2,3 space 4
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216
7,49,343
8,64,512
9,81,729
10,100,1e3


In [196]:
cubes2 <- read_csv("cubes.csv")
print(cubes2)

Parsed with column specification:
cols(
  `1` = col_integer(),
  `2` = col_integer(),
  `3 space 4` = col_double()
)


# A tibble: 10 x 3
     `1`   `2` `3 space 4`
   <int> <int>       <dbl>
 1     1     1           1
 2     2     4           8
 3     3     9          27
 4     4    16          64
 5     5    25         125
 6     6    36         216
 7     7    49         343
 8     8    64         512
 9     9    81         729
10    10   100        1000


### Fixed-width files
Another common data format is called a *fixed-width file*. Each data column gets its own fixed width, in this case five spaces:

    a     b     c
    10    2     3
    4     1.5   3

To read this type of data we use the `read_table()` function:

In [207]:
read_table("a     b    c
10    2     3
4     1.5   3")

  a  b   c
1 10 2.0 3
2  4 1.5 3

### How parsing works
Sometimes the automatic parsers will fail. To understand why, it's helpful to look at how these functions actually parse data.

The first step is to guess each column type. The parser functions will look at the first few entries of each column and use that to try and guess the column type. The default is 1000 entries and can be controlled with the `guess_max=` option.

In [211]:
tbl = read_csv(
"a, b
1, 3
2, 4
's', 't'
", guess_max=2
)
problems(tbl)

“2 parsing failures.
row # A tibble: 2 x 5 col     row   col   expected actual         file expected   <int> <chr>      <chr>  <chr>        <chr> actual 1     3     a an integer    's' literal data file 2     3     b an integer    't' literal data
”

  row col expected   actual file        
1 3   a   an integer 's'    literal data
2 3   b   an integer 't'    literal data

The reason this fails for `guess_max=2` is that it looks at the first two entries, sees integers, and assumes the rest of the column will be integers. Then it calls the `parse_integer()` function on the vector of strings `c("1", "2", "'b'")`:

In [221]:
# guess_parser(c("1.1", "2"))
# guess_parser(c("1", "2", "c"))
parse_integer(c("1", "2", "b"))

“1 parsing failure.
row # A tibble: 1 x 4 col     row   col   expected actual expected   <int> <int>      <chr>  <chr> actual 1     3    NA an integer      b
”

[1]  1  2 NA
attr(,"problems")
# A tibble: 1 x 4
    row   col   expected actual
  <int> <int>      <chr>  <chr>
1     3    NA an integer      b

A useful function for figuring out why parsing went wrong is `problems()`:

In [178]:
tbl = read_csv(
"a, b
1, 3
2, 4
'b', 'c'
", guess_max=2
)
problems(tbl)

“2 parsing failures.
row # A tibble: 2 x 5 col     row   col   expected actual         file expected   <int> <chr>      <chr>  <chr>        <chr> actual 1     3     a an integer    'b' literal data file 2     3     b an integer    'c' literal data
”

  row col expected   actual file        
1 3   a   an integer 'b'    literal data
2 3   b   an integer 'c'    literal data

If you already know what format each column has, rather than hoping it guesses correctly you can simply tell that to R:

In [214]:
read_csv(
"a, b
1, 3
2, 4
1, 2
",
   col_types=list(
       a=col_character(),
       b=col_character()
   )
) %>% print

# A tibble: 3 x 2
      a     b
  <chr> <chr>
1     1     3
2     2     4
3     1     2
