# R for data manipulation and visualization

MNB adatkurzus

# R basics

R is an

 - interpreted
 - dyamically typed
 - multi-paradigm

programming language.

Altough it can be considered to be multi-paradigm (like Python) as it supports a variety of coding styles, it is *at heart a functional* language.

If you did not understand a word above, it is perfectly alright.

## Vectors

In R, there is no difference between a scalar and a (one-element) vector.

In [1]:
vec1 <- c(1, 2, 3)
vec2 <- c(4, 5, 6)

In [2]:
vec1 + vec2

In [3]:
c(vec1, vec2)

All element of a vector must be of the same type, but implicit casting occurs in some cases

In [4]:
vec_char <- c("a", "b", "c")
c(vec1, vec_char)

But not in others

In [5]:
vec1 + vec_char

ERROR: Error in vec1 + vec_char: non-numeric argument to binary operator


The possible vector types are
 - logical: `c(TRUE, FALSE)`
 - integer: `c(1, 2, 3)`
 - double: `c(1.5, 2.0, 2.5)`
 - character: `c("a", 'b', "c")`

## Lists

Lists are more flexible.
 - can have heterogeneous elements
 - are recursive (can have lists as elements)
 - can have vectors as elements

In [6]:
lst1 <- list(1, 2, 3)
lst1

In [7]:
lst2 <- list(
    c(1, 2, 3),
    c(4, 5, 6)
)
lst2

In [8]:
c(lst1, lst2)

In [9]:
lst1[[4]] <- lst2
lst1

`[]` always returns a sublist, while `[[]]` always return an element

In [10]:
lst1[1]

In [11]:
lst1[[1]]

In [12]:
lst1[1:2]

In [13]:
lst1[[1:2]]

ERROR: Error in lst1[[1:2]]: subscript out of bounds


## Functions

Defining functions in R is really simple. Do so often!

In [14]:
fun1 <- function(a, b) {
    return(a + b)
}
fun1(2, 3)

If there is no return statement (function), the result of the last expression is returned implicitly

In [15]:
fun2 <- function(a, b) {
    a - b
}
fun2(2, 3)

Function arguments can also be specified by name instead of position

In [16]:
fun2(b = 2, a = 3)

And functions can do a whole lot of things in R:
 - take functions as arguments
 - return functions
 - delay the evaluation of their arguments
 - etc.

However, their most important use cases are *eliminating code repetition* and *structuring the code*

## Control flow

R has the usual set of control flow tools

In [17]:
a <- 3
if (a < 2) {
    print("a is small")
} else if (a < 4) {
    print("a is pretty large")
} else {
    print("Whoah, a is humongous")
}

[1] "a is pretty large"


In [18]:
for (i in 1:3) {
    print(i)
}

[1] 1
[1] 2
[1] 3


In [19]:
n <- 1
while (n < 3) {
    print(n)
    n <- n + 1
}

[1] 1
[1] 2


They can be nested of course

In [20]:
for (i in 1:10) {
    if (i %% 6 == 0) {
        print("fiszfasz")
    } else if (i %% 2 == 0) {
        print("fisz")
    } else if (i %% 3 == 0) {
        print("fasz")
    } else {
        print(i)
    }
}

[1] 1
[1] "fisz"
[1] "fasz"
[1] "fisz"
[1] 5
[1] "fiszfasz"
[1] 7
[1] "fisz"
[1] "fasz"
[1] "fisz"


# Working with data

Multiple packages for working with data:

 - Base R:
   - (+) Available by default
   - (-) Sometimes questionable and idiosyncratic behavior behavior
   - (-) Not very intuitive/readable
   - (/) Suitable for small-medium data (<1-2 GiB)


 - data.table:
   - (+) Very concise syntax
   - (+) Suitable for small-large(ish) data (tens of GiB depending on available memory)
   - (+) Lightning fast filtering and joins
   - (-) Too concise syntax (not very readable)
   - (-) Not so easy to learn


 - Tidyverse:
   - (+) Very readable syntax
   - (+) Relatively easy to learn
   - (-) Verbose syntax
   - (/) Suitable for small-medium data (<1-2 GiB)

## Reading data

### Base R

In [21]:
cars_df <- read.csv("cars.csv")

In [22]:
head(cars_df)

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
12.0,8,307,130,18,chevrolet chevelle malibu,USA,3504,1970-01-01
11.5,8,350,165,15,buick skylark 320,USA,3693,1970-01-01
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01


In [23]:
sapply(cars_df, class)

### data.table

In [24]:
library(data.table)

In [25]:
cars_dt <- fread("cars.csv")

In [26]:
head(cars_dt)

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
12.0,8,307,130,18,chevrolet chevelle malibu,USA,3504,1970-01-01
11.5,8,350,165,15,buick skylark 320,USA,3693,1970-01-01
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01


In [27]:
sapply(cars_dt, class)

### Tidyverse

In [28]:
library(tidyverse)

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::between()   masks data.table::between()
x dplyr::filter()    masks stats::filter()
x dplyr::first()     masks data.table::first()
x dplyr::lag()       masks stats::lag()
x dplyr::last()      masks data.table::last()
x purrr::transpose() masks data.table::transpose()


In [29]:
cars_tbl <- read_csv("cars.csv")

Parsed with column specification:
cols(
  Acceleration = col_double(),
  Cylinders = col_double(),
  Displacement = col_double(),
  Horsepower = col_double(),
  Miles_per_Gallon = col_double(),
  Name = col_character(),
  Origin = col_character(),
  Weight_in_lbs = col_double(),
  Year = col_date(format = "")
)


In [30]:
head(cars_tbl)

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
12.0,8,307,130,18,chevrolet chevelle malibu,USA,3504,1970-01-01
11.5,8,350,165,15,buick skylark 320,USA,3693,1970-01-01
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01


In [31]:
sapply(cars_tbl, class)

## Manipulating data

### Slicing and filtering

Single column as a vector:

In [32]:
cars_df[, "Miles_per_Gallon"]                           [1:10]

In [33]:
cars_dt[, Miles_per_Gallon]                           [1:10]

In [34]:
pull(cars_tbl, Miles_per_Gallon)                           [1:10]

Single column as a data.frame / data.table / tibble

In [35]:
cars_df[, "Miles_per_Gallon", drop = FALSE]                       %>% head(3)

Miles_per_Gallon
18
15
18


In [36]:
cars_dt[, .(Miles_per_Gallon)]                       %>% head(3)

Miles_per_Gallon
18
15
18


In [37]:
select(cars_tbl, Miles_per_Gallon)                       %>% head(3)

Miles_per_Gallon
18
15
18


Rows:

In [38]:
cars_df[3:6, ]  # fontos az üres második argumentum!

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
3,11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
4,12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
5,10.5,8,302,140,17,ford torino,USA,3449,1970-01-01
6,10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01


In [39]:
cars_dt[3:6]  # nem kell üres második argumentum

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01


In [40]:
slice(cars_tbl, 3:6)  # működik a Base R megoldás is

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01


Columns by position:

In [41]:
cars_df[, 1:2]                          %>% head()

Acceleration,Cylinders
12.0,8
11.5,8
11.0,8
12.0,8
10.5,8
10.0,8


In [42]:
cars_dt[, 1:2]                          %>% head()

Acceleration,Cylinders
12.0,8
11.5,8
11.0,8
12.0,8
10.5,8
10.0,8


In [43]:
select(cars_tbl, 1:2)                          %>% head()

Acceleration,Cylinders
12.0,8
11.5,8
11.0,8
12.0,8
10.5,8
10.0,8


Rows and columns:

In [44]:
cars_df[3:6, 1:2]

Unnamed: 0,Acceleration,Cylinders
3,11.0,8
4,12.0,8
5,10.5,8
6,10.0,8


In [45]:
cars_dt[3:6, 1:2]

Acceleration,Cylinders
11.0,8
12.0,8
10.5,8
10.0,8


In [46]:
cars_tbl %>% 
    slice(3:6) %>% 
    select(1:2)

Acceleration,Cylinders
11.0,8
12.0,8
10.5,8
10.0,8


More complex indexing:

In [47]:
cars_df[c(1, 3:4, 9), c(1:2, 8)]

Unnamed: 0,Acceleration,Cylinders,Weight_in_lbs
1,12,8,3504
3,11,8,3436
4,12,8,3433
9,10,8,4425


In [48]:
cars_dt[c(1, 3:4, 9), c(1:2, 8)]

Acceleration,Cylinders,Weight_in_lbs
12,8,3504
11,8,3436
12,8,3433
10,8,4425


In [49]:
cars_tbl %>%
    slice(c(1, 3:4, 9)) %>% 
    select(c(1:2, 8))

Acceleration,Cylinders,Weight_in_lbs
12,8,3504
11,8,3436
12,8,3433
10,8,4425


Based on column name:

In [50]:
cars_df[c(1, 3:4, 9), c("Horsepower", "Name")]

Unnamed: 0,Horsepower,Name
1,130,chevrolet chevelle malibu
3,150,plymouth satellite
4,150,amc rebel sst
9,225,pontiac catalina


In [51]:
cars_dt[c(1, 3:4, 9), .(Horsepower, Name)]

Horsepower,Name
130,chevrolet chevelle malibu
150,plymouth satellite
150,amc rebel sst
225,pontiac catalina


In [52]:
cars_tbl %>%
    slice(c(1, 3:4, 9)) %>% 
    select(Horsepower, Name)

Horsepower,Name
130,chevrolet chevelle malibu
150,plymouth satellite
150,amc rebel sst
225,pontiac catalina


Based on conditions:

In [53]:
cars_df[cars_tbl$Origin == "USA" & cars_tbl$Horsepower < 62, ]  # NA-k!!!

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
,,,,,,,,,
NA.1,,,,,,,,,
203,22.2,4.0,85.0,52.0,29.0,chevrolet chevette,USA,2035.0,1976-01-01
204,22.1,4.0,98.0,60.0,24.5,chevrolet woody,USA,2164.0,1976-01-01
NA.2,,,,,,,,,
NA.3,,,,,,,,,


In [54]:
cars_dt[Origin == "USA" & Horsepower < 62]

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
22.2,4,85,52,29.0,chevrolet chevette,USA,2035,1976-01-01
22.1,4,98,60,24.5,chevrolet woody,USA,2164,1976-01-01


In [55]:
filter(cars_tbl, Origin == "USA" & Horsepower < 62)

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
22.2,4,85,52,29.0,chevrolet chevette,USA,2035,1976-01-01
22.1,4,98,60,24.5,chevrolet woody,USA,2164,1976-01-01


These examples only cover a subset of slicing options, but they should be enough for most use cases.

### Creating columns

In [56]:
cars_df$Weight_in_kg <- cars_df$Weight_in_lbs * 0.453592
cars_df$l_per_100km <- 235.215 / cars_df$Miles_per_Gallon

In [57]:
cars_dt[, Weight_in_kg := Weight_in_lbs * 0.453592]
cars_dt[, l_per_100km := 235.215 / Miles_per_Gallon]

In [58]:
cars_tbl <- mutate(cars_tbl,
                   Weight_in_kg = Weight_in_lbs * 0.453592,
                   l_per_100km = 235.215 / Miles_per_Gallon)

In [59]:
head(cars_tbl)

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year,Weight_in_kg,l_per_100km
12.0,8,307,130,18,chevrolet chevelle malibu,USA,3504,1970-01-01,1589.386,13.0675
11.5,8,350,165,15,buick skylark 320,USA,3693,1970-01-01,1675.115,15.681
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01,1558.542,13.0675
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01,1557.181,14.70094
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01,1564.439,13.83618
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01,1969.043,15.681


### Aggregating data

Simple:

In [60]:
aggregate(cars_df$Miles_per_Gallon,
          by = list(cars_df$Origin),
          FUN = function (x) mean(x, na.rm = TRUE))

Group.1,x
Europe,27.89143
Japan,30.45063
USA,20.08353


In [61]:
cars_dt[, mean(Miles_per_Gallon, na.rm = TRUE), by = Origin]

Origin,V1
USA,20.08353
Europe,27.89143
Japan,30.45063


In [62]:
cars_tbl %>% 
    group_by(Origin) %>% 
    summarise(mean(Miles_per_Gallon, na.rm = TRUE))

Origin,"mean(Miles_per_Gallon, na.rm = TRUE)"
Europe,27.89143
Japan,30.45063
USA,20.08353


Advanced:

In [63]:
cars_dt[
    , 
    .(mpg = mean(Miles_per_Gallon, na.rm = TRUE),
      weight = mean(Weight_in_lbs, na.rm = TRUE)),
    by = .(Origin, Cylinders)
]

Origin,Cylinders,mpg,weight
USA,8,14.96311,4105.194
Europe,4,28.41111,2343.318
Japan,4,31.59565,2153.493
USA,6,19.66351,3213.905
USA,4,27.84028,2437.167
Japan,3,20.55,2398.5
Japan,6,23.88333,2882.0
Europe,6,20.1,3382.5
Europe,5,27.36667,3103.333


In [64]:
cars_tbl %>% 
    group_by(Origin, Cylinders) %>% 
    summarise(mpg = mean(Miles_per_Gallon, na.rm = TRUE),
              weight = mean(Weight_in_lbs, na.rm = TRUE))

Origin,Cylinders,mpg,weight
Europe,4,28.41111,2343.318
Europe,5,27.36667,3103.333
Europe,6,20.1,3382.5
Japan,3,20.55,2398.5
Japan,4,31.59565,2153.493
Japan,6,23.88333,2882.0
USA,4,27.84028,2437.167
USA,6,19.66351,3213.905
USA,8,14.96311,4105.194


Again, it is only the tip of the iceberg.

### Joins

In [65]:
capitals_df <- data.frame(
    country = c("USA", "Japan", "Hungary"),
    capital = c("Washington DC", "Tokyo", "Budapest")
)
capitals_dt <- data.table(
    country = c("USA", "Japan", "Hungary"),
    capital = c("Washington DC", "Tokyo", "Budapest")
)
capitals_tbl <- tibble(  # I know it' confusing, but it is a tibble
    country = c("USA", "Japan", "Hungary"),
    capital = c("Washington DC", "Tokyo", "Budapest")
)
capitals_tbl

country,capital
USA,Washington DC
Japan,Tokyo
Hungary,Budapest


In [66]:
merge(
    cars_df,
    capitals_df,
    by.x = "Origin",
    by.y = "country",
    all.x = FALSE,
    all.y = FALSE
)                         %>%  head()

Origin,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Weight_in_lbs,Year,Weight_in_kg,l_per_100km,capital
Japan,14.5,4,120,97,23.0,toyouta corona mark ii (sw),2506,1972-01-01,1136.7016,10.226739,Tokyo
Japan,14.5,4,97,88,27.0,datsun pl510,2130,1971-01-01,966.151,8.711667,Tokyo
Japan,18.5,4,98,68,31.5,honda Accelerationord cvcc,2045,1977-01-01,927.5956,7.467143,Tokyo
Japan,16.5,4,97,88,27.0,toyota corolla 1600 (sw),2100,1972-01-01,952.5432,8.711667,Tokyo
Japan,19.2,4,85,65,31.8,datsun 210,2020,1979-01-01,916.2558,7.396698,Tokyo
Japan,14.5,6,146,97,22.0,datsun 810,2815,1977-01-01,1276.8615,10.691591,Tokyo


In [67]:
merge(
    cars_dt,
    capitals_dt,
    by.x = "Origin",
    by.y = "country",
    all.x = FALSE,
    all.y = FALSE
)                         %>%  head()

Origin,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Weight_in_lbs,Year,Weight_in_kg,l_per_100km,capital
Japan,15.0,4,113,95,24,toyota corona mark ii,2372,1970-01-01,1075.9202,9.800625,Tokyo
Japan,14.5,4,97,88,27,datsun pl510,2130,1970-01-01,966.151,8.711667,Tokyo
Japan,14.5,4,97,88,27,datsun pl510,2130,1971-01-01,966.151,8.711667,Tokyo
Japan,14.0,4,113,95,25,toyota corona,2228,1971-01-01,1010.603,9.4086,Tokyo
Japan,19.0,4,71,65,31,toyota corolla 1200,1773,1971-01-01,804.2186,7.587581,Tokyo
Japan,18.0,4,72,69,35,datsun 1200,1613,1971-01-01,731.6439,6.720429,Tokyo


In [68]:
inner_join(
    cars_tbl,
    capitals_tbl,
    by = c("Origin" = "country")
)                         %>%  head()

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year,Weight_in_kg,l_per_100km,capital
12.0,8,307,130,18,chevrolet chevelle malibu,USA,3504,1970-01-01,1589.386,13.0675,Washington DC
11.5,8,350,165,15,buick skylark 320,USA,3693,1970-01-01,1675.115,15.681,Washington DC
11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01,1558.542,13.0675,Washington DC
12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01,1557.181,14.70094,Washington DC
10.5,8,302,140,17,ford torino,USA,3449,1970-01-01,1564.439,13.83618,Washington DC
10.0,8,429,198,15,ford galaxie 500,USA,4341,1970-01-01,1969.043,15.681,Washington DC


 - all of the tools can do basic joins (left, right, inner, outer)
 - tidiverse can do it in a very readable way
 - data.table can do much more and much more quicly (e.g. non-equi-joins)

### Reordering a dataset

In [69]:
cars_df <- cars_df[order(cars_df$Origin, -cars_df$Displacement), ]

In [70]:
setorder(cars_dt, Origin, -Displacement)

In [71]:
cars_tbl <- arrange(cars_tbl, Origin, -Displacement)

In [72]:
head(cars_tbl)

Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year,Weight_in_kg,l_per_100km
20.1,5,183,77,25.4,mercedes benz 300d,Europe,3530,1979-01-01,1601.18,9.260433
16.7,6,168,120,16.5,mercedes-benz 280s,Europe,3820,1976-01-01,1732.721,14.255455
13.6,6,163,125,17.0,volvo 264gl,Europe,3140,1978-01-01,1424.279,13.836176
15.8,6,163,133,16.2,peugeot 604sl,Europe,3410,1978-01-01,1546.749,14.519444
21.8,4,146,67,30.0,mercedes-benz 240d,Europe,3250,1980-01-01,1474.174,7.8405
19.6,6,145,76,30.7,volvo diesel,Europe,3160,1982-01-01,1433.351,7.661726


### Let's put it all together!

Task: List the median weight and fuel consumption of cars heavier than 1200 kg by capital of origin and number of cylinders. Calculate the consumption per kg (I know it does not make much sense) of the medians in each category. Order it by number of cylinders (decreasing) and capital.

In [73]:
dt <- merge(
    cars_dt[Weight_in_kg > 1200],
    capitals_dt,
    by.x = "Origin",
    by.y = "country",
    all.x = FALSE,
    all.y = FALSE
)
dt_coll <- dt[
    ,
    .(weight = median(Weight_in_kg, na.rm = TRUE),
      consumption = median(l_per_100km, na.rm = TRUE)),
    by = .(Cylinders, capital)
]
dt_coll[, consumption_per_kg := consumption / weight]
setorder(dt_coll, -Cylinders, capital)
dt_coll

Cylinders,capital,weight,consumption,consumption_per_kg
8,Washington DC,1876.737,16.801071,0.008952279
6,Tokyo,1317.685,10.20561,0.007745107
6,Washington DC,1480.751,12.379737,0.008360444
4,Tokyo,1225.606,7.893121,0.00644018
4,Washington DC,1265.522,9.224118,0.007288787
3,Tokyo,1233.77,10.940233,0.008867318


In [74]:
cars_tbl %>% 
    filter(Weight_in_kg > 1200) %>% 
    inner_join(capitals_tbl, by = c("Origin" = "country")) %>% 
    group_by(Cylinders, capital) %>% 
    summarize(weight = median(Weight_in_kg, na.rm = TRUE),
              consumption = median(l_per_100km, na.rm = TRUE)) %>% 
    mutate(consumption_per_kg = consumption / weight) %>% 
    arrange(-Cylinders, capital)

Cylinders,capital,weight,consumption,consumption_per_kg
8,Washington DC,1876.737,16.801071,0.008952279
6,Tokyo,1317.685,10.20561,0.007745107
6,Washington DC,1480.751,12.379737,0.008360444
4,Tokyo,1225.606,7.893121,0.00644018
4,Washington DC,1265.522,9.224118,0.007288787
3,Tokyo,1233.77,10.940233,0.008867318


# Data visualization

 - Multiple ways and packages (surprise, surprise...)
 
 
 - However, `ggplot2` is the clear favourite of most R users
    - is part of the tidyverse ecosystem
    - very powerful
    - grammar of graphics
    
    
 - Other options:
    - Base R graphics (painful, but very customizable)
    - `lattice` (somewhat better than base R)
    - `vegalite` (grammar of graphics, compiles to vega-lite (JSON), easy to share online)
 
 - In this lesson we are going to use `ggplot2`

In [75]:
library(ggplot2)

Creating the plot:

In [79]:
# The data should be in long/tidy format!
ggplot(cars_tbl)                 %>% ggsave("./plots/plot1.png", ., width = 6, height = 4)

In [None]:
# The data should be in long/tidy format!
ggplot(cars_tbl)

<img src="https://raw.githubusercontent.com/stanmart/mnb_dataviz/master/plots//plot1.png" width="600" align="left"/>

Adding an element:

In [96]:
( # Only the stuff between the outer parntheses is necessary!
    
ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km))

) %>% ggsave(filename = "./plots/plot2.png", plot = ., width = 6, height = 4)

"Removed 14 rows containing missing values (geom_point)."

In [None]:
ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km))

<img src="https://raw.githubusercontent.com/stanmart/mnb_dataviz/master/plots//plot2.png" width="600" align="left"/>

Two more aesthetics (data - visual property connection):

In [92]:
(

ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)))

) %>% ggsave(filename = "./plots/plot3.png", plot = ., width = 6, height = 4)

"Removed 14 rows containing missing values (geom_point)."

In [None]:
ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)))

<img src="https://raw.githubusercontent.com/stanmart/mnb_dataviz/master/plots//plot3.png" width="600" align="left"/>

Adding another element:

In [93]:
(

ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)), alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin), method = "lm")

) %>% ggsave(filename = "./plots/plot4.png", plot = ., width = 6, height = 4)

"Removed 14 rows containing missing values (geom_point)."

In [None]:
ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)), alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin), method = "lm")

<img src="https://raw.githubusercontent.com/stanmart/mnb_dataviz/master/plots//plot4.png" width="600" align="left"/>

Faceting:

In [94]:
(

ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)), alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin), method = "lm") +
    facet_wrap(~ factor(Year))

) %>% ggsave(filename = "./plots/plot5.png", plot = ., width = 6, height = 4)

"Removed 14 rows containing missing values (geom_point)."

In [None]:
ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)), alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin), method = "lm") +
    facet_wrap(~ factor(Year))

<img src="https://raw.githubusercontent.com/stanmart/mnb_dataviz/master/plots//plot5.png" width="600" align="left"/>

Adding fluff:

In [95]:
(

ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)),
               alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin),
                method = "lm") +
    facet_wrap(~ factor(Year), ncol = 4) +
    ggtitle("Horsepower and consumption by year") +
    xlab("Horsepower") +
    ylab("Consumption (l / 100 km)") +
    theme_bw()

) %>% ggsave(filename = "./plots/plot6.png", plot = ., width = 6, height = 4)

"Removed 14 rows containing missing values (geom_point)."

In [None]:
ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)),
               alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin),
                method = "lm") +
    facet_wrap(~ factor(Year), ncol = 4) +
    ggtitle("Horsepower and consumption by year") +
    xlab("Horsepower") +
    ylab("Consumption (l / 100 km)") +
    theme_bw()

<img src="https://raw.githubusercontent.com/stanmart/mnb_dataviz/master/plots//plot6.png" width="600" align="left"/>

Saving the plot:

In [97]:
p <- ggplot(cars_tbl) +
    geom_point(aes(x = Horsepower, y = l_per_100km, color = Origin, shape = factor(Cylinders)),
               alpha = 0.5) +
    stat_smooth(aes(x = Horsepower, y = l_per_100km, color = Origin),
                method = "lm") +
    facet_wrap(~ factor(Year), ncol = 4) +
    ggtitle("Horsepower and consumption by year") +
    xlab("Horsepower") +
    ylab("Consumption (l / 100 km)") +
    theme_bw()

ggsave("cars.png", p)

Saving 6.67 x 6.67 in image
"Removed 14 rows containing missing values (geom_point)."

And many many more plot objects and faceting options:

In [99]:
ls("package:ggplot2")[grep("^geom_", ls("package:ggplot2"))]

In [68]:
ls("package:ggplot2")[grep("^stat_", ls("package:ggplot2"))]

In [65]:
ls("package:ggplot2")[grep("^facet_", ls("package:ggplot2"))]

# Thank you!