# <center>Transitioning into the tidyverse</center>
### <center>Rebecca Barter (@rlbarter)</center>


<center><img src="images/horst_tidyverse.png" alt="Tidyverse" style="width: 400px;" align="middle"/>
(Image by Allison Horst)
</center>



## Slides 
### [https://github.com/rlbarter/tidyverse_tutorial](https://github.com/rlbarter/tidyverse_tutorial)


## Blog post 
### [http://www.rebeccabarter.com/blog/2019-08-05_base_r_to_tidyverse/](http://www.rebeccabarter.com/blog/2019-08-05_base_r_to_tidyverse/)

# <center> What is the tidyverse? </center>
<br>
<center>A set of packages (<i>ggplot2, dplyr, purrr, tidyr, readr, tibble</i>)</center>
<br>
<center>
... and a way of thinking about "tidy" analysis.
</center>
<br>

<center>Two ways to keep up with the tidyverse:</center>


<center>
    <table>
        <tr>
         <td><img src="images/twitter.png" alt="Tidyverse" style="width: 100px;"/></td>
         <td><img src="images/rstudio.png" alt="Tidyverse" style="width: 200px;"/></td>
        </tr>        
    </table>
</center>


# <center>Loading the tidyverse</center>

In [None]:
library(dplyr)
library(ggplot2)
library(purrr)
library(tidyr)
library(readr)
library(tibble)

<center>Or you can just load the tidyverse package:</center>

In [None]:
library(tidyverse) 

# <center> Entering the tidyverse </center>
<br>
<center>The fundamental object type of the tidyverse is the <b>data frame</b></center>

<center><b>Your data frame is the universe</b> whose objects are the columns</center>

In [None]:
# Alcohol consumption in Australia and New Zealand
library(DAAG)
grog

<center>Tidy coding involves <b>minimizing defining new objects</b></center>

## <center> Part 1 </center>
### <center> piping, dplyr and ggplot2 </center>


<center>
    <table>
        <tr>
            <td><img src="images/pipe.png" alt="Pipe" style="width: 200px;"/></td>
            <td><img src="images/dplyr.png" alt="dplyr" style="width: 200px;"/></td>
            <td><img src="images/ggplot2.png" alt="ggplot2" style="width: 200px;"/></td>
        </tr>        
     </table>
</center>



## <center> Part 2 </center>
### <center>tidyr, purrr, readr, tibbles, lubridate, forcats, stringr </center>

<center>
    <table>
        <tr>
            <td><img src="images/tidyr.jpg" alt="tidyr" style="width: 200px;"/></td>
            <td><img src="images/purrr.jpg" alt="purrr" style="width: 200px;"/></td>
            <td><img src="images/readr.png" alt="readr" style="width: 170px;"/></td>
            <td><img src="images/tibble.png" alt="tibble" style="width: 165px;"/></td>
            <td><img src="images/lubridate.png" alt="lubridate" style="width: 165px;"/></td>
            <td><img src="images/forcats.png" alt="forcats" style="width: 165px;"/></td>
            <td><img src="images/stringr.jpg" alt="stringr" style="width: 165px;"/></td>
        </tr>        
     </table>
</center>

# <center>Gapminder data!</center>

Let's load the gapminder dataset

In [None]:
# to download the data directly:
gapminder_orig <- read.csv(
    "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv"
)
# define a copy of the original dataset that we will clean and play with 
gapminder <- gapminder_orig

In [None]:
# check the dimension of the data
dim(gapminder)

In [None]:
# show the first 6 rows of the data
head(gapminder, 10)

# <center> Piping </center>

<center><img src="images/pipe.png" alt="Tidyverse" style="width: 200px;" align="middle"/></center>

<center>&ensp;&ensp;Chain functions together &emsp; &emsp;  Easy to read  &emsp;&emsp; Reduces intermediate objects  </center>

# <center>The pipe: %>%</center>
<br>
<center>Read the pipe as <i>"and then"</i> </center>

In [None]:
gapminder %>%  
  filter(continent == "Americas", year == 2007) %>%
  select(country, year, lifeExp)

### What is the pipe doing?

The object on the *left-hand-side* of the `%>%` is used as the *first argument* of the function on the *right-hand-side*.


In [None]:
gapminder %>% head()

Which is easier to read?

The unpiped version:

In [None]:
gapminder_filtered <- filter(gapminder, continent == "Americas", year == 2007)
gapminder_filtered_selected <- select(gapminder_filtered, country, lifeExp)
gapminder_filtered_selected

or the piped version:

In [None]:
gapminder %>%
  filter(continent == "Americas", year == 2007) %>%
  select(country, lifeExp)

# <center>dplyr</center>

<center><img src="images/dplyr.png" alt="dplyr" style="width: 200px;" align="middle"/></center>

<center> the data manipulation package </center>

# <center> dplyr: indexing </center>

The tidyverse rarely uses `[,]` or `$` indexing.

Instead, indexing is done using 

- `filter()` for rows 

- `select()` for columns.

Variable names inside tidyverse functions are **unquoted**

# <center>dplyr::select()</center>

`select()` is for indexing columns. Specify the column names that you want to keep

In [None]:
gapminder %>% 
  select(country, gdpPercap) %>%
  head

or specify those that you want to remove

In [None]:
gapminder %>% 
  select(-continent) %>%
  head

# <center>dplyr::filter()</center>

`filter()` is for indexing rows that satisfy a condition. 

In [None]:
gapminder %>% 
  filter(pop > 1000000000) 

You can specify multiple conditions 

In [None]:
gapminder %>% 
  filter(pop > 1000000000, year == 1992) 

# <center> dplyr: readability </center>

Compare dplyr code

In [None]:
gapminder %>%
  filter(continent == "Americas", year == 2007) %>%
  select(country, lifeExp)

with one potential version of equivalent base R code

In [None]:
continent_year_index <- which(gapminder["continent"] == "Americas" & gapminder["year"] == 2007)
gapminder[continent_year_index, c("country", "lifeExp")]

# <center> dplyr::mutate() </center>

`mutate()` lets you add new variables.

The new variable is *usually* a function of existing variables.


In [None]:
gapminder %>% 
  mutate(gdp = gdpPercap * pop) %>%
  head

Note that if you want to actually modify your data, you would need to redefine the `gapminder` data frame

In [None]:
gapminder <- gapminder %>% 
  mutate(gdp = gdpPercap * pop) 
gapminder %>% head

# <center> dplyr::arrange() </center>

`arrange()` lets you reorder the rows of your data frame based on a specified column/variable

In [None]:
 gapminder %>% 
  arrange(lifeExp) %>%
  head

To arrange in *descending* order, you need to wrap the variable in the `desc()` function

In [None]:
gapminder %>% 
  arrange(desc(lifeExp)) %>%
  head

# <center> dplyr::group_by() </center>

`group_by()` lets you to apply functions separately within groups defined by a categorical variable. 

After you're done, remember to `ungroup()`.

In [None]:
gapminder %>%
  group_by(continent) %>%
  filter(lifeExp > mean(lifeExp)) %>%
  ungroup() 
#  count(continent)

In [None]:
# compare with the ungrouped version
gapminder %>%
  filter(lifeExp > mean(lifeExp)) 
#  count(continent)

# <center> dplyr::summarise() </center>

`summarise()` lets you aggregate across rows of the data frame.

In [None]:
gapminder %>% 
  summarise(mean_lifeExp = mean(lifeExp),
            total_gdp = sum(gdp))

`summarise()` also works really nicely with `group_by()`:

In [None]:
gapminder %>% 
  group_by(year) %>%
  summarise(mean_lifeExp = mean(lifeExp),
            total_gdp = sum(gdp)) %>%
  ungroup()

# <center> More dplyr functions </center>

`rename()` for renaming variables of the data frame

In [None]:
gapminder %>%
  rename(gdp_per_capita = gdpPercap,
         life_exp = lifeExp) %>%
  head

`distinct()` for extracting the distinct values of a variable

In [None]:
gapminder %>% 
  distinct(continent)

`sample_n()` and `sample_frac()` for taking random samples of rows

In [None]:
gapminder %>% 
  sample_n(2)

`count()` for counting the number of rows with each value of a categorical variable

In [None]:
gapminder %>% 
  count(continent)

`transmute()` mutate and select at the same time

In [None]:
gapminder %>% 
  transmute(gdp = gdpPercap * pop) %>%
  head

Advanced dplyr practitioners will eventually want to learn about [*scoped verbs*](http://www.rebeccabarter.com/blog/2019-01-23_scoped-verbs/). 


# <center> ggplot2 </center>


<center><img src="images/ggplot2.png" alt="ggplot2" style="width: 200px;" align="middle"/></center>

<center>the data visualisation package</center>

# <center>ggplot2</center>

Based on the **layered grammar of graphics**.

**Geom**etric objects are created based on **aes**thetic mappings from data.

In [None]:
ggplot(gapminder)

# <center> ggplot2 with pipes </center>

In [None]:
gapminder %>% 
  filter(year == 2007) %>%
  ggplot()


# <center> Geom layers </center>

Layers are added using `+` (rather than `%>%`).

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() 
  # add a points layer on top (for gdpPercap and lifeExp)


# <center> Geom layers </center>

You can layer multiple geom layers ontop of one another.

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(aes(x = gdpPercap, y = lifeExp)) +
  # ((((( add a smoothed LOESS layer (geom_smooth) ))))) 



# <center> Global aesthetic mappings </center>

Instead of specifying a separate `aes()` function for each layer, you can specify a global aesthetic function in the `ggplot()` function

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  # (((( specify global aesthetic mappings ))))
  ggplot() +
  geom_point(aes(x = gdpPercap, y = lifeExp)) +
  geom_smooth(aes(x = gdpPercap, y = lifeExp), method = "loess")


# <center> ggplot2::geom_line() </center>

Line plots are great for visualizing time series. 

Plotting the average life expectency by year.

In [None]:
gapminder %>% 
  # ((((( calculate the average life expectency for each year )))))

  # ((((( plot points and line )))))

# <center> Grouping line plots </center>

What if we want a separate line for each continent?

In [None]:
# ((((((( create separate line for each continent )))))))
gapminder %>%
  # calculate average life expectancy
  group_by(year) %>%
  summarise(avg_lifeExp = mean(lifeExp)) %>%
  ungroup() %>%
  # plot points and lines
  ggplot(aes(x = year, y = avg_lifeExp)) +
  geom_point() +
  geom_line()


# <center> More aesthetic mappings </center>

We have only seen x- and y-position mappings so far. 

To change the color of all points - add a `col` argument *outside* the `aes()` function 

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  # (((((( change color of all points to cornflowerblue ))))))
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp))

## <center> Using a variable to colour the points </center>

To specify color based on a variable - add a `col` argument *inside* the `aes()` function.


In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  # ((((( change color based on continent )))))
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp))

# <center> Changing the colors with a scale layer </center>

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp, 
                 col  = continent)) + 
  # manually choose the colors for the groups
  scale_colour_manual(values = c("orange", "red4", "purple", "darkgreen", "blue"))

# <center> Other aesthetic mappings </center>

Specifying the **size** based on a variable 

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp, 
                 col = continent, 
                 size = pop)) 

# <center> Line plots: grouping and color </center>

Color can also group

In [None]:
gapminder %>%
  group_by(continent, year) %>%
  summarise(avg_lifeExp = mean(lifeExp)) %>%
  # specify global aesthetic mappings
  ggplot() +
  # add a points layer on top
  geom_line(aes(x = year, y = avg_lifeExp, group = continent))

# <center> Other types of geoms </center>

## Histograms

In [None]:
gapminder %>%
  ggplot() + 
  geom_histogram(aes(x = lifeExp))

## Boxplots

In [None]:
gapminder %>%
  ggplot() +
  geom_boxplot(aes(x = continent, y = lifeExp))

# <center> Faceting: making grids of plots </center>

In [None]:
gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(aes(x = gdpPercap, y = lifeExp)) 
  # ((((( Create a facet of plots by continent )))))

# <center> Customizing ggplot2 </center>

In [None]:
gapminder %>% 
  filter(year == 2007) %>%
  ggplot() +
  # add scatter points
  geom_point(aes(x = gdpPercap, y = lifeExp, col = continent, size = pop),
             alpha = 0.7) +
  # add some text annotations for the very large countries
  geom_text(aes(x = gdpPercap, y = lifeExp + 3, label = country),
            col = "grey50",
            data = filter(gapminder, year == 2007, pop > 1000000000 | country %in% c("Nigeria", "United States"))) +
  # clean the axes names and breaks
  scale_x_log10(limits = c(200, 60000)) +
  # change labels
  labs(title = "GDP versus life expectancy in 2007",
       x = "GDP per capita (log scale)",
       y = "Life expectancy",
       size = "Population",
       col = "Continent") +
  # change the size scale
  scale_size(range = c(0.1, 10),
             # remove size legend
             guide = "none") +
  # add a nicer theme
  theme_classic() +
  # place legend at top and grey axis lines
  theme(legend.position = "top")

In [None]:
gapminder_life_exp_diff <- gapminder %>%
  # filter to the starting and ending years only
  filter(year == 1952 | year == 2007) %>%
  # ensure that the data are arranged so that 1952 is first and 2007 is second 
  # within each year
  arrange(country, year) %>%
  # for country, add a variable corresponding to the difference between life 
  # expectency in 2007 and 1952
  group_by(country) %>%
  mutate(lifeExp_diff = lifeExp[2] - lifeExp[1],
         # also calculate the largest population for the country (based on the two years)
         max_pop = max(pop)) %>%
  ungroup() %>%
  # arrange in order of the biggest difference in life expectency
  arrange(lifeExp_diff) %>%
  # restrict to countries with a population of at least 30,000 so we can fit 
  # the plot in a reasonable space
  filter(max_pop > 50000000) %>%
  # redefine the country varaible so that it does not have the additional 
  # country levels corresponding to countries that were removed in the previous
  # step
  mutate(country = droplevels(country)) %>%
  select(country, year, continent, lifeExp, lifeExp_diff)
gapminder_life_exp_diff %>%
  mutate(country = fct_inorder(country)) %>%
  # for each country define a varaible for min and max life expectancy
  group_by(country) %>%
  mutate(max_lifeExp = max(lifeExp),
         min_lifeExp = min(lifeExp)) %>% 
  ungroup() %>%
  ggplot() +
  # plot a horizontal line from min to max life expectency for each country
  geom_segment(aes(x = min_lifeExp, xend = max_lifeExp, 
                   y = country, yend = country,
                   col = continent), alpha = 0.5, size = 7) +
  # add a point for each life expectancy data point
  geom_point(aes(x = lifeExp, y = country, col = continent), size = 8) +
  # add text of the country name as well as the max and min life expectency 
  geom_text(aes(x = min_lifeExp + 0.7, y = country, 
                label = paste(country, round(min_lifeExp))), 
            col = "grey50", hjust = "right") +
  geom_text(aes(x = max_lifeExp - 0.7, y = country, 
                label = round(max_lifeExp)), 
            col = "grey50", hjust = "left") +
  # ensure that the left-most text is not cut off 
  scale_x_continuous(limits = c(20, 85)) +
  # choose a different colour palette
  scale_colour_brewer(palette = "Pastel2") +
  # set the title
  labs(title = "Change in life expectancy",
       subtitle = "Between 1952 and 2007",
       x = "Life expectancy (in 1952 and 2007)",
       y = NULL, 
       col = "Continent") +
  # remove the grey background
  theme_classic() +
  # remove the axes and move the legend to the top
  theme(legend.position = "top", 
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank())

# <center> Part 2: everything else </center>

## <center> tidyr, purrr, readr, tibbles, lubridate, forcats, stringr </center>


<center>
    <table>
        <tr>
            <td><img src="images/tidyr.jpg" alt="tidyr" style="width: 200px;"/></td>
            <td><img src="images/purrr.jpg" alt="purrr" style="width: 200px;"/></td>
            <td><img src="images/readr.png" alt="readr" style="width: 170px;"/></td>
            <td><img src="images/tibble.png" alt="tibble" style="width: 165px;"/></td>
            <td><img src="images/lubridate.png" alt="lubridate" style="width: 165px;"/></td>
            <td><img src="images/forcats.png" alt="forcats" style="width: 165px;"/></td>
            <td><img src="images/stringr.jpg" alt="stringr" style="width: 165px;"/></td>
        </tr>        
     </table>
</center>

# <center> tidyr </center>

<center><td><img src="images/tidyr.jpg" alt="tidyr" style="width: 200px;"/></td></center>

### <center> Convert data between long and wide form </center>

- `gather()` and `spread()`

- `pivot_longer()` and `pivot_wider()`

In [None]:
gapminder %>% 
  filter(country == "Australia", year > 2000) 
  #gather(key = "variable", value = "value", -country, -year) 
  

# <center> purrr </center>

<center> <td><img src="images/purrr.jpg" alt="purrr" style="width: 200px;"/></td> </center>

### <center> Iterate. Tidyverse's apply functions </center>

- `map()` functions iterate over elements of list, vector or columns of data frame

In [None]:
gapminder %>% map_df(class) 

# <center> readr </center>

<center> <td><img src="images/readr.png" alt="readr" style="width: 170px;"/></td> </center>

### <center>  Load in data </center>

`read_csv()`, `read_delim()`, ....


# <center> tibble </center>

<center> <td><img src="images/tibble.png" alt="tibble" style="width: 165px;"/></td> </center>

### <center> Fancy data frames </center>



# <center> lubridate </center>

<center> <td><img src="images/lubridate.png" alt="lubridate" style="width: 165px;"/></td> </center>

### <center> Handle dates and times </center>


In [None]:
library(lubridate)
mdy("August 2nd 2019")

In [None]:
mdy("8/2/19")

In [None]:
mdy_hms("August 2nd 2019, 1:21:30 pm") - mdy_hms("August 1st 2019, 11:23:33 am")

# <center> forcats </center>

<center> <td><img src="images/forcats.png" alt="forcats" style="width: 165px;"/></td> </center>

### <center> Handle factors </center>


In [None]:
library(repr)
options(repr.plot.width=8, repr.plot.height=2.5)

gapminder %>%
  filter(pop > 100000000) %>%
  group_by(country) %>% 
  summarise(avg_lifeExp = mean(lifeExp)) %>%
  arrange(desc(avg_lifeExp)) %>% ############ Why doesn't this work?
  ggplot() +
  geom_bar(aes(x = country, y = avg_lifeExp), stat = "identity")

# <center> stringr </center>

 <center> <td><img src="images/stringr.jpg" alt="stringr" style="width: 165px;"/></td> </center>

### <center> do things with strings </center>

<center> <td><img src="images/stringr_table.png" alt="stringr" style="width: 800px;"/></td> </center>

https://stringr.tidyverse.org/articles/from-base.html

# Blog post
### http://www.rebeccabarter.com/blog/2019-08-05_base_r_to_tidyverse/