# Introduction to dplyr and tidyr (part 1)

### Why use R?


- Save and rerun code
- Lots of stats/data science packages
- Great graphics
- Built for data
- free, open source, cross-platform
- Large community

### Market-share

![](img/Fig-1a-IndeedJobs-2017.png)

What is the `tidyverse`?
========================================================

- Multiple packages for managing and manipulating data, using data verb syntax
    - `dplyr`
        - filter
        - mutate
        - aggregate
    - `tidyr`
        - stack and unstack (`gather()` and `spread()`)
    - `ggplot`
        - Nice graphics
    
    

### Loading a Library

In [1]:
# This loads all of the dplyr functions
#must do everytime you start new R session
library("dplyr")


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



#### Reading in data

In [2]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')
head(surveys)

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


Selecting columns with `select`
========================================================

In [3]:
# Syntax: select(df, col1, col2, ...)
new_df <- select(surveys, plot_id, species_id, weight)

In [4]:
# Good habit: Always inspect the result with head
head(new_df)

plot_id,species_id,weight
2,NL,
2,NL,
2,NL,
2,NL,
2,NL,
2,NL,


Filtering rows with `filter`
========================================================

Why are there columns not selected (last slide) still here?

In [5]:
new_df2 <- filter(surveys, year == 1995)
head(new_df2)

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
22314,6,7,1995,2,NL,M,34,,Neotoma,albigula,Rodent,Control
22728,9,23,1995,2,NL,F,32,165.0,Neotoma,albigula,Rodent,Control
22899,10,28,1995,2,NL,F,32,171.0,Neotoma,albigula,Rodent,Control
23032,12,2,1995,2,NL,F,33,,Neotoma,albigula,Rodent,Control
22003,1,11,1995,2,DM,M,37,41.0,Dipodomys,merriami,Rodent,Control
22042,2,4,1995,2,DM,F,36,45.0,Dipodomys,merriami,Rodent,Control


Making a new column with `mutate`
========================================================

In [6]:
new_df <- select(surveys, plot_id, species_id, weight, year)
new_df2 <- filter(new_df, year == 1995)
new_df3 <- mutate(new_df2, weight_kg = weight / 1000)
head(new_df3)

plot_id,species_id,weight,year,weight_kg
2,NL,,1995,
2,NL,165.0,1995,0.165
2,NL,171.0,1995,0.171
2,NL,,1995,
2,DM,41.0,1995,0.041
2,DM,45.0,1995,0.045


In [7]:
#Drop old weight column:
new_df4 <- select(new_df3, -weight)
head(new_df4)

plot_id,species_id,year,weight_kg
2,NL,1995,
2,NL,1995,0.165
2,NL,1995,0.171
2,NL,1995,
2,DM,1995,0.041
2,DM,1995,0.045


Motivating pipes
========================================================

### Imperative pattern - Save, save, save


<img width="450" src="img/imperative_pattern.png">

- **Problem 1:** Lots of temporary variables 
- **Problem 2:** Messy and lots of *overhead* 
    - All the extra *stuff* clouds the meaning/intent of the code 

### Poor solution - Rewrite to the same data frame

```{R}
surveys <- select(surveys, plot_id, species_id, weight, year)
surveys <- filter(surveys, year == 1995)
surveys <- mutate(surveys, weight_kg = weight / 1000)
```

**Question:** What's the problems with this approach?

### Use a pipe for cleaner code

Pipe pushes the data frame through the first position:

<img width="350" src="img/pipe1.png">

### Imagine an invisible data frame in the first spot...

Important Point - Each data frame is NEW

<img width="350" src="img/pipe2.png">

### ... but don't write it!

In [8]:
surveys  %>% 
  select(plot_id, species_id, weight, year) %>%
  filter(year == 1995) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

plot_id,species_id,weight,year,weight_kg
2,NL,,1995,
2,NL,165.0,1995,0.165
2,NL,171.0,1995,0.171
2,NL,,1995,
2,DM,41.0,1995,0.041
2,DM,45.0,1995,0.045


### My preferred code format

In [9]:
(surveys  
 %>% select(plot_id, species_id, weight, year) 
 %>% filter(year == 1995) 
 %>% mutate(weight_kg = weight / 1000) 
 %>% head()
 )

plot_id,species_id,weight,year,weight_kg
2,NL,,1995,
2,NL,165.0,1995,0.165
2,NL,171.0,1995,0.171
2,NL,,1995,
2,DM,41.0,1995,0.041
2,DM,45.0,1995,0.045


## <font color="red"> Exercise 2 </font>

Write a pipe to perform the following tasks.

1. Compute the a weight of all species in lbs.
2. Compute the a weight of all species in lbs for each `genus`

In [10]:
# Your code for task 1 here

In [11]:
# Your code for task 2 here

# Part 2 - Converting code and types of errors.

### You've seen piping before...
 
<img width="850" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/img/openrefine_piping.PNG">

### Saving the result of a piped operation

In [12]:
surveys_small <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_small)

species_id,sex,weight
PF,F,4
PF,F,4
PF,M,4
RM,F,4
RM,M,4
PF,,4


### The advantages of piping

* reads left-to-right
* reads top-to-bottom
* Focuses on verbs
* Removes pointless nouns

### Compare

* Imperative
* Functional
* Piping

### Imperative:

In [13]:
x <- pi
r_x <- round(x, 2)
c_x <- as.character(r_x)
c_x

### Functional:

In [14]:
as.character(round(pi, 2))

### Piping:

In [15]:
pi %>%
  round(2) %>%
  as.character

## Example 1 - Converting to pipes

In [16]:
surveys_small <- filter(surveys, weight < 5) 
survey_small_id_sex_wgt <- select(surveys_small, species_id, sex, weight)
head(survey_small_id_sex_wgt)

species_id,sex,weight
PF,F,4
PF,F,4
PF,M,4
RM,F,4
RM,M,4
PF,,4


In [17]:
# Convert to piped code

## Example 2 - Converting to imperative

In [18]:
surveys_small <- surveys %>%
  filter(species_id == 'NL') %>%
  select(species_id, sex, weight)

head(surveys_small)

species_id,sex,weight
NL,M,
NL,M,
NL,,
NL,,
NL,,
NL,,


In [19]:
# Convert to imperative

## Example 3 - Converting to functional

In [20]:
surveys_small <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_small)

species_id,sex,weight
PF,F,4
PF,F,4
PF,M,4
RM,F,4
RM,M,4
PF,,4


In [21]:
# Convert to imperative

## <font color="red"> Exercise 2 </font>

Perform each of the following code conversions.

In [22]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Salesperson,Compact,Sedan,SUV,Truck
Ann,22,18,15,12
Bob,19,12,17,20
Yolanda,19,8,32,15
Xerxes,12,23,18,9


#### <font color="red">TASK 1</font>. Convert the following *piped code* to the *imperative style*

In [23]:
sales %>%
    select(Salesperson, Compact, Sedan) %>%
    mutate(Car = Compact + Sedan) 

Salesperson,Compact,Sedan,Car
Ann,22,18,40
Bob,19,12,31
Yolanda,19,8,27
Xerxes,12,23,35


In [24]:
# Your code here

#### <font color="red">TASK 2</font>. Convert the following *imperative code* to the *piped style*

In [25]:
df2 <- mutate(sales, Car = Compact + Sedan)
df3 <- mutate(df2, Utility = SUV + Truck)
df4 <- select(df3, Salesperson, Car, Utility)
head(df4)

Salesperson,Car,Utility
Ann,40,27
Bob,31,37
Yolanda,27,47
Xerxes,35,27


In [26]:
# Your code here

Types of programming errors
========================================================

* Name errors
* Syntax errors
* Semantic errors (hardest/worst)

### Name Errors - Using the wrong name

In [27]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Salesperson,Compact,Sedan,SUV,Truck
Ann,22,18,15,12
Bob,19,12,17,20
Yolanda,19,8,32,15
Xerxes,12,23,18,9


##  Find the name errors

In [28]:
sales %>%
  select(salesperson, sedan)

ERROR: Error: Can't subset columns that don't exist.
✖ Column `salesperson` doesn't exist.


### Syntax errors - Incorrect syntax

In [29]:
head(sales

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: head(sales
   ^


##  Find the syntax errors 

In [30]:
# Find the syntax error
sales %>%
  mutate(monthly_sedan = Sedan/3,
         monthly_suv = SUV/3
         monthly_truck = Truck/3

ERROR: Error in parse(text = x, srcfile = src): <text>:5:10: unexpected symbol
4:          monthly_suv = SUV/3
5:          monthly_truck
            ^


### Semantic Errors - Correct code, wrong meaning

In [31]:
# Find the semantic errors
sales %>%
  group_by(Salesperson) %>%
  mutate(avg_sedan = median(Truck))

Salesperson,Compact,Sedan,SUV,Truck,avg_sedan
Ann,22,18,15,12,12
Bob,19,12,17,20,20
Yolanda,19,8,32,15,15
Xerxes,12,23,18,9,9


## <font color="red"> Exercise 3 </font>

Identify all of the errors and classify each as either a name, syntax, or semantic error.

In [32]:
sales %>%
    mutate(Car = compact + sedan) %>%
    mutate(Utility = SUV * Truck %>%

ERROR: Error in parse(text = x, srcfile = src): <text>:4:0: unexpected end of input
2:     mutate(Car = compact + sedan) %>%
3:     mutate(Utility = SUV * Truck %>%
  ^


> Your answer here