<a href="https://colab.research.google.com/github/yardsale8/DSCI_210_R_notebooks/blob/main/lecture_6_1_getting_started_with_dplyr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to dplyr and tidyr (part 1)

### Why use R?


- Save and rerun code
- Lots of stats/data science packages
- Great graphics
- Built for data
- free, open source, cross-platform
- Large community

### Market-share

![](https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/Fig-1a-IndeedJobs-2017.png?raw=1)

What is the `tidyverse`?
========================================================

- Multiple packages for managing and manipulating data, using data verb syntax
    - `dplyr`
        - filter
        - mutate
        - aggregate
    - `tidyr`
        - stack and unstack (`gather()` and `spread()`)
    - `ggplot`
        - Nice graphics
    
    

### Loading a Library

In [3]:
# This loads all of the dplyr functions
#must do everytime you start new R session
library("tidyverse")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


#### Reading in data

In [4]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')
head(surveys)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


## <font color="red"> Exercise 1 </font>

1. Load the `NC_births.csv` from github (see data folder, Raw link).
2. Inspect the data using `head`

In [1]:
# Your code here

Selecting columns with `select`
========================================================

In [5]:
# Syntax: select(df, col1, col2, ...)
new_df <- select(surveys, plot_id, species_id, weight, year)

No output?  $\rightarrow$ This is normal.  

Use `head` to inspect.

In [6]:
# Good habit: Always inspect the result with head
head(new_df)

Unnamed: 0_level_0,plot_id,species_id,weight,year
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>
1,2,NL,,1977
2,2,NL,,1977
3,2,NL,,1977
4,2,NL,,1977
5,2,NL,,1977
6,2,NL,,1977


### Filtering rows with `filter`


In [7]:
new_df2 <- filter(surveys, year == 1995)
head(new_df2)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,22314,6,7,1995,2,NL,M,34,,Neotoma,albigula,Rodent,Control
2,22728,9,23,1995,2,NL,F,32,165.0,Neotoma,albigula,Rodent,Control
3,22899,10,28,1995,2,NL,F,32,171.0,Neotoma,albigula,Rodent,Control
4,23032,12,2,1995,2,NL,F,33,,Neotoma,albigula,Rodent,Control
5,22003,1,11,1995,2,DM,M,37,41.0,Dipodomys,merriami,Rodent,Control
6,22042,2,4,1995,2,DM,F,36,45.0,Dipodomys,merriami,Rodent,Control



<font color="red"> <b> Question.</b></font> Why are there columns not selected (last slide) still here? <font size = 1>Answer below.</font>


<font color="red"> <b> Tasks.</b></font>

1. Edit the filter code to correctly apply both the select and filter,
2. Identify the new bug  <font size = 1>Answer below</font>,
3. Correct the bug by editing the select code.

<font color="orange">
Your answers here
</font>

Making a new column with `mutate`
========================================================

In [15]:
new_df <- select(surveys, plot_id, species_id, weight, year)
new_df2 <- filter(new_df, year == 1995)
new_df3 <- mutate(new_df2, weight_kg = weight / 1000)
head(new_df3)

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


In [18]:
#Drop old weight column:
new_df4 <- select(new_df3, -weight)
head(new_df4)

Unnamed: 0_level_0,plot_id,species_id,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>
1,2,NL,1995,
2,2,NL,1995,0.165
3,2,NL,1995,0.171
4,2,NL,1995,
5,2,DM,1995,0.041
6,2,DM,1995,0.045


Motivating pipes
========================================================

### Imperative pattern - Save, save, save


<img width="450" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/imperative_pattern.png?raw=1">

- **Problem 1:** Lots of temporary variables
- **Problem 2:** Messy and lots of *overhead*
    - All the extra *stuff* clouds the meaning/intent of the code

### Poor solution - Rewrite to the same data frame

```{R}
surveys <- select(surveys, plot_id, species_id, weight, year)
surveys <- filter(surveys, year == 1995)
surveys <- mutate(surveys, weight_kg = weight / 1000)
```

**Question:**
<font color="red"> <b> Question.</b></font> What's the problems with this approach?<font size = 1>Answer below.</font>


<font color="orange" size=3>
You answer here
</font>

## Use a pipe for cleaner code

Pipe pushes the data frame through the first position:

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe1.png?raw=1">

### Imagine an invisible data frame in the first spot...

Important Point - Each data frame is NEW

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe2.png?raw=1">

### ... but don't write it!

In [19]:
surveys  %>%
  select(plot_id, species_id, weight, year) %>%
  filter(year == 1995) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


### My preferred code format

In [8]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% filter(year == 1995)
 %>% mutate(weight_kg = weight / 1000)
 %>% head()
 ) -> survey_clean

 survey_clean

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


## <font color="red"> Exercise 2 </font>

Write a pipe to perform the following tasks.

1. Select the `weight`, `sex` and `gender` columns,
2. Filter out the rows that are `NA` (see [answer 1](https://stackoverflow.com/questions/28857653/removing-na-observations-with-dplyrfilter)), and
3. Compute the a weight of all species in lbs.

In [None]:
# Your piped process here

## Deliverables

Please submit the following on D2L.

1. A WORD document containing screenshots of your solutions (code + output), and
2. A share link to your notebook.  You will need to save a copy of your notebook and change the share permissions to allow anyone with the link access.

# Part 2 - Converting code and types of errors.

### You've seen piping before...

<img width="850" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/img/openrefine_piping.PNG">

### Saving the result of a piped operation

In [None]:
surveys_small <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_small)

species_id,sex,weight
PF,F,4
PF,F,4
PF,M,4
RM,F,4
RM,M,4
PF,,4


### The advantages of piping

* reads left-to-right
* reads top-to-bottom
* Focuses on verbs
* Removes pointless nouns

### Compare

* Imperative
* Functional
* Piping

### Imperative:

In [None]:
x <- pi
r_x <- round(x, 2)
c_x <- as.character(r_x)
c_x

### Functional:

In [None]:
as.character(round(pi, 2))

### Piping:

In [None]:
pi %>%
  round(2) %>%
  as.character

## Example 1 - Converting to pipes

In [None]:
surveys_small <- filter(surveys, weight < 5)
survey_small_id_sex_wgt <- select(surveys_small, species_id, sex, weight)
head(survey_small_id_sex_wgt)

species_id,sex,weight
PF,F,4
PF,F,4
PF,M,4
RM,F,4
RM,M,4
PF,,4


In [None]:
# Convert to piped code

## Example 2 - Converting to imperative

In [None]:
surveys_small <- surveys %>%
  filter(species_id == 'NL') %>%
  select(species_id, sex, weight)

head(surveys_small)

species_id,sex,weight
NL,M,
NL,M,
NL,,
NL,,
NL,,
NL,,


In [None]:
# Convert to imperative

## Example 3 - Converting to functional

In [None]:
surveys_small <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_small)

species_id,sex,weight
PF,F,4
PF,F,4
PF,M,4
RM,F,4
RM,M,4
PF,,4


In [None]:
# Convert to imperative

## <font color="red"> Exercise 2 </font>

Perform each of the following code conversions.

In [None]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Salesperson,Compact,Sedan,SUV,Truck
Ann,22,18,15,12
Bob,19,12,17,20
Yolanda,19,8,32,15
Xerxes,12,23,18,9


#### <font color="red">TASK 1</font>. Convert the following *piped code* to the *imperative style*

In [None]:
sales %>%
    select(Salesperson, Compact, Sedan) %>%
    mutate(Car = Compact + Sedan)

Salesperson,Compact,Sedan,Car
Ann,22,18,40
Bob,19,12,31
Yolanda,19,8,27
Xerxes,12,23,35


In [None]:
# Your code here

#### <font color="red">TASK 2</font>. Convert the following *imperative code* to the *piped style*

In [None]:
df2 <- mutate(sales, Car = Compact + Sedan)
df3 <- mutate(df2, Utility = SUV + Truck)
df4 <- select(df3, Salesperson, Car, Utility)
head(df4)

Salesperson,Car,Utility
Ann,40,27
Bob,31,37
Yolanda,27,47
Xerxes,35,27


In [None]:
# Your code here

Types of programming errors
========================================================

* Name errors
* Syntax errors
* Semantic errors (hardest/worst)

### Name Errors - Using the wrong name

In [None]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Salesperson,Compact,Sedan,SUV,Truck
Ann,22,18,15,12
Bob,19,12,17,20
Yolanda,19,8,32,15
Xerxes,12,23,18,9


##  Find the name errors

In [None]:
sales %>%
  select(salesperson, sedan)

ERROR: Error: Can't subset columns that don't exist.
✖ Column `salesperson` doesn't exist.


### Syntax errors - Incorrect syntax

In [None]:
head(sales

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: head(sales
   ^


##  Find the syntax errors

In [None]:
# Find the syntax error
sales %>%
  mutate(monthly_sedan = Sedan/3,
         monthly_suv = SUV/3
         monthly_truck = Truck/3

ERROR: Error in parse(text = x, srcfile = src): <text>:5:10: unexpected symbol
4:          monthly_suv = SUV/3
5:          monthly_truck
            ^


### Semantic Errors - Correct code, wrong meaning

In [None]:
# Find the semantic errors
sales %>%
  group_by(Salesperson) %>%
  mutate(avg_sedan = median(Truck))

Salesperson,Compact,Sedan,SUV,Truck,avg_sedan
Ann,22,18,15,12,12
Bob,19,12,17,20,20
Yolanda,19,8,32,15,15
Xerxes,12,23,18,9,9


## <font color="red"> Exercise 3 </font>

Identify all of the errors and classify each as either a name, syntax, or semantic error.

In [None]:
sales %>%
    mutate(Car = compact + sedan) %>%
    mutate(Utility = SUV * Truck %>%

ERROR: Error in parse(text = x, srcfile = src): <text>:4:0: unexpected end of input
2:     mutate(Car = compact + sedan) %>%
3:     mutate(Utility = SUV * Truck %>%
  ^


> Your answer here