<a href="https://colab.research.google.com/github/yardsale8/DSCI_210_R_notebooks/blob/main/lecture_7_2_1_basics_of_select_filter_mutate_and_pipes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# This loads all of the dplyr functions
#must do everytime you start new R session
library("tidyverse")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# The basics of `select`, `filter`, `mutate`, and `pipes`.

In this notebook, we will look at the basics of data management in R, including

1. Applying `select`, `filter`, and `mutate`, and
2. Performing a data management process using a pipe.

## Example - Loading some survey data.

In [4]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')
head(surveys)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


## Topic 1 - `select`, `filter`, and `mutate`


### Selecting columns with `select`

In [5]:
# Syntax: df %>% select(col1, col2, ...)
(surveys
 %>% select(plot_id, species_id, weight, year)
)

plot_id,species_id,weight,year
<int>,<chr>,<int>,<int>
2,NL,,1977
2,NL,,1977
2,NL,,1977
2,NL,,1977
2,NL,,1977
2,NL,,1977
2,NL,,1977
2,NL,,1978
2,NL,218,1978
2,NL,,1978


**Note.** Looking at all that data is annoying $\rightarrow$ add a temporary `head` at the end of the pipe.

In [10]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% head. # Temp.  Remember to delete before saving
)

Unnamed: 0_level_0,plot_id,species_id,weight,year
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>
1,2,NL,,1977
2,2,NL,,1977
3,2,NL,,1977
4,2,NL,,1977
5,2,NL,,1977
6,2,NL,,1977


**Dropping columns with `-`**

In [12]:
# Syntax: df %>% select(-col1, -col2, ...)
(surveys
 %>% select(-record_id, -plot_type)
 %>% head # Temp.  Remember to delete before saving
)


Unnamed: 0_level_0,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent
2,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent
3,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent
4,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent
5,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent
6,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent


### Filtering rows with `filter`


In [14]:
(surveys
 %>% filter(year == 1995)
 %>% head # Temp.  Remember to delete before saving
)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,22314,6,7,1995,2,NL,M,34,,Neotoma,albigula,Rodent,Control
2,22728,9,23,1995,2,NL,F,32,165.0,Neotoma,albigula,Rodent,Control
3,22899,10,28,1995,2,NL,F,32,171.0,Neotoma,albigula,Rodent,Control
4,23032,12,2,1995,2,NL,F,33,,Neotoma,albigula,Rodent,Control
5,22003,1,11,1995,2,DM,M,37,41.0,Dipodomys,merriami,Rodent,Control
6,22042,2,4,1995,2,DM,F,36,45.0,Dipodomys,merriami,Rodent,Control


**Combining with `&`   [AND]**

In [15]:
(surveys
 %>% filter(year == 1995 & sex == 'M')
 %>% head # Temp.  Remember to delete before saving
)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,22314,6,7,1995,2,NL,M,34,,Neotoma,albigula,Rodent,Control
2,22003,1,11,1995,2,DM,M,37,41.0,Dipodomys,merriami,Rodent,Control
3,22044,2,4,1995,2,DM,M,37,46.0,Dipodomys,merriami,Rodent,Control
4,22109,3,4,1995,2,DM,M,37,46.0,Dipodomys,merriami,Rodent,Control
5,22168,4,1,1995,2,DM,M,36,48.0,Dipodomys,merriami,Rodent,Control
6,22368,6,27,1995,2,DM,M,37,44.0,Dipodomys,merriami,Rodent,Control


**Combining with `|`  [OR]**

In [16]:
(surveys
 %>% filter(sex == 'M' | sex == 'F')
 %>% head # Temp.  Remember to delete before saving
)


Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
4,845,5,6,1978,2,NL,M,32.0,204.0,Neotoma,albigula,Rodent,Control
5,990,6,9,1978,2,NL,M,,200.0,Neotoma,albigula,Rodent,Control
6,1164,8,5,1978,2,NL,M,34.0,199.0,Neotoma,albigula,Rodent,Control


## <font color="red"> <b> Exercise 7.2.1 </b></font>

**Questions.**
1. Why are there columns not selected (last slide) still here?
2. How would we perform both the `select` and `filter` together?


<font color="orange">
Your answers here
</font>

## Topic 2 - Some good habits.

1. Perform all steps in one pipe,
2. Add a temporary `head` to the end of the pipe for cleaner output, and
3. Save and inspect the results of tables required for later tasks.

#### Perform all steps in one pipe.

In [None]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% filter(year == 1995)
)


plot_id,species_id,weight,year
<int>,<chr>,<int>,<int>
2,NL,,1995
2,NL,165,1995
2,NL,171,1995
2,NL,,1995
2,DM,41,1995
2,DM,45,1995
2,DM,46,1995
2,DM,49,1995
2,DM,46,1995
2,DM,48,1995


#### Output too long? $\rightarrow$ Temporarily shorten with `head`

Use `head` to inspect.

In [17]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% filter(year == 1995)
 %>% head  # Temporary - comment/remove before saving
)

Unnamed: 0_level_0,plot_id,species_id,weight,year
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>
1,2,NL,,1995
2,2,NL,165.0,1995
3,2,NL,171.0,1995
4,2,NL,,1995
5,2,DM,41.0,1995
6,2,DM,45.0,1995


#### Saving and inspecting the results of a computation

In [None]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% filter(year == 1995)
#  %>% head  # Comment and move to last line
) -> survey_narrow

survey_narrow %>% head

Unnamed: 0_level_0,plot_id,species_id,weight,year
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>
1,2,NL,,1995
2,2,NL,165.0,1995
3,2,NL,171.0,1995
4,2,NL,,1995
5,2,DM,41.0,1995
6,2,DM,45.0,1995


## <font color="red"> Exercise 7.2.2 </font>

Write a single pipe to perform the following tasks.

1. Select the `weight`, `sex` and `gender` columns,
2. Filter out the rows that are `NA` (see [answer 1](https://stackoverflow.com/questions/28857653/removing-na-observations-with-dplyrfilter)), and
3. Save the resulting table to an appropriately named variable.
4. Write the table to a CSV and download the resulting file.

In [None]:
# Your code here

## Topic 3 - Making a new column with `mutate`

**Syntax.**

```{R}
...
%>% mutate(new_col_name = _some_arithmetic_expression)
...
```

For example, suppose we want to convert the weight to kg.

In [None]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% filter(year == 1995)
 %>% mutate(weight_kg = weight / 1000)        # Convert to kg.
 %>% select(-weight)                          # Drop the old weight column
) -> survey_narrow

survey_narrow %>% head


Unnamed: 0_level_0,plot_id,species_id,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>
1,2,NL,1995,
2,2,NL,1995,0.165
3,2,NL,1995,0.171
4,2,NL,1995,
5,2,DM,1995,0.041
6,2,DM,1995,0.045


## <font color="red"> Exercise 7.2.3 </font>

Write a single pipe to perform the following tasks.

1. Select the `weight`, `sex` and `gender` columns,
2. Filter out the rows that are `NA` (see [answer 1](https://stackoverflow.com/questions/28857653/removing-na-observations-with-dplyrfilter)), and
3. Compute the a weight of all species in lbs.
4. Save the resulting table to an appropriately named variable.
5. Write the table to a CSV and download the resulting file.

In [None]:
# Your piped process here

## Motivating pipes

We prefer to use `%>%` to organize our data management steps in a pipe.  In this section, we will provide arguments why this is the preferred style.

### Imperative programming style

In [None]:
new_df <- select(surveys, plot_id, species_id, weight, year)
new_df2 <- filter(new_df, year == 1995)
new_df3 <- mutate(new_df2, weight_kg = weight / 1000)
head(new_df3)

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


### Imperative pattern - Save, save, save


<img width="450" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/imperative_pattern.png?raw=1">

- **Problem 1:** Lots of temporary variables
- **Problem 2:** Messy and lots of *overhead*
    - All the extra *stuff* clouds the meaning/intent of the code

### Poor solution - Rewrite to the same data frame

```{R}
surveys <- select(surveys, plot_id, species_id, weight, year)
surveys <- filter(surveys, year == 1995)
surveys <- mutate(surveys, weight_kg = weight / 1000)
```

<font color="red"> <b> Question.</b></font> What's the problems with this approach?<font size = 1>Answer below.</font>


<font color="orange" size=3>
You answer here
</font>

### Use a pipe for cleaner code

Pipe pushes the data frame through the first position:

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe1.png?raw=1">

### Imagine an invisible data frame in the first spot...

Important Point - Each data frame is NEW

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe2.png?raw=1">

### ... but don't write it!

In [None]:
surveys  %>%
  select(plot_id, species_id, weight, year) %>%
  filter(year == 1995) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


### My preferred code format

In [None]:
(surveys
 %>% select(plot_id, species_id, weight, year)
 %>% filter(year == 1995)
 %>% mutate(weight_kg = weight / 1000)
 %>% head()
 ) -> survey_clean

 survey_clean

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


## Deliverables

Please submit the following on D2L.

1. A WORD document containing screenshots of your solutions (code + output), and
2. A share link to your notebook.  You will need to save a copy of your notebook and change the share permissions to allow anyone with the link access.