<a href="https://colab.research.google.com/github/thooks630/DSCI_210_R_notebooks/blob/main/lecture_6_1_introduction_to_dplyr_and_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1 - Introduction to the `tidyverse` in `R`

## Why use R?


- Save and rerun code
- Several data science/statistics packages available
- Great graphics
- Built for data
- Free and open-source
- Large user community

### Market-share

![](https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/Fig-1a-IndeedJobs-2017.png?raw=1)

## What is the `tidyverse`?

The tidyverse is a collection of `R` packages designed for data science. They all share an underlying design philosophy, grammar, and data structures. We will focus on a few packages for managing data, using the data verb syntax.
*   `dplyr` (`select`, `filter`, `mutate`, `group_by`, `aggregate`)
* `tidyr` (stack and unstack with `gather` and `spread`)


In future data science courses, you will likely use `ggplot`  to create nice graphics.   
    

# Introduction to the `dplyr` package in `R`

## Loading a Library

In [None]:
# This loads all of the dplyr functions
# You must do everytime you start new R session

library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Reading in data

In [None]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')

# Good habit: Always inspect the result with head
head(surveys)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


## Selecting columns with `select`

In [None]:
# Syntax: select(df, col1, col2, ...)

new_df <- select(surveys, plot_id, species_id, weight)
head(new_df)

## Filtering rows with `filter`

In [None]:
new_df2 <- filter(surveys, year == 1995)
head(new_df2)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,22314,6,7,1995,2,NL,M,34,,Neotoma,albigula,Rodent,Control
2,22728,9,23,1995,2,NL,F,32,165.0,Neotoma,albigula,Rodent,Control
3,22899,10,28,1995,2,NL,F,32,171.0,Neotoma,albigula,Rodent,Control
4,23032,12,2,1995,2,NL,F,33,,Neotoma,albigula,Rodent,Control
5,22003,1,11,1995,2,DM,M,37,41.0,Dipodomys,merriami,Rodent,Control
6,22042,2,4,1995,2,DM,F,36,45.0,Dipodomys,merriami,Rodent,Control


*Question: Why are the columns not selected up above still appearing here?*

## Creating a new column with `mutate`

In [None]:
new_df <- select(surveys, plot_id, species_id, weight, year)
new_df2 <- filter(new_df, year == 1995)
new_df3 <- mutate(new_df2, weight_kg = weight / 1000)
head(new_df3)

In [None]:
# To drop the old weight column:

new_df4 <- select(new_df3, -weight)
head(new_df4)

Unnamed: 0_level_0,plot_id,species_id,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>
1,2,NL,1995,
2,2,NL,1995,0.165
3,2,NL,1995,0.171
4,2,NL,1995,
5,2,DM,1995,0.041
6,2,DM,1995,0.045


## Motivating pipes

The pipe, `%>%`, is a powerful tool for clearly expressing a sequence of multiple operations. Before we explore using the pipe with `dplyr` functions, let's look at some alternatives.

### Alternative #1: Imperative coding pattern - save, save, save!


<img width="450" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/imperative_pattern.png?raw=1">

This works, but it's not the best approach.
- **Problem 1:** Creates lots of temporary variables 
- **Problem 2:** Messy and lots of overhead

All the extra *stuff* clouds the meaning/intent of the code!

### Alternative #2 - Rewrite to the same data frame

Instead of creating new objects at each step, we could just overwrite the original:

```{R}
surveys <- select(surveys, plot_id, species_id, weight, year)
surveys <- filter(surveys, year == 1995)
surveys <- mutate(surveys, weight_kg = weight / 1000)
```

**Problem:** This approach obscures what's changing on each line.



### Alternative #3 - Functional coding approach

This approach just strings the function calls together:

In [None]:
surveys <-
select(
  filter(
    mutate(surveys,
      weight_kg = weight / 1000), 
    year == 1995), 
  plot_id, species_id, weight, year)

**Problem:** We have to read from inside-out. This is difficult to understand!

### The fix: use a pipe for cleaner code

The pipe helps us write code in a way that is easier to read and understand. The pipe pushes the data frame through the first position:

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe1.png?raw=1">

Imagine an invisible data frame in the first spot... but don't write it!

<img width="350" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/blob/main/img/pipe2.png?raw=1">

Note this important point - each data frame is NEW when you use the pipe.

\\

### The code with pipes - much cleaner!
The code shown below uses the pipe with `dplyr` functions. The advantage is that we are now focusing on the data verbs!

In [None]:
surveys  %>% 
  select(plot_id, species_id, weight, year) %>%
  filter(year == 1995) %>%
  mutate(weight_kg = weight / 1000) %>%
  head()

### My preferred code format

In [None]:
(surveys  
 %>% select(plot_id, species_id, weight, year) 
 %>% filter(year == 1995) 
 %>% mutate(weight_kg = weight / 1000) 
 %>% head()
)

Unnamed: 0_level_0,plot_id,species_id,weight,year,weight_kg
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>
1,2,NL,,1995,
2,2,NL,165.0,1995,0.165
3,2,NL,171.0,1995,0.171
4,2,NL,,1995,
5,2,DM,41.0,1995,0.041
6,2,DM,45.0,1995,0.045


## <font color="red"> Exercise 1 </font>

Write code using `dplyr` with pipes to perform the following tasks.

1. Compute the weight of all species in lbs.
2. Compute the weight of all species in lbs for each `genus` separately.

In [None]:
# Your code for task 1 here

In [None]:
# Your code for task 2 here

\\
# Part 2 - Converting code and types of errors

## You've seen piping before...
 
<img width="850" src="https://github.com/thooks630/DSCI_210_R_notebooks/raw/main/img/openrefine_piping.PNG">

## Saving the result of a piped operation

In [None]:
surveys_small <- 
(surveys 
  %>% filter(weight < 5) 
  %>% select(species_id, sex, weight)
)

head(surveys_small)

Unnamed: 0_level_0,species_id,sex,weight
Unnamed: 0_level_1,<chr>,<chr>,<int>
1,PF,F,4
2,PF,F,4
3,PF,M,4
4,RM,F,4
5,RM,M,4
6,PF,,4


## A recap - the advantages of piping

* Reads left-to-right
* Reads top-to-bottom
* Focuses on verbs
* Removes pointless nouns

## Comparing three different coding approaches

* Imperative
* Functional
* Piping

### Imperative:

In [None]:
x <- pi
r_x <- round(x, 2)
c_x <- as.character(r_x)
c_x

### Functional:

In [None]:
as.character(round(pi, 2))

### Piping:

In [None]:
pi %>%
  round(2) %>%
  as.character

## Example 1 - converting to pipes

In [None]:
surveys_small <- filter(surveys, weight < 5) 
survey_small_id_sex_wgt <- select(surveys_small, species_id, sex, weight)
head(survey_small_id_sex_wgt)

Unnamed: 0_level_0,species_id,sex,weight
Unnamed: 0_level_1,<chr>,<chr>,<int>
1,PF,F,4
2,PF,F,4
3,PF,M,4
4,RM,F,4
5,RM,M,4
6,PF,,4


In [None]:
# Convert to piped code

## Example 2 - converting to imperative approach

In [None]:
surveys_small <- surveys %>%
  filter(species_id == 'NL') %>%
  select(species_id, sex, weight)

head(surveys_small)

In [None]:
# Convert to imperative

## Example 3 - converting to functional approach

In [None]:
surveys_small <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

head(surveys_small)

In [None]:
# Convert to imperative

## <font color="red"> Exercise 2 </font>

Perform each of the following code conversions.

In [None]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

#### <font color="red">TASK 1</font>. Convert the following *piped code* to the *imperative style*

In [None]:
sales %>%
    select(Salesperson, Compact, Sedan) %>%
    mutate(Car = Compact + Sedan) 

In [None]:
# Your code here (using imperative approach)

#### <font color="red">TASK 2</font>. Convert the following *imperative code* to the *piped style*

In [None]:
df2 <- mutate(sales, Car = Compact + Sedan)
df3 <- mutate(df2, Utility = SUV + Truck)
df4 <- select(df3, Salesperson, Car, Utility)
head(df4)

In [None]:
# Your code here

## Types of programming errors

* Name errors
* Syntax errors
* Semantic errors (hardest/worst)

### Name Errors - Using the wrong name

In [None]:
sales <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/auto_sales.csv')
head(sales)

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>
1,Ann,22,18,15,12
2,Bob,19,12,17,20
3,Yolanda,19,8,32,15
4,Xerxes,12,23,18,9


In [None]:
# Find the name errors
sales %>%
  select(salesperson, sedan)

### Syntax errors - Incorrect syntax

In [None]:
head(sales

In [None]:
# Find the syntax errors
sales %>%
  mutate(monthly_sedan = Sedan/3,
         monthly_suv = SUV/3
         monthly_truck = Truck/3

### Semantic Errors - Correct code, wrong meaning

In [None]:
# Find the semantic errors
sales %>%
  group_by(Salesperson) %>%
  mutate(avg_sedan = median(Truck))

## <font color="red"> Exercise 3 </font>

Identify all of the errors in the following code and classify each as either a name, syntax, or semantic error.

In [None]:
sales %>%
    mutate(Car = compact + sedan) %>%
    mutate(Utility = SUV * Truck %>%

> Your answer here