<a href="https://colab.research.google.com/github/thooks630/DSCI_210_R_notebooks/blob/main/lecture_6_2_common_mutations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Common mutations**

*   Numeric transformations
*   String transformations



Let's start by loading the `dplyr` package and the `surveys` dataset:

In [None]:
library(dplyr)

In [None]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')
head(surveys)

# **Some Numeric Transformations**

## Example - Creating a decade column with the `floor` function

In [None]:
# Playing around
# Step 1 - Save the column
year <- surveys$year 
year %>% head

In [None]:
# Playing around
# Step 2 - Try it out
new <- floor(year/10)*10
head(new)

In [None]:
# Step 3 - Embed in a mutate
(surveys
 %>% mutate(decade = floor(year/10)*10)
 %>% head
 )

## Example - Making an indicator column with the `ifelse` function

In [None]:
# Playing around
sex <- surveys$sex
new <- ifelse(sex == 'M', 1, 0)
head(new)

In [None]:
(surveys
%>% mutate(is_male = ifelse(sex == 'M', 1, 0))
%>% head
)

## <font color="red"> Exercise 6.2 - Problem 1 - Convert lbs. to kg. and create an indicator column </font>

The variables `Bench.pre` and `Bench.post` represent the pre- and post-measurement for maximum bench press in lbs.

* **Task 1:** Create two new columns that contain the bench press weights (pre- and post-) converted to kilograms.
* **Task 2:** Create an indicator column called `is_class_1` that is `1` for all rows with `Class == 1` and `0` otherwise.

In [None]:
football_sleep <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Football_Sleep_data.csv')
head(football_sleep)

In [None]:
# Your code here

# **Some String Transformations**

*   Substring extraction
*   Replace
*   Recode


## Example 1 - Substring extraction with the `substr` function

**Syntax:** `substr(col, start, stop)`

In [None]:
# Play around
plot_type <- surveys$plot_type
unique(plot_type)

**Note:** R indexing starts at 1 not 0

In [None]:
new <- substr(plot_type, 1, 5)
head(new)

In [None]:
(surveys
 %>% mutate(plot_type = substr(plot_type, 1, 5))
 %>% head
 )

## <font color="red"> Exercise 6.2 - Problem 2 - Creating a decade column using string functions</font>

In a previous example, we used division and the `floor` function to create a `decade` column.  Another approach to this problem is to convert the years to strings and use `substr` to extract the decade.

**Tasks:** The basic process for converting to decade is

1. Use `as.charactar` to convert `year` to a string.
2. Use `substr` to extract the first three digits.
3. Use `paste(column, '0', sep = '')` to add on a zero.
4. Use `as.numeric` to convert back to a numeric column.

Be sure to *play around* with the columns first, THEN mutate.

In [None]:
head(surveys)

In [None]:
example_paste <- paste(surveys$species_id, "a", sep = '')
unique(example_paste)

In [None]:
# Your play-around code here

In [None]:
# Your mutate pipe here

## Example 2 - Using the `str_replace` function from the `stringr` library

**Syntax:** `str_replace(column, pattern, replace)`

In [None]:
library(stringr)

### Example - Cleaning up the percents

In [None]:
pct7 <- football_sleep$Pct7
head(pct7)

#### (A) Try out `str_replace`

In [None]:
new <- str_replace(pct7, '%', '')
head(new)

#### (B) Switch to numeric with `as.numeric`

In [None]:
new <- str_replace(pct7, '%', '') %>% as.numeric
head(new)

#### (C) Embed in a `mutate`

In [None]:
(football_sleep
 %>% mutate(percent = str_replace(pct7, '%', '') 
                     %>% as.numeric)
 %>% head
)

student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent
1,,,,3.21,3.6,270,,425,,315,,
2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0
3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0
4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0
5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0
6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0


#### (D) Also compute the fraction

In [None]:
(football_sleep
 %>% mutate(percent = str_replace(pct7, '%', '') 
                     %>% as.numeric,
           fraction = percent/100)
 %>% head
)

student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent,fraction
1,,,,3.21,3.6,270,,425,,315,,,
2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0,0.95
3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0,0.9
4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0,0.25
5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0,0.44
6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0,0.88


## <font color="red"> Exercise 6.2 - Problem 3 - Replacing `-` with `_`</font>

Use `str_replace` to replace all hyphens in the `plot_type` column with underscores.

In [None]:
unique(surveys$plot_type)

In [None]:
# Your code here

## Example 3 - Recoding a character column


**Syntax:**

```{r}
recode(column, 
      `old value 1` = "new string 1",
      `old value 2` = "new string 2",
      ...)
```

In [None]:
(surveys
%>% mutate(month_name = recode(month,
                               `1` = "Jan",
                               `2` = "Feb",
                               `3` = "Mar"))
 %>% head(10)
)

## <font color="red"> Exercise 6.2 - Problem 4 - Cleaning up the comic column </font>

**Tasks:** 

1. Use `unique` to explore the `comic` column and identify problems.
2. Use `recode` to clean up the `comic` column

In [None]:
comics <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Comic_Data_Messy.csv')
comics %>% head

In [None]:
# Your code here