# Common mutations

In [1]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [2]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')
head(surveys)

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


# Outline

* Numeric transformations
* String transformations

# Common mutations

### Example - Creating a decade column

In [3]:
# Playing around
# Step 1 - Save the column
year <- surveys$year 
year %>% head

In [4]:
# Playing around
# Step 2 - Try it out
new <- floor(year/10)*10
head(new)

In [5]:
# Step 3 - Embed in a mutate
(surveys
 %>% mutate(decade = floor(year/10)*10)
 %>% head
 )

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,decade
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,1970
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,1970
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,1970
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,1970
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,1970
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,1970


### Making a indicator column with `ifelse`

In [6]:
# Playing around
sex <- surveys$sex
new <- ifelse(sex == 'M', 1, 0)
head(new)

In [7]:
(surveys
%>% mutate(is_male = ifelse(sex == 'M', 1, 0))
%>% head
)

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,is_male
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,1
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,1
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0


## <font color="red"> Exercise 6.2.1 - Convert to kg. </font>

The variables `Bench.pre` and `Bench.post` represent the pre- and post-measurement for maximum bench press in lbs.

* **Task 1:** Create two new columns that contain the bench press weights (pre- and post-) converted to kilograms.
* **Task 2:** Create an indicator column called `is_class_1` that is `1` for all rows with `Class == 1` and `0` otherwise.

In [8]:
football_sleep <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Football_Sleep_data.csv')
head(football_sleep)

student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post
1,,,,3.21,3.6,270,,425,,315,
2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0
3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0
4,2.0,25%,6.0,2.57,2.2,290,,450,,275,
5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0
6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0


In [9]:
# Your code here

# String Transformations

1. Substring extraction
2. Replace
3. Recode

## Pattern 1 - Substring extraction with `substr`

**Syntax:** `substr(col, start, stop)`

In [10]:
# Play around
plot_type <- surveys$plot_type
unique(plot_type)

**Note:** R indexing starts at 1 not 0

In [11]:
new <- substr(plot_type, 1, 5)
head(new)

In [12]:
(surveys
 %>% mutate(plot_type = substr(plot_type, 1, 5))
 %>% head
 )

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Contr
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Contr
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Contr
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Contr
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Contr
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Contr


## <font color="red"> Exercise 6.2.2 - Creating a decade column using string functions.</font>

In a previous example, we used division and `floor` to create a `decade` column.  Another approach to this problem is to convert the years to strings and use `substr` to extract the decade.

**Tasks:** The basic process for converting to decade is

1. Use `as.charactar` to convert `year` to a string.
2. Use `substr` to extract the first three digits.
3. Use `paste(column, '0', sep = '')` to add on a zero.
4. Use `as.numeric` to convert back to a numeric column

Be sure to *play around* with the columns first, THEN mutate.

In [13]:
head(surveys)

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


In [15]:
example_paste <- paste(surveys$species_id, "a", sep = '')
unique(example_paste)

In [16]:
# Your play-around code here

In [17]:
# Your mutate pipe here

## Pattern 2 - `str_replace` from the `stringr` library

**Syntax:** `str_replace(column, pattern, replace)`

In [18]:
library(stringr)

### Example - Cleaning up the percents

In [19]:
pct7 <- football_sleep$Pct7
head(pct7)

#### (A) Try out `str_replace`

In [20]:
new <- str_replace(pct7, '%', '')
head(new)

#### (B) Switch to numeric with `as.numeric`

In [21]:
new <- str_replace(pct7, '%', '') %>% as.numeric
head(new)

#### (C) Embed in a `mutate`

In [22]:
(football_sleep
 %>% mutate(percent = str_replace(pct7, '%', '') 
                     %>% as.numeric)
 %>% head
)

student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent
1,,,,3.21,3.6,270,,425,,315,,
2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0
3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0
4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0
5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0
6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0


#### (D) Also compute the fraction

In [23]:
(football_sleep
 %>% mutate(percent = str_replace(pct7, '%', '') 
                     %>% as.numeric,
           fraction = percent/100)
 %>% head
)

student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent,fraction
1,,,,3.21,3.6,270,,425,,315,,,
2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0,0.95
3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0,0.9
4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0,0.25
5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0,0.44
6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0,0.88


## <font color="red"> Exercise 6.2.3 - Replacing `-` with `_`</font>

Use `str_replace` to replace all hyphens in the `plot_type` column with underscores.

In [24]:
unique(surveys$plot_type)

In [25]:
# Your code here

### Pattern 3 - Recoding a character column


**Syntax**

```{r}
recode(column, 
      `old value 1` = "new string 1",
      `old value 2` = "new string 2",
      ...)
```

In [26]:
(surveys
%>% mutate(month_name = recode(month,
                               `1` = "Jan",
                               `2` = "Feb",
                               `3` = "Mar"))
 %>% head(10)
)

“Problem with `mutate()` input `month_name`.
ℹ Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default
“Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default”

record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,month_name
1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,
72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,
224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,Jan
588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control,Feb
661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,Mar


## <font color="red"> Exercise 6.2.4 - Cleaning up comics </font>

**Tasks:** 

1. Use `unique` to explore the `comic` column and identify problems.
2. Use `recode` to clean up the `comic` column

In [27]:
comics <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Comic_Data_Messy.csv')
comics %>% head

page_id,urlslug,ID,ALIGN,SEX,ALIVE,APPEARANCES,FIRST.APPEARANCE,comic,PHYSICAL
666101,\/Jonathan_Dillon_(Earth-616),Public Identity,,Male Characters,Living Characters,4.0,Apr-97,marvel,"Blue Eyes , Brown Hair"
280850,\/John_(Mutant)_(Earth-616),Public Identity,,Male Characters,Deceased Characters,,Oct-01,marvl,"Blue Eyes , Blond Hair"
129267,\/wiki\/Gene_LaBostrie_(New_Earth),Public Identity,Good Characters,Male Characters,Living Characters,15.0,"1987, September",DC comics,", Black Hair"
157368,\/wiki\/Reemuz_(New_Earth),Public Identity,Good Characters,Male Characters,Deceased Characters,15.0,"1992, September",DC,"Black Eyes ,"
16171,\/Aquon_(Earth-616),Secret Identity,Bad Characters,Male Characters,Living Characters,1.0,Jul-73,marvl,","
240298,\/wiki\/Rachel_Berkowitz_(New_Earth),Secret Identity,Bad Characters,Female Characters,Living Characters,4.0,"2002, April",DC,", Black Hair"


In [130]:
# Your code here