<a href="https://colab.research.google.com/github/yardsale8/DSCI_210_R_notebooks/blob/main/lecture_8_4_string_verbs_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# String Verbs in `R`

In [2]:
surveys <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/portal_data_joined.csv')
head(surveys)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


## String Transformations

1. **Substring extraction.** Extracting a fixed length substing,
2. **Replace.** Replace a substring with another, and
3. **Recode.** Translate each string to another.

### Pattern 1 - Substring extraction with `substr`

**Syntax:**
```{R}
...
%>% mutate(new_col = substr(old_col, start, stop))
...
```

In [5]:
(surveys
 %>% select(plot_type) # Temp
 %>% mutate(plot_type_short = substr(plot_type, 1, 5))
 %>% head # Temp
 )

Unnamed: 0_level_0,plot_type,plot_type_short
Unnamed: 0_level_1,<chr>,<chr>
1,Control,Contr
2,Control,Contr
3,Control,Contr
4,Control,Contr
5,Control,Contr
6,Control,Contr


## <font color="red"> Exercise 8.4.1 - Creating a decade column using string functions.</font>

In a previous example, we used division and `floor` to create a `decade` column.  Another approach to this problem is to convert the years to strings and use `substr` to extract the decade.

**Tasks:** The basic process for converting to decade is

1. Use `as.charactar` to convert `year` to a string. [Inside `mutate`]
2. Use `substr` to extract the first three digits into a column names `first_three`. [Inside `mutate`]
3. Use `glue("{first_three}0")` to add on a zero. [Inside `mutate`]
4. Use `as.numeric` to convert back to a numeric column. [Inside `mutate`]

Be sure to *play around* with the columns first, THEN mutate.

In [6]:
library(glue)

In [None]:
# Your mutate pipe here

## Pattern 2 - `str_replace` from the `stringr` library

**Syntax:**
```{R}
...
%>% mutate(new_col = str_replace(old_col, pattern, replace)
...
```

#### Use `str_replace` to replace the first instance.

In [None]:
# Get help on str_replace
?str_replace

In [7]:
# Functional code
str_replace("a-b-c", '-', '/')

#### Use `str_replace_all` to replace all instances.

In [8]:
# functional code
str_replace_all("a-b-c", '-', '/')

### Example - Cleaning up the percents

In [10]:
football_sleep <- read.csv("https://github.com/yardsale8/DSCI_210_R_notebooks/raw/main/data/Football_Sleep_data.csv")

football_sleep %>% head

Unnamed: 0_level_0,student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,,,,3.21,3.6,270,,425,,315,
2,2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0
3,3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0
4,4,2.0,25%,6.0,2.57,2.2,290,,450,,275,
5,5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0
6,6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0


In [11]:
pct7 <- football_sleep$Pct7
head(pct7)

#### (A) Try out `str_replace`

In [12]:
new <- pct7 %>% str_replace('%', '')
head(new)

#### (B) Switch to numeric with `as.numeric`

In [13]:
new <- pct7 %>% str_replace('%', '') %>% as.numeric
head(new)

#### (C) Embed in a `mutate`

In [14]:
(football_sleep
 %>% mutate(percent = pct7 %>% str_replace('%', '')  %>% as.numeric)
 %>% head
)

Unnamed: 0_level_0,student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,1,,,,3.21,3.6,270,,425,,315,,
2,2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0
3,3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0
4,4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0
5,5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0
6,6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0


(D) Make the code easier to read.

In [15]:
(football_sleep
 %>% mutate(percent = pct7
                      %>% str_replace('%', '')
                      %>% as.numeric,
           )
 %>% head
)

Unnamed: 0_level_0,student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,1,,,,3.21,3.6,270,,425,,315,,
2,2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0
3,3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0
4,4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0
5,5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0
6,6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0


#### (D) Also compute the fraction

In [16]:
(football_sleep
 %>% mutate(percent = pct7
                      %>% str_replace('%', '')
                      %>% as.numeric,
            fraction = percent/100,
           )
 %>% head
)

Unnamed: 0_level_0,student,Class,Pct7,Avg.sleep.per.night,GPA.pre,GPA.post,Clean.pre,Clean.post,Back.pre,Back.post,Bench.pre,Bench.post,percent,fraction
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,1,,,,3.21,3.6,270,,425,,315,,,
2,2,1.0,95%,7.6,3.1,3.3,265,290.0,385,430.0,255,275.0,95.0,0.95
3,3,2.0,90%,7.5,2.94,3.46,264,264.0,290,425.0,290,290.0,90.0,0.9
4,4,2.0,25%,6.0,2.57,2.2,290,,450,,275,,25.0,0.25
5,5,1.0,44%,6.0,3.5,3.5,280,265.0,415,390.0,270,235.0,44.0,0.44
6,6,2.0,88%,7.0,2.64,1.53,253,253.0,405,415.0,305,325.0,88.0,0.88


### Making piped code readable

<img src="https://github.com/yardsale8/DSCI_210_R_notebooks/blob/99bd4f4d10de6abdee409a5ad62d83abf1cde832/img/more_readable_code%20copy.png?raw=true" width="700">

## <font color="red"> Exercise 6.2.3 - Replacing `-` with `_`</font>

Use `str_replace` to replace all hyphens in the `plot_type` column with underscores.

In [None]:
plot_type <- surveys$plot_type
plot_type %>% unique

In [None]:
# Your code here

### Pattern 3 - Recoding a character column


**Syntax**

```{r}
recode(column,
      `old value 1` = "new string 1",
      `old value 2` = "new string 2",
      ...)
```

In [17]:
(surveys
%>% mutate(month_name = recode(month,
                               `1` = "Jan",
                               `2` = "Feb",
                               `3` = "Mar"),
          )
 %>% head(10)
)

[1m[22m[36mℹ[39m In argument: `month_name = recode(month, `1` = "Jan", `2` = "Feb", `3` =
  "Mar")`.
[1m[22m[33m![39m Unreplaced values treated as NA as `.x` is not compatible.
Please specify replacements exhaustively or supply `.default`.”


Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,month_name
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
7,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
8,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,Jan
9,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control,Feb
10,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,Mar


#### Adding a `.default` option

In [18]:
(surveys
%>% mutate(month_name = recode(month,
                               `1` = "Jan",
                               `2` = "Feb",
                               `3` = "Mar",
                               .default = "Other"),
         )
 %>% head(10)
)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,month_name
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,Other
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,Other
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,Other
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,Other
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,Other
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,Other
7,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,Other
8,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,Jan
9,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control,Feb
10,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,Mar


## Pattern 3 - Using `str_detect` to determine if a string CONTAINS a value.

To apply the string verb CONTAINS, we use the `str_detect` function from the `stringr` library (in `tidyverse`).

In [19]:
sex <- surveys$sex
sex %>% unique

### Use Case 1 - Using `str_detect` in a `filter`

In [20]:
(surveys
 %>% filter(str_detect(sex, 'M'))
 %>% head
)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
3,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control
4,845,5,6,1978,2,NL,M,32.0,204.0,Neotoma,albigula,Rodent,Control
5,990,6,9,1978,2,NL,M,,200.0,Neotoma,albigula,Rodent,Control
6,1164,8,5,1978,2,NL,M,34.0,199.0,Neotoma,albigula,Rodent,Control


### Use Case 2 - Use with `ifelse` to create an indicator column

In [21]:
(surveys
 %>% mutate(male = ifelse(str_detect(sex, 'M'), 1, 0))
 %>% head
)

Unnamed: 0_level_0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,male
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,1
2,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,1
3,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0
4,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0
5,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0
6,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,0


## Pattern 4 - Split & Get

We can use the `str_split_i(string, pattern, i)` function to split `string` on `pattern`, then getting element `i`

In [22]:
( surveys
 %>% select(genus)
 %>% filter(genus == 'Neotoma')
  %>% mutate(part1 = str_split_i(genus, 't', 1),
             part2 = str_split_i(genus, 't', 2),
            )
  %>% head
)

Unnamed: 0_level_0,genus,part1,part2
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,Neotoma,Neo,oma
2,Neotoma,Neo,oma
3,Neotoma,Neo,oma
4,Neotoma,Neo,oma
5,Neotoma,Neo,oma
6,Neotoma,Neo,oma


# Activity 6.3 - Cleaning the comic data in `R`

In today's activity, we will return to the comics data, this time cleaning it up in `R`

#### Problem 0 - Lead the `Comic_Data_Messy.csv` file, which is located in the `data` folder in D2L.

In [None]:
# Your code here

#### Problem 1 - Clean up the `comic` column
**Tasks:**

1. Use `unique` to explore the `comic` column and identify problems.
2. Use `recode` to clean up the `comic` column

In [None]:
# Your code here

### Problem 1 - Lead a fresh copy of the file.

In [None]:
# Your code here.

## Problem 2 - Extract the year

**Task 1.** Use `ifelse`, `str_detect`, and `str_split_i` to extract the year from the `FIRST.APPEARANCE` column into a column called `year.raw` .  The basic pattern will have the basic form shown below.

```
mutate(year.raw = ifelse(CONTAINS a comma, SPLIT on comma and GET 1, SPLIT on hyphen and GET 2)
```

**Task 2.** Suppose that we want to keep the last two digits of the year column.  Perform a transformations using `ifelse` to extract these last two digits into a column called `year`.

**Task 3.** Drop the `year.raw` column.

In [None]:
# Your code here

## Problem 3 - Extract and recode the month

Next, we will extract the month of the `FIRST.APPEARANCE`.
**Task 1.** Use a similar pattern as used in the last problem to extract the month.

**Task 2.** Use recode the recode the months to the three digit abbreviation.

In [None]:
# Your code here

## Case Study - Cleaning and extracting the information from `urlslug`

In a previous activity, I shared the code for cleaning up the `urlslug` column.  I will apply the extact same process, but this time using `R`.

In [None]:
(comics
#  %>% select(urlslug) # Temp select
 %>% mutate(has_group = ifelse(str_detect(urlslug, '\\)_\\('), 1, 0), # parentheses are special and need \\ in front
            urlslug2 = urlslug
                        %>% str_replace("\\\\/wiki", '') # both \ and / are also special and need to be escaped
                        %>% str_replace("\\\\/", "")
                        %>% str_replace("\\)_\\(", "|")
                        %>% str_replace("_\\(", "|")
                        %>% str_replace('\\)', '')
                        %>% str_replace('_', ' '),
           name = str_split_i(urlslug2, '\\|', 1),
           group_or_alias = ifelse(has_group, str_split_i(urlslug2, '\\|', 2), NA),
           universe = str_split_i(urlslug2, '\\|', -1),
           )
 %>% select(-urlslug, -urlslug2)
 %>% head(10)
)

Unnamed: 0_level_0,page_id,ID,ALIGN,SEX,ALIVE,APPEARANCES,FIRST.APPEARANCE,comic,PHYSICAL,has_group,name,group_or_alias,universe
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>
1,666101,Public Identity,,Male Characters,Living Characters,4.0,Apr-97,marvel,"Blue Eyes , Brown Hair",0,Jonathan Dillon,,Earth-616
2,280850,Public Identity,,Male Characters,Deceased Characters,,Oct-01,marvl,"Blue Eyes , Blond Hair",1,John,Mutant,Earth-616
3,129267,Public Identity,Good Characters,Male Characters,Living Characters,15.0,"1987, September",DC comics,", Black Hair",0,Gene LaBostrie,,New_Earth
4,157368,Public Identity,Good Characters,Male Characters,Deceased Characters,15.0,"1992, September",DC,"Black Eyes ,",0,Reemuz,,New Earth
5,16171,Secret Identity,Bad Characters,Male Characters,Living Characters,1.0,Jul-73,marvl,",",0,Aquon,,Earth-616
6,240298,Secret Identity,Bad Characters,Female Characters,Living Characters,4.0,"2002, April",DC,", Black Hair",0,Rachel Berkowitz,,New_Earth
7,186659,,Bad Characters,Male Characters,Living Characters,,"1999, May",DC,",",0,Golem II,,New_Earth
8,288877,Public Identity,Bad Characters,Male Characters,Living Characters,1.0,Oct-86,Marvel Comics,",",0,Burka,,Earth-616
9,402328,,Bad Characters,Male Characters,Living Characters,14.0,Jul-06,marvelcomics,"Blue Eyes , Blond Hair",0,Max Lohmer,,Earth-616
10,22342,Public Identity,Neutral Characters,Male Characters,Deceased Characters,7.0,Aug-89,marvelcomics,", Grey Hair",0,Paul Harker,,Earth-616


#### Problem 4 - Explore the case study code.

**Questions.**

1. What is the purpose of the first (now commented out) `select`?
2. Use comments to step through the creation of the `urlslug2` column and explain the purpose of each step.
3. Discuss how the 2-3 pieces of information were split into three columns.

<font color="orange">
Your answers here
</font>