# Lab 9: Regular Expressions and Factors


# Preliminaries



# Regular Expressions


In [None]:
library(tidyverse)

## Regular Expressions are Hard

-  Even seasoned programmers often struggle with regular expressions; they require a lot of practice to master.
-  Many people have written entire books with different animals on the cover just on the topic, and this is just from one publisher.


![owl](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/owl.jpg)
![weasel](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/weasel.jpg)
![bat](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/bat.jpg)


-  This website will let you test out regular expressions on the fly: https://www.regexpal.com/
-  Be sure to check the "multiline" box under "flags"

![](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/regex_pal2.png)

## RegExr
-  RegExr is a more powerful website that color-codes expressions as you build them: https://regexr.com/

![](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/regexr.png)

## Using Regular Expressions in dplyr
- Regular expressions aren't just useful for text data. It is also useful when data comes in a wide format with many columns that are hard to reconcile.
- Here we have a table of crime data by age, sex, and race from the ICSPR at the University of Michigan (original source here: https://www.icpsr.umich.edu/web/NACJD/studies/36115)
- As you'll see below, each row corresponds to a particular agency, month, and offense. The subsequent columns denotes counts of gender-age crimes (e.g. "F20" is how many reports for females age 20), or of different racial groups which are broken up by juvenile-status and race ("AW" is Adult White).
- While high-quality, this data is not very tidy; it is good example of how data actually comes in the real world.


In [None]:
icpsr_raw <- read_csv("https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/icpsr_raw.csv")

In [None]:
head(icpsr_raw)
names(icpsr_raw)

In [None]:
glimpse(icpsr_raw)

- We want to form a table of the crime data of gender and age groups. We use regular expressions to select the columns that match the age-gender format we saw
- We can print out our columns, then copy-pasted them into RegExr

In [None]:
message(paste0(names(icpsr_raw), collapse="\n"))

![](https://raw.githubusercontent.com/dereklhansen/stats306_lab/master/lab8/gender_age2.png)

-  We then use the ```matches``` function within ```select```. ```matches``` will keep all columns that match our regular expression.
-  Matches works with tidyr functions as well

In [None]:
icpsr_raw %>% filter(CONTENTS == 3) %>% 
  select(originating_agency = ORI,
         month = MONTH,
         offense = OFFENSE,
         matches("^(M|F)[0-9]+")) %>% print()

In [None]:
icpsr_raw %>%
  filter(CONTENTS == 3) %>%
  select(originating_agency = ORI,
         month = MONTH,
         offense = OFFENSE,
         matches("^(M|F)[0-9]+")) %>%
  gather(matches("^(M|F)[0-9]+"), 
         key = "gender_age",
         value = "count") %>% print()

In [None]:
icpsr_gender_age <- icpsr_raw %>%
  filter(CONTENTS == 3) %>%
  select(originating_agency = ORI,
         month = MONTH,
         offense = OFFENSE,
         matches("^(M|F)[0-9]+")) %>%
  gather(matches("^(M|F)[0-9]+"), 
         key = "gender_age",
         value = "count") %>%
  mutate(count = ifelse(count == 99999, 0, count)) %>%
  mutate(count = ifelse(count == 99998, NA_real_, count)) %>% 
  separate(gender_age, into = c("gender", "age"), sep=1)

In [None]:
print(icpsr_gender_age)

- In this simple case, we could just use the tidyr ```separate``` function with an index (1) 
- The ```extract``` function from ```tidyr``` is more powerful, as we can have it search for particular patterns

In [None]:
icpsr_gender_age <- icpsr_raw %>%
  filter(CONTENTS == 3) %>%
  select(originating_agency = ORI,
         month = MONTH,
         offense = OFFENSE,
         matches("^(M|F)[0-9]+")) %>%
  gather(matches("^(M|F)[0-9]+"), 
         key = "gender_age",
         value = "count") %>%
  mutate(count = ifelse(count == 99999, 0, count)) %>%
  mutate(count = ifelse(count == 99998, NA_real_, count)) %>%
  extract(gender_age, into = c("gender", "age"), regex=("(M|F)([0-9]+)"))
print(icpsr_gender_age )

-  What happened? Try looking back at our website code
-  Our regular expression was good enough to find the right columns, but "[0-9]+" doesn't match to any of the names with underscores (such as "M0_9")
-  This is why regular expressions are tricky to master!

In [None]:
icpsr_gender_age <- icpsr_raw %>%
  filter(CONTENTS == 3) %>%
  select(originating_agency = ORI,
         month = MONTH,
         offense = OFFENSE,
         matches("^(M|F)[0-9]+")) %>%
  gather(matches("^(M|F)[0-9]+"), 
         key = "gender_age",
         value = "count") %>%
  mutate(count = ifelse(count == 99999, 0, count)) %>%
  mutate(count = ifelse(count == 99998, NA_real_, count)) %>%
  extract(gender_age, into = c("gender", "age"), regex=("(M|F)([0-9]+$|[0-9]+_[0-9]+)"))
print(icpsr_gender_age)

**Exercise:**

Instead of age being a character, split the range into age_min and age_max. If there is just one age, set age_min=age_max

**Solution:**

You could just do ```separate``` on the "age" column. Or you can do it directly, all with regular expressions!

We need to make sure that we handle all cases properly. We define three groups to define gender_age:
-  ```"(M|F)"``` matches either male or female
-  ```"([0-9]+)"``` matches "9", "20", etc.
-  ```"((_[0-9]+|)"``` matches ```"_10"```, ```"_100"```, **or nothing**. If we don't specify this, the rows where the string does not match this pattern will return "NA" for all values.

# Factors

In [None]:
icpsr_gender_age_fctr <- icpsr_gender_age %>%
    mutate(originating_agency=factor(originating_agency), offense=factor(offense), gender=factor(gender), age=factor(age))
print(icpsr_gender_age_fctr)

Suppose we want to make the age groupings more clear (i.e. say "0 to 9" instead of "0_9"). the ```fct_recode``` is useful for making quick changes

In [None]:
mutate(icpsr_gender_age_fctr, age=fct_recode(age, "0 to 9"="0_9", "10 to 12"="10_12", "13 to 14"="13_14")) %>%
    print()

However, in our case, it would be tedious to make this same change for every group. Instead, we can manipulate the ```levels``` directly, which lets us use regular expressions.

In [None]:
levels(icpsr_gender_age_fctr$age)

In [None]:
levels(icpsr_gender_age_fctr$age) <- str_replace(levels(icpsr_gender_age_fctr$age), "_", " to ")

In [None]:
print(icpsr_gender_age_fctr)
print(levels(icpsr_gender_age_fctr$age))

For our last example, we revisit the ```gss_cat``` dataset. We want to extract different features of ```partyid```

In [None]:
print(gss_cat)

In [None]:
print(levels(gss_cat$partyid))

Regular expressions don't need to be complicated to be useful! Even simple ones can save you a lot of time

In [None]:
gss_splitup <- mutate(gss_cat, 
       party    = str_extract(partyid, "(Ind|rep|dem|Other party|No answer|Don't know)"),
       leaning  = str_extract(partyid, "(rep|dem)"),
       strength_of_leaning = str_extract(partyid, "(Strong|Not str|Ind,near)")
      ) 

In [None]:
print(select(gss_splitup, partyid, party, leaning, strength_of_leaning))

In [None]:
ggplot(gss_splitup) + geom_bar(aes(x=party, fill=paste0(leaning))) +
    scale_fill_manual(values = c(dem="blue", rep="red", `NA`="grey"))

In [None]:
ggplot(filter(gss_splitup, !is.na(leaning))) + geom_bar(aes(x=leaning, fill=strength_of_leaning))

In [None]:
ggplot(filter(gss_splitup, !is.na(leaning))) + geom_bar(aes(x=leaning, fill=strength_of_leaning), position="fill")