# Recoding categories in R

We have previously seen how variables can be created or recoded from existing variables using arithetic operations (for example `df$newvar <- df$oldvar^2`).

Data often contains categorical data which may have to be recoded as well. Often categories are stored as strings. Changing the content of the category name or combining categories thus requires one to replace the text with something else.

## Recoding categories with base R

It is possible to recode categories with base R operations. Recoding is done by basically pin-pointing the values that needs to be replaced and then replacing those values with the new category.

In the example below, a variable is created indicating the level of educational attainment recoded to ISCED ([International Standard Classification of Education](https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED)#Implementation_of_ISCED_2011_.28levels_of_education.29)).

The values are recoded using the following schema:

|edlvddk| ISCED|
|----|----|
|Folkeskole 6.-8. klasse |      1  |
|               Folkeskole 9.-10. klasse |       2|
|Gymnasielle uddannelser, studentereksamen, HF, HHX, HTX |   3|
|Kort erhvervsuddannelse under 1-2 års varighed, F.eks. AMU Arbejdsmarkedsuddann |   3  |
|Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social- |   3|
|Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem  |   5|
|Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer, |   6|
|Universitetsbachelor. 1. del af kandidatuddannelse |   6|
|Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks |   7|
|                              Licentiat |   7     |
|       Forskeruddannelse. Ph.d., doktor |   8|
|                                  Other |   NA|


In [6]:
# Read in the data

library(readr)

data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")
head(data)

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61


In [15]:
# Create new empty variable (all values as missing)

data$edlvisced <- NA

# Specify values to replace and replace with ISCED

data[which(data$edlvddk == "Folkeskole 6.-8. klasse"), "edlvisced"] <- 1
data[which(data$edlvddk == "Folkeskole 9.-10. klasse"), "edlvisced"] <- 2
data[which(data$edlvddk == "Gymnasielle uddannelser, studentereksamen, HF, HHX, HTX"), "edlvisced"] <- 3
data[which(data$edlvddk == "Kort erhvervsuddannelse under 1-2 års varighed, F.eks. AMU Arbejdsmarkedsuddann"), "edlvisced"] <- 3
data[which(data$edlvddk == "Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"), "edlvisced"] <- 3
data[which(data$edlvddk == "Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"), "edlvisced"] <- 5
data[which(data$edlvddk == "Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"), "edlvisced"] <- 6
data[which(data$edlvddk == "Universitetsbachelor. 1. del af kandidatuddannelse"), "edlvisced"] <- 6
data[which(data$edlvddk == "Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks"), "edlvisced"] <- 7
data[which(data$edlvddk == "Licentiat"), "edlvisced"] <- 7
data[which(data$edlvddk == "Forskeruddannelse. Ph.d., doktor"), "edlvisced"] <- 8
data[which(data$edlvddk == "Other"), "edlvisced"] <- NA

In [18]:
head(data[, c('edlvddk', 'edlvisced')])

edlvddk,edlvisced
"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",5
"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",5
Folkeskole 9.-10. klasse,2
Folkeskole 9.-10. klasse,2
"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",5
"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",6


## Recoding categories with dplyr 

`dplyr` offers functions for recoding. There are three main functions:
- `recode`: For recoding single values
- `if_else`: For recoding based on logical
- `case_when`: For recoding based on several logicals

All these have to be combined with `mutate`.

In [20]:
# Read in the data

library(readr)
library(dplyr)

data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


In [22]:
# Recoding edlvddk to two categories to ISCED (text value)
data <- data %>%
    mutate(edlvisced = recode(edlvddk, "Folkeskole 6.-8. klasse" = "Primary education", "Folkeskole 9.-10. klasse" = "Lower secondary education"))

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,Lower secondary education
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,Lower secondary education
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"


Using the `.default` argument, new values can be set for the values not specified.

In [25]:
# Recoding edlvddk to two categories ("lower secondary or below" and "above lower secondary"
data <- data %>%
    mutate(edlvbin = recode(edlvddk, "Folkeskole 6.-8. klasse" = "lower secondary or below", "Folkeskole 9.-10. klasse" = "lower secondary or below",
                              .default = "above lower secondary"))

head(data)

# Can you see any problems with the code above?

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced,edlvbin
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,above lower secondary,above lower secondary
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,above lower secondary,above lower secondary
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,lower secondary or below,lower secondary or below
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,lower secondary or below,lower secondary or below
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,above lower secondary,above lower secondary
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,above lower secondary,above lower secondary


Use `if_else` when recoding based on a single logical condition.

In [28]:
data <- data %>% #note that this code also recodes missing
    mutate(phdornot = if_else(edlvddk == "Forskeruddannelse. Ph.d., doktor", "PhD", "Not PhD"))

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced,edlvbin,phdornot
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,above lower secondary,above lower secondary,Not PhD
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,above lower secondary,above lower secondary,Not PhD
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,lower secondary or below,lower secondary or below,Not PhD
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,lower secondary or below,lower secondary or below,Not PhD
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,above lower secondary,above lower secondary,Not PhD
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,above lower secondary,above lower secondary,Not PhD


Use `case_when` when recoding based on several logicals.

In [30]:
# Recoding edlvddk to two categories to ISCED (text value) - same as first recode example

data <- data %>%
    mutate(edlvisced = case_when(
        edlvddk == "Folkeskole 6.-8. klasse" ~ "Primary education", 
        edlvddk == "Folkeskole 9.-10. klasse" ~ "Lower secondary education",
        TRUE ~ edlvddk)) #This line keeps remaining values as they are

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,edlvisced,edlvbin,phdornot
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,Lower secondary education,lower secondary or below,Not PhD
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,Lower secondary education,lower secondary or below,Not PhD
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",above lower secondary,Not PhD
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",above lower secondary,Not PhD


## DISCUSSION: RECODING FUNCTIONS

- Now being familiar with the various ways of recoding with `dplyr`, how would you prefer recoding all values in `edlvddk` to the ISCED categories?

- Can you identify (possibly other) situations where `case_when` could be useful?