In [89]:
# Setup

library(tidyverse)
data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


# Categorical variables (factors)

Categorical variables in R are typically stored as "factors".

Unlike other statistical software solutions, R does not assign categorical variables an underlying numerical value. Values in a factor can therefore only be refered to by their category name!

Factors can sometimes cause issues, as a standard setting for some import functions in R is to import text variables as factors. This causes issues as you have little control over how they are converted to categorical variables (this was especially an issue in older versions of R).
It often makes more sense to recode the variables as factors yourself.

Factors are both used in statistical models to tell R, how a categorical variable should be treated (unordered/ordered) and used in graphs for various ordering of categories.

## Creating factors

Strings are immediately coercible to factors with the command `as.factor()`:

In [90]:
# Coerce as factor
data <- data %>%
    mutate(gndr = as.factor(gndr))

head(data)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43
4658,90,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61


Just inspecting the data shows no difference between the string version of the factor version of the variable but isolating it reveals how it is now structured:

In [91]:
# Inspecting values and levels
unique(data$gndr)

Compared to strings, factors both contains the values *and* the possible values of the factor (the levels).

In [92]:
levels(data$gndr)

### Ordered/unordered

A factor will by default be set as unordered (nominally scaled). This can be changed by using the `factor()` function and the `ordered = ` argument. Where `as.factor()` simply converts the string values to unordered categories, `factor()` both allows for specifying the possible categories and whether or not they are ordered:

In [93]:
# Create categorical variable for contracted hours:

data <- data %>%
    mutate(wkhct_cat = case_when(
        wkhct == 37 ~ "37 hours",
        wkhct < 37 ~ "Less than 37 hours",
        wkhct > 37 ~ "More than 37 hours",
        TRUE ~ NA_character_ # specifies the type of missing (character missing)
        ))

data %>%
    group_by(wkhct_cat) %>%
    summarize(count = n())

wkhct_cat,count
,124
37 hours,848
Less than 37 hours,426
More than 37 hours,174


In [94]:
# Create factor as ordered/ordinal (but what order?)
data <- data %>%
    mutate(wkhct_cat = factor(wkhct_cat, ordered = TRUE))

In [95]:
# Inspecting values and levels
unique(data$wkhct_cat)

Because the order was not explicitly specified, R will just order the categories alphabetically:

In [96]:
head(data, 2)

data$wkhct_cat[1] > data$wkhct_cat[2]

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,wkhct_cat
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,Less than 37 hours
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,37 hours


The order has to be explicitly specified:

In [97]:
# Creating ordered factor but setting custom order
data <- data %>%
    mutate(wkhct_cat = factor(wkhct_cat, levels = c('Less than 37 hours', '37 hours', 'More than 37 hours'), 
                              ordered = TRUE))

unique(data$wkhct_cat)

In [99]:
head(data, 2)

data$wkhct_cat[1] > data$wkhct_cat[2]

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,wkhct_cat
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,Less than 37 hours
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,37 hours


## Strings vs. factors

The main benefit of factors is being able to control the behaviour of the categories. Factors allows one to work with categories that may not be present in a specific variable (this can be useful in the case of likert scales where not all possible levels of the scale are present).

In [138]:
data <- data %>%
    mutate(gndr_3 = factor(gndr, levels = c("Female", "Male", "Other")))

In [139]:
table(data$gndr_3)


Female   Male  Other 
   726    846      0 

**Note of caution**
Using `factor()` will automatically recode categories not present in the data to missing:

In [140]:
data <- data %>%
    mutate(gndr_3 = factor(gndr, levels = c("Male", "Other")))

In [141]:
table(data$gndr_3)


 Male Other 
  846     0 

As an extra caution, use `parse_factor()` instead as this will give a warning if this occurs (`parse_factor()` expects input variable to be character):

In [137]:
data <- data %>%
    mutate(gndr_3 = parse_factor(as.character(gndr), levels = c("Male", "Other")))

table(data$gndr_3)

"726 parsing failures.
row col           expected actual
  5  -- value in level set Female
  7  -- value in level set Female
  9  -- value in level set Female
 11  -- value in level set Female
 14  -- value in level set Female
... ... .................. ......
See problems(...) for more details.
"


 Male Other 
  846     0 

## `forcats` 

`forcats` is a package specifically for working with factors in R. It provides a range of function for modifying level labels and order of labels for a factor.

See [the cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf).

Some useful functions include:
- `fct_recode()`: Alterntive to `recode()` that maintains the factor levels
- `fct_collapse()`: Combine categories in a factor
- `fct_lump()`: Combine small categories to a common category (like "Other")