In [1]:
# Setup

library(tidyverse)
data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot =

# Reading data from other analysis software with `haven`

[`haven`](https://haven.tidyverse.org/) is a tidyverse package for reading and writing data from other analyis software tools like SAS, Stata and SPSS.

The functions in `haven` are simple but because of the functional differences between R and the program the data was created in, one should be advised when importing data with `haven`.

## Reading Stata data with `haven` 

Stata data (.dta) can be read into R using `read_dta()`:

In [6]:
library(haven)

ess18_occu <- read_dta("https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ESS2018DK_subset_occu-ethn.dta")

In [7]:
head(ess18_occu)

idno,health,brncntr,facntr,mocntr,marsts,isco08
110,3,1,1,1,1.0,9334
705,2,1,1,1,,210
1327,2,1,1,1,6.0,7231
3760,1,1,1,1,6.0,9111
4658,1,1,2,1,,3251
5816,2,1,1,1,,2352


A core feature of Stata is using descriptive labels for both variables and values. This feature is not supported by R and data is therefore simply read its "raw" form. 

`haven` does however store the variable and value labels as attributes:

In [34]:
attr(ess18_occu$health, "label")

In [33]:
attr(ess18_occu$health, "labels")

In [72]:
for (i in seq_along(ess18_occu)){
    cat(names(ess18_occu)[i], "\t", attr(ess18_occu[[i]], "label"), "\n")
    }

idno 	 Respondent's identification number 
health 	 Subjective general health 
brncntr 	 Born in country 
facntr 	 Father born in country 
mocntr 	 Mother born in country 
marsts 	 Legal marital status 
isco08 	 Occupation, ISCO08 


## Dealing with `haven_labelled`

To ensure no information in the data is lost, `haven` stores the value labels by treating the variables as the class `haven_labelled`:

In [73]:
class(ess18_occu$health)

This class has limited functionality and one shoud *always* convert `haven_labelled` to an appropriate R class (numeric, character, factor, logical).

### Converting to numeric

Convert to numeric (not categorical!) simply by using `as.numeric`:

In [106]:
ess18_occu %>%
    mutate(health = as.numeric(health)) %>%
    head()

idno,health,brncntr,facntr,mocntr,marsts,isco08
110,3,1,1,1,1.0,9334
705,2,1,1,1,,210
1327,2,1,1,1,6.0,7231
3760,1,1,1,1,6.0,9111
4658,1,1,2,1,,3251
5816,2,1,1,1,,2352


### Converting to factor

Use `as_factor()` to convert a `haven_labelled` to a factor. The argument `levels` lets you specify whether to use the values (`levels = "values"`) or the labels (`levels = "labels"`) as the factor levels:

In [107]:
# Using values
ess18_occu %>%
    mutate(health = as_factor(health, levels = 'values', ordered = TRUE)) %>%
    head(10)

idno,health,brncntr,facntr,mocntr,marsts,isco08
110,3,1,1,1,1.0,9334
705,2,1,1,1,,210
1327,2,1,1,1,6.0,7231
3760,1,1,1,1,6.0,9111
4658,1,1,2,1,,3251
5816,2,1,1,1,,2352


In [108]:
# Using labels
ess18_occu %>%
    mutate(health = as_factor(health, levels = 'labels', ordered = TRUE)) %>%
    head()


idno,health,brncntr,facntr,mocntr,marsts,isco08
110,Fair,1,1,1,1.0,9334
705,Good,1,1,1,,210
1327,Good,1,1,1,6.0,7231
3760,Very good,1,1,1,6.0,9111
4658,Very good,1,2,1,,3251
5816,Good,1,1,1,,2352


In [109]:
# Using both
ess18_occu %>%
    mutate(health = as_factor(health, levels = 'both', ordered = TRUE)) %>%
    head()

idno,health,brncntr,facntr,mocntr,marsts,isco08
110,[3] Fair,1,1,1,1.0,9334
705,[2] Good,1,1,1,,210
1327,[2] Good,1,1,1,6.0,7231
3760,[1] Very good,1,1,1,6.0,9111
4658,[1] Very good,1,2,1,,3251
5816,[2] Good,1,1,1,,2352


#### Important note on ordering! 

When using `as_factor` to create an ordered factor, R will use the label order. *Just remember that R expects levels to be specified from worst to best!*

In the case of the `health` variable in the ESS data, the values are ranked from best to worst in terms of health. This can easily cause confusion if the labels are used as levels:

In [110]:
ess18_occu %>%
    mutate(health = as_factor(health, levels = 'labels', ordered = TRUE)) %>%
    filter(health > "Fair") %>%
    head(10)

idno,health,brncntr,facntr,mocntr,marsts,isco08
11688,Very bad,1,1,1,,5320.0
28202,Bad,1,1,1,,5120.0
76553,Bad,1,1,1,,2413.0
78061,Bad,1,1,1,4.0,7212.0
78613,Bad,1,1,1,,8189.0
78728,Very bad,1,1,1,,1344.0
80016,Bad,1,1,1,4.0,9334.0
80397,Bad,1,1,1,,7233.0
82127,Bad,1,1,1,6.0,
82855,Bad,1,1,1,6.0,7114.0


This can easily be resolved by using the function `fct_rev()` from `forcats`, which will reverse the order of the levels:

In [111]:
ess18_occu %>%
    mutate(health = as_factor(health, levels = 'labels', ordered = TRUE)) %>%
    mutate(health = fct_rev(health)) %>%
    filter(health > "Fair") %>%
    head(10)

idno,health,brncntr,facntr,mocntr,marsts,isco08
705,Good,1,1,1,,210
1327,Good,1,1,1,6.0,7231
3760,Very good,1,1,1,6.0,9111
4658,Very good,1,2,1,,3251
5816,Good,1,1,1,,2352
9607,Good,1,1,1,,4110
16357,Very good,1,1,1,6.0,8113
17504,Good,1,1,1,6.0,5311
19970,Very good,2,2,1,4.0,1120
20724,Good,1,1,1,,7125
