In [45]:
# Setup

library(tidyverse)
data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


# Recoding and creating variables

Creating variables and (simple) recoding is usually done in the same way. The only difference being whether the recoding is assigned to a new variable or overwriting an existing (we are here only looking at recoding by arithmetic operations and not by replacing values).

In base R, we simply specify a variable that is not in the data and specify the contents:

In [46]:
data$inwth <- data$inwtm / 60 # Creating variable for length of interview in hours

head(data$inwth)

In [47]:
data$inwth <- NULL # This line removes the variable

### Recoding and creating variables using `dplyr`

The function `mutate()` in `dplyr` is use for creating and recoding variables:

In [48]:
data <- data %>%
    mutate(inwth = inwtm / 60)

head(data$inwth)

# Classes in R

R differentiates between objects via the "class" of the object.

The function `class()` is used to check the class of an object:

In [49]:
name = "keenan"
year = 1964

In [50]:
class(name)

In [51]:
class(year)

Single variables/vectors can only contain values of the same class. The `class()` function therefore works on vectors too.

The variable `tygrtr` (Retire permanently, age too young) seems like a variable that should contain numeric values (the age). However, looking at the first couple of rows, we see that it also contains text values:

In [52]:
head(data$tygrtr)

When we check the class, we also see that the values are stored as text:

In [53]:
class(data$tygrtr)

This means that we cannot perform calculations with this variable:

In [54]:
max(data$tygrtr)

## Class coercion

In most cases, R can coerce values from one class to another. When doing this, values that are incompatible with the class are coded to missing (`NA`) so beware!

Values can be coerced to character values with `as.character()`

Values can be coerved to numeric values with `as.numeric()`

Here we coerce the variable to be numeric (notice the warning):

In [55]:
data <- mutate(data, tygrtr = as.numeric(tygrtr))

"NAs introduced by coercion"

Now the variable can be used in calculations:

In [56]:
max(data$tygrtr, na.rm = TRUE)

# Missing values

Data will often contain missing values. Missing values can denote a lot of things like a non-response, an invalid answer, an inaccessible information and so on. 

Missing values are used to assign a value without assigning a value. They are denotes as `NA` in R.

The `summary()` function includes information about the number of missing values:

In [57]:
summary(data$inwtm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  17.00   50.00   59.00   62.68   70.00  613.00       6 

Missing values are neither high or low in R. This means that it is not possible to perform computations on missing values:

In [58]:
min(data$inwtm) # NA is neither high or low - returns NA
max(data$inwtm) # NA is neither high or low - returns NA
mean(data$inwtm) # NA is neither high or low - returns NA

Usually one will have to deal with the missing values in some ways - either by replacing them or removing them.

## Removing missing observations (listwise deletion)

`drop_na()` from `tidyr` is used for listwise deletion. If columns are specified, it would look for missing in those specific columns:

In [59]:
library(tidyr)
data_drop_all = drop_na(data)

print(dim(data))
print(dim(data_drop_all))

[1] 1572   17
[1] 171  17


In [60]:
data_drop_specific = drop_na(data, inwtm)

print(dim(data))
print(dim(data_drop_specific))

[1] 1572   17
[1] 1566   17


In [61]:
data_drop_several = drop_na(data, inwtm, tygrtr)

print(dim(data))
print(dim(data_drop_several))

[1] 1572   17
[1] 1238   17


## Replacing missing values

`replace_na()` is used to replace missing values with a specified value. It can fx be used in combination with mutate:

In [63]:
data %>%
    mutate(prtvtddk = replace_na(prtvtddk, 'MISSING')) %>%
    head()

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm,inwth,prtvtddk_misreplace
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119,1.9833333,Socialdemokratiet - The Social democrats
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67.0,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55,0.9166667,Det Konservative Folkeparti - Conservative People's Party
1327,240,5,,MISSING,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37,0.6166667,MISSING
3760,300,7,Not eligible to vote,MISSING,"Still in parental home, never left 2 months",40.0,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43,0.7166667,MISSING
4658,90,8,Yes,MISSING,1974,50.0,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62,1.0333333,MISSING
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60.0,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61,1.0166667,SF Socialistisk Folkeparti - Socialist People's Party
