In [13]:
# Setup

library(readr)
data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


# Filtering and subsetting

R supports filtering and subsetting from "base" operations but there are packages with more intuitive functions (like the packages in tidyverse: https://www.tidyverse.org/).

## Compare and contrast

These two commands achieve the same result. Which is more intuitive?

In [14]:
subset <- data[data$gndr == 'Male', c('gndr', 'prtvtddk')]
head(subset)

gndr,prtvtddk
Male,Socialdemokratiet - The Social democrats
Male,Det Konservative Folkeparti - Conservative People's Party
Male,
Male,
Male,SF Socialistisk Folkeparti - Socialist People's Party
Male,Socialdemokratiet - The Social democrats


In [15]:
library(dplyr)
subset <- data %>%
    filter(gndr == 'Male') %>%
    select('gndr', 'prtvtddk')
head(subset)

gndr,prtvtddk
Male,Socialdemokratiet - The Social democrats
Male,Det Konservative Folkeparti - Conservative People's Party
Male,
Male,
Male,SF Socialistisk Folkeparti - Socialist People's Party
Male,Socialdemokratiet - The Social democrats


## Filtering observations (subsetting)

Like values in a vector, each value in a data frame has an index. Each value in a data frame can be uniquely identified by the combination of its row and column index.

Data frames are indexed using `[rowindex, columnindex/column name]`:

In [16]:
data[10, 2] # Row 10, column 2

netustm
150


In [17]:
data[10, 'prtvtddk'] # Row 10 in column prtvtddk

prtvtddk
Socialdemokratiet - The Social democrats


Notice that even though we are asking for a single value, the object returned is a 1x1 data frame! When indexing via the column, the actual value is returned:

In [18]:
data$prtvtddk[10]

We can also ask for several rows and columns:

In [19]:
data[c(1:10), c('prtvtddk', 'gndr')] # Row 1-10, column prtvtddk and gndr (specified as vectors)

prtvtddk,gndr
Socialdemokratiet - The Social democrats,Male
Det Konservative Folkeparti - Conservative People's Party,Male
,Male
,Male
,Female
SF Socialistisk Folkeparti - Socialist People's Party,Male
Dansk Folkeparti - Danish People's Party,Female
Socialdemokratiet - The Social democrats,Male
Alternativet - The Alternative,Female
Socialdemokratiet - The Social democrats,Male


### Filtering with booleans/logical values

Normally we do not know the rowindex of the values we want to keep. Rather we want to filter observations based on a certain criteria. 

In R this is done via the use of "booleans" or "logical values". These are values that are either `TRUE` or `FALSE`.

A number of operations in R always return a logical value:

- `>`
- `>=`
- `<`
- `<=`
- `==`
- `!=`

In [20]:
42 > 10

In [21]:
10 != 10

Logicals can be used when filtering observations. Logicals are used to index the rows, so that only rows meeting the criteria will be returned:

In [22]:
head(data[data$gndr == 'Male', c('gndr', 'prtvtddk')])

gndr,prtvtddk
Male,Socialdemokratiet - The Social democrats
Male,Det Konservative Folkeparti - Conservative People's Party
Male,
Male,
Male,SF Socialistisk Folkeparti - Socialist People's Party
Male,Socialdemokratiet - The Social democrats


Logicals can also be used when indexing vectors, thus being able to perform calculation on a specific group:

In [23]:
mean(data$ppltrst[data$gndr == "Male"], na.rm = TRUE) # Mean of ppltrst for gndr = male

## Subsetting data with `dplyr` 

One major drawback of base R subsetting is that you quickly end up having to write the dataset over and over. Base R commands can also be a bit counter-intuitive to read.

The package `dplyr` contains various commands for filtering and subsetting data. The functions `filter` and `select` can be used to subset data instead of base R commands.

`filter()` takes a dataset and a logical statement using a variable in the data. It returns a dataset with the observations that meet the criteria.

`select()` takes a dataset and a list of variable names. It returns the dataset and the specified variables.

NOTE: There is also a base R function called `filter()`. This function is overwritten when importing `dplyr`.

In [24]:
library(dplyr)

data_male <- filter(data, gndr == 'Male') # Subset with only males

head(data_male)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55
1327,240,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,,,37
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61
7887,360,8,Yes,Socialdemokratiet - The Social democrats,1983,55,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",25,39,39,36000.0,42000.0,89


In [25]:
data_male_subset <- select(data_male, idno, gndr, yrbrn, edlvddk) # Selecting specific variables

head(data_male_subset)

idno,gndr,yrbrn,edlvddk
110,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
705,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
1327,Male,2000,Folkeskole 9.-10. klasse
3760,Male,2002,Folkeskole 9.-10. klasse
5816,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"
7887,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks"


`select()` can also be used to drop single variables simply by prefacing the variable names with `-`:

In [37]:
data_nogndr <- select(data, -gndr)
head(data_nogndr)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,,,119
705,60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,,,55
1327,240,5,,,"Still in parental home, never left 2 months",,2000,Folkeskole 9.-10. klasse,11,37,37,,,37
3760,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,2002,Folkeskole 9.-10. klasse,9,2,2,200.0,,43
4658,90,8,Yes,,1974,50,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,,,62
5816,90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61


### Chaining commands with the pipe `%>%`

Part of the `tidyverse` is the pipe-operator (`%>%`). It is part of the `magrittr` package: https://magrittr.tidyverse.org/ (usually imported along with `dplyr`).

The pipe allows one to chain together commands without having to refer to the name of the dataset (the input data is assumed to be the data output from the previous line):

In [26]:
data_male_subset <- data %>% # Creating the same subset as before but with shorter code
    filter(gndr == 'Male') %>%
    select(idno, gndr, yrbrn, edlvddk)

head(data_male_subset)

idno,gndr,yrbrn,edlvddk
110,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
705,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
1327,Male,2000,Folkeskole 9.-10. klasse
3760,Male,2002,Folkeskole 9.-10. klasse
5816,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"
7887,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks"


## Arrange and reorder data

Data can be arranged and reordered using `arrange()` and `select()` from `dplyr`.

- `arrange()` is used for sorting rows based on one or several variables
- `select()` is used for changing the order of the variables

### Arrange

In the subset created (`data_male_subset`), rows are currently ordered by their original order in the dataset. 

`arrange()` can be used to change the order based on one or several variables:

In [28]:
head(arrange(data_male_subset, yrbrn))

idno,gndr,yrbrn,edlvddk
77284,Male,1929,Folkeskole 6.-8. klasse
105339,Male,1929,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
116473,Male,1929,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
134875,Male,1929,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"
137116,Male,1929,Folkeskole 6.-8. klasse
145351,Male,1929,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"


`arrange()` sorts by ascending by default. This can be changed by using `desc()` on the variable:

In [29]:
head(arrange(data_male_subset, desc(yrbrn)))

idno,gndr,yrbrn,edlvddk
93939,Male,2003,Folkeskole 6.-8. klasse
98385,Male,2003,Folkeskole 6.-8. klasse
107396,Male,2003,Folkeskole 9.-10. klasse
112183,Male,2003,
117251,Male,2003,Folkeskole 6.-8. klasse
123968,Male,2003,Folkeskole 6.-8. klasse


`arrange()` accepts several variables:

In [33]:
head(arrange(data_male_subset, edlvddk, desc(yrbrn)))

idno,gndr,yrbrn,edlvddk
76225,Male,2001,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
114759,Male,1997,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
131603,Male,1997,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
134970,Male,1997,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
96513,Male,1996,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"
122274,Male,1995,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-"


### Select

`select()` is used for selecting specific columns but notice that the columns are ordered in the order provided:

In [35]:
data_male_subset <- data %>%
    filter(gndr == 'Male') %>%
    select(idno, gndr, yrbrn, edlvddk)

head(data_male_subset)

idno,gndr,yrbrn,edlvddk
110,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
705,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem"
1327,Male,2000,Folkeskole 9.-10. klasse
3760,Male,2002,Folkeskole 9.-10. klasse
5816,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,"
7887,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks"


`everything()` can be used with `select()` to select all columns. `everything()` automatically excludes columns already specified and can therefore be used to reorder the columns:

In [39]:
data_reordered <- select(data, idno, gndr, yrbrn, edlvddk, everything())
head(data_reordered)

idno,gndr,yrbrn,edlvddk,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",180,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,9,28,28,,,119
705,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",60,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,22,37,45,,,55
1327,Male,2000,Folkeskole 9.-10. klasse,240,5,,,"Still in parental home, never left 2 months",,11,37,37,,,37
3760,Male,2002,Folkeskole 9.-10. klasse,300,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,9,2,2,200.0,,43
4658,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",90,8,Yes,,1974,50,4,30,30,,,62
5816,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",90,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,35,37,37,37000.0,35000.0,61
