# <center>WORKING WITH TABLE DATA IN R - REFRESHER</center>
<img src="../elem/caldiss_symbol_square.png" width="200">


<i><center>Kristian Gade Kjelmann</center></i>
<i><center>August 20th 2020</center><i>

# How to work with the R language

Working with R means to work with objects and functions. 

*Objects* are containers of information and can be pretty much anything.

An object is always of a certain *class*, denoting the type of information the object contains (a numerical value, a string/text value, a dataset, a list of values/vector etc.).

*Functions* are used to manipulate objects and produce output. Functions consist of *arguements*

**Assign objects**

Assign objects using `<-`.

In [1]:
a_number <- 42     # assign the number '42' to object 'a_number'
a_word <- "hello"  # assign the word 'hello' to object 'a_word'

**Using a function**

Functions are used by writing the function name with the arguements in `()`. Additional arguements are separated by `,`. 

In [4]:
toupper(a_word)

Functions do not change an object. The output of a function has to be stored in a new or the same object to be stored:

In [5]:
print(a_word)

[1] "hello"


In [6]:
another_word = toupper(a_word)
print(another_word)

[1] "HELLO"


**Check class**

The class of an object can be checked using the function `class()`.

In [2]:
class(a_number)    # what class is a_number?
class(a_word)      # what class is a_word?

If an object is stored as an incorrect class (like a number stored as text) it can be changed using a function like `as.numeric()`:

In [7]:
another_number <- "42" 
class(another_number)   # 'another_number' is a character class (text)

another_number <- as.numeric(another_number)
class(another_number)   # 'another_number' is now a numeric class

**Logical/boolean values**

An often occuring type of object class in R is the logical class. A logical is either `TRUE` or `FALSE`.

**NA**

`NA` is the R equivalent of a missing value.

# Data structures in R

## Vectors

A `vector` is an object in R for storing several values. It is comparable to a variable/column in a dataset.

Vectors are created using `c()`:

In [1]:
my_vector <- c(2, 9, 14, 42)
my_vector

Like a variable in a dataset, a vector can only contain one type of values; i.e. all values in a vector must be the same class.

When trying to combine values of different classes, the vector will be coerced to a class compatible with all of them (in the case below, all values will be converted to character):

In [2]:
my_mix_vector <- c(2, "hello", 14, 42)
my_mix_vector

`class()` can be used on vectors as well:

In [3]:
class(my_vector)
class(my_mix_vector)

Each element in a vector is giving an index starting from 1. Specific values of a vector can be extracted by refering to their index inside `[]`:

In [18]:
my_mix_vector[2]  # prints element 2: "hello"
my_mix_vector[4]  # prints element 4: 42

A range of elements is specified using `:`:

In [20]:
my_mix_vector[2:3]  # elements 2 to 3; both included

Several elements can be specified by inputting a vector of indexes:

In [21]:
my_mix_vector[c(1,3)]  # elements 1 and 3: 2 and 14

`length()` returns the number of elements:

In [22]:
length(my_mix_vector)

## Data frames

A data frame is the most common data structure in R for tabular data (rows and columns). They can be created manually by specifying a number of vectors to combine to a data frame:

In [5]:
my_dataframe <- data.frame(numbers = c(1, 2, 4, 6, 12), words = c("hello", "cat", "banana", "carpenter", "dog"))
my_dataframe

numbers,words
1,hello
2,cat
4,banana
6,carpenter
12,dog


Often data frames are created from existing data sets.

When importing tabular data sets in R, they will in most cases be converted to a data frame.

R also contains a range of sample data sets for exercises; like `mtcars`:

In [8]:
data(mtcars)  # this line loads the dataset "mtcars" into the R environment

`head()` prints the first six rows of the dataset (six is the default. The number of rows to print can be changed by adding the second arguement to the function).

In [9]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


`summary()` provide summary statistics for all the variables:

In [13]:
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000  

`nrow()` and `ncol()` returns the number of rows and columns, respectively:

In [26]:
nrow(mtcars)
ncol(mtcars)

`colnames()` returns the column names (names of the variables):

In [28]:
colnames(mtcars)

Specific columns can be referenced using `$`. When refering to a column this way, they work just like a vector:

In [25]:
mtcars$wt[4]  # 4th value in the column "am" or 4th row
length(mtcars$wt)  # number of elements in wt (corresponding to the number of rows)

Like with vectors, square brackets `[]` are used for subsetting. The format is `[rows, columns]`. Columns can both be referenced by their number (colomn index) of their name:

In [27]:
mtcars[10:15, c('wt', 'gear')]  # row 10-15, column wt and gear - returns as a data frame

Unnamed: 0,wt,gear
Merc 280,3.44,4
Merc 280C,3.44,4
Merc 450SE,4.07,3
Merc 450SL,3.73,3
Merc 450SLC,3.78,3
Cadillac Fleetwood,5.25,3


# Packages

- Installing 
- Loading

R is open source and a lot of functionality are being developed by various contributors. Additional functionality can be added to R through "packages".

Packages are installed with `install.packages([packagename])`:

In [29]:
install.packages('readr') # installs the package 'readr' for reading various data files

Installing package into 'D:/ProgramData/R/packages'
(as 'lib' is unspecified)


package 'readr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\kjelm\AppData\Local\Temp\RtmpUZUIEl\downloaded_packages


Once installed the library can be loaded using `library([packagename])`. The package must be loaded before the functions in the package are available.

In [30]:
library(readr)

"package 'readr' was built under R version 3.6.3"

# (Re-)Introducing tidyverse

[`tidyverse`](https://www.tidyverse.org/) is a collection of packages for various data handling tasks.

These include:

- `readr`: for reading data into R
- `dplyr`: for data management
- `lubridate`: for handling dates
- `stringr`: for handling strings (text values)

## `readr`

`readr` contains various functions for reading data into R. 

The function `read_csv()` is used for loading a csv file into R. It works both on local files and URLs. The data loaded in must be assigned an object. Otherwise R will just print the dataset in the console.

In [31]:
ess_data <- read_csv("https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ess2014_mainsub_p1.csv")

Parsed with column specification:
cols(
  idno = col_double(),
  ppltrst = col_character(),
  polintr = col_character(),
  vote = col_character(),
  lrscale = col_character(),
  happy = col_character(),
  health = col_character(),
  cgtsday = col_double(),
  cgtsmke = col_character(),
  alcfreq = col_character(),
  brncntr = col_character(),
  height = col_double(),
  weight = col_double(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  marsts = col_character(),
  polpartvt = col_character()
)


In [32]:
head(ess_data)

idno,ppltrst,polintr,vote,lrscale,happy,health,cgtsday,cgtsmke,alcfreq,brncntr,height,weight,gndr,yrbrn,edlvddk,marsts,polpartvt
921018,6,Hardly interested,Not eligible to vote,4,9,Very good,10.0,I smoke but not every day,2-3 times a month,Yes,178,64,Male,1990,Folkeskole 6.-8. klasse,None of these (NEVER married or in legally registered civil union),[NA] Not applicable
921026,8,Quite interested,Yes,4,8,Very good,,I have never smoked,Several times a week,Yes,172,64,Female,1948,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",Widowed/civil partner died,[1] Socialdemokraterne - the Danish social democrats
921034,8,Quite interested,Yes,7,8,Good,,I don't smoke now but I used to,Every day,Yes,176,87,Male,1957,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",Not applicable,"[7] Venstre, Danmarks Liberale Parti - Venstre"
921181,9,Quite interested,Yes,5,9,Fair,,I don't smoke now but I used to,Once a week,Yes,194,102,Male,1956,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",Not applicable,[2] Det Radikale Venstre - Danish Social-Liberal Party
921204,9,Hardly interested,Yes,7,8,Good,,I don't smoke now but I used to,Once a week,No,157,48,Female,1941,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",Not applicable,[NA] Don't know
921262,8,Hardly interested,Yes,7,8,Very good,,I have never smoked,2-3 times a month,Yes,180,93,Male,1987,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",None of these (NEVER married or in legally registered civil union),"[7] Venstre, Danmarks Liberale Parti - Venstre"


# Datawrangling with `dplyr`

The package `dplyr` contains various functions that eases data management tasks.

In [36]:
library(dplyr)

`select()` is used for selecting specific variables/columns:

In [44]:
ess_subset <- select(ess_data, idno, polintr, happy)  # selects variables idno, polintr and happy
head(ess_subset, 3)

idno,polintr,happy
921018,Hardly interested,9
921026,Quite interested,8
921034,Quite interested,8


Note that the dataset to subset has to be the first arguement specified, as any number of datasets can be loaded into R at the same time.

`filter()` is used to filter observations based on a certain condition:

In [45]:
ess_subset <- filter(ess_data, yrbrn > 1985)  # filter observations born after 1985
head(ess_subset, 3)

idno,ppltrst,polintr,vote,lrscale,happy,health,cgtsday,cgtsmke,alcfreq,brncntr,height,weight,gndr,yrbrn,edlvddk,marsts,polpartvt
921018,6,Hardly interested,Not eligible to vote,4,9,Very good,10.0,I smoke but not every day,2-3 times a month,Yes,178,64,Male,1990,Folkeskole 6.-8. klasse,None of these (NEVER married or in legally registered civil union),[NA] Not applicable
921262,8,Hardly interested,Yes,7,8,Very good,,I have never smoked,2-3 times a month,Yes,180,93,Male,1987,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",None of these (NEVER married or in legally registered civil union),"[7] Venstre, Danmarks Liberale Parti - Venstre"
921319,8,Very interested,Not eligible to vote,8,Extremely happy,Fair,10.0,I smoke daily,Once a week,Yes,185,82,Male,1997,Folkeskole 9.-10. klasse,None of these (NEVER married or in legally registered civil union),[NA] Not applicable


`arrange()` is used for ordering the data based on one or several columns:

In [47]:
ess_ordered <- arrange(ess_data, height)  # ordered by height, lowest first
head(ess_ordered, 3)

idno,ppltrst,polintr,vote,lrscale,happy,health,cgtsday,cgtsmke,alcfreq,brncntr,height,weight,gndr,yrbrn,edlvddk,marsts,polpartvt
939530,7,Very interested,Yes,8,8,Bad,,I don't smoke now but I used to,Several times a week,No,149,56,Female,1983,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",None of these (NEVER married or in legally registered civil union),[2] Det Radikale Venstre - Danish Social-Liberal Party
926246,5,Hardly interested,Yes,2,5,Fair,,I have never smoked,Never,No,150,53,Female,1969,"Gymnasielle uddannelser, studentereksamen, HF, HHX, HTX",Legally divorced/civil union dissolved,[1] Socialdemokraterne - the Danish social democrats
931697,8,Very interested,Yes,9,9,Very good,,I have never smoked,2-3 times a month,Yes,150,52,Female,1946,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",Not applicable,"[7] Venstre, Danmarks Liberale Parti - Venstre"


`desc()` can be used in combination with `arrange()` to sort descending:

In [48]:
ess_ordered <- arrange(ess_data, desc(height))  # ordered by height, highest first
head(ess_ordered, 3)

idno,ppltrst,polintr,vote,lrscale,happy,health,cgtsday,cgtsmke,alcfreq,brncntr,height,weight,gndr,yrbrn,edlvddk,marsts,polpartvt
921408,7,Quite interested,Yes,8,9,Fair,1.0,I smoke but not every day,Several times a week,Yes,203,104,Male,1977,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",Not applicable,"[7] Venstre, Danmarks Liberale Parti - Venstre"
949999,8,Hardly interested,Not eligible to vote,4,4,Good,,I have only smoked a few times,Several times a week,No,200,88,Male,1994,"Gymnasielle uddannelser, studentereksamen, HF, HHX, HTX",None of these (NEVER married or in legally registered civil union),[NA] Not applicable
926149,Most people can be trusted,Quite interested,Yes,5,5,Very good,,I have never smoked,Several times a week,Yes,198,105,Male,1956,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",Not applicable,[2] Det Radikale Venstre - Danish Social-Liberal Party


## The pipe

All `tidyverse` packages (like `dplyr`) support using the pipe `%>%`.

The pipe allows one to chain commands together. When using the pipe, the first arguement is assumed to be the data frame specified on the previous line:

In [50]:
ess_subset <- ess_data %>%  # the data frame ess_subset is created as a copy of ess_data and then "fed" into the pipe
    select(idno, polintr, happy, yrbrn, height) %>% # 5 variables selected. the first arguement is omitted as that is given through the pipe
    filter(yrbrn > 1985) %>% # observations filtered
    arrange(desc(height))  # ordered by height, descending

head(ess_subset, 3)

idno,polintr,happy,yrbrn,height
949999,Hardly interested,4,1994,200
948715,Very interested,6,1993,198
924113,Quite interested,8,1997,197


## Recoding and new variables

`mutate()` is used both for recoding and creating new variables.

Creating an age variable based on `yrbrn` (data is from 2014):

In [52]:
ess_data %>%    # note no object assignment meaning this change is not stored
    mutate(age = 2014 - yrbrn) %>%
    select(idno, yrbrn, age) %>%
    head(3)

idno,yrbrn,age
921018,1990,24
921026,1948,66
921034,1957,57


`recode()` is used to change single values. `.default = ` sets the value for values not specified:

In [54]:
ess_data %>% #note that this code also recodes missing
    mutate(new_alcfreq = recode(alcfreq, "Every day" = "DAILY DRINKER", "Once a week" = "WEEKLY DRINKER", 
                            .default = "IRRELEVANT")) %>%
    select(idno, alcfreq, new_alcfreq) %>%
    head(4)

idno,alcfreq,new_alcfreq
921018,2-3 times a month,IRRELEVANT
921026,Several times a week,IRRELEVANT
921034,Every day,DAILY DRINKER
921181,Once a week,WEEKLY DRINKER


`if_else()` is used to recode based on logicals/booleans:

In [57]:
ess_data %>% #note that this code also recodes missing
    mutate(new_health = if_else(health == "Very good", "HEALTHY PERSON", "LESS HEALTHY PERSON")) %>%
    select(idno, health, new_health) %>%
    head(3)

idno,health,new_health
921018,Very good,HEALTHY PERSON
921026,Very good,HEALTHY PERSON
921034,Good,LESS HEALTHY PERSON


`case_when()` is used for specifying several logical/booleans:

In [80]:
ess_data %>%
    mutate(new_health = case_when(
        health == "Very good" ~ "healthy", 
        health == "Good" ~ "healthy",
        health == "Fair" ~ "healthy",
        health == "Bad" ~ "unhealthy",
        health == "Very bad" ~ "unhealthy", 
        TRUE ~ health)) %>% #This line keeps remaining values as they are
    select(idno, health, new_health, yrbrn) %>%
    filter(yrbrn > 1990) %>%
    head(5)

idno,health,new_health,yrbrn
921319,Fair,healthy,1997
921636,Very good,healthy,1996
922129,Good,healthy,1998
922454,Good,healthy,1991
922941,Bad,unhealthy,1991


# Refresher exercise

This exercise will have you apply a variety of the functions taught in this refresher for reading and handling table data in R.

1. Use `read_csv()` to read the dataset `ess2014_mainsub_p1.csv` from the URL: https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ess2014_mainsub_p1.csv.
    - Remember to assign the dataset to an object.


2. Inspect the data set using `head()`. What does the data seem to contain? How many rows and columns? (`dim()`, `nrow()`, `ncol()`). 


3. Use `mutate()` to create an age variable from the variable `yrbrn` (data is from 2014).


4. Use `mutate()` with `recode()` or `case_when()` to create a dummy variable indicating whether the respondent is a smoker or non-smoker (use either `cgtsday` or `cgtsmke`).
    - Tip: Use `unique()` to see the unique values of a variable.


5. Create a subset of the data set containing only smokers over the age of 40. How many observations are in the subset?

In [8]:
library(readr)
library(dplyr)

ess_data_smokers <- read_csv("https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ess2014_mainsub_p1.csv") %>%
    mutate(age = 2014 - yrbrn,
          smoker = ifelse(is.na(cgtsday), "non-smoker", "smoker")) %>%
    filter(age > 40 & smoker == "smoker")

head(ess_data_smokers)

Parsed with column specification:
cols(
  idno = col_integer(),
  ppltrst = col_character(),
  polintr = col_character(),
  vote = col_character(),
  lrscale = col_character(),
  happy = col_character(),
  health = col_character(),
  cgtsday = col_integer(),
  cgtsmke = col_character(),
  alcfreq = col_character(),
  brncntr = col_character(),
  height = col_integer(),
  weight = col_double(),
  gndr = col_character(),
  yrbrn = col_integer(),
  edlvddk = col_character(),
  marsts = col_character(),
  polpartvt = col_character()
)


idno,ppltrst,polintr,vote,lrscale,happy,health,cgtsday,cgtsmke,alcfreq,brncntr,height,weight,gndr,yrbrn,edlvddk,marsts,polpartvt,age,smoker
921678,4,Not at all interested,Yes,5,Extremely happy,Very good,10,I smoke daily,Every day,Yes,178,65,Male,1946,Folkeskole 6.-8. klasse,Legally divorced/civil union dissolved,[1] Socialdemokraterne - the Danish social democrats,68,smoker
921911,5,Hardly interested,Yes,6,1,Bad,20,I smoke daily,Every day,Yes,175,98,Male,1969,Folkeskole 6.-8. klasse,Legally divorced/civil union dissolved,[2] Det Radikale Venstre - Danish Social-Liberal Party,45,smoker
922098,4,Hardly interested,Yes,8,9,Very good,3,I smoke but not every day,Less than once a month,Yes,186,69,Male,1947,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",Not applicable,[5] Dansk Folkeparti - Danish peoples party,67,smoker
922284,7,Quite interested,Yes,3,8,Good,6,I smoke but not every day,Every day,Yes,176,77,Male,1947,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",Not applicable,[1] Socialdemokraterne - the Danish social democrats,67,smoker
922593,Most people can be trusted,Quite interested,Yes,7,8,Fair,15,I smoke daily,Every day,Yes,163,50,Female,1968,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",Not applicable,[3] Det Konservative Folkeparti - Conservative,46,smoker
922983,6,Hardly interested,Yes,Don't know,7,Good,15,I smoke daily,Once a week,Yes,160,54,Female,1947,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",Widowed/civil partner died,[NA] Refusal,67,smoker


In [9]:
dim(ess_data_smokers)