In [1]:
# Setup

library(tidyverse)
data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot =

# Summarizing data

Summarizing data can be achieved by using base table commands. Alternatively, one could use `group_by()` and `summarise()` for creating summaries at group level.

## Tables

Frequency and contingency tables can be done in a number of different ways in R dependent on the kind of tables you want to make.

R has a few built-in functions based around the `table()` function. The `table()` function is used for creating a table object that can then be manipulated.

Specifying a single variable creates a one-dimensional frequency table:

In [2]:
table(data$gndr)


Female   Male 
   726    846 

Specifying two variables creates a crosstable of counts of every combination:

In [3]:
table(data$gndr, data$vote)

        
          No Not eligible to vote Yes
  Female  34                   63 626
  Male    60                   77 706

The functions `margin.table()` and `prop.table()` are used for frequencies and calculating percentages respectively. They both accept a table object as input.

In [4]:
ess_table <- table(data$gndr, data$vote) # creating table object (gndr as rows, brncntr as columns)

margin.table(ess_table, 1) # gndr frequencies (row frequencies)


Female   Male 
   723    843 

In [5]:
margin.table(ess_table, 2) # brncntr frequencies (column frequencies)


                  No Not eligible to vote                  Yes 
                  94                  140                 1332 

In [6]:
prop.table(ess_table, 1) # gndr percentages (rows)

        
                 No Not eligible to vote        Yes
  Female 0.04702628           0.08713693 0.86583679
  Male   0.07117438           0.09134045 0.83748517

In [7]:
prop.table(ess_table, 2) # brncntr percentages (columns)

        
                No Not eligible to vote       Yes
  Female 0.3617021            0.4500000 0.4699700
  Male   0.6382979            0.5500000 0.5300300

### The `CrossTable()` function (part of `gmodels`)

The package `gmodels` contains the function `CrossTable()`.

`CrossTable` combines the various table functionalities in base R for an easier way to create crosstables. It also makes it easier to include various tests of independence.

The line below creates a crosstable for `vote` and `gndr`, displaying percentages column-wise and calculating the chi-squared.

In [8]:
library(gmodels)

CrossTable(data$vote, data$gndr, prop.r = FALSE, prop.c = TRUE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1566 

 
                     | data$gndr 
           data$vote |    Female |      Male | Row Total | 
---------------------|-----------|-----------|-----------|
                  No |        34 |        60 |        94 | 
                     |     0.047 |     0.071 |           | 
---------------------|-----------|-----------|-----------|
Not eligible to vote |        63 |        77 |       140 | 
                     |     0.087 |     0.091 |           | 
---------------------|-----------|-----------|-----------|
                 Yes |       626 |       706 |      1332 | 
                     |     0.866 |     0.837 |           | 
---------------------|-----------|-----------|-----------|
        Column Total |       723 |       843 |      1566 | 
                     |     0.462 |     0.538 |           | 
----------------

## Grouped summaries

`group_by()` is part of the `dplyr` package. `group_by()` is used together with `summarise()` for creating summary statistics.

Below the mean time spent on the internet per day per gender is calculated and displayed:

In [9]:
data %>%
    group_by(gndr) %>%
    summarise(mean_internettime = mean(netustm, na.rm = TRUE))

gndr,mean_internettime
Female,227.5647
Male,236.0696


Several summary statistics can be created for the same grouping:

In [10]:
data %>%
    group_by(gndr) %>%
    mutate(age = 2018 - yrbrn) %>%
    summarise(mean_age = mean(age),
             mean_internettime = mean(netustm, na.rm = TRUE),
             count = n())

gndr,mean_age,mean_internettime,count
Female,50.16529,227.5647,726
Male,49.40189,236.0696,846


Observations can be grouped based on several variables

In [11]:
data %>%
    group_by(gndr, vote) %>%
    mutate(age = 2018 - yrbrn) %>%
    summarise(mean_age = mean(age),
             mean_internettime = mean(netustm, na.rm = TRUE),
             count = n())

gndr,vote,mean_age,mean_internettime,count
Female,,56.0,240.0,3
Female,No,45.32353,280.0417,34
Female,Not eligible to vote,26.34921,299.4839,63
Female,Yes,52.79712,217.0458,626
Male,,39.66667,272.6667,3
Male,No,41.7,295.4118,60
Male,Not eligible to vote,23.57143,324.3421,77
Male,Yes,52.91501,220.4675,706


### Tabulating with `tidyverse`

There are various ways of creating tables and cross-tables using functions from the tidyverse.

`count()` (part of `dplyr`) can be used for frequency tables:

In [12]:
library(dplyr)

data %>%
    count(gndr)

gndr,n
Female,726
Male,846


Crosstables can be achieved by combining `group_by()` summaries with `pivot_wider()`:

In [13]:
library(tidyr)

data %>%
  group_by(gndr, vote)%>%
  summarise(n=n())%>%
  pivot_wider(names_from = gndr, values_from = n)

vote,Female,Male
No,34,60
Not eligible to vote,63,77
Yes,626,706
,3,3
