# Descriptive

- Summaries
- Tables

# Statistics

- Models
- Output (`stargazer`)

In [8]:
library(readr)
library(dplyr)

ess_data <- read_csv("https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ess2014_mainsub_p1.csv") %>%
    mutate(bmi = weight / (height/100)^2,
          age = 2014 - yrbrn)

Parsed with column specification:
cols(
  idno = col_double(),
  ppltrst = col_character(),
  polintr = col_character(),
  vote = col_character(),
  lrscale = col_character(),
  happy = col_character(),
  health = col_character(),
  cgtsday = col_double(),
  cgtsmke = col_character(),
  alcfreq = col_character(),
  brncntr = col_character(),
  height = col_double(),
  weight = col_double(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  marsts = col_character(),
  polpartvt = col_character()
)


# Summaries

`group_by()` is part of the `dplyr` package. `group_by()` is used together with `summarise()` for creating summary statistics.

Below the mean age per gender is calculated and displayed:

In [22]:
ess_data %>%
    group_by(gndr) %>%
    summarise(mean_age = mean(age))

gndr,mean_age
Female,49.38244
Male,47.37437


Several summary statistics can be created for the same grouping:

In [26]:
ess_data %>%
    group_by(gndr) %>%
    mutate(cgtsday = ifelse(is.na(cgtsday), 0, cgtsday)) %>%
    summarise(mean_age = mean(age),
             mean_cgs = mean(cgtsday, na.rm = TRUE),
             count = n())

gndr,mean_age,mean_cgs,count
Female,49.38244,2.082153,353
Male,47.37437,3.155779,398


Observations can be grouped based on several variables

In [33]:
ess_data %>%
    group_by(gndr, brncntr) %>%
    mutate(cgtsday = ifelse(is.na(cgtsday), 0, cgtsday)) %>%
    summarise(mean_age = mean(age),
             mean_cgs = mean(cgtsday, na.rm = TRUE),
             count = n())

gndr,brncntr,mean_age,mean_cgs,count
Female,No,48.94118,1.617647,34
Female,Yes,49.42947,2.131661,319
Male,No,43.29167,3.25,24
Male,Yes,47.63636,3.149733,374


## Tables

Frequency and contingency tables can be done in a number of different ways in R dependent on the kind of tables you want to make.

R has a few built-in functions based around the `table()` function. The `table()` function is used for creating a table object that can then be manipulated.

Specifying a single variable creates a one-dimensional frequency table:

In [44]:
table(ess_data$gndr)


Female   Male 
   353    398 

Specifying two variables creates a crosstable of counts of every combination:

In [47]:
table(ess_data$gndr, ess_data$brncntr)

        
          No Yes
  Female  34 319
  Male    24 374

The functions `margin.table()` and `prop.table()` are used for frequencies and calculating percentages respectively. They both accept a table object as input.

In [52]:
ess_table <- table(ess_data$gndr, ess_data$brncntr) # creating table object (gndr as rows, brncntr as columns)

margin.table(ess_table, 1) # gndr frequencies (row frequencies)


Female   Male 
   353    398 

In [53]:
margin.table(ess_table, 2) # brncntr frequencies (column frequencies)


 No Yes 
 58 693 

In [55]:
prop.table(ess_table, 1) # gndr percentages (rows)

        
                 No        Yes
  Female 0.09631728 0.90368272
  Male   0.06030151 0.93969849

In [56]:
prop.table(ess_table, 2) # brncntr percentages (columns)

        
                No       Yes
  Female 0.5862069 0.4603175
  Male   0.4137931 0.5396825

### The `CrossTable()` function (part of `gmodels`)

The package `gmodels` contains the function `CrossTable()`.

`CrossTable` combines the various table functionalities in base R for an easier way to create crosstables. It also makes it easier to include various tests of independence.

The line below creates a crosstable for `cgtsmke` and `gndr`, displaying percentages column-wise and calculating the chi-squared.

In [65]:
library(gmodels)

CrossTable(ess_data$cgtsmke, ess_data$gndr, prop.r = FALSE, prop.c = TRUE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

ERROR: Error in library(gmodels): there is no package called 'gmodels'


### Tabulating with `tidyverse`

There are various ways of creating tables and cross-tables using functions from the tidyverse.

`count()` (part of `dplyr`) can be used for frequency tables:

In [58]:
library(dplyr)

ess_data %>%
    count(gndr)

gndr,n
Female,353
Male,398


Crosstables can be achieved by combining `group_by()` summaries with `pivot_wider()`:

In [64]:
library(tidyr)

ess_data %>%
  group_by(gndr, brncntr)%>%
  summarise(n=n())%>%
  pivot_wider(names_from = gndr, values_from = n)

brncntr,Female,Male
No,34,24
Yes,319,374


# Statistical models

There are a lot of packages for creating statistical and there are packages for all kinds of specific analysis.

A recurring element of a lot of these packages and functions however is to specify the model as a function.

Formulas are specified as:
- `y ~ x1 (+x2 +x3 ... +xn)`


The code below created a linear model for age and weight:

In [9]:
#Linear model for weight and yrbrn
lm(weight ~ yrbrn, ess_data)


Call:
lm(formula = weight ~ yrbrn, data = ess_data)

Coefficients:
(Intercept)        yrbrn  
   44.27414      0.01624  


In [10]:
#Multiple
lm(bmi ~ weight + height, ess_data)


Call:
lm(formula = bmi ~ weight + height, data = ess_data)

Coefficients:
(Intercept)       weight       height  
    50.1059       0.3318      -0.2889  


An advantage of R is the ability to store the model as any other object making it easy to store and recall past results.

In [11]:
#Storing model
bmi_model <- lm(bmi ~ weight + height, ess_data)

In [12]:
#Summary statistics for bmi_model
summary(bmi_model)


Call:
lm(formula = bmi ~ weight + height, data = ess_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1250 -0.1842  0.0199  0.1593  4.1995 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 50.105861   0.341344   146.8   <2e-16 ***
weight       0.331774   0.001376   241.2   <2e-16 ***
height      -0.288922   0.002197  -131.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4993 on 737 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.9875,	Adjusted R-squared:  0.9875 
F-statistic: 2.915e+04 on 2 and 737 DF,  p-value: < 2.2e-16


## Modelling interactions and quadratic terms

**Interactions**

Interactions can be modelled using `*` or `:`:

In [17]:
bmi_model <- lm(bmi ~ height + weight + age + weight*age, ess_data)

summary(bmi_model)


Call:
lm(formula = bmi ~ height + weight + age + weight * age, data = ess_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2292 -0.1913  0.0293  0.1714  4.1145 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5.107e+01  4.413e-01  115.728  < 2e-16 ***
height      -2.904e-01  2.271e-03 -127.885  < 2e-16 ***
weight       3.235e-01  3.432e-03   94.263  < 2e-16 ***
age         -1.551e-02  5.001e-03   -3.101  0.00200 ** 
weight:age   1.816e-04  6.623e-05    2.741  0.00627 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.496 on 735 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.9877,	Adjusted R-squared:  0.9876 
F-statistic: 1.477e+04 on 4 and 735 DF,  p-value: < 2.2e-16


**Quadratic terms**

Unfortunately there is no shorthand for doing quadratic terms (at least not with the `lm()` function). 

A variable for the quadratic term has to be created before creating the model:

In [21]:
ess_data$quad_height <- ess_data$height^2

bmi_model <- lm(bmi ~ height + weight + quad_height, ess_data)

summary(bmi_model)


Call:
lm(formula = bmi ~ height + weight + quad_height, data = ess_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9395 -0.1880 -0.0090  0.1415  4.3044 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 35.4419541  4.6707998   7.588 9.84e-14 ***
height      -0.1201562  0.0536584  -2.239  0.02544 *  
weight       0.3316839  0.0013678 242.503  < 2e-16 ***
quad_height -0.0004838  0.0001537  -3.148  0.00171 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4963 on 736 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.9877,	Adjusted R-squared:  0.9876 
F-statistic: 1.967e+04 on 3 and 736 DF,  p-value: < 2.2e-16


## Models and categorical

When working with categoricals in R, almost everything about how to treat that categorical in a model should be specified *before* creating the model.

- Should the variable be treated as ordered (nominal) or unordered (ordinal)?
- What value should be used as reference/base?
- Is the ordinal variable to be used as an interval variable?


In [65]:
#Linear model with categorical (2 values)
lm(height ~ yrbrn + gndr, ess_data)


Call:
lm(formula = height ~ yrbrn + gndr, data = ess_data)

Coefficients:
(Intercept)        yrbrn     gndrMale  
   -50.2225       0.1104      12.7617  


In [67]:
#Linear model with ordinal
ess_data$healthcat <- factor(ess_data$health, levels = c('Very bad', 'Bad', 'Fair', 'Good', 'Very good'), ordered = TRUE)

summary(lm(height ~ yrbrn + healthcat, ess_data))


Call:
lm(formula = height ~ yrbrn + healthcat, data = ess_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.8004  -6.7797  -0.1917   6.4317  30.0358 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -61.37901   36.03229  -1.703   0.0889 .  
yrbrn         0.11893    0.01836   6.478 1.69e-10 ***
healthcat.L   4.11147    2.31731   1.774   0.0764 .  
healthcat.Q  -1.34712    1.99509  -0.675   0.4997    
healthcat.C   2.02962    1.51963   1.336   0.1821    
healthcat^4  -2.09098    1.03979  -2.011   0.0447 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.311 on 742 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.07674,	Adjusted R-squared:  0.07052 
F-statistic: 12.34 on 5 and 742 DF,  p-value: 1.644e-11


In [68]:
#Linear model with nominal (character as factor)
summary(lm(height ~ yrbrn + health, ess_data))


Call:
lm(formula = height ~ yrbrn + health, data = ess_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.8004  -6.7797  -0.1917   6.4317  30.0358 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -60.03582   35.99289  -1.668   0.0957 .  
yrbrn             0.11893    0.01836   6.478 1.69e-10 ***
healthFair       -2.12265    1.69675  -1.251   0.2113    
healthGood        0.03304    1.61819   0.020   0.9837    
healthVery bad   -5.55532    3.82994  -1.450   0.1473    
healthVery good   0.92896    1.61690   0.575   0.5658    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.311 on 742 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.07674,	Adjusted R-squared:  0.07052 
F-statistic: 12.34 on 5 and 742 DF,  p-value: 1.644e-11


## Output a model

In [69]:
library(stargazer)


Please cite as: 


 Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.2. https://CRAN.R-project.org/package=stargazer 




In [70]:
height_model <- lm(height ~ yrbrn + health, ess_data)
stargazer(height_model, type = "html", out = "../output/modelout.html")


<table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
<tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td>height</td></tr>
<tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">yrbrn</td><td>0.119<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.018)</td></tr>
<tr><td style="text-align:left"></td><td></td></tr>
<tr><td style="text-align:left">healthFair</td><td>-2.123</td></tr>
<tr><td style="text-align:left"></td><td>(1.697)</td></tr>
<tr><td style="text-align:left"></td><td></td></tr>
<tr><td style="text-align:left">healthGood</td><td>0.033</td></tr>
<tr><td style="text-align:left"></td><td>(1.618)</td></tr>
<tr><td style="text-align:left"></td><td></td></tr>
<tr><td style="text-align:left">healthVery bad</td><td>-5.555</td><