### Outcomes

- Read in and interpret .csv data
- Create readable summary statistics tables of relevant data


### References

- Statistics Canada, Survey of Financial Security, 2019, 2021. Reproduced and distributed on an "as is" basis with the permission of Statistics Canada. Adapted from Statistics Canada, Survey of Financial Security, 2019, 2021. This does not constitute an endorsement by Statistics Canada of this product.

In [90]:
# install.packages("gtsummary")
# install.packages('cardx')
# install.packages("IRdisplay")

library('gtsummary')
library(tidyverse) 
library(gtsummary)
library(dplyr)
library(haven)
library(stargazer)
library(cardx)
library(IRdisplay)
library(cancensus)
library(sf)

Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE



In [181]:
source("intermediate_summary_statistics_functions.R")

### Creating simple Summary Statistics Tables Using `gtsummary`

In [155]:
api_key = "CensusMapper_f228791d9506a7a747ece66db73b95be"

In [186]:
census_data <- get_dataset(api_key) #the get_dataset() function code is fairly long, but can be helpful for projects in 326/490. 
# The full code can be found in the functions.R file underneath this notebook.

Reading vectors data from local cache.



Reading geo data from local cache.



In [187]:
census_data <- census_data|>
drop_na()|>
glimpse()

Rows: 992
Columns: 12
$ population_density     [3m[90m<dbl>[39m[23m 2110.9, 4575.3, 6663.7, 4895.0, 6567.9, 6445.0,…
$ age                    [3m[90m<dbl>[39m[23m 43.2, 42.2, 41.5, 40.2, 40.1, 41.0, 43.9, 37.6,…
$ income                 [3m[90m<dbl>[39m[23m 0.90965, 0.93611, 0.84736, 0.45184, 0.82517, 0.…
$ education              [3m[90m<dbl>[39m[23m 195, 140, 235, 100, 135, 80, 165, 75, 180, 80, …
$ car_commute_driver     [3m[90m<dbl>[39m[23m 240, 165, 240, 130, 210, 170, 280, 195, 275, 21…
$ car_commute_driven     [3m[90m<dbl>[39m[23m 0, 0, 0, 10, 0, 0, 10, 10, 0, 0, 0, 0, 20, 0, 0…
$ pt_commute             [3m[90m<dbl>[39m[23m 25, 20, 50, 30, 45, 50, 45, 50, 50, 15, 20, 15,…
$ walk_commute           [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 10, 0, 0, 0, 0, 10, 10, 0, 10, 0, 1…
$ bike_commute           [3m[90m<dbl>[39m[23m 10, 15, 0, 0, 10, 0, 10, 0, 10, 0, 10, 0, 0, 15…
$ other_commute          [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 

Suppose, for our project, we were interested in determining if a greater proportion of people cycling to work is associated with a greater income, adjusted for age, education, and population density. Note the columns ending in `_commute`. These represent the number of walkers/bikers/drivers living within a given census area. `total_reported_commute` is the total number of individuals within each census district that reported their commute method to StatsCanada in the 2016 census. The first thing we want to check is that the sum of all `_commute` columns adds up to `total_reported_commute`. This will allow us to check for missing values, as well as, more importantly, that the vectors we picked for our analysis from the cancesus API are indeed the correct ones:

In [73]:
mismatch <- census_data %>%
  mutate(commute_sum = car_commute_driver + car_commute_driven +
           pt_commute + walk_commute + bike_commute + other_commute) %>%
  filter(commute_sum != total_reported_commute)|>
  mutate(mismatch_amount = as.numeric(total_reported_commute - commute_sum))|>
  mutate(commute_sum = as.numeric(commute_sum))|>
  mutate(total_reported_commute = as.numeric(total_reported_commute))|>
  select(commute_sum, total_reported_commute, mismatch_amount)

mismatch <- mismatch|>
as_tibble()|>
select(!geometry)

glimpse(mismatch)

Rows: 987
Columns: 3
$ commute_sum            [3m[90m<dbl>[39m[23m 275, 200, 290, 170, 285, 220, 345, 255, 335, 23…
$ total_reported_commute [3m[90m<dbl>[39m[23m 165, 120, 210, 140, 175, 155, 215, 170, 220, 13…
$ mismatch_amount        [3m[90m<dbl>[39m[23m -110, -80, -80, -30, -110, -65, -130, -85, -115…


We see that every row is slightly mismatched: people are either reporting more than one type of commute, or are reporting commute types that are not collected in the census (example, not commuting at all, or working from home). Let's create a summary statistic table using `gtsummary` to figure out the scale of the differences.  

In [None]:

summary_table <- mismatch |>
  select(commute_sum, total_reported_commute, mismatch_amount) |>
  tbl_summary(
    type = all_continuous() ~ "continuous2",  # continuous2 for multiple stats
    statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
    digits = all_continuous() ~ 2,
    label = list(
      commute_sum ~ "Total Commute",
      total_reported_commute ~ "Reported Commute",
      mismatch_amount ~ "Mismatch Amount"
    )
  )|>
    as_gt() |> 
    gt::as_raw_html() |> 
    display_html()

summary_table



Characteristic,N = 987
Total Commute,
Mean,223.31
Min,10.0
Max,3760.0
Reported Commute,
Mean,162.26
Min,20.0
Max,2760.0
Mismatch Amount,
Mean,-61.05


Great! We see that the sum of all commute types is about 61 people more than the total amount of people who reported their method of commuting for each census area. This suggests that individuals are reporting more than one type of commute. Lastly, we'll put together a table of summary statistics for all our variables. Recall our research question: We are interested in determining if a increase in proportion of cyclists commuting to work per capita is correlated with increases in average income. Hence, we can exclude non-cycling modes of transport from our summary statistic table. We'll begin by some minor data wranging:

In [86]:
census_data <- census_data |>
mutate(bike_prop = bike_commute / total_reported_commute)|> #getting the proportion of cyclists
as_tibble() |>
drop_na() |>
glimpse()

Rows: 991
Columns: 13
$ population_density     [3m[90m<dbl>[39m[23m 2110.9, 4575.3, 6663.7, 4895.0, 6567.9, 6445.0,…
$ age                    [3m[90m<dbl>[39m[23m 43.2, 42.2, 41.5, 40.2, 40.1, 41.0, 43.9, 37.6,…
$ income                 [3m[90m<dbl>[39m[23m 0.90965, 0.93611, 0.84736, 0.45184, 0.82517, 0.…
$ education              [3m[90m<dbl>[39m[23m 195, 140, 235, 100, 135, 80, 165, 75, 180, 80, …
$ car_commute_driver     [3m[90m<dbl>[39m[23m 240, 165, 240, 130, 210, 170, 280, 195, 275, 21…
$ car_commute_driven     [3m[90m<dbl>[39m[23m 0, 0, 0, 10, 0, 0, 10, 10, 0, 0, 0, 0, 20, 0, 0…
$ pt_commute             [3m[90m<dbl>[39m[23m 25, 20, 50, 30, 45, 50, 45, 50, 50, 15, 20, 15,…
$ walk_commute           [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 10, 0, 0, 0, 0, 10, 10, 0, 10, 0, 1…
$ bike_commute           [3m[90m<dbl>[39m[23m 10, 15, 0, 0, 10, 0, 10, 0, 10, 0, 10, 0, 0, 15…
$ other_commute          [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 

In [87]:
summary_table_2 <- census_data|>
  select(population_density, income, education, bike_prop) |>
  tbl_summary(
    type = all_continuous() ~ "continuous2",  # continuous2 for multiple stats
    statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
    digits = all_continuous() ~ 2,
    label = list(
      population_density ~ "Population Density",
      income ~ "Income (in 100k CAD)",
      education ~ "Education Level",
      bike_prop ~ "Proportion of Bikers"
    )
  )|>
    as_gt() |> 
    gt::as_raw_html() |> 
    display_html()

summary_table_2

Characteristic,N = 991
Population Density,
Mean,10011.26
Min,297.9
Max,77692.3
Income (in 100k CAD),
Mean,0.73
Min,0.13
Max,2.14
Education Level,
Mean,220.85


Looks great! Now suppose we wanted to isolate the summary statistics for census areas which have a bikeway passing through them and compare with census areas which do not have a bikeway (IE, include a dummy variable for presence of bikeways for each census area). 

In [192]:
source("intermediate_summary_statistics_functions.R")

In [193]:
census_data_bikes <- get_bikes(api_key)
census_data_bikes <- census_data_bikes|>
as_tibble()
glimpse(census_data_bikes)
# like the other function, get_bikes() is very long. the full code can be found in the adjacent .R file.

Reading vectors data from local cache.

Reading geo data from local cache.

Reading data from temporary cache



Rows: 991
Columns: 6
$ population_density [3m[90m<dbl>[39m[23m 2110.9, 4575.3, 6663.7, 4895.0, 6567.9, 6445.0, 629…
$ age                [3m[90m<dbl>[39m[23m 43.2, 42.2, 41.5, 40.2, 40.1, 41.0, 43.9, 37.6, 42.…
$ income             [3m[90m<dbl>[39m[23m 0.90965, 0.93611, 0.84736, 0.45184, 0.82517, 0.7436…
$ education          [3m[90m<dbl>[39m[23m 195, 140, 235, 100, 135, 80, 165, 75, 180, 80, 130,…
$ has_bike_lane      [3m[90m<dbl>[39m[23m 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, …
$ bike_prop          [3m[90m<dbl>[39m[23m 0.06060606, 0.12500000, 0.00000000, 0.00000000, 0.0…


We will once again repeat the same process for gener