Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
397 lines (321 sloc) 21.4 KB
---
title: "Parliament's gender problem"
author: "Jesse Tweedle"
date: '2018-02-14'
slug: parliament-gender
categories: ["r"]
tags: ["r", "canada", "parliament", "gender", "bias", "text analysis", "hansard"]
description: 'A look at digital copies of Canadian parliamentary debates 1994--2017, showing gender imbalance in both number of speakers and time pattern of speakers.'
image: "https://jesse.tw/images/circle-arrows.png"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE,
warning = FALSE,
message = FALSE,
fig.height = 4, fig.width = 8)
colours <- function(nlevels) { viridis::magma(nlevels+1)[-nlevels-1] }
```
(Preface: I'm `r emo::ji('canada')` but idk anything about parliament and dk what a hansard was before I started this. So if you see something wrong, get the [code from github](https://github.com/tweed1e/weblog/blob/master/content/post/2018-02-14-parliament-gender.Rmd) and diy!)
## Summary
1. Women make up [20-25% of Parliament](http://www.cbc.ca/news2/interactives/women-politics/)
2. Number of words spoken by women increases from 1994--2017 from 15% to 28%
3. Increase mainly comes from doubling involvement in routine government business
4. Nice to have Trudeau's [gender-balanced cabinet](http://www.cbc.ca/news2/interactives/women-politics/), but female MPs are also becoming more involved in day-to-day operations
[(Click to skip to the important pictures.)](#dayplot)
## Motivation
When it comes to equality, Canada likes to talk a big game. What really happens in Parliament? Of course Trudeau likes to play up his female cabinet (extra props go to Chrystia Freeland for not capitulating to that weird Belgian area that was [holding up the Canada-EU trade agreement](http://www.cbc.ca/news/politics/canada-eu-ceta-brussels-friday-1.3815332), and I hope she can handle [you-who-know and NAFTA too](http://www.cbc.ca/news/politics/freeland-nafta-fifth-round-prepare-for-worst-1.4412673), although she switched from trade to foreign affairs).
But you don't need to take anyone's word for it; we have data on gender and parliament. What does it say? On one hand, we just set a [record for female MPs in the 2015 election](https://en.wikipedia.org/wiki/Women_in_the_42nd_Canadian_Parliament)! On the other hand, that record is still only 26%. Uh. Not great. But that's only part of the story. A gender-balance cabinet helps (Trudeau appointed 15 female ministers out of a 30 minister cabinet). *But what actually happens when parliament gets down to business?*
## The data: [https://openparliament.ca/](https://openparliament.ca/)
[Michael Mulley](https://github.com/michaelmulley) (not a government employee, that would be too obvious) gathered all the parliament data and made a website! You can go to the site or the [project's github page](https://github.com/michaelmulley/openparliament) to see what's up. There's an API to access the data, but I had no idea how parliament works or what I was looking for, so I [downloaded a PostgreSQL](https://openparliament.ca/data-download/) copy of the database to go through on my own. It's about 4GB.
### Setup postgres and the database
To get the data into R, you need to
1. setup a local postgres database,
2. copy the table to the database,
3. connect to the database from R
4. and read the table you'd like
First, I didn't have postgres. So google [like mad](https://www.codementor.io/engineerapart/getting-started-with-postgresql-on-mac-osx-are8jcopb) and get super frustrated, then finally settle on this strat:
```
$ brew install postgresql
$ createdb -T template0 openparl
$ psql openparl < openparliament.public.sql
```
The first one installs postgresql; the next creates the database `openparl` that I'll put the data in, and the last one dumps the data I downloaded from the website into the database! Done and done. Now open `psql` and try to look at what's inside:
```
$ psql postgres
postgres-# \list
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+--------------+----------+-------------+-------------+-------------------------------
openparl | jessetweedle | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | jessetweedle | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | jessetweedle | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/jessetweedle +
| | | | | jessetweedle=CTc/jessetweedle
template1 | jessetweedle | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/jessetweedle +
| | | | | jessetweedle=CTc/jessetweedle
(4 rows)
```
It's there! It worked ffs! Next, connect to the database and list contents (I'll show only a few results)
```
postgres-# \connect openparl
openparl-# \dt
List of relations
Schema | Name | Type | Owner
--------+------------------------------------------+-------+--------------
public | accounts_logintoken | table | jessetweedle
public | accounts_user | table | jessetweedle
public | activity_activity | table | jessetweedle
public | alerts_politicianalert | table | jessetweedle
...(lots more)...
(64 rows)
```
It turns out the two databases we're looking for are `hansard_statement` (the parliament transcripts) and `core_politician` (for politician gender data). Now all we have to do is leave that running and connect to that table from R! (Aside: the R tutorials can be super frustrating, because they assume you've got the database part worked out already---no! that's the only part I need!)
### Hansard
A [hansard](https://en.wikipedia.org/wiki/Hansard) is non-verbatim transcript of parliamentary proceedings. Aside: Canada is bilingual, so MPs can speak in either official language, and the Hansard will record which language is spoken, and later translate from one to the other for the official record. Which led to this gem:
>...during a Liberal filibuster in the Canadian Senate, Senator Philippe Gigantès was accused of reading one of his books only so that he could get the translation for free through the Hansard. ([Wikipedia](https://en.wikipedia.org/wiki/Hansard#cite_ref-16)).
So, we have a record of who speaks at what time during every session of parliament. Let's take a look at the `hansard_statement` in the open parliament database.
### Import data into R and explore
Now that we're back home in R, let's do our regular warm-up exercises:
``` {r warning = FALSE, message = FALSE}
library(RPostgreSQL) # to access database
library(tidyverse) # to tidy things
library(viridis) # bc I like the colours
library(lubridate) # to deal with times. also need `hms` library too.
```
Now, let's get the data:
``` {r, eval = FALSE}
con <- dbConnect(dbDriver("PostgreSQL"),
dbname = "openparl",
host = "localhost",
user = "jessetweedle", password = "")
```
This is mainly ⌘+C/⌘+V from [an r-bloggers post ](https://www.r-bloggers.com/getting-started-with-postgresql-in-r/) by someone named [David Zimmerman](https://www.linkedin.com/in/david-zimmermann-76a737a4/). Thanks David.
I had to write `"openparl"` as the `dbname` because that's the name of the new local database I created to store the tables; `host` is local, and `user` is the name automatically given to me when I set up `psql`, and the default password is empty.
Now let's check out the tables; you can modify this code to get all the tables in this database, but I'm focusing on ones with `"hansard"` in them (because I've already written all the code and I know what I need!):
``` {r, eval = FALSE}
tbl_query <- "SELECT *
FROM pg_tables
WHERE schemaname='public';"
dbGetQuery(con, tbl_query) %>%
as_tibble() %>%
select(schemaname, tablename, tableowner) %>%
filter(grepl("hansard", x = tablename))
# A tibble: 5 x 3
# schemaname tablename tableowner
# <chr> <chr> <chr>
# 1 public hansards_document jessetweedle
# 2 public hansards_statement jessetweedle
# 3 public hansards_statement_bills jessetweedle
# 4 public hansards_statement_mentioned_politicians jessetweedle
# 5 public hansards_oldsequencemapping jessetweedle
```
Now just check out the ones we want, and save them to a `tibble`. First, let's just have a look at `hansard_statements` just to check it out:
``` {r eval = FALSE}
test_query <- "SELECT *
FROM hansards_statement
LIMIT 20;"
df <- dbGetQuery(con, test_query) %>% as_tibble()
df %>% print(n = 5)
# A tibble: 20 x 27
# id document_id time h1_en h2_en member_id who_en content_en sequence
# * <int> <int> <dttm> <chr> <chr> <int> <chr> <chr> <int>
# 1 645328 388 2008-02-14 13:15:00 Routi… Ques… NA "" "<p class=\… 86
# 2 232373 1878 2001-05-03 13:50:00 Gover… Fede… 2611 Ms. M… <p>Mr. Spea… 89
# 3 170961 944 1999-05-10 16:05:00 Gover… Inco… 3066 Mr. G… "<p>Mr. Spe… 210
# 4 27684 1138 1994-11-18 10:00:00 "" Priv… NA The S… "<p>My coll… 0
# 5 645329 388 2008-02-14 13:15:00 Routi… Ques… 1534 Mr. T… "<p data-Ho… 87
# # ... with 15 more rows, and 18 more variables: wordcount <int>, politician_id <int>,
# # procedural <lgl>, h3_en <chr>, who_hocid <int>, content_fr <chr>, statement_type <chr>,
# # written_question <chr>, source_id <chr>, who_context_en <chr>, slug <chr>,
# # urlcache <chr>, h1_fr <chr>, h2_fr <chr>, h3_fr <chr>, who_fr <chr>,
# # who_context_fr <chr>, wordcount_en <int>
```
Data so cool. Thank you again [Michael](https://twitter.com/michaelmulley?lang=en).
``` {r, eval = FALSE}
data_query <- "SELECT time, h1_en, h2_en, h3_en, who_en, politician_id,
wordcount, who_hocid, who_context_en, name, gender
FROM hansards_statement
LEFT JOIN core_politician
ON hansards_statement.politician_id = core_politician.id
ORDER BY time, sequence;"
han_df <- dbGetQuery(con, data_query) %>% as_tibble()
han_df %>% print(n = 5)
# A tibble: 2,297,140 x 11
# time h1_en h2_en h3_en who_en politician_id wordcount who_hocid
# <dttm> <chr> <chr> <chr> <chr> <int> <int> <int>
# 1 1994-01-17 11:00:00 "" "" "" "" NA 301 NA
# 2 1994-01-17 11:25:00 "" "" "" The Clerk of… NA 27 NA
# 3 1994-01-17 11:25:00 "" Election… "" The Presidin… NA 134 NA
# 4 1994-01-17 11:25:00 "" Election… "" Mr. Nunziata 4892 40 NA
# 5 1994-01-17 11:25:00 "" Election… "" The Presidin… NA 289 NA
# ... with 2.297e+06 more rows, and 3 more variables: who_context_en <chr>, name <chr>, gender <chr>
```
Cool, the two things I want (for now) are `time` and `who_en`.
## Analysis
First, an overview of the data we're going to work with.
``` {r}
han_df <- read_csv("han_df.zip")
han_df <- han_df %>% mutate(time = with_tz(time, "America/Toronto"))
han_df %>% sample_n(5) # an easy way to look at a random sample of observations instead of just the first 10
```
Ok, now we're getting somewhere. I want to check two things: (1) what does the time distribution look like? and (2) can we get the gender of the MPs from names? We got genders from the politician database, but there are some sanity checks we'll need. Leave that for later.
### Cycles of activity
So, the time distribution:
``` {r, echo = FALSE, fig.keep = 'all'}
name_time <- han_df %>% mutate(hms = hms::hms(second(time), minute(time), hour(time)))
words <- name_time %>%
select(time, hms, politician_id, who_en, wordcount, gender) %>%
filter(who_en != "" & wordcount > 0)
name_time %>%
group_by(hms) %>%
summarize(n = sum(wordcount)) %>%
# count(hms) %>%
filter(n > 40000) %>%
ggplot(aes(x = hms, y = n / 1e6)) +
geom_point() + geom_smooth(span = 0.15) +
scale_x_time(breaks = hms::hms(rep(0, 24), rep(0, 24), 0:23),
labels = function(x) strftime(x, "%H:%M", tz = "UTC", usetz = FALSE)) +
theme_minimal() +
labs(x = "Time",
y = "Number of words (millions)",
title = "Number of words spoken over the day in Hansard, 1994--2017")
```
There are cycles of activity; it picks up when session begins, drops off from 1 to 4, then jumps up from 4-5. There are obvious patterns to the activity in the Hansard that definitely correspond to the daily schedule of parliament. Keep that in mind for later. At this stage, we just want to know that the data make sense. There are some outliers left off the graph (e.g., filibustering that lasts through the night).
### Gender (im)balance in Parliament: overall stats
From our exploratory analysis before, we know that there is often a gendered title associated with the name, along with the gender given by the politician database. As a first stab to sanity check and validate the data, just call names with "Mr." (and French equivalents) Male, and names with "Mrs." (and other English and French equivalents) Female. (This has its own problems---in a few cases, the hansard specifies that a woman is speaking on behalf of a man, and vice versa.) Putting these two things together gives us a more accurate measure of gender.
``` {r}
words_gender <- words %>%
filter(who_en != "") %>%
mutate(gender_x = case_when(
grepl("(Mr\\.|M\\.)", x = who_en) ~ "M",
grepl("(Mrs\\.|Ms\\.|Miss|Mlle\\.|Mme\\.)", x = who_en) ~ "F",
TRUE ~ "")) %>%
mutate(gender = ifelse(is.na(gender) | (gender_x != gender & gender_x != ""), gender_x, gender)) %>%
filter(gender != "")
words_gender %>% select(who_en, gender, gender_x) %>% sample_n(5)
```
Ok, that's not bad; now what are the overall statistics?
``` {r, echo = FALSE}
words_gender %>%
group_by(gender) %>%
summarize(n = sum(wordcount)) %>%
mutate(freq = scales::percent(n / sum(n))) %>%
rename_all(stringr::str_to_title) %>%
knitr::kable("html", align = "lrr") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Not great. Does it change over time? We know 2015 was a record breaking election for female candidates; on the other hand, the previous record was 1993. So in between: `r emo::ji('shrug')`.
``` {r, echo = FALSE, fig.keep = 'all'}
words_gender %>%
mutate(year = year(time)) %>%
group_by(gender, year) %>%
summarize(n = sum(wordcount) / 1e6) %>%
group_by(year) %>%
mutate(freq = n / sum(n)) %>%
rename(Frequency = freq, Count = n, Gender = gender) %>%
gather(key = type, value = stat, -(Gender:year)) %>%
filter(type == "Count" | (type == "Frequency" & Gender == "F")) %>%
ggplot(aes(x = year, y = stat, colour = Gender)) +
geom_point() + geom_line() +
expand_limits(y = 0) +
scale_colour_manual(values = colours(2)) +
scale_x_continuous(breaks = seq.int(1994, 2017, 3), minor_breaks = 1994:2017) +
facet_wrap(~ type, scale = "free") +
labs(x = "Year", y = "", title = "Count (millions) and frequency of words by gender") +
theme_minimal()
```
Whoops. There's an obvious break in the series for Count---so something happened to the data collection or the structure of the hansard (it's less likely that parliament itself changed significantly at that time). On the plus side, women get more words in Parliament over time!
### Gender (im)balance in Parliament: time and day patterns {#dayplot}
``` {r, echo = FALSE, fig.keep = 'all'}
parliament_schedule <- read_csv("parliament_schedule_2018.csv") %>%
gather(key = day, value = session_type, -X1) %>%
rename(time = X1) %>%
mutate(session_type = ifelse(grepl("Government Orders", x = session_type), "Government Orders", session_type))
psked <- parliament_schedule %>%
filter(!is.na(session_type)) %>%
mutate(day = factor(day, levels = wday(1, label = TRUE) %>% levels()), hour = hour(time), minute = minute(time))
dayplot <- words_gender %>%
mutate(
year = year(time),
hour = hour(time),
minute = as.integer((minute(time) %/% 15) * 15),
day = wday(time, label = TRUE)
) %>%
left_join(psked, by = c("hour", "minute", "day")) %>%
mutate(hms = hms::parse_hm(paste0(hour, ":", minute))) %>%
filter(session_type != "Review of Delegated Legislation") %>%
group_by(gender, hour, hms, day) %>%
summarize(n = sum(wordcount)) %>%
spread(key = gender, value = n) %>%
filter(hour(hms) %>% between(8, 17)) %>%
filter(`F` > 5 & day != "Sat" & day != "Sun") %>%
mutate(f_ratio = `F` / (M + `F`))
dayplot %>%
filter(hour %>% between(10, 17) & !(hour > 17 & day == "Fri")) %>%
ggplot(aes(x = hms, y = f_ratio * 100, colour = day)) +
# geom_hline(yintercept = overall, colour = "#999999", linetype = 'longdash', size = 1) +
geom_point() +
geom_line() +
scale_colour_manual(values = colours(n_distinct(dayplot$day))) +
scale_x_time(breaks = hms::hms(rep(0, 24), rep(0, 24), 0:23),
labels = function(x) strftime(x, "%H:%M", tz = "UTC", usetz = FALSE)) +
# expand_limits(y = 0) +
labs(x = "Hour",
y = "% female",
title = "% of female speakers spikes at 2:00PM",
colour = "Day") +
theme_minimal()
```
Wow. That's a pattern I didn't expect---there's a spike at 2:00PM from Monday-Thursday, but not Friday. Weird. That cyclical activity pattern we noted before probably explains some of this. There's some information on this in the dataset (under the original heading `h1_en`), but it's not very consistent (different capitalization, spelling, naming conventions).
So let's do some googling to find [Parliament's daily order of business](https://www.ourcommons.ca/About/Schedules/DailyOrderOfBusiness-e.html). Mon-Thurs at 2:00PM (and Friday at 11:00AM) is time for "Statements from Members"! That's where all the spikes are! Which means women are speaking at the time each member can get exactly one minute to speak on any topic. It's also relatively high at 2:30 (same days), which is Question Period! They're talking and asking questions at the times they are allowed to do so. But they're not speaking during regular government business.
Let's save that data as a csv file, read it in and join it to the data.
``` {r, echo = FALSE}
gsesh <- words_gender %>%
mutate(year = year(time),
hour = hour(time),
minute = as.integer((minute(time) %/% 15) * 15),
day = wday(time, label = TRUE)) %>%
left_join(psked, by = c("hour", "minute", "day")) %>%
group_by(gender, year, session_type) %>%
summarize(n = sum(wordcount)) %>%
filter(!is.na(session_type))
gsesh %>%
mutate(session_type = ifelse(session_type == "Statements by Members", "Statements by Members", "Other")) %>%
group_by(year, gender, session_type) %>%
summarize(n = sum(n)) %>%
# filter(session_type != "Review of Delegated Legislation") %>%
group_by(year, session_type) %>%
mutate(freq = n / sum(n)) %>%
filter(gender == "F") %>%
ggplot(aes(x = year, y = freq * 100, colour = session_type)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq.int(1994, 2017,2), minor_breaks = 1994:2017) +
scale_colour_manual(values = colours(2)) + #n_distinct(gsesh$session_type))) +
labs(x = "Year",
y = "% words spoken by female",
title = "% of words spoken by women of non-'statement' gov't\nbusiness almost doubles from 1994--2017",
colour = "Session Type") +
theme_minimal()
```
Wow---Statements by Members is constant (matching the relative number of female MPs), but the % of words spoken by women has during routine business has doubled over time! Let's take a look at the deeper categories of business:
``` {r, echo = FALSE}
gsesh %>%
filter(year == 2017 | year == 1994) %>%
group_by(gender, session_type, year) %>%
summarize(n = sum(n)) %>%
group_by(gender, year) %>%
spread(key = gender, value = n) %>%
rename(`Session Type` = session_type) %>%
mutate(`% F` = `F` / (M + `F`)) %>%
# arrange(-`% F`) %>%
mutate(`% Female` = scales::percent(`% F`)) %>% select(-`% F`, -`F`, -M) %>%
spread(key = year, value = `% Female`, fill = "0%") %>%
knitr::kable("html", align = "lrr") %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::add_header_above(c(" " = 1, "% female words" = 2))
```
Dang. Alright. The ratio of female speakers in parliament doubles at times off of regular government business---suggesting they are not involved in the actual running of the government (from this data, we can't tell whether that's by choice of the individuals or by design of those in power; you be the judge). Then they jump up when they can ('statements by members', 'oral questions' and 'adjournment proceedings' are all made up on short one minute statements and question period).
That's it. Parliament's gender power imbalance is still poor, but has improved over time:
1. Women make up 20-25% of Parliament
2. Number of words spoken by women increases from 1994--2017 from 15% to 28%
3. Increase mainly comes from doubling involvement in routine government business
And that's before we get to the words! Who knows what these people are actually saying? [tidytext](https://github.com/juliasilge/tidytext) does! And after writing this, I found out about [Linked Parliamentary Data Project (LiPaD)](https://www.lipad.ca/data/) at the University of Toronto, an even larger historical digital source of Canadian Hansards. Who knows what we can come up with!