-
Notifications
You must be signed in to change notification settings - Fork 1
/
01-1_get_data.Rmd
247 lines (195 loc) · 8.17 KB
/
01-1_get_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "01-1_get_data"
subtitle: "Read, subset, and save the data"
author: "Ross Gayler"
date: "2021-01-07"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
markdown:
wrap: 72
---
```{r setup}
# Set up the project environment, because each Rmd file knits in a new R session
# so doesn't get the project setup from .Rprofile
library(here)
source(here::here("code", "setup_project.R"))
# Extra set up for the 01*.Rmd notebooks
source(here::here("code", "setup_01.R"))
# Extra set up for this notebook
library(vroom) # fast reading of delimited text files
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
```
# Introduction
The `01*.Rmd` notebooks read the data, filter it to the subset to be
used for modelling, characterise it to understand it, check for possible
gotchas, clean it, and save it for the analyses proper.
This notebook (`01-1_get_data`) reads the raw data, subsets it to the
data we will use, briefly sanity checks the data, and saves it in an
R-friendly format.
Some of the analyses have been run on a laptop computer with 16GB RAM.
Consequently it is helpful to minimise the data volume as soon as
possible. This means we will occasionally take a diversion in the
analyses to make the data fit the hardware.
## Entity data
This project uses historical voter registration data from the [North
Carolina State Board of Elections](https://www.ncsbe.gov/). This
information is made publicly available in accordance with [North
Carolina state
law](https://s3.amazonaws.com/dl.ncsbe.gov/ReadMe_PUBLIC_DATA.txt). The
[Voter Registration Data
page](https://www.ncsbe.gov/results-data/voter-registration-data) links
to a [folder of Voter Registration
snapshots](https://dl.ncsbe.gov/index.html?prefix=data/Snapshots/),
which contains the snapshot data files and a [metadata file describing
the layout of the snapshot data
files](https://s3.amazonaws.com/dl.ncsbe.gov/data/Snapshots/layout_VR_Snapshot.txt).
At the time of writing the snapshot files cover the years 2005 to 2020
with at least one snapshot per year. The files are [ZIP
compressed](https://en.wikipedia.org/wiki/ZIP_(file_format)) and
relatively large, with the smallest being 572 MB after compression.
The snapshots contain many columns that are irrelevant to this project
(e.g. school district name) and/or prohibited under Australian privacy
law (e.g. political affiliation, race). We do not read these unneeded
columns from the snapshot file.
We use only one snapshot file
([VR_Snapshot_20051125.zip](https://s3.amazonaws.com/dl.ncsbe.gov/data/Snapshots/VR_Snapshot_20051125.zip))
because this project does not investigate linkage of records across
time. We chose the oldest snapshot (2005) because it is the smallest and
the contents are the most out of date, minimising the current
information made available. Note that this project will not generate any
information that is not already directly, publicly available from NCSBE.
# Read entity data
The snapshot ZIP file was manually downloaded (572 MB), uncompressed
(5.7 GB), then re-compressed in [XZ
format](https://en.wikipedia.org/wiki/XZ_Utils) to minimise the size
(248 MB). The compressed snapshot file and the metadata file are stored
in the `data` directory.
The data is tab-separated, not fixed-width as you might reasonably think
from reading the metadata. The field widths (interpreted as maximum
lengths) in the metadata are not accurate. Some fields contain values
longer than the stated width.
Inspection of the raw data shows that the character fields are unquoted.
However, at least one character value contains a double-quote character,
which has the potential to confuse the parsing if it is looking for
quoted values.
```{r}
# Show the raw data file location
# This is set in code/file_paths.R
f_entity_uncln_tsv
# read the data
d <- vroom::vroom( #read raw data
f_entity_uncln_tsv,
# n_max = 1e4, # limit the rows for testing
col_select = c( # get all the columns that might conceivably be used
# the names and ordering are from the metadata file
snapshot_dt : voter_status_reason_desc, # 9 cols
last_name : street_sufx_cd, # 10 cols
unit_num : zip_code, # 4 cols
area_cd, phone_num, # 2 cols
sex_code : registr_dt, # 5 cols
cancellation_dt, load_dt # 2 cols
), # total 32 cols
col_types = cols(
.default = col_character() # all cols as chars to allow for bad formatting
),
delim = "\t", # assume that fields are *only* delimited by tabs
col_names = TRUE, # use the column names on the first line of data
na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
quote = "", # don't allow for quoted strings
comment = "", # don't allow for comments
trim_ws = TRUE, # trim leading and trailing whitespace
escape_double = FALSE, # assume no escaped quotes
escape_backslash = FALSE # assume no escaped backslashes
) %>%
tibble::as_tibble() %>%
dplyr::mutate( # convert the datetime cols to dates
snapshot_dt = lubridate::as_date(snapshot_dt),
registr_dt = lubridate::as_date(registr_dt),
cancellation_dt = lubridate::as_date(cancellation_dt),
load_dt = lubridate::as_date(load_dt)
)
```
Check the number of rows and columns read.
```{r}
dim(d)
```
- Correct number of data rows extracted (external line count of input
file = 8,003,294)
- Correct number of columns extracted (checked against manual count of
columns in metadata)
# Exclusions
Preliminary examination of the data showed that about half the rows
correspond to people who have been removed from the electoral roll.
Remove these rows from the data. Keep only the rows flagged as ACTIVE
and VERIFIED, because by a common-sense interpretation of those labels,
these rows have passed the electoral checking criteria and therefore
are least likely to contain errors or be duplicates.
Preliminary examination of the data discovered one row that was an
obvious test record which was flagged as ACTIVE and VERIFIED. Remove
that row from the data.
Remove those rows now to reduce processing time for later steps and to
avoid thinkiing about records that won't be used.
```{r}
d <- d %>%
dplyr::filter(
voter_status_desc == "ACTIVE" & voter_status_reason_desc == "VERIFIED",
! (first_name == "THIS" & last_name == "TEST")
)
dim(d)
```
# Quick check
Take a very quick look at all the columns to see if they contain the
expected content.
```{r}
glimpse(d)
skimr::skim(d)
```
## No useful variation
The following variables are entirely missing values:
- `ncid` North Carolina identification number (NCID) of voter
The following variables have exactly one unique nonmissing value:
- `snapshot_dt` Date of snapshot
- `load_dt` Data load date
The following variables have exactly one unique nonmissing value becasue of selcting ACTIVE & VERIFIED records:
* `status_cd` Status code for voter registration
* `voter_status_desc` Status code description
* `reason_cd` Reason code for voter registration status
* `voter_status_reason_desc` Reason code description
Those variables can not possibly be useful. Drop them from the data.
```{r}
d <- d %>%
dplyr::select(
-c(ncid, snapshot_dt, load_dt,
status_cd, voter_status_desc, reason_cd, voter_status_reason_desc)
)
```
## Useful variation
The remainder of variables have more than one unique nonmissing value,
so are potentially usable.
- Eyeball checking of these variables shows they appear to have
contents of the expected type.
Add an identity variable.
We assume that all the records correspond to unique people.
So just sequentially number the records.
```{r}
d <- d %>%
dplyr::mutate(id = 1:nrow(.))
```
# Save entity data
The usable data is stored as an [`fst` format](https://www.fstpackage.org/) file in the `output` directory.
This format stores only a data frame and can be read *very* rapidly.
It is possible to read a subset of the stored columns,
so we don't have to be too worried about storing columns that aren't always needed.
```{r}
# Show the raw data file location
# This is set in code/file_paths.R
f_entity_fst
# save the usable entity data (cheap-skate caching)
d %>% fst::write_fst(f_entity_fst, compress = 100)
```
# Timing {.unnumbered}
```{r echo=FALSE}
tictoc::toc()
```