-
Notifications
You must be signed in to change notification settings - Fork 1
/
01-5_check_name.Rmd
970 lines (740 loc) · 19.9 KB
/
01-5_check_name.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
---
title: "01-5_check_name"
subtitle: "Check name variables"
author: "Ross Gayler"
date: "2021-01-12"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
markdown:
wrap: 72
---
```{r setup}
# Set up the project environment, because each Rmd file knits in a new R session
# so doesn't get the project setup from .Rprofile
# Project setup
library(here)
source(here::here("code", "setup_project.R"))
# Extra set up for the 01*.Rmd notebooks
source(here::here("code", "setup_01.R"))
# Extra set up for this notebook
# ???
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
```
# Introduction
The `01*.Rmd` notebooks read the data, filter it to the subset to be
used for modelling, characterise it to understand it, check for possible
gotchas, clean it, and save it for the analyses proper.
This notebook (`01-5_check_name`) characterises the name variables in
the saved subset of the data.
These variables will be used to construct the main predictors in the
compatibility models.
We intend to use the one snapshot file as both the database to be
queried and as the set of queries. Consequently, strictly speaking, we
don't need to standardise the name variables because the database and
query records are guaranteed to be identical (they will literally be the
same record). However, we will look at the name variables with an eye to
standardisation because it is never a good idea to statistically model
data without having an idea about the quality of the data. We will apply
some basic standardisation to the name variables, if appropriate,
because it parallels what would be necessary in practice.
------------------------------------------------------------------------
Define the name variables.
```{r}
vars_name <- c(
"last_name", "first_name", "midl_name", "name_sufx_cd"
)
```
Read the usable data. Remember that this consists of only the ACTIVE &
VERIFIED records.
```{r}
# Show the entity data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_raw_fst)
# get data for next section of analyses
d <- fst::read_fst(
f_entity_raw_fst,
columns = c(vars_name, "sex") # get sex as well for cross-checking
) %>%
tibble::as_tibble()
dim(d)
```
Take a quick look at the distributions.
```{r}
d %>% skimr::skim()
```
- `last_name` 100% filled
- `first_name` \~100% filled (23 missing)
- `midl_name` 94% filled
- `name_sufx_cd` 6% filled
# Name length
Look at the distributions of name lengths first, before moving on to
analyses more focused on standardisation.
Calculate the lengths of the name variables.
```{r}
x <- d %>%
dplyr::mutate(
len_last = stringr::str_length(last_name),
len_first = stringr::str_length(first_name),
len_midl = stringr::str_length(midl_name)
)
```
## last_name
`last_name` Voter last name
Look at the distributions of name lengths.
```{r}
summary(x$len_last)
table(x$len_last, useNA = "ifany")
x %>%
ggplot() +
geom_histogram(aes(x = len_last), binwidth = 1) +
scale_y_sqrt()
```
Look at examples of short names.
```{r}
# length == 1
x %>%
dplyr::filter(len_last == 1) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- 1-letter last names are very rare
- 1-letter last names are probably errors
```{r}
# length == 2
x %>%
dplyr::filter(len_last == 2) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- Most 2-letter last names are probably valid.
- ST is probably Saint from a multi-word last name
Look at examples of long names.
```{r}
# length == 21
x %>%
dplyr::filter(len_last == 21) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- 21-letter last names are hyphenated
```{r}
# length >= 20
x %>%
dplyr::filter(len_last >= 20) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- 20+-letter last names appear to be multi-word and/or hyphenated
## first_name
`first_name` Voter first name
Look at the distributions of name lengths.
```{r}
summary(x$len_first)
table(x$len_first, useNA = "ifany")
x %>%
ggplot() +
geom_histogram(aes(x = len_first), binwidth = 1) +
scale_y_sqrt()
```
Look at the missing names.
```{r}
x %>%
dplyr::filter(is.na(first_name)) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- Some missing first names look like the middle name is actually the
first name, e.g. ? JASON ALEXANDER
- Some missing first names appear to have only a last name, e.g. ? ?
AMEN
- Some missing first names appear to have the entire name in the last
name variable, e.g. ? ? FRYE WILLIAM C
Look at examples of short names.
```{r}
# length == 1
x %>%
dplyr::filter(len_first == 1) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- The 1-letter first names appear to be using an initial as the first
name
```{r}
# length == 2
x %>%
dplyr::filter(len_first == 2) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
2-letter first names appear to be:
- Valid, e.g. JO W CLARK, HO NGOC NGUYEN
- Part of a multi word name that has bee split across the first and
middle name variables, e.g. LA SONDA FOWLER
Look at the long names.
```{r}
# length >= 16
x %>%
dplyr::filter(len_first >= 16) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
Long first names appear to be:
- Long non-anglo names, e.g. LAKSHMINARAYANAN
- Multi-word and/or hyphenated, e.g. ELIZABETH-LINDSAY
## midl_name
`midl_name` Voter middle name
These names will often be missing or initials only.
Look at the distributions of name lengths.
```{r}
summary(x$len_midl)
table(x$len_midl, useNA = "ifany")
x %>%
ggplot() +
geom_histogram(aes(x = len_midl), binwidth = 1) +
scale_y_sqrt()
```
- *Many* records are missing middle name
- Spike of 1-letter names will be initials
Look at the long names.
```{r}
# lentgh >= 16
x %>%
dplyr::filter(len_midl >= 16) %>%
dplyr::select(ends_with("_name")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, first_name) %>%
knitr::kable()
```
- Long middle names appear to be multiple names and/or hyphenated
```{r}
# clean up
rm(x)
gc()
```
# name_sufx_cd
`name_sufx_cd` Voter name suffix
This is intended for generation markers, e.g. Junior, Senior.
I am not going to use name suffix in entity resolution because age
should be sufficient and is much better quality. I will look at what
values turn up in the name suffix because the same values sometimes
wrongly occur in the main name variables. Knowing what values occur may
help us to remove those values from the main name variables.
```{r}
d %>% dplyr::select(name_sufx_cd) %>% skimr::skim()
table(d$name_sufx_cd, useNA = "ifany") %>% sort() %>% rev()
# get a better look at the cleaned suffixes
d %>%
dplyr::mutate(
sufx = name_sufx_cd %>%
stringr::str_to_upper() %>%
stringr::str_remove_all(pattern = "[^A-Z0-9]") %>% # remove non-alphanumeric
dplyr::na_if("")
) %>%
dplyr::count(sufx) %>%
dplyr::filter(n > 1) %>%
dplyr::arrange(desc(n), sufx) %>%
knitr::kable()
```
- There are generation suffixes: JR, SR, I, II (11), III (111), IV, V,
VI, VII
- There are honorific titles: MRS, MR, MS, DR, REV
# Standardisation
Look at issues that might be addressed by standardisation.
For each type of standardisation issue look at first middle and last
names separately, because the issue may manifest differently in each of
the name variables.
## Lower-case letters.
```{r}
d %>% dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "[a-z]"))
d %>% dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "[a-z]"))
d %>% dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "[a-z]"))
```
- Lower case letters occur in last, first, and middle names
- Associated with particles where there would optionally be a space,
e.g. JoANN, McBride
## Non-alphanumeric
Check for non-alphanumeric characters in names.
### Hyphen
Check for hyphens.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "-"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- \~21k last names with hyphens
- Look like legitimately hyphenated last names
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "-"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) %>%
knitr::kable()
```
- \~3kL first names with hyphens
- Look like legitimately hyphenated first names
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "-"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- ~4k middle names with hyphens
- Look like legitimately hyphenated middle names
### Quote
Check for quotes.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "'"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- ~5k last names with quotes
- Look like legitimately quoted last names
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "'"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) %>%
knitr::kable()
```
- ~1k first names with quotes
- Look like legitimately quoted first names
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "'"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- ~3k middle names with quotes
- Look like legitimately quoted middle names
### Period
Check for periods.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "\\."))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- 11 last names with periods
- Look like legitimate abbreviations
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "\\."))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) %>%
knitr::kable()
```
- 120 first names with periods
- Look like initials
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "\\."))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- ~2k middle names with periods
- Look like initials
### Comma
Check for commas.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, ","))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- 2 last names with commas
- Punctuation for suffix field values added to last name
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, ","))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) %>%
knitr::kable()
```
- 4 first names with commas
- Arbitrary added punctuation
- Punctuation for suffix field value added to first name
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, ","))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- 12 middle names with periods
- List separator
- Punctuation to squeeze in extra field
### Other non-alphanumeric
Check for other non-alphanumeric characters.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "[^ a-zA-Z0-9\\.,'-]"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) # %>%
# knitr::kable() # some of the characters break the kable formatting
```
- 31 last names with other non-alphanumeric characters
- Most look like substitutions for hyphen or quote
- Some look like random cruft
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "[^ a-zA-Z0-9\\.,'-]"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) # %>%
# knitr::kable() # some of the characters break the kable formatting
```
- 102 first names with other non-alphanumeric characters
- Some look like substitutions for hyphen or quote
- Some are parenthetical notes
- Some look like random cruft
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "[^ a-zA-Z0-9\\.,'-]"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- ~1k middle names with other non-alphanumeric characters
- Some look like substitutions for hyphen
- Many are parenthetical notes (NMN = no middle name)
## Digits
Check for digits.
### Zero
Check for zero
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "0"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- 29 last names with zero
- Substitution for O
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "0"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) %>%
knitr::kable()
```
- 33 first names with zero
- Substitution for O
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "0"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- 77 middle names with zero
- Some are substitution for O
- Some are in superfluous numbers
### One
Check for one.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "1"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- 1 last name with one
- Substitution for I in generation suffix (111 = III)
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "1"))
nrow(x)
```
- 0 first names with one
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "1"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- 39 middle names with one
- Some are substitution for I in generation suffix
- Some are in superfluous numbers
### Other digits
Check for other digits.
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(last_name, "[2-9]"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(last_name, sex) %>%
knitr::kable()
```
- 1 last name with a 5
- Random insertion
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(first_name, "[2-9]"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(first_name, sex) %>%
knitr::kable()
```
- 2 first names with digits 2-9
- Look like random insertions
```{r}
x <- d %>%
dplyr::filter(stringr::str_detect(midl_name, "[2-9]"))
nrow(x)
x %>%
dplyr::slice_sample(n = 20) %>%
dplyr::arrange(midl_name, sex) %>%
knitr::kable()
```
- 24 middle names with digits 2-9
- One random insertion
- Most appear to be superfluous numbers (from the address?)
## Special words
Look for special words that shouldn't be in names.
Define word patterns to search for.
```{r}
# honorifics
w_hons <- c(
"MR", "MISTER", "MASTER", "MRS", "MS", "MISS",
"REV", "REVEREND", "SR", "SISTER", "BR", "BROTHER",
"FATHER", "MOTHER", "PASTOR", "ELDER", "BISHOP",
"DR", "DOCTOR", "MD", "PROF", "PROFESSOR"
)
# generation suffixes
w_gen <- c(
"JR", "JNR", "JUNIOR", "SR", "SNR", "SENIOR",
"1ST", "2ND", "3RD", "4TH", "5TH", "6TH", "7TH", "8TH",
"FIRST", "SECOND", "THIRD", "FOURTH", "FIFTH", "SIXTH", "SEVENTH", "EIGHTH", "EIGHTTH",
"1", "2", "3", "4", "5", "6", "7", "8",
"I", "II", "III", "IIII", "IV", "V", "VI"
)
# special values
w_spec <- c(
"NN", "NMN", "NAME",
"UNK", "UNKNOWN", "AKA", "KNOWN AS", "ALSO KNOWN AS", "ALIAS",
"BLIND"
)
# test
w_test <- c(
"TEST", "TST", "DUMMY", "VOTER", "([A-Z])\\1{2,}"
)
```
### Last name
```{r}
# regular expression to match words
w_regexp <-
c(w_hons, w_gen, w_spec, w_test) %>% # all special words
unique() %>% # make it a set
dplyr::setdiff( # remove words that appear to mostly be validly used
c(
"BISHOP",
"BLIND",
"BROTHER",
"DOCTOR",
"ELDER",
"FIRST",
"JUNIOR",
"MASTER",
"MISS",
"MISTER",
"PASTOR",
"SENIOR",
"TEST",
"THIRD",
"VOTER"
)
) %>%
glue::glue(x = . , "\\b{x}\\b") %>% # must be words
glue::glue_collapse(sep = "|") # search for any
x <- d %>%
dplyr::mutate(
match =
last_name %>%
stringr::str_to_upper() %>%
stringr::str_replace_all(pattern = "[^ A-Z]", replacement = " ") %>%
stringr::str_squish() %>%
stringr::str_extract(pattern = w_regexp)
) %>%
dplyr::filter(!is.na(match))
nrow(x)
x %>%
dplyr::arrange(match, sex, last_name, first_name) %>%
knitr::kable()
```
I eyeballed the results and removed words which appeared to be mostly
validly used.
Invalid words:
- As whole field:
- As first word:
- As last word: DR, II, III, IIII, IV, JR, MD, SR
- As internal word: SR
### First name
```{r}
# regular expression to match words
w_regexp <-
c(w_hons, w_gen, w_spec, w_test) %>% # all special words
unique() %>% # make it a set
dplyr::setdiff( # remove words that appear to mostly be validly used
c(
"BISHOP",
"BROTHER",
"DOCTOR",
"ELDER",
"JUNIOR",
"MASTER",
"MISTER",
"PASTOR",
"PROFESSOR"
)
) %>%
glue::glue(x = . , "\\b{x}\\b") %>% # must be words
glue::glue_collapse(sep = "|") # search for any
x <- d %>%
dplyr::mutate(
match =
first_name %>%
stringr::str_to_upper() %>%
stringr::str_replace_all(pattern = "[^ A-Z]", replacement = " ") %>%
stringr::str_squish() %>%
stringr::str_extract(pattern = w_regexp)
) %>%
dplyr::filter(!is.na(match))
nrow(x)
x %>%
dplyr::arrange(match, sex, last_name, first_name) %>%
knitr::kable()
```
I eyeballed the results and removed words which appeared to be mostly
validly used.
Invalid words:
- As whole field: FATHER, III, IV, JR, MD, MR, MRS, SISTER, SR
- As first word: DR, MISS, MRS, REV, SISTER
- As last word: III, JR, MRS, NMN, SR
- As internal word: MRS
### Middle name
```{r}
# regular expression to match words
w_regexp <-
c(w_hons, w_gen, w_spec, w_test) %>% # all special words
unique() %>% # make it a set
dplyr::setdiff( # remove words that appear to mostly be validly used
c(
"BISHOP",
"BLIND",
"BR",
"BROTHER",
"DOCTOR",
"ELDER",
"FIRST",
"JR", # invalid & too many to display
"JUNIOR",
"MASTER",
"MISTER",
"MRS", # invalid & too many to display
"NMN", # invalid & too many to display
"PASTOR",
"SENIOR",
"SISTER",
"I",
"V",
"VI",
"VOTER"
)
) %>%
glue::glue(x = . , "\\b{x}\\b") %>% # must be words
glue::glue_collapse(sep = "|") # search for any
x <- d %>%
dplyr::mutate(
match =
midl_name %>%
stringr::str_to_upper() %>%
stringr::str_replace_all(pattern = "[^ A-Z]", replacement = " ") %>%
stringr::str_squish() %>%
stringr::str_extract(pattern = w_regexp)
) %>%
dplyr::filter(!is.na(match))
nrow(x)
x %>%
dplyr::arrange(match, sex, last_name, first_name) %>%
knitr::kable()
```
I eyeballed the results and removed words which appeared to be mostly
validly used.
Invalid words:
- As whole field: AKA, DR, II, III, IV, JR, MD, MISS, MRS, MS, NMN,
REV, SR
- As first word: JR, MRS
- As last word: DR, II, III, IV, JR, MD, MISS, MR, MRS, NMN, NN, SR
- As internal word: JR
# Timing {.unnumbered}
```{r echo=FALSE}
tictoc::toc()
```