Prepare Raking Targets from Census Microdata


‘I’m sorry, leaf peeping? Is that something we do now?’President Jed Bartlet, The West Wing

leafpeepr prepares data for weighting with raking packages like autumn. It creates weighting targets from census microdata. It can also recode values and collapse values into an other category in both census data and survey data.


# install.packages("remotes")


Let’s get an example dataset of census microdata ready to be used as targets for raking.

#> # A tibble: 13,780 x 7
#>   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1   124     2    57     1      0    36     8
#> 2    18     2    35     1      0    33    10
#> 3   104     2    15     1      0    33     3
#> 4   213     2    31     1      0    25    10
#> 5   182     1    52     1      0   410    10
#> # ... with 1.378e+04 more rows


leaf_recode() recodes columns using a data frame as a map.

#> # A tibble: 2 x 2
#>    code SEX   
#>   <dbl> <chr> 
#> 1     1 Male  
#> 2     2 Female

leaf_recode(acs_nh, acs_sex_codes)
#> # A tibble: 13,780 x 7
#>   <dbl> <chr>  <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1   124 Female    57     1      0    36     8
#> 2    18 Female    35     1      0    33    10
#> 3   104 Female    15     1      0    33     3
#> 4   213 Female    31     1      0    25    10
#> 5   182 Male      52     1      0   410    10
#> # ... with 1.378e+04 more rows

The recoding map can also use formulas.

#> # A tibble: 2 x 2
#>   code       BPL            
#>   <chr>      <chr>          
#> 1 ~ . <  120 United States  
#> 2 ~ . >= 120 Another country

leaf_recode(acs_nh, acs_bpl_codes)
#> # A tibble: 13,780 x 7
#>   PERWT   SEX   AGE  RACE HISPAN BPL              EDUC
#>   <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>           <dbl>
#> 1   124     2    57     1      0 United States       8
#> 2    18     2    35     1      0 United States      10
#> 3   104     2    15     1      0 United States       3
#> 4   213     2    31     1      0 United States      10
#> 5   182     1    52     1      0 Another country    10
#> # ... with 1.378e+04 more rows

Or a combination of values and formulas.

#> # A tibble: 4 x 2
#>   code           EDUC                    
#>   <chr>          <chr>                   
#> 1 ~ . %in%   0:5 Non-high school graduate
#> 2 6              High school graduate    
#> 3 ~ . %in%   7:9 Some college            
#> 4 ~ . %in% 10:11 College graduate

leaf_recode(acs_nh, acs_educ_codes)
#> # A tibble: 13,780 x 7
#>   PERWT   SEX   AGE  RACE HISPAN   BPL EDUC                    
#>   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <chr>                   
#> 1   124     2    57     1      0    36 Some college            
#> 2    18     2    35     1      0    33 College graduate        
#> 3   104     2    15     1      0    33 Non-high school graduate
#> 4   213     2    31     1      0    25 College graduate        
#> 5   182     1    52     1      0   410 College graduate        
#> # ... with 1.378e+04 more rows

You can recode multiple columns at once using wide or long data frames.

#> # A tibble: 8 x 12
#>    code SEX    code   RACE   code   HISPAN  code  EDUC   code  BPL   code  AGE  
#>   <dbl> <chr>  <chr>  <chr>  <chr>  <chr>   <chr> <chr>  <chr> <chr> <chr> <chr>
#> 1     1 Male   1      White  0      Not Hi~ ~ . ~ Non-h~ ~ . ~ Unit~ ~ . ~ 17 o~
#> 2     2 Female 2      Black  ~ . %~ Hispan~ 6     High ~ ~ . ~ Anot~ ~ . ~ 18-23
#> 3    NA <NA>   3      AIAN   9      <NA>    ~ . ~ Some ~ <NA>  <NA>  ~ . ~ 24-29
#> 4    NA <NA>   ~ . %~ AAPI   <NA>   <NA>    ~ . ~ Colle~ <NA>  <NA>  ~ . ~ 30-39
#> 5    NA <NA>   7      Other~ <NA>   <NA>    <NA>  <NA>   <NA>  <NA>  ~ . ~ 40-49
#> # ... with 3 more rows
leaf_recode(acs_nh, acs_codes)
#> # A tibble: 13,780 x 7
#>   PERWT SEX    AGE          RACE  HISPAN      BPL           EDUC                
#>   <dbl> <chr>  <chr>        <chr> <chr>       <chr>         <chr>               
#> 1   124 Female 50-59        White Not Hispan~ United States Some college        
#> 2    18 Female 30-39        White Not Hispan~ United States College graduate    
#> 3   104 Female 17 or young~ White Not Hispan~ United States Non-high school gra~
#> 4   213 Female 30-39        White Not Hispan~ United States College graduate    
#> 5   182 Male   50-59        White Not Hispan~ Another coun~ College graduate    
#> # ... with 1.378e+04 more rows

#> # A tibble: 25 x 3
#>   variable code  value 
#>   <chr>    <chr> <chr> 
#> 1 SEX      1     Male  
#> 2 SEX      2     Female
#> 3 RACE     1     White 
#> 4 RACE     2     Black 
#> 5 RACE     3     AIAN  
#> # ... with 20 more rows
leaf_recode(acs_nh, acs_codes_long)
#> # A tibble: 13,780 x 7
#>   PERWT SEX    AGE          RACE  HISPAN      BPL           EDUC                
#>   <dbl> <chr>  <chr>        <chr> <chr>       <chr>         <chr>               
#> 1   124 Female 50-59        White Not Hispan~ United States Some college        
#> 2    18 Female 30-39        White Not Hispan~ United States College graduate    
#> 3   104 Female 17 or young~ White Not Hispan~ United States Non-high school gra~
#> 4   213 Female 30-39        White Not Hispan~ United States College graduate    
#> 5   182 Male   50-59        White Not Hispan~ Another coun~ College graduate    
#> # ... with 1.378e+04 more rows

Creating interaction variables

leaf_interact() creates an interaction between two variables.

acs_nh_recoded <- leaf_recode(acs_nh, acs_codes) %>% 
  janitor::clean_names() #Make column names nicer to look at

leaf_interact(acs_nh_recoded, race, hispan)
#> # A tibble: 13,780 x 8
#>   perwt sex    age       race  hispan   bpl       educ          race_x_hispan   
#>   <dbl> <chr>  <chr>     <chr> <chr>    <chr>     <chr>         <chr>           
#> 1   124 Female 50-59     White Not His~ United S~ Some college  White x Not His~
#> 2    18 Female 30-39     White Not His~ United S~ College grad~ White x Not His~
#> 3   104 Female 17 or yo~ White Not His~ United S~ Non-high sch~ White x Not His~
#> 4   213 Female 30-39     White Not His~ United S~ College grad~ White x Not His~
#> 5   182 Male   50-59     White Not His~ Another ~ College grad~ White x Not His~
#> # ... with 1.378e+04 more rows

leaf_interactions() creates multiple interactions at once using a list.

leaf_interactions(acs_nh_recoded, c("race", "educ"), c("sex", "age"))
#> # A tibble: 13,780 x 9
#>   perwt sex    age     race  hispan   bpl    educ     race_x_educ    sex_x_age  
#>   <dbl> <chr>  <chr>   <chr> <chr>    <chr>  <chr>    <chr>          <chr>      
#> 1   124 Female 50-59   White Not His~ Unite~ Some co~ White x Some ~ Female x 5~
#> 2    18 Female 30-39   White Not His~ Unite~ College~ White x Colle~ Female x 3~
#> 3   104 Female 17 or ~ White Not His~ Unite~ Non-hig~ White x Non-h~ Female x 1~
#> 4   213 Female 30-39   White Not His~ Unite~ College~ White x Colle~ Female x 3~
#> 5   182 Male   50-59   White Not His~ Anoth~ College~ White x Colle~ Male x 50-~
#> # ... with 1.378e+04 more rows

leaf_interact_all() creates interactions between one variable and all other variables.

leaf_interact_all(acs_nh_recoded, sex, except = perwt)
#> # A tibble: 13,780 x 12
#>   perwt sex   age   race  hispan bpl   educ  age_x_sex race_x_sex hispan_x_sex
#>   <dbl> <chr> <chr> <chr> <chr>  <chr> <chr> <chr>     <chr>      <chr>       
#> 1   124 Fema~ 50-59 White Not H~ Unit~ Some~ 50-59 x ~ White x F~ Not Hispani~
#> 2    18 Fema~ 30-39 White Not H~ Unit~ Coll~ 30-39 x ~ White x F~ Not Hispani~
#> 3   104 Fema~ 17 o~ White Not H~ Unit~ Non-~ 17 or yo~ White x F~ Not Hispani~
#> 4   213 Fema~ 30-39 White Not H~ Unit~ Coll~ 30-39 x ~ White x F~ Not Hispani~
#> 5   182 Male  50-59 White Not H~ Anot~ Coll~ 50-59 x ~ White x M~ Not Hispani~
#> # ... with 1.378e+04 more rows, and 2 more variables: bpl_x_sex <chr>,
#> #   educ_x_sex <chr>

Generating a target data frame

Once our data is recoded, leaf_peep() prepares it to be used as weighting targets in autumn::harvest()

acs_nh_interacted <- leaf_interactions(
  acs_nh_recoded, c("race", "educ"), c("sex", "age")

leaf_peep(acs_nh_interacted, weight_col = perwt)
#> # A tibble: 64 x 3
#>   variable level         proportion
#>   <chr>    <chr>              <dbl>
#> 1 sex      Female            0.502 
#> 2 sex      Male              0.498 
#> 3 age      17 or younger     0.183 
#> 4 age      18-23             0.0825
#> 5 age      24-29             0.0729
#> # ... with 59 more rows

Collapsing categories

leaf_other() recategorizes levels into an other category if their proportion is below a certain cutoff.

acs_nh_targets <- leaf_peep(acs_nh_interacted, weight_col = perwt)

dplyr::arrange(acs_nh_targets, proportion)
#> # A tibble: 64 x 3
#>    variable    level                                 proportion
#>    <chr>       <chr>                                      <dbl>
#>  1 race_x_educ AIAN x Non-high school graduate         0.000280
#>  2 race_x_educ AIAN x High school graduate             0.000469
#>  3 race_x_educ AIAN x Some college                     0.000626
#>  4 race_x_educ Other race x Some college               0.000830
#>  5 race_x_educ AIAN x College graduate                 0.000881
#>  6 race_x_educ Other race x College graduate           0.00110 
#>  7 race_x_educ Other race x High school graduate       0.00158 
#>  8 race        AIAN                                    0.00226 
#>  9 race_x_educ Other race x Non-high school graduate   0.00256 
#> 10 race_x_educ Black x College graduate                0.00261 
#> # ... with 54 more rows

leaf_other(acs_nh_targets, 0.01) %>% 
#> # A tibble: 44 x 3
#>    variable    level                   proportion
#>    <chr>       <chr>                        <dbl>
#>  1 race_x_educ AAPI x College graduate     0.0127
#>  2 race        Mixed race                  0.0208
#>  3 race        Other                       0.0229
#>  4 race        AAPI                        0.0311
#>  5 sex_x_age   Female x 24-29              0.0358
#>  6 sex_x_age   Male x 24-29                0.0371
#>  7 hispan      Hispanic                    0.0386
#>  8 sex_x_age   Male x 18-23                0.0402
#>  9 sex_x_age   Female x 18-23              0.0423
#> 10 sex_x_age   Male x 70 or older          0.0524
#> # ... with 34 more rows

If the other category would itself be under the cutoff proportion, the next smallest level is added to the other category. To avoid this, set inclusive = FALSE.

leaf_other(acs_nh_targets, 0.01, inclusive = FALSE) %>% 
#> # A tibble: 45 x 3
#>    variable    level                   proportion
#>    <chr>       <chr>                        <dbl>
#>  1 race        Other                      0.00833
#>  2 race_x_educ AAPI x College graduate    0.0127 
#>  3 race        Black                      0.0146 
#>  4 race        Mixed race                 0.0208 
#>  5 race        AAPI                       0.0311 
#>  6 sex_x_age   Female x 24-29             0.0358 
#>  7 sex_x_age   Male x 24-29               0.0371 
#>  8 hispan      Hispanic                   0.0386 
#>  9 sex_x_age   Male x 18-23               0.0402 
#> 10 sex_x_age   Female x 18-23             0.0423 
#> # ... with 35 more rows


Hex sticker font is Source Code Pro by Adobe.

Image adapted from and Twemoji by Twitter.

Please note that leafpeepr is released with a Contributor Code of Conduct.


