# Data Cleaning and Preparation with IPUMS USA
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

## Introduction
This notebook demonstrates the process of cleaning and preparing tabular population data for analysis.  The notebook uses data extracted from the [IPUMS USA](https://usa.ipums.org/usa) repository, which provides harmonized data from the U.S. Decennial Census, American Community Survey, and Puerto Rico Community Survey, and other sources.  Working with large and complex datasets, like those provided by IPUMS, often requires meticulous cleaning and preparation to ensure that the data are suitable for the user's specific analysis. This notebook will guide you through key steps in this process, including importing, exploring, and transforming IPUMS data.

### Notebook Goals
This notebook introduces a data cleaning workflow using previously-downloaded data from [IPUMS USA](https://usa.ipums.org/usa) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html).  This notebook is intended as a follow-up to [IPUMS USA Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/ab5cad39-6d00-43d2-bc51-17fd4e6b98f2).  Users will learn how to clean and recode common continuous and categorical variables used in population data in R.  By the end of this notebook, users will have the skills to create their own workflows cleaning and preparing tabular IPUMS data or other population datasets for social and demographic research workflows.

### ★ Prerequisites ★
* Complete [Chapter 2.1 IPUMS USA Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/ab5cad39-6d00-43d2-bc51-17fd4e6b98f2)
* Have a copy of the *ipums_usa_example.rds* file available in your workspace
  * If you worked through [Chapter 2.1](https://platform.i-guide.io/notebooks/ab5cad39-6d00-43d2-bc51-17fd4e6b98f2) you should have created and saved a copy of *ipums_usa_example.rds* in the final section of the notebook.
  * You can also download a copy of *ipums_usa_example.rds* file from [the I-GUIDE platform](https://platform.i-guide.io/datasets/0cb99a7c-97c0-4ffc-a2d7-ff539c8eadae) or [Kate's GitHub](https://github.com/vavramusser/r-spatial/blob/main/ipums_usa_example.csv).

#### About the Example Data Set
The [*ipums_usa_example.rds*](https://github.com/vavramusser/r-spatial/blob/main/ipums_usa_example.csv) file contains basic demographic information (sex, age, race, educational attainment, and total personal income) on residents of the state of Michigan collected as part of the 2010 [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html).  The ACS is an annual survey conducted by the [U.S. Census Bureau](https://www.census.gov) that collects information on a subset of the U.S. population.  It is a more in-depth supplement to the [U.S. Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html) and in 2005 replaced the long-form version of the Decennial Census survey which was previously conducted every ten years.

### Notebook Overview
1. Setup
2. Initial Review
3. Cleaning and Recoding Continuous Variables
5. Cleaning and Recoding Categorical Variables
6. Final Review

## 1. Setup

This section will guide you through the process of installing essential packages.

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) · A Grammar of Data Manipulation. This notebook uses the the following functions from *dplyr*.

* [*case_when*](https://rdrr.io/cran/dplyr/man/case_when.html) · a general vectorized if-else
* [*count*](https://rdrr.io/cran/dplyr/man/count.html) · count the observations in each group
* [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html) · create, modify, and delete columns
* [*row_number*](https://rdrr.io/cran/dplyr/man/row_number.html) · integer ranking functions
* [*select*](https://rdrr.io/cran/dplyr/man/select.html) · keep or drop columns using their names and types
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator.  The *pip* operator is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows and is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**haven**](https://cran.r-project.org/web/packages/haven/index.html) · Import foreign statistical formats into R via the embedded ["ReadStat"](https://github.com/WizardMac/ReadStat) C library.  This notebook uses the the following functions from *haven*.

* [*as_factor*](https://rdrr.io/cran/haven/man/as_factor.html) · convert labelled vectors to factors

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "haven"))

Load the packages into your workspace.

In [None]:
library(dplyr)
library(haven)

### 1b. Load the Data File

Run the following line of code to read the *ipums_usa_example.rds* file into memory.  You may need to update the file path to reflect the file's location on your machine or in your working directory.

In [None]:
dat <- readRDS("ipums_usa_example.rds")

## 2. Initial Review

As a reminder, the [*ipums_usa_example.rds*](https://github.com/vavramusser/r-spatial/blob/main/ipums_usa_example.csv) file contains basic demographic information (sex, age, race, educational attainment, and total personal income) on residents of the state of Michigan collected as part of the 2010 [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html).

Let's take a look at the number of observations and variables in the data.

In [None]:
dim(dat)

The data includes information on 19 variables for 98,973 individuals living in the state of Michigan in 2010.

Let's take a look at the first few lines of the data.

In [None]:
head(dat)

In [Chapter 2.1](https://platform.i-guide.io/notebooks/ab5cad39-6d00-43d2-bc51-17fd4e6b98f2) we set up our IPUMS API extraction with a selection of 7 population variables.  However, IPUMS includes a set of preselected variables in data extractions including metadata and other supplemental information which account for the additional 12 variables.

Let's take a look at the list of column names.

In [None]:
colnames(dat)

Below is a referece list of the variables included in the data.  This list includes the 7 variables originally selected as well as detailed supplemental variables for 2 of the selected variables (RACED for the variales RACE and EDUCD for the variable EDUC) and 10 IPUMS preselected variables which mainly include metadata such as identification codes, weights, and other metainformation.

**User Variable Selection**
* [State FIPS Code (STATEFIP)](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code)
* [Public Use Microdata Area (PUMA)](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html#:~:text=Public%20Use%20Microdata%20Areas%20(PUMAs)%20are%20non%2Doverlapping%2C,%2C%20Puerto%20Rico%2C%20and%20Guam.)
* Sex (SEX)
* Age (AGE)
* Race (RACE)
* Educational Attainment (EDUC)
* Total Personal Income (INCTOT)

**Detailed Supplements for Selected Variables**
* Race (detailed) (RACED)
* Education (detailed) (EDUCD)

**IPUMS Preselected Variables**
* Census Year (YEAR)
* IPUMS Sample Identifier (SAMPLE)
* Household Serial Number (SERIAL)
* Original Census Bureau Household Serial Number (CBSERIAL)
* Household Weight (HHWT)
* Household Cluster for Vaccine Estimation (CLUSTER)
* Household Strata for Variance Estimation (STRATA)
* Group Quarters Status (GQ)
* Person Number in Sample Unit (PERNUM)
* Person Weight (PERWT)

### 2a. Summarize the Data

Let's start with a quick summary of the entire dataset using the *summary* command which provides a overview of a dataset and is a quick and easy way of exploring the overall data structure as well as potential high-level issues.  For this step we will restrict the summary to only subset of demographic variables since those are the variables we will focus on in the data cleaning process for this notebook.

In [None]:
summary(dat[, c("SEX", "AGE", "RACE", "RACED", "EDUC", "EDUCD", "INCTOT")])

The high-level summary shows us a few potential issues with these data.

1. **All variables are coded as numeric, even though many of them are actually categorical.**  It is very common to code categorical variables using numeric codes but it is important to remember that these categorical variables are useless without data labels or a codebook reference.  Fortunately, the IPUMS extraction and download process included metadata for our data, which includes labels for relevant variables.  We can also refer to the [IPUMS USA Online Data Finder](https://usa.ipums.org/usa-action/variables/group/demog) to find detailed information on each of our variables.  It is a good data stewardship practice to always have the codebook or other reference material nearby while performing the data cleaning process.
2. **There are no missing values.**  If there were missing values in the data they would be indicated by an "NA" count for each variable.  While it is possible that our data actually has no missing values, it is also very possible that missing values are specified with a specific code.  Especailly with very large and complex datasets, it is important to retain skepticism about the lack of missing values and review each variable and its corresponding labels and codebook during the cleaning process.
3. **The largest total income (INCTOT) value is 9999999.**  While it is possible that this is the actual maximum total income value in the dataset, it is more likely that this is a [topcoded reference value](https://en.wikipedia.org/wiki/Top-coded) or a missing value code.  Values in multiples of 9s, e.g. "999" or "9999999" are commonly used to indicate a topcoded value, missing, or other special circumstance.
4. **The smalles total income (INCTOT) value is negative.**  What does this mean?

Next we will review each of the demographic variables, keeping these potential issues at the front of our mind as we go.  The cleaning process will involve looking for and managing and issues with the data, such as missing values or values which need to be recoded, as well as reorganizing variables to make them easier to use or to suit the needs of our project.

## 3. Clean and Recode Categorical Variables

We will start with the three categorical variables in our set demographic variables.  The categorical variables in our dataset include:

* [sex (SEX)](https://usa.ipums.org/usa-action/variables/SEX#codes_section)
* [race (RACE and RACED)](https://usa.ipums.org/usa-action/variables/RACE#codes_section)
* [educational attainment (EDUC and EDUCD)](https://usa.ipums.org/usa-action/variables/EDUC#codes_section)

For this section we will create and use following helper function *variable_nice* using the [*count*](https://rdrr.io/cran/dplyr/man/count.html), [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html), and [*row_number*](https://rdrr.io/cran/dplyr/man/row_number.html), functions from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package and the [*as_factor*](https://rdrr.io/cran/haven/man/as_factor.html) function from the [**haven**](https://cran.r-project.org/web/packages/haven/index.html) package.  This function will create a simple view showing the data label *{{variable}}*, count of observations *n*, numeric level indicator *code*, and the percent of the data which is represented by each variable level.

Alternately, we could use the function *table* which would provide us with the counts for each variable level but would not show the percents or labels.

In [None]:
variable_nice <- function(data, variable) {
    data %>%
    count({{variable}}) %>%
    mutate(code = row_number(),
           {{variable}} := as_factor({{variable}}),
           percent = round(n / sum(n) * 100, 2))
}

In [None]:
variable_nice(dat, RACE)

### 3a. Sex

The [IPUMS codebook for SEX](https://usa.ipums.org/usa-action/variables/SEX#codes_section) shows us that there are three categories for this variable.

* 1 ‧ Male
* 2 ‧ Female
* 9 ‧ *Missing or Blank*

Similar to the 9999999 INCTOT values we saw in the initial summary, the SEX variable also has a special coding using 9s.  A code of 9 for SEX indicates a missing or blank value.

Let's take a look at the breakdown of the SEX variable in our data.

In [None]:
variable_nice(dat, SEX)

All values in the SEX variable in our data are 1 (male) or 2 (female) and there are no missing values.

### 3b. Race

As we were reminded in our initial review of the data, the IPUMS data extraction included both the RACE variable and a supplementary detailed race RACED variable.  The [IPUMS codebook for RACE and RACED](https://usa.ipums.org/usa-action/variables/RACE#codes_section) shows us that there are nine categories for the basic race variable (RACE) and 253 categories for the detailed race variable (RACED).

Let's first take a look at the RACE variable.

In [None]:
variable_nice(dat, RACE)

As we saw in the codebook, the RACE variable has nine categories, and our dataset includes individuals in each of the nine categories.

**★ Pro Tip:** The acronym NEC, as in the "Other race, nec" category, stands for "not elsewhere classified" and is commonly used in demographic data.

The RACE variable is an example of a very common occurrance in demographic data where one or a few categories represent the vast the majority of the data and the remaining categories are relatively rare.  In our sample nearly 85% of the individuals are White and most of the other categories constitute fewer than 1% of the sample.  This sometimes poses an issue in analysis, especially if we are interested in focusing on minority populations.  We won't worry about it for now, but it is something to think about when you conduct your own analyses.

We know that the detailed race variable (RACED) has a lot of categories (253).  Let's check how many of them are represented in our data.

In [None]:
length(unique((dat$RACED)))

The individuals in our data represent 104 of the possible 253 detailed race categories.

104 is still a lot of categories so we aren't going to attempt to view the full table for this variable.  Instead, let's take a look at the first ten categories.

In [None]:
head(variable_nice(dat, RACED), 10)

We can already see that this version of the race variable has a lot of additional information.  The first two categories (White and Black/African-American) are the same as the RACE variable but the next few categories show that the more general "American Indian or Alaska Native" cateogry is split into categories for specific Native American tribes.  Without looking at the entire list of all categories in RACED we can get an idea of the level of detail included in this variable.

This amount of detail might come in handy later on if we use this data for a detailed race-based analysis.  But for now let's keep it simple and stick to just using the RACE variable.

#### Recoding the Race Variable

Even the 9 race categories in the simple RACE variable is a lot, especially if we plan to run a regression or do some other analysis where statistical power is important.  We might want to simplify this variable to even fewer categories.

For some types of analysis it might be essential to reduce the number of cateogies, especially if there are many categories with very few observations like the RACE variable, so that quantative analysis have sufficient statisitcal power.  On the other hand, other types of analyses, especially qualitative analysis, may significantly benefit from the high level of detail included in the RACED variable.

When it comes to a variable like race, the practice of combining categories can be contentious and nuanced.  The type of analysis you plan to carry out and the goals of your project will determine how to best recode complex variables like RACE and RACED.  For this exercise, we will condense our data to five very commonly used race categories for working with data on the United States general population.  These are the categories currently in use for the U.S. Decennial Census.  This set of categories is not be the best option for all analyses, but it is fine for our example workflow.

1. White
2. Black or African-American
3. American Indian or Alaska native
4. Asian or Pacific Islander
5. Some Other Race
6. Two or More Races

This step will simplify our data down from 9 to 6 categories.  Individuals classified as "4. Chinese", "5. Japanese", "6, Other Asian or Pacific Islander" will be reclassified to the new "4. Asian or Pacific Islander" category and indiviuals classified as "8. Two major races" and "9. Three or more major races" will be reclassified the new "6. Two or More Races" category.

To carry out the recode, we will use the [*case_when*](https://rdrr.io/cran/dplyr/man/case_when.html) and [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html) functions from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package to create a new variable (RACE_RECODE) based on information from the existing RACE variable.

In [None]:
dat <- dat %>%
  mutate(RACE_RECODE = case_when(
    RACE == 1 ~ 1,
    RACE == 2 ~ 2,
    RACE == 3 ~ 3,
    RACE %in% c(4, 5, 6) ~ 4,
    RACE == 7 ~ 5,
    RACE %in% c(8, 9) ~ 6))

We should also specify the new RACE_RECODE variable as a [**factor**](https://r4ds.hadley.nz/factors.html) which a data type used to represent categorical variables that have a limited and fixed set of possible categories.  The individual categories of a factor are usually referred to as "levels".  A nice feature of factors is that we can also set the order of the levels and add labels to each level.

The RACE_RECODE variable has a fixed set of five possible levels, corresponding to the five race groupings we created, so it fits the critera of a factor.  We will add labels in this step which will help us remember what races are grouped into the categories represented by each of the RACE_RECODE levels.  The following code uses the [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html) function from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package and the *factor* function to generate the new version of the race variable.

In [None]:
dat <- dat %>%
  mutate(RACE_RECODE = factor(RACE_RECODE,
                              levels = c(1, 2, 3, 4, 5, 6),
                              labels = c("White",
                                         "Black",
                                         "American Indian or Native American",
                                         "Asian or Pacific Islander",
                                         "Other Race",
                                         "Two or More Races")))

Let's take a look at the new variable.

In [None]:
variable_nice(dat, RACE_RECODE)

We now have a new, simplified version of the RACE variable with fewer categories.

### 3c. Education

Similar to the RACE variable, the eductional attainment variable EDUC comes with the more detailed educational attainment variable EDUCD.  The EDUC and EDUCD variables provide information on total individual educational attainment.  The [IPUMS codebook for EDUC and EDUCD](https://usa.ipums.org/usa-action/variables/EDUCD#codes_section) shows us that there are 13 categories for the basic education variable (EDUC) and 44 categories for the detailed education variable (EDUCD).

Let's first take a look at the EDUC variable.

In [None]:
variable_nice(dat, EDUC)

Looking at the EDUC variable, we can see that the first level corresponds to *N/A or no schooling*.  At the beginning of the EDA section, we mentioned the possibility that some variables will have levels which correspond to missing values or special circumstances.  Unfortunatly for us, the two components of the *N/A or no schooling* group correspond to two very different situations.  *N/A* likely refers to situations where educational attainment was not collected, not relevant, or not included in the database for some reason.  However, *no schooling* refers to individuals who specifically have no educational attainment.

Depending on the type of analysis, we might want to seperate these two situations if possible, for example by recoding the *N/A* values to *missing* and either creating a seperate group for individuals with no schooling or combining them into a new *little or no schooling* category with other low-educational attainment categories.

We can also see that, while the EDUC variable provides information on the number of years of education completed, it does not provide information on degree completion.  In most cases, there is a significant social and economic difference between receiving a high school diploma or equivalent degree and completing 12 years of education but not receiving a degree.  It would be best if we could explore additional detail for the individuals in some of these categories.

Fortunatly, we have the EDUCD variable which has a lot more information on educational attainment.  Let's take a look at what information is available in EDUCD.

In [None]:
variable_nice(dat, EDUCD)

The EDUCD variable has many more categories including seperate categories for *N/A* and "no schooling completed" as well as detailed information on degree completion!  We can use this information to recode the information from both variables into a new variable which better suits our analyses needs.

#### Recoding the Education Variable

Our recode will have the following seven categories.

1. eight or fewer years of eduction (including no schooling)
2. some high school (9-12 years, no degree)
3. high school diploma, GED, or alternative credential completed
4. some college (1+ years, no degree)
5. associate's degree
6. bachelor's degree
7. advanced degree (master's, professional, or doctoral degree)

Unlike RACE_RECODE, the EDUC_RECODE variable will include missing values since we plan to recode instances of *N/A* to missing.  In R, missing values are represented by the special object *NA*.  Most analytical functions in R automatically ignore *NA* values or have an option to ignore them.  Recoding missing categories as *NA* will make things easier for us down the line if we plan to carry out quantiative analysis with our data.

In [None]:
dat <- dat %>%
  mutate(EDUC_RECODE = case_when(
    EDUCD == 1 ~ NA,                # N/A
    EDUCD %in% c(2:26) ~ 1,         # eight or fewer years (including no schooling)
    EDUCD %in% c(30:61) ~ 2,        # some high school (9-12 years, no degree)
    EDUCD %in% c(63:64) ~ 3,        # high school diploma, GED, or alternative credential completed
    EDUCD %in% c(65:71) ~ 4,        # some college (1+ years, no degree)
    EDUCD == 81 ~ 5,                # associate's degree
    EDUCD == 101 ~ 6,               # bachelor's degree
    EDUCD %in% c(114:116) ~ 7))     # advanced degree (master's, professional, or doctoral degree)

As with the RACE_RECODE variable in the previous section, we will classify the EDUC_RECODE as a factor and attach category labels.

In [None]:
dat <- dat %>%
  mutate(EDUC_RECODE = factor(EDUC_RECODE,
                              levels = c(1, 2, 3, 4, 5, 6, 7),
                              labels = c("eight or fewer years",
                                         "some high school, no degree",
                                         "high school degree or equivalent",
                                         "some college, no degree",
                                         "associate's degree",
                                         "bachelor's degree",
                                         "advanced degree")))

Let's take a look at the new variable.

In [None]:
variable_nice(dat, EDUC_RECODE)

We have new, simplified version of the a categorical variable with fewer categories and recoded missing values.  In this case, the new variable is both simplier than either of the original RACE and RACED variables and also reorganizes the data in a way which may be more meaningful for the analyses we have in mind.

## 4. Clean and Recode Continuious Variables

Now that we have finished reviewing the three categorical demographic varables, we will move on to the two continuous demographic variables.  The continuous variables in our dataset include:

* [age (AGE)](https://usa.ipums.org/usa-action/variables/AGE#codes_section)
* [individual total income (INCTOT)](https://usa.ipums.org/usa-action/variables/INCTOT#codes_section)

### 4a. Age

The AGE variable provides the numeric age, in years, for each indiviual in the sample.  Reviewing the [IPUMS codebook for AGE](https://usa.ipums.org/usa-action/variables/AGE#codes_section) shows us that *999* corresponds to missing value but based on our initial review of the data there were no instances of *999* in the AGE variable indicating there is no missing data for age.  We will double check for missing values in this step and still carry out the recode step.

We could use the *variable_nice* function used to view the categorical variables to view the continuous AGE variable but that would give us a table which includes a line for every age in the dataset, probably not very useful.

Instead, let's take a look at summary statistics and a frequnecy distribution for this variable using the *summary* and *hist* functions.

Both the *summary* and *hist* functions are from base R!  The information and graphics these functions will produce are relatively simple and not very pretty but the commands are also simple, and easy to remember and use, part of the base R environment (so they don't require any additional packages), and computationally non-intensive.  You could use more complicated functions to generate summary tables and visualizations but at this stage, when we just want quick snapshots of the data so we can make management decisions.  The simple base R functions work great.

**★ Pro Tip:**  Kate thinks base R is the best R and you can't convince her otherwise.

In [None]:
summary(dat$AGE)

Ages in this sample range from 0 to 94 with a median of 42 which seems reasonable for a large sample of the general population.  The summary also confirms that we don't have any instances of the *999* code which would indicate missing data.

In [None]:
hist(dat$AGE, breaks = 100, main = "Histogram of Age", xlab = "Age")

The distribution of ages, as shown in the histogram, also seems to be a reasonable representation of a large sample of the general population.

Interestingly, there seems to be lower-than-expected frequency values for individuals in their twenties through late-thirties.  This may be representative of the actual population in Michiagn or it may be an artifact of the sample population or sampling process.  Unlike the Decennial Census, the ACS is not a complete population count some subpopulations, such as college students or younger adults, may be more prone to being left out of the sampling process.

If we wanted to use ACS data as an estimate for the true 2010 Michian population, we would first need to apply the survey design and survey weights to the data.  The IPUMS preselected set of variables included multiple survey design and survey weight supplemental variables which we would use to apply weights.  For now we will skip the process of applying a survey design and survey weights to our data.

#### Categorizing the Age Variable

While the numeric AGE variable is very useful, it likely contains more detail that we will need for our analysis.  Let's create a new categorical age variable where each category represents a range of ages.  Our new variable will use the following age ranges commonly used in demographic and sociological research.

1. 0 to 17 (children)
2. 18 to 24 (young adults)
3. 25 to 44
4. 45 to 64
5. 65+ (older adults; retirement age)

We will use the [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html) function from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package to recode information from AGE variable into a new AGE_CAT variable in the same way we used [*mutate*](https://rdrr.io/cran/dplyr/man/mutate.html) to create the RACE_RECODE and EDUC_RECODE variables.

In [None]:
dat <- dat %>%
  mutate(AGE_CAT = case_when(
    AGE < 18 ~ 1,
    AGE %in% c(18:24) ~ 2,
    AGE %in% c(25:44) ~ 3,
    AGE %in% c(45:64) ~ 4,
    AGE >= 65 ~ 5))

Although the original AGE variable was continuous, the new AGE_CAT variable is categorical and like RACE_RECODE and EDUC_RECODE, AGE_CAT is also a factor.  Therefore, we will also specify that AGE_CAT as a factor and set labels for each variable level using the same workflow we used for the cateogorical variables.

In [None]:
dat <- dat %>%
  mutate(AGE_CAT = factor(AGE_CAT,
                              levels = c(1, 2, 3, 4, 5),
                              labels = c("under 18",
                                         "18-24",
                                         "25-44",
                                         "45-64",
                                         "65+")))

Now that we have a labeled, categorical version of the AGE variable, let's take a look at the percent breakdown by age range.

In [None]:
variable_nice(dat, AGE_CAT)

The new AGE_CAT variable simplifies the AGE information into a format which is likely easier for us to use and understand.

### 4b. Income

The last demographic varaible we will work with is the income variable (INCTOT).  Like the age variable, the original version of this variable is numeric and represents the exact reported income for each surveyed individual. 

Like with AGE, rather than using the *variable_nice* function, we'll take a look at the summary statistics and frequency distribution of INCTOT using the *summary* and *hist* functions

In [None]:
summary(dat$INCTOT)

The *9999999* maximum value should raise some alarm and remind you that we previously noticed the maximum INCTOT value is *9999999*.  Something to be suspicious about...

We are also reminded that the INCTOT variable includes negative values, another interesting characteristic of the data that we should look into.

In [None]:
hist(dat$INCTOT, breaks = 100, main = "Histogram of Total Income", xlab = "Total Income")

The histogram showing the distribution of income values is another red flag that something is not right.  Numeric, continuous demographic variables almost always follow a somewhat normal or log-normal distribution.  However, there is clearly a large chunk of the data categorized at a very high value.  An anomaly which doesn't seem to follow the pattern of the rest of the data.

The solution is to take a look at the [IPUMS codebook for INCTOT](https://usa.ipums.org/usa-action/variables/INCTOT#codes_section).  The codebooks tells us that value *9999998* indicates "unknown" and the value *9999999* indictes "N/A".

So for our next step we should recode these two values as *NA* so they won't be included in any calculations or in while visualizing the distribution.

In [None]:
dat <- dat %>%
  mutate(INCTOT = na_if(INCTOT, 9999998),
         INCTOT = na_if(INCTOT, 9999999))

Let's take another look at the summary statistics after recoding the missing information.

In [None]:
summary(dat$INCTOT)

And the new distribution of values.

In [None]:
hist(dat$INCTOT, breaks = 100, main = "Histogram of Total Income", xlab = "Total Income")

Recoding the *9999998* and *9999999* values to *NA* impacted 17,688 observations, or about 18% of the data and after removing those values from the analysis the summary and distribution appear much more as we would expect.  The distribution shows that the majority of the individuals in the sample have total incomes around \\$200,000 or less but there are a few outliers ranging up to the largest income in the sample, \\$637,000.

#### Categorizing the Age Variable

Similar to age, total income is not very useful as a raw number.  Instead, it's often more useful to look at income brackets rather than raw income values.

Binning income into categories would be especially useful in our situation since the income variable has a frequency distribution with a very long right tail.  Very few individuals with very high incomes are skewing the overall distribution.  In addition, when a variable has a highly skewed dataset, values usually take on wildly different meanings in different sections of the distribution.  For example, the difference between an income of \\$10,000 and \\$50,000 corresponds to a very large difference in terms of lifestyle and socioeconmic status.  However, the difference between an income of \\$1,010,000 and \\$1,050,000 is likely negligible and both correspond to a very high socioeconomic status.

We will recode the income variable in the following way:

1. < 25,000 (low income)
2. 25,000 to 49,999 (lower-middle income)
3. 50,000 to 74,999 (middle income)
4. 75,000 to 149,999 (upper-middle income)
5. 150,000 or more (high income)

In [None]:
dat <- dat %>%
  mutate(INCTOT_CAT = case_when(
    INCTOT < 25000 ~ 1,
    INCTOT %in% c(25000:49999) ~ 2,
    INCTOT %in% c(50000:74999) ~ 3,
    INCTOT %in% c(75000:149999) ~ 4,
    INCTOT >= 150000 ~ 5))

As usuall, we will code this variable as a factor and create a set of labels for each level.

In [None]:
dat <- dat %>%
  mutate(INCTOT_CAT = factor(INCTOT_CAT,
                              levels = c(1, 2, 3, 4, 5),
                              labels = c("under 25,000",
                                         "25,000 to 49,999",
                                         "50,000 to 74,999",
                                         "75,000 to 149,999",
                                         "150,000 or more")))

Let's take a look at the percentage breakdown for our new categorical variable.

In [None]:
variable_nice(dat, INCTOT_CAT)

As with AGE_CAT, the INCTOT_CAT variable simplifies the income information into a simplier format which is easier to understand and work with.

## 5. Final Review

We've completed initial data management and recode steps for each of the demographic variables in our data.  Let's take a final look at the updated summary information including all the new demographic varaibles we created in this section.

In [None]:
summary(dat[, c("SEX", "AGE", "AGE_CAT", "RACE", "RACED", "RACE_RECODE", "EDUC", "EDUCD", "EDUC_RECODE", "INCTOT", "INCTOT_CAT")])

We've made a lot of updates!  We now have categorical versions of our two continuous numeric variables, AGE and INCTOT, and new, recoded versions of RACE and EDUC.  We've also recoded the missing values in EDUC_RECODE and INC_TOT to *NA* so we can use these variables in analyses without having to worry about incorrectly treating the missing codes like real numers.

Let's make a simplified version of the data, specifically for analysis, that only includes the variables we plan to keep.

This step isn't essential but it will give us a smaller version of the data to work with in the next steps of our analysis which will be less confusing and less computationally intensive to work with.

Specifically, we will restrict our data to the following demographic variables:

1. sex (SEX)
2. age (AGE)
3. age categories (AGE_CAT)
4. race categories (RACE_RECODE)
5. educational attainment categories (EDUC_RECODE)
6. income (INCTOT)
7. income categories (INCTOT_CAT)

And the two geographic identification variables from the original data extraction request:

8. [State FIPS Code (STATEFIP)](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code)
9. [Public Use Microdata Area (PUMA)](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html#:~:text=Public%20Use%20Microdata%20Areas%20(PUMAs)%20are%20non%2Doverlapping%2C,%2C%20Puerto%20Rico%2C%20and%20Guam.)

We'll create a new, subseted version of the data using the [*select*](https://rdrr.io/cran/dplyr/man/select.html) function from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package.

In [None]:
dat_analysis <- dat %>% select(STATEFIP, PUMA, SEX, AGE, AGE_CAT, RACE_RECODE, EDUC_RECODE, INCTOT, INCTOT_CAT)

Let's take a look at the first few lines of *dat_analysis*.

In [None]:
head(dat_analysis)

And a look at the summary information for *dat_analysis*.

In [None]:
summary(dat_analysis)

Finally, we'll save a copy of *dat_analysis* in the R Data Serialization (RDS) file format.  The .rds format will retain metadata for the next time we want to import the file back into R and is only useable within R.

In [None]:
saveRDS(dat_analysis, "ipums_usa_analysis.rds")

At the end of this exercise we have a cleaned up and nicely prepared version of our IPUMS USA data saved in our workspace.

## Recommended Next Steps

* **Continue with Chapter 3: Data Cleaning and Preparation**
  * 3.2 Spatial Data Preparation and Transformation with IPUMS NHGIS
* **Move on to Chapter 4: Exploratory Data Analysis (EDA)**
  * [4.1 Exploratory Data Analysis (EDA) with IPUMS USA](https://platform.i-guide.io/notebooks/29c5c2da-4bfe-4150-9c05-b65956c997b4)

## Quick Code
A clean and simple version of the code included in this notebook.