# Data Wrangling

Data wrangling comprises a substantial portion of every data professional's life. Wrangling data encompasses the steps you undertake to organize and clean your underlying data for your analysis. Wrangling includes merging and appending datasets, finding typos, and creating new variables.

## Tidy data

Before we get to the analysis stage of a project, you'll likely first want to arrive at a "cleaned" dataset. Some may call this an "analysis" file, which implies you can start doing analysis without much more fuss. The exact structure of such an analysis file will vary between projects, but there are some fundamental concepts that are common across many situations. Let's start by taking a look at this example from the book, _R for Data Science_ by Hadley Wickham ([Chapter 12](https://r4ds.had.co.nz/tidy-data.html)). Consider the following tables of data.

In [1]:
# Loading packages
library(dplyr) # A popular way to do data manipulations in R

# Table (1)
infections_tidy <- tribble(
    ~country,      ~year,  ~cases, ~population,
    "Afghanistan",  1999,     745,    19987071, 
    "Afghanistan",  2000,    2666,    20595360, 
    "Brazil",       1999,   37737,   172006362, 
    "Brazil",       2000,   80488,   174504898, 
    "China",        1999,  212258,  1272915272, 
    "China",        2000,  213766,  1280428583
)

# Table (2)
infections_too_long <- tribble(
    ~country,      ~year, ~variable,        ~value,
    "Afghanistan",  1999, "cases",             745,
    "Afghanistan",  1999, "population",   19987071,
    "Afghanistan",  2000, "cases",            2666,
    "Afghanistan",  2000, "population",   20595360,
    "Brazil",       1999, "cases",           37737,
    "Brazil",       1999, "population",  172006362,
    "Brazil",       2000, "cases",           80488,
    "Brazil",       2000, "population",  174504898,
    "China",        1999, "cases",          212258,
    "China",        1999, "population", 1272915272,
    "China",        2000, "cases",          213766,
    "China",        2000, "population", 1280428583
)

# Table (3a)
infections_just_cases <- tribble(
    ~country,      ~`1999`, ~`2000`,
    "Afghanistan",     745,    2666,
    "Brazil",        37737,   80488,
    "China",        212258,  213766
)

# Table (3b)
infections_just_population <- tribble(
    ~country,          ~`1999`,     ~`2000`,
    "Afghanistan",    19987071,    20595360,
    "Brazil",        172006362,   174504898,
    "China",        1272915272,  1280428583
)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




If you wanted to calculate the infection rate (cases per population) and plot it over time by country, which dataset makes that the easiest? Key factors in making a dataset "tidy":

- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.

The R package called {tidyr} has several functions to help us translate these untidy tables to tidy ones.

In [2]:
library(tidyr) # A set of functions to support tidy data

# Table (2)
infections_too_long %>%
    pivot_wider(names_from = variable, values_from = value)

country,year,cases,population
<chr>,<dbl>,<dbl>,<dbl>
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [3]:
# Tables (3a) + (3b)

# Step (1): Reshape to long
# Step (2): Rename variables
infections_cases_long <- infections_just_cases %>%
    pivot_longer(!country, names_to = "year") %>%
    rename(cases = value)

infections_population_long <- infections_just_population %>%
    pivot_longer(!country, names_to = "year") %>%
    rename(population = value)

# Step (3): Join the data
inner_join(infections_cases_long, infections_population_long, by = c("country", "year"))

country,year,cases,population
<chr>,<chr>,<dbl>,<dbl>
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


## Tidy values

The previous section discussed the overall structure of the dataset. Now let's dig into common problems you'll run into with actual values (cells) of your data. We can divide these into two categories:

- Strings
- Numeric

There are of course other types of data types (dates, boolean and factors), but let's just focus on these main types for the moment.

### Strings

Strings can be very messy! Let's take the simple example of just asking people to write down what campus they attend.

In [4]:
student_campuses <- tribble(
  ~student_id, ~campus,
            1, "BERKELEY",      # Capitalization
            2, "UCLA ",         # Trailing space
            3, " UCLA",         # Leading space            
            4, "U.C.L.A.",      # Punctuation
            5, "Berkely ",      # Typos
            6, "UC  Berkeley",  # Embedded blanks
            7, "UCB",           # Abbreviations
            8, "Not applicable" # Missing
)

Now if we wanted to count the number of students at each campus, we'll get an incorrect count. Let's go over some common strategies for tidying up these strings.

- Standardizing the case of the variables (i.e., making everything uppercase or lowercase)
- Removing trailing spaces, leading spaces and reduced multiple spaces into a single space
- Cleaning up punctuation
- Recoding values

In [5]:
library(stringr)

# Let's make all campuses uppercase
student_campuses %>%
  mutate(consistent_campus = str_to_upper(campus))

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,BERKELEY
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,U.C.L.A.
5,Berkely,BERKELY
6,UC Berkeley,UC BERKELEY
7,UCB,UCB
8,Not applicable,NOT APPLICABLE


In [6]:
# Let's get ride of the spaces
student_campuses %>%
  mutate(consistent_campus = str_squish(campus))

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,BERKELEY
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,U.C.L.A.
5,Berkely,Berkely
6,UC Berkeley,UC Berkeley
7,UCB,UCB
8,Not applicable,Not applicable


In [7]:
# Let's get rid of the punctuation
student_campuses %>%
  mutate(consistent_campus = str_replace_all(campus, "[:punct:]", ""))

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,BERKELEY
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,UCLA
5,Berkely,Berkely
6,UC Berkeley,UC Berkeley
7,UCB,UCB
8,Not applicable,Not applicable


In [8]:
# Let's fix typos
student_campuses %>%
  mutate(consistent_campus = str_replace_all(campus, "Berkely", "Berkeley"))

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,BERKELEY
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,U.C.L.A.
5,Berkely,Berkeley
6,UC Berkeley,UC Berkeley
7,UCB,UCB
8,Not applicable,Not applicable


In [9]:
# Let's standardize values
student_campuses %>%
  mutate(consistent_campus = if_else(str_detect(campus, "[UC]*Berkeley"), "UCB", campus))

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,BERKELEY
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,U.C.L.A.
5,Berkely,Berkely
6,UC Berkeley,UCB
7,UCB,UCB
8,Not applicable,Not applicable


In [10]:
# Cleaning up missing values
student_campuses %>%
    mutate(consistent_campus = na_if(campus, "Not applicable"))

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,BERKELEY
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,U.C.L.A.
5,Berkely,Berkely
6,UC Berkeley,UC Berkeley
7,UCB,UCB
8,Not applicable,


In [11]:
# Now let's put it all together!
student_campuses %>%
  mutate(
      consistent_campus = campus %>%
          str_to_upper(.) %>%                     # Consistent case
          str_squish(.)   %>%                     # Remove spaces (trailing, leading, consecutive)
          str_replace_all(., "[:punct:]", "") %>% # Punctuation
          str_replace_all(                        # Fix typos
              .,
              "BERKELY",
              "BERKELEY"
          ) %>%
          if_else(                                # Standardize values
              str_detect(., "[UC]*BERKELEY"),
              "UCB",
              .
          ) %>%
          na_if(., "NOT APPLICABLE")              # Clean up missing values
  )

student_id,campus,consistent_campus
<dbl>,<chr>,<chr>
1,BERKELEY,UCB
2,UCLA,UCLA
3,UCLA,UCLA
4,U.C.L.A.,UCLA
5,Berkely,UCB
6,UC Berkeley,UCB
7,UCB,UCB
8,Not applicable,


### String or number?

Sometimes you have number stuck in a string (e.g., `"$1,000"`) and sometimes you have a string posing as a number (e.g., ZIP Codes). Let's go through an example of each and how to solve these issues.

In [12]:
zcta_earnings <- tribble(
    ~zcta, ~median_earnings,
     1240, "$33,530",
     1242, "*",
    89010, "$26,172",
    89019, "$36,354"
)

Some identification codes consisting of numbers (e.g., ZIP Codes and FIPS codes) may mistakenly stored or loaded as numbers. Why does this matter? In general, we wouldn't want to treat a ZIP Code as a number. Why is 89010 greater than 1240? Ideally, such variables should be stored as strings instead of numbers.

In [13]:
zcta_earnings %>%
    mutate(zcta_str = str_pad(zcta, width = 5, pad = "0"), .after = zcta)

zcta,zcta_str,median_earnings
<dbl>,<chr>,<chr>
1240,1240,"$33,530"
1242,1242,*
89010,89010,"$26,172"
89019,89019,"$36,354"


In [14]:
library(readr) # Used for reading data files, but has a helpful parse_number function

zcta_earnings %>%
    mutate(median_earnings_num = parse_number(median_earnings, na = c("*", "")))

zcta,median_earnings,median_earnings_num
<dbl>,<chr>,<dbl>
1240,"$33,530",33530.0
1242,*,
89010,"$26,172",26172.0
89019,"$36,354",36354.0


### Numeric

Numeric variables can be represented as integers or decimal values (float/double). Sometimes integers depict measurements that represent actual integer values (e.g., number of apartments in a building) and sometimes they represent ordinal values (e.g., survey responses of "strongly disagree", "disagree", etc.).

Let's create an example dataset that contains some of the variables.

In [15]:
housing <- tribble(
    ~resident_id, ~unit_id, ~age, ~education, ~income,
               1,        1,   35,          6,    57324,
               2,        1,   27,          5,    67366,
               3,        2,   42,        -99,    47343,
               4,        3,   56,          4,   -43123,
)

One of the first thing to do with numeric values are making sure you're representing missing values correctly. Many data sources will sometimes encode missing values with a placeholder value like -99. Double check your data documentation to make sure you know how missing values are encoded.

In [16]:
housing %>% mutate(education = na_if(education, -99))

resident_id,unit_id,age,education,income
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,35,6.0,57324
2,1,27,5.0,67366
3,2,42,,47343
4,3,56,4.0,-43123


Sometimes numeric values have intervals they fall between. For example, the fraction of days absent in a school year must fall between zero and one. Or height must be greater than 0. However, there are sometimes ambigious situations. For example, we might think income needs to be at least zero. But if someone includes business losses in their reported income, negative values might be plausible! For now, let's suppose losses are not possible in our dataset. We can recode all values we think are invalid as missing.

In [17]:
housing %>% mutate(income = replace(income, income < 0, NA))

resident_id,unit_id,age,education,income
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,35,6,57324.0
2,1,27,5,67366.0
3,2,42,-99,47343.0
4,3,56,4,


Once you're more confident with the values in your numeric columns, you're ready to do the calculations needed for your analysis! Sometimes this could mean doing transformations of variables.

In [18]:
housing %>% mutate(income = replace(income, income < 0, NA)) %>%
    mutate(ln_income = log(income))

resident_id,unit_id,age,education,income,ln_income
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,35,6,57324.0,10.95647
2,1,27,5,67366.0,11.1179
3,2,42,-99,47343.0,10.76517
4,3,56,4,,


Or sometimes you need to do aggregations.

In [19]:
housing %>% mutate(income = replace(income, income < 0, NA)) %>%
    group_by(unit_id) %>%
    summarize(avg_unit_income = mean(income, na.rm = TRUE))

unit_id,avg_unit_income
<dbl>,<dbl>
1,62345.0
2,47343.0
3,
