R
Clone or download
Latest commit cba1d54 Aug 1, 2018
Permalink
Failed to load latest commit information.
.github add contribution guidelines Nov 26, 2017
R Correct NA handling in excel_numeric_to_date (#221) Jul 29, 2018
docs rebuild pkgdown site Aug 1, 2018
inst/doc update Readme/Index and links Mar 17, 2018
man interminable pre-release tweaks Jul 18, 2018
revdep give attribution for #220 fix, bump version number post-CRAN acceptance Jul 31, 2018
tests Correct NA handling in excel_numeric_to_date (#221) Jul 29, 2018
vignettes re-knit vignette Mds for github Jul 17, 2018
.Rbuildignore add revdepchecks Mar 17, 2018
.gitignore remove inst/doc from .gitignore Nov 10, 2017
.travis.yml clean_names respects separators (or lack) between number and letter (#… Feb 9, 2018
DESCRIPTION give attribution for #220 fix, bump version number post-CRAN acceptance Jul 31, 2018
LICENSE Create LICENSE Apr 13, 2016
LICENSE.md Update LICENSE.md Apr 13, 2016
NAMESPACE Add new function, row_to_names (#208) Jun 22, 2018
NEWS.md give attribution for #220 fix, bump version number post-CRAN acceptance Jul 31, 2018
README.Rmd complete tweaks to index, readme Mar 17, 2018
README.md complete tweaks to index, readme Mar 17, 2018
codecov.yml moving print.tabyl into its own file, exclude it from covr check Oct 29, 2017
cran-comments.md update particulars for v1.1.1 Jul 31, 2018
dirty_data.xlsx use adorn_totals() in README, don't use two deprecated functions Apr 12, 2017
index.Rmd small edits to reflect 1.0 now on CRAN Mar 23, 2018
index.md small edits to reflect 1.0 now on CRAN Mar 23, 2018
janitor.Rproj Add janitor.Rproj Dec 1, 2016
planning.Rmd Update planning.Rmd Jan 5, 2017
planning.md more notes to planning page Dec 23, 2016

README.md

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014

janitor


Travis-CI Build Status Coverage Status lifecycle CRAN_Status_Badge !Monthly Downloads !Downloads

janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.

The main janitor functions:

  • perfectly format data.frame column names;
  • create and format frequency tables of one, two, or three variables - think an improved table(); and
  • isolate partially-duplicate records.

The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.

janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

Installation

You can install:

  • the most recent officially-released version from CRAN with

    install.packages("janitor")
  • the latest development version from GitHub with

    install.packages("devtools")
    devtools::install_github("sfirke/janitor")

Using janitor

Below are quick examples of how janitor tools are commonly used. A full description of each function can be found in janitor's catalog of functions vignette.

Cleaning dirty data

Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx: All kinds of dirty.

Dirtiness includes:

  • Dreadful column names
  • Rows and columns containing Excel formatting but no data
  • Dates stored as numbers
  • Values spread inconsistently over the "Certification" columns

Here's that data after being read in to R:

library(pacman) # for loading packages
p_load(readxl, janitor, dplyr, here)

roster_raw <- read_excel(here("dirty_data.xlsx")) # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
#> Observations: 13
#> Variables: 11
#> $ `First Name`        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...
#> $ `Last Name`         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...
#> $ `Employee Status`   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...
#> $ Subject             <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...
#> $ `Hire Date`         <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...
#> $ `% Allocated`       <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
#> $ `Full time?`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...
#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ Certification       <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...
#> $ Certification__1    <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...
#> $ Certification__2    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.

Clean it with janitor functions:

roster <- roster_raw %>%
  clean_names() %>%
  remove_empty(c("rows", "cols")) %>%
  mutate(hire_date = excel_numeric_to_date(hire_date),
         cert = coalesce(certification, certification_1)) %>% # from dplyr
  select(-certification, -certification_1) # drop unwanted columns

roster
#> # A tibble: 12 x 8
#>    first_name   last_name employee_status subject    hire_date  percent_allocated full_time cert          
#>    <chr>        <chr>     <chr>           <chr>      <date>                 <dbl> <chr>     <chr>         
#>  1 Jason        Bourne    Teacher         PE         2008-08-30             0.750 Yes       Physical ed   
#>  2 Jason        Bourne    Teacher         Drafting   2008-08-30             0.250 Yes       Physical ed   
#>  3 Alicia       Keys      Teacher         Music      2001-08-15             1.00  Yes       Instr. music  
#>  4 Ada          Lovelace  Teacher         <NA>       1975-05-01             1.00  Yes       PENDING       
#>  5 Desus        Nice      Administration  Dean       2013-06-06             1.00  Yes       PENDING       
#>  6 Chien-Shiung Wu        Teacher         Physics    1930-03-20             0.500 Yes       Science 6-12  
#>  7 Chien-Shiung Wu        Teacher         Chemistry  1930-03-20             0.500 Yes       Science 6-12  
#>  8 James        Joyce     Teacher         English    1990-05-01             0.500 No        English 6-12  
#>  9 Hedy         Lamarr    Teacher         Science    1976-06-08             0.500 No        PENDING       
#> 10 Carlos       Boozer    Coach           Basketball 2015-08-05            NA     No        Physical ed   
#> 11 Young        Boozer    Coach           <NA>       1995-01-01            NA     No        Political sci.
#> 12 Micheal      Larsen    Teacher         English    2009-09-15             0.800 No        Vocal music

The core janitor cleaning function is clean_names() - call it whenever you load data into R.

Examining dirty data

Finding duplicates

Use get_dupes() to identify and examine duplicate records during data cleaning. Let's see if any teachers are listed more than once:

roster %>% get_dupes(first_name, last_name)
#> # A tibble: 4 x 9
#>   first_name   last_name dupe_count employee_status subject   hire_date  percent_allocated full_time cert    
#>   <chr>        <chr>          <int> <chr>           <chr>     <date>                 <dbl> <chr>     <chr>   
#> 1 Chien-Shiung Wu                 2 Teacher         Physics   1930-03-20             0.500 Yes       Science…
#> 2 Chien-Shiung Wu                 2 Teacher         Chemistry 1930-03-20             0.500 Yes       Science…
#> 3 Jason        Bourne             2 Teacher         PE        2008-08-30             0.750 Yes       Physica…
#> 4 Jason        Bourne             2 Teacher         Drafting  2008-08-30             0.250 Yes       Physica…

Yes, some teachers appear twice. We ought to address this before counting employees.

Tabulating tools

A variable (or combinations of two or three variables) can be tabulated with tabyl(). The resulting data.frame can be tweaked and formatted with the suite of adorn_ functions for quick analysis and printing of pretty results in a report. adorn_ functions can be helpful with non-tabyls, too.

tabyl can be called two ways:

  • On a vector, when tabulating a single variable - e.g., tabyl(roster$subject)
  • On a data.frame, specifying 1, 2, or 3 variable names to tabulate : roster %>% tabyl(subject, employee_status).
    • Here the data.frame is passed in with the %>% pipe; this allows tabyl to be used in an analysis pipeline

tabyl()

Like table(), but pipe-able, data.frame-based, and fully featured.

One variable:

roster %>%
  tabyl(subject)
#>     subject n    percent valid_percent
#>  Basketball 1 0.08333333           0.1
#>   Chemistry 1 0.08333333           0.1
#>        Dean 1 0.08333333           0.1
#>    Drafting 1 0.08333333           0.1
#>     English 2 0.16666667           0.2
#>       Music 1 0.08333333           0.1
#>          PE 1 0.08333333           0.1
#>     Physics 1 0.08333333           0.1
#>     Science 1 0.08333333           0.1
#>        <NA> 2 0.16666667            NA

Two variables:

roster %>%
  filter(hire_date > as.Date("1950-01-01")) %>%
  tabyl(employee_status, full_time)
#>  employee_status No Yes
#>   Administration  0   1
#>            Coach  2   0
#>          Teacher  3   4

Three variables:

roster %>%
  tabyl(full_time, subject, employee_status, show_missing_levels = FALSE)
#> $Administration
#>  full_time Dean
#>        Yes    1
#> 
#> $Coach
#>  full_time Basketball NA_
#>         No          1   1
#> 
#> $Teacher
#>  full_time Chemistry Drafting English Music PE Physics Science NA_
#>         No         0        0       2     0  0       0       1   0
#>        Yes         1        1       0     1  1       1       0   1
Adorning tabyls

The adorn_ functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:

roster %>%
  tabyl(employee_status, full_time) %>%
  adorn_totals("row") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined")
#>  employee_status/full_time         No        Yes
#>             Administration   0.0% (0) 100.0% (1)
#>                      Coach 100.0% (2)   0.0% (0)
#>                    Teacher  33.3% (3)  66.7% (6)
#>                      Total  41.7% (5)  58.3% (7)

Pipe that right into knitr::kable() in your RMarkdown report.

These modular adornments can be layered to reduce R's deficit against Excel and SPSS when it comes to quick, informative counts.

Contact me

You are welcome to: