New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: trim whitespace on read #211

Closed
gitJMDR opened this Issue Oct 20, 2016 · 4 comments

Comments

4 participants
@gitJMDR

gitJMDR commented Oct 20, 2016

It'd be great to have an option like strip.white=TRUE, or strip_white=BOTH to automatically remove leading or trailing whitespace on import.

@svenhalvorson

This comment has been minimized.

svenhalvorson commented Nov 7, 2016

I think this is very important as well especially in light of how dplyr takes in column names. If you import a data set with spaces in the column names, you can't even use rename() to fix the problem

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 5, 2017

@svenhalvorson Even if leading and trailing whitespace is trimmed, this won't touch embedded whitespace. You can deal with such column names like so:

suppressPackageStartupMessages(library(tidyverse))
(df <- tibble(`with space` = 1:2))
#> # A tibble: 2 × 1
#>   `with space`
#>          <int>
#> 1            1
#> 2            2
df %>% 
  rename(no_space = `with space`)
#> # A tibble: 2 × 1
#>   no_space
#>      <int>
#> 1        1
#> 2        2
@hadley

This comment has been minimized.

Member

hadley commented Mar 24, 2017

Alternatively you could just do: purrr::modify_if(df, is.character, trimws) (assuming you have the dev version of purrr, which will be released soon).

Having this as an option just doesn't feel that important for readxl to me.

@jennybc

This comment has been minimized.

Member

jennybc commented Mar 28, 2017

Why it's needed:

  1. It is handy.
  2. readr does it and readxl wants to grow up to be readr, but for spreadsheets.
  3. The modify_if() workaround doesn't address column names. Sure, something similar could be done on the names. But that gets back to point 1.

Why it's not needed:

  • Col type guessing is different in readxl vs readr: it's based on cell types declared in the xls[x] file, not on the data. So whitespace trimming isn't nearly as high stakes. Post import modification of character cols, e.g. modify_if(), is viable.

@jennybc jennybc added this to TODO in jennybc Mar 29, 2017

@jennybc jennybc closed this in 00a8891 Apr 8, 2017

@jennybc jennybc moved this from TODO to Done in jennybc Apr 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment