Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore missing/duplicate names if column is skipped #571

Open
cbrnr opened this issue Jan 9, 2017 · 7 comments
Open

Ignore missing/duplicate names if column is skipped #571

cbrnr opened this issue Jan 9, 2017 · 7 comments
Labels
feature read

Comments

@cbrnr
Copy link

@cbrnr cbrnr commented Jan 9, 2017

When I read a file with trailing delimiters, read_csv spits out a warning that a missing column name was filled in. Is there a way to tell the function that I want to read in all but the last (empty) column so that the warning message is not produced? I don't know how common such (malformed) CSV files are, but an option to ignore trailing delimiters might be useful. I tried to get it to work with the col_types argument, but it seems like all columns are read in at first. See also my question on StackOverflow.

@jennybc
Copy link
Member

@jennybc jennybc commented Jan 9, 2017

You can get the result you want by explicitly skipping that column. Here is one way, but there are some others, such as using cols_only(). Apparently you still get the warning. Perhaps that should be phrased differently, because you have declared your desire to skip this last variable.

library(readr)
read_csv("X1,X2,\nhi,there,\n", col_types = "cc_")
#> Warning: Missing column names filled in: 'X3' [3]
#> # A tibble: 1 × 2
#>      X1    X2
#>   <chr> <chr>
#> 1    hi there

@cbrnr
Copy link
Author

@cbrnr cbrnr commented Jan 10, 2017

Hm. It seems like skipping columns always occurs after all data has been read, which is why the warning makes sense if you know how read_csv works. If you naively assume that skipped columns do not influence the result, it seems a bit odd to see this warning.

The same thing is true for skipping columns in arbitrary positions if they don't have values at all, for example:

library(readr)
read_csv("X1,,X2\n1,,2\n3,,4", col_types="i_i")

This results in two warnings because first the missing column automatically gets renamed to X2, and the existing X2 columns gets renamed to X2_1 to avoid a duplicate name. I guess this is not what most users would expect. Of course this can be solved by explicitly specifying column names like this:

read_csv("X1,,X2\n1,,2\n3,,4", col_types="i_i", col_names=c("X1", "X2"), skip=1)

Considering the behavior above, I was expecting to supply 3 column names - but this doesn't work and I only have to specify the names for the used columns.

I'm really just starting to use readr, so it might be my lack of experience. But maybe all of this could be fixed by having an option to ignore consecutive delimiters (which might include a trailing/leading delimiter as a special case). Or maybe people should try to format their CSVs properly before loading and this is not within readr's scope at all, I don't know.

@hadley hadley added the feature label Jan 14, 2017
@hadley hadley changed the title Skip trailing delimiters? Ignore missing/duplicate names if column is skipped Jan 14, 2017
@hadley
Copy link
Member

@hadley hadley commented Jan 14, 2017

I think this is a problem to do with automatically renaming columns that are then skipped. An option to skip consecutive delimiters seems dangerous to me.

library(readr)
read_csv("X1,\nhi", col_types = "c_")
#> Warning: Missing column names filled in: 'X2' [2]
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 1 columns
#> # A tibble: 1 × 1
#>      X1
#>   <chr>
#> 1    hi

read_csv("X2,\nhi", col_types = "c_")
#> Warning: Missing column names filled in: 'X2' [2]
#> Warning: Duplicated column names deduplicated: 'X2' => 'X2_1' [2]
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 1 columns
#> # A tibble: 1 × 1
#>      X2
#>   <chr>
#> 1    hi

@hadley hadley added the read label Jan 14, 2017
@jimhester
Copy link
Member

@jimhester jimhester commented Feb 7, 2017

There is a bit of a chicken and egg problem here, standardising column types needs column names sorted out first, but if column names depend on skipped columns ☹️ .

It can be done I am sure, but will likely take some refactoring of col_spec_standardise.

@jennybc
Copy link
Member

@jennybc jennybc commented Feb 7, 2017

FWIW I have the same problem in readxl. Also unsolved. We should talk/commiserate about this @jimhester, to harmonize the solutions as much as possible.

@cbrnr
Copy link
Author

@cbrnr cbrnr commented Sep 9, 2019

I just stumbled over this issue again. I'm reading a CSV file with an extra delimiter at the end of each line (so read_csv spits out a warning "Missing column names filled in: 'X56' [56]"). This happens even though I'm passing col_types=cols_only(...), where I only specify a subset of column names.

Short example:

read_csv("X1,X2,\nhi,there,\n",
         col_types=cols_only(X1=col_character(),
                             X2=col_character()))

Since I explicitly state which columns I want to load, the warning is a bit irritating. Would it be possible to not issue the warning if I haven't explicitly selected it? Otherwise, wrapping everything in withCallingHandlers and suppressing that specific warning gets really unreadable:

withCallingHandlers({
    read_csv("X1,X2,\nhi,there,\n",
             col_types=cols_only(X1=col_character(),
                                 X2=col_character()))
    },
    warning=function(w) {if (startsWith(conditionMessage(w), "Missing column names"))
        invokeRestart("muffleWarning")})

Or maybe read_csv could have a suppressWarnings argument that also accepts a regex to suppress specific warnings I know I'm going to ignore?

@glensbo
Copy link

@glensbo glensbo commented Apr 17, 2020

I've got the same problem with read_delim
N_CSV <- read_delim("~/Documents/RFolder/BOGEN/Bogen_Kapt3/Data/NormalTusindSepDel.csv", delim =",")%>% slice(-1)
Parsed with column specification:
cols(
i = col_double(),
t = col_character(),
TRH = col_character(),
pTSH = col_character(),
TSH = col_character(),
TT4 = col_character(),
FT4 = col_character(),
TT3 = col_character(),
FT3 = col_character(),
cT3 = col_character(),
X11 = col_logical()
)
Warning message:
Missing column names filled in: 'X11' [11]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature read
Projects
None yet
Development

No branches or pull requests

5 participants