New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_column() with zero-row data frame #167

Closed
krlmlr opened this Issue Aug 30, 2016 · 9 comments

Comments

Projects
None yet
2 participants
@krlmlr
Member

krlmlr commented Aug 30, 2016

The following still doesn't work as expected:

add_column(iris[0, ], a = character(), b = 1)
## Error: Variables must be length 1 or 1.
## Problem variables: 'a'

I'd expect the result to be a zero-row data frame with two new columns a (character) and b (numeric).

I suspect we need a special case for zero-row input. @LaDilettante: Would you mind taking a look?

@anhqle

This comment has been minimized.

Contributor

anhqle commented Aug 30, 2016

Yes, I'd be happy to.

iris[0, ] and a are empty, but b has data. Shouldn't adding data to zero-row data frame throw an error instead of silently giving back a zero-data frame? The user may think that the data in b is successfully added somehow.

I have the same question about issue #147

@anhqle

This comment has been minimized.

Contributor

anhqle commented Oct 10, 2016

@krlmlr I can take on this next, but please give some guidance on my point in the previous comment. I think

add_column(iris[0, ], a = character(), b = 1)

should give an error instead of a zero-row data frame. Otherwise, the user may think that the data from b = 1 successfully gets added.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Oct 10, 2016

How about a new argument .allow_atomic = NA? TRUE allows this use case, FALSE explicitly forbids adding atomic values (forcing to always use full vectors), and NA is the current behavior.

@anhqle

This comment has been minimized.

Contributor

anhqle commented Oct 10, 2016

Could you elaborate? My understanding is that add_column has always allowed adding atomic values (using recycle to fill the column with that atomic value).

The issue at hand is how should add_column behave when the user add a column to a zero-row data frame.

  • Your top post suggests returning a zero-row data frame with new column
  • I suggests raising an error to highlight that the data in the new column is not added

Is allow_atomic related to this choice? I'd appreciate more clarification. Thanks!

@krlmlr

This comment has been minimized.

Member

krlmlr commented Oct 10, 2016

Support for atomic values is very recent (#164, thanks!), and not on CRAN yet, so there's time to reconsider.

This issue was motivated by the desire to provide a robust interface for programming that also works in the zero-row corner case. Maybe we just need a new add_column_atomic(), and throw an error as we did before with add_column?

@anhqle

This comment has been minimized.

Contributor

anhqle commented Oct 11, 2016

I see. IMO,

  1. We should keep adding an atomic value as the default because users are probably accustomed to cbind(1:4, 1). Plus, adding a constant is a common task (e.g. adding an intercept, some categorical values, etc.)
  2. Regarding zero-row tibble, I still can't imagine an use case for it, but my best guess is that it's used to initialize storage. With that assumption,
    • add_column(iris[0, ], a = character(), b = integer()) returns a zero-row tibble with new columns as expected
    • add_column(iris[0, ], a = character(), b = 1) should never happen. If it does, we should raise an error to alert the user with a better message rather than silently returning a zero-row tibble. Imagine being a reader of the code, seeing add_column(iris[0, ], a = character(), b = 1) raising no error. I would wrongly expect b = 1 to be added in the tibble.

So in sum, I suggest we keep the behavior of add_column and improve the error message in the case of zero-row edge case.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Oct 11, 2016

Zero-row tibbles can occur in a data analysis (think empty groups). I think this corner case needs to be handled well.

Thinking again, it's only about mixing length-one and length-n columns where the problem occurs. The real cause for this behavior is that tibble(a = 1:3, b = 1) succeeds but tibble(a = integer(), b = 1) raises an error; we might as well fix that.

Let's keep this issue open for a while and think of a clean way to handle this.

@anhqle

This comment has been minimized.

Contributor

anhqle commented Oct 11, 2016

IMO tibble(a = integer(), b = 1) raising an error makes sense. 1 is not divisible by 0. The user should not expect recycling-like behavior here.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Oct 11, 2016

I'm thinking more of a programming context: The integer() might be a vector of unknown length, in this case it's not very helpful to have to special-case for zero.

@krlmlr krlmlr added the bug label Jul 27, 2017

@krlmlr krlmlr closed this in ca24e73 Jul 27, 2017

krlmlr added a commit that referenced this issue Jul 28, 2017

Merge tag 'v1.3.3.9001'
- Escape factor levels when printing (#277).
- Adding columns to zero-row data frames now also works when mixing
lengths 1 and 0 in the new columns (#167).
- Non-syntactic names are now also escaped in `glimpse()` (#280).
- The `validate` argument is now also supported in `as_tibble.tbl_df()`,
with default to `FALSE` (#278).

krlmlr added a commit that referenced this issue Aug 22, 2017

Merge tag 'v1.3.4'
Bug fixes
---------

- Values of length 1 in a `tibble()` call are recycled prior to evaluating subsequent arguments, improving consistency with `mutate()` (#213).
- Recycling of values of length 1 in a `tibble()` call maintains their class (#284).
- `add_row()` now always preserves the column data types of the input data frame the same way as `rbind()` does (#296).
- `lst()` now again handles duplicate names, the value defined last is used in case of a clash.
- Adding columns to zero-row data frames now also works when mixing lengths 1 and 0 in the new columns (#167).
- The `validate` argument is now also supported in `as_tibble.tbl_df()`, with default to `FALSE` (#278).  It must be passed as named argument, as in `as_tibble(validate = TRUE)`.

Formatting
----------

- `format_v()` now always surrounds lists with `[]` brackets, even if their length is one. This affects `glimpse()` output for list columns (#106).
- Factor levels are escaped when printing (#277).
- Non-syntactic names are now also escaped in `glimpse()` (#280).
- `tibble()` gives a consistent error message in the case of duplicate column names (#291).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment