Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ drops attributes of df #155

Closed
t-kalinowski opened this issue Aug 17, 2016 · 29 comments · Fixed by #512
Closed

[ drops attributes of df #155

t-kalinowski opened this issue Aug 17, 2016 · 29 comments · Fixed by #512

Comments

@t-kalinowski
Copy link

t-kalinowski commented Aug 17, 2016

possibly related to tidyverse/dplyr#1984

library(tibble)
df <- tibble(x = 1:2, y = x)
attr(df, "along for the ride") <- "still here"

attr(df, "along for the ride")
# [1] "still here"

df <- df[names(df)]
attr(df, "along for the ride")
# NULL

this is my current workaround:

df[] <- df[names(df)]

@krlmlr
Copy link
Member

krlmlr commented Aug 18, 2016

Hm, I'm not sure if we want to maintain all attributes. What if they somehow relate to the number of rows or columns?

A safer way would be to define your own class and implement [ by calling NextMethod() and copying over all attributes you want to.

What is your use case?

@t-kalinowski
Copy link
Author

I'm not sure if my use case is broadly applicable. I'm trying to delicately (lazily?) add some functionality to existing code by adding an attribute "rider" that contains some additional information doesn't fit neatly into the dataframe. My thought was that the attribute would be safely ignored in all the code except where I then explicitly look for it.

The alternatives approach that comes to mind (besides a new class as you mention) is forcing my core data frame and new "rider" into a list at one point, then unpacking it midway in the code path into two variables that get passed as separate arguments to the functions I'm trying to add.

I'm not sure where attributes fit into the tidyverse. To me, user-defined attributes (i.e., not dim and names) are never added "by mistake", and therefore seem like something that should be added, dropped, or modified only when explicitly called for by the user, and should otherwise be carried over by all the "infrastructure" code.

But maybe that encourages bad coding. I honestly don't know.

@krlmlr
Copy link
Member

krlmlr commented Aug 18, 2016

As far as I understand tidy data, assuming there is more than one rider, you'd probably use a data frame with a "rider" and a "data" column, the "data" column would be a list of data frames. A row from that data frame naturally becomes a named list, which is in the spirit of the alternative approach you sketched.

From another angle, the attribute defines something like a "has-a" relationship: the tabular data also "has a rider". The list approach you suggested would be more like a compound object that has both tabular data and a rider. This looks tidier to me, but then I don't know the structure of your data.

As shown in in the dplyr issue you linked, dplyr verbs also don't seem to (always) maintain attributes. This guarantee would be a feature that needs to be specified, implemented, and thoroughly tested. I'd review a pull request, but to me this is currently not a top priority for tibble.

@krlmlr
Copy link
Member

krlmlr commented Aug 26, 2016

Closing for now, please comment or open a new issue for further questions.

@krlmlr krlmlr closed this as completed Aug 26, 2016
@t-kalinowski
Copy link
Author

t-kalinowski commented Aug 26, 2016

Thanks for taking a look. I agree this is not high priority.

I was reminded of this recently when I updated readr and saw that readr tables now come with a new "spec" attribute. This addition is very analogous to my use case that led me to file this issue. I was adding an attribute that described how the tbl_df was generated (a legend, of sorts), and i was also grafting it onto existing code while trying to not break anything.

It's clear that "spec" as an attribute is the correct way for readr to add that information to the returned object. The alternatives of adding a new column to the returned data frame, or returning a list (compound object), are not appropriate. So the "spec" attribute from readr makes for a good and specific thought experiment. Should functions like [ or mutate() preserve it (and arbitrary, user-defined, attributes like it)? Or should the "spec" attribute be lost as soon as the data frame is modified in any way.

The more I think about it, the less of an opinion I have on what is the correct decision. It would just be nice if there was a consistent behavior across dplyr / purrr / tibble et al.

But again, I agree this is low priority right now. Thank you for your time and attention.

@jennybc
Copy link
Member

jennybc commented Aug 26, 2016

define your own class and implement [

I basically have the same question asked here. I've been building S3 classes based on tibble with the main motivation being to create a print method. So the loss of nice printing (due to the loss of my class) just from looking at a subset of the rows is a source of aggravation.

But I'm also waiting to see if I can exert my will over printing by applying a class to specific columns and then allowing tibble printing to happen normally. Therefore, in my case, this is very linked to #144.

@krlmlr
Copy link
Member

krlmlr commented Aug 26, 2016

@t-kalinowski: In the readr case, I think the spec apply to the imported data, but not so much for any modifications of it. My feeling is that stripping the attribute isn't the worst thing to do here.

@jennybc: But who is removing your class attribute? Can you please show an example?

@krlmlr krlmlr reopened this Aug 26, 2016
@krlmlr krlmlr added this to the 1.3 milestone Aug 26, 2016
@t-kalinowski
Copy link
Author

t-kalinowski commented Aug 27, 2016

Some further thoughts:
Conceptually, I view this as related to #141 and #156, and driven by the same goal as Hmisc::label and base::comment (which is then respected by, for example, package:zoo).

All are driven by a desire to add richer annotations to the tidy data, and for those annotations to travel with the object.

Each of the examples above is limited to character strings unfortunately. It would be nice if annotations could also be lists.

What if they somehow relate to the number of rows or columns?

It might make sense to support attributes at both levels, annotations for individual vectors, and annotations for the data frame.
Also, annotations like this are primarily intended for human interpretation, and so the potential harm is limited.

@jennybc
Copy link
Member

jennybc commented Aug 27, 2016

Here's an example of a tibble losing my class from row-subsetting:

library(tibble)

x <- structure(head(iris), class = c("jenny", "data.frame"))
class(x)
#> [1] "jenny"      "data.frame"
class(x[1:2, ])
#> [1] "jenny"      "data.frame"

xt <- structure(tibble::as_tibble(head(iris)),
                class = c("jenny", "tbl_df", "tbl", "data.frame"))
class(xt)
#> [1] "jenny"      "tbl_df"     "tbl"        "data.frame"
class(xt[1:2, ])
#> [1] "tbl_df"     "tbl"        "data.frame"

@t-kalinowski
Copy link
Author

I just came across the last example in ?[.data.frame`` link and realized that using NextMethod() with my own class this is much simpler than I originally thought. I'll be doing that. Thanks for the pointer.

@krlmlr
Copy link
Member

krlmlr commented Apr 17, 2017

Can we agree that subclasses of tbl_df should provide their own [ method, like grouped_df does?

library(dplyr, warn.conflicts = FALSE)
group_by(iris, Species)[45:55,]
#> Source: local data frame [11 x 5]
#> Groups: Species [2]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>           <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
#> 1           5.1         3.8          1.9         0.4     setosa
#> 2           4.8         3.0          1.4         0.3     setosa
#> 3           5.1         3.8          1.6         0.2     setosa
#> 4           4.6         3.2          1.4         0.2     setosa
#> 5           5.3         3.7          1.5         0.2     setosa
#> 6           5.0         3.3          1.4         0.2     setosa
#> 7           7.0         3.2          4.7         1.4 versicolor
#> 8           6.4         3.2          4.5         1.5 versicolor
#> 9           6.9         3.1          4.9         1.5 versicolor
#> 10          5.5         2.3          4.0         1.3 versicolor
#> 11          6.5         2.8          4.6         1.5 versicolor

@t-kalinowski
Copy link
Author

t-kalinowski commented Apr 18, 2017

Over time, my opinion on this has completely reversed. I think the current behavior, where subclasses of tbl_df should provide their own [ method, is more preferable now. I'm ok to close this issue. Thank you for considering.

@jennybc
Copy link
Member

jennybc commented Apr 18, 2017

I think that's fine. Then I think it's important to document this well. Maybe there could be a vignette on making your own tibble-based classes. And it could list the minimal set of methods you should implement, with an example.

@krlmlr
Copy link
Member

krlmlr commented Mar 1, 2018

I'm happy to call sloop::reconstruct() in [.tbl_df once we standardize on it.

@Deleetdk
Copy link

Deleetdk commented Mar 9, 2018

Any news on this? Many dplyr verbs drop classes too and it seems silly to have to implement a custom version of every single dplyr verb as well as the base subsetting ones to avoid custom classes being dropped. My case use is similar to @jennybc, namely that I'm building on top of tibble to add a custom print and some light data integrity testing, but I want it to stay functional with all the usual data frame functions and my data is tidy.

There's a lot of related issues and questions (some of which relate to attributes generally):

@krlmlr
Copy link
Member

krlmlr commented Mar 12, 2018

Thanks for the reminder and for collecting the pointers. We need sloop before we can work on it. tidyverse/dplyr#3259 is the only open dplyr issue that covers this problem. Please upvote (add 👍 to the first post with the "smiley" icon).

@vspinu
Copy link
Member

vspinu commented Mar 18, 2018

Is there (will it be) an easy solution without requiring dplyr, tibble, sloop etc? Maybe an attributes marker "leave-my-attributes-alone" or a simple similarly-minded method?

My problem is exactly as the one described by @jennybc . I add an extra class to a data.frame to customize printing and a bunch of non-structural attributes.

@vspinu
Copy link
Member

vspinu commented Mar 18, 2018

Can we agree that subclasses of tbl_df should provide their own [

Would the default that none of the attributes are vectorized on columns or rows make more sense? Then, only the subclasses dealing with such attributes would need to provide the [ method.

@krlmlr
Copy link
Member

krlmlr commented Mar 19, 2018

I like the idea of a special attribute that contains the names of all attributes that are safe to copy over. This can be used later with a solution based on reconstruct(), too.

@t-kalinowski
Copy link
Author

t-kalinowski commented Mar 19, 2018

I like the idea, but mostly as a mechanism to more precisely control attribute-copying behavior. However, I would prefer if the default was to copy over all attributes, even when this attribute is missing.
In any case, it made me think of this: image

@vspinu
Copy link
Member

vspinu commented Mar 19, 2018

The exclusion declaration - list all attributes which should't be copied - seems more natural to me as well. But I am not sure how useful would it be. Subclasses which deal with such attributes will probably have to declare their [ method anyway, unless there is a provision for row-subsetable-attributes and col-subsetable-attributes attributes.

@t-kalinowski
Copy link
Author

t-kalinowski commented Mar 19, 2018

hmm... if the attribute so neatly fits into being row-subsettable, why would it be an attribute and not an additional column? Also, if it neatly fits into being column subsettable, why would the attribute be set on the whole tibble and not on the column itself? Are you using attributes to implement user-hidden columns?

@vspinu
Copy link
Member

vspinu commented Mar 19, 2018

I am not using attributes vectorized on rows or columns, but from what I see others do and, if I understand correctly, this is the foremost reason why attributes are dropped by dplyr verbs in the first place.

attribute and not an additional column?

Because it's internal to the class representation and do not constitute user level data.

@krlmlr
Copy link
Member

krlmlr commented Mar 21, 2018

I think e.g. sf is doing the right thing by keeping the geometry in a regular column. Supporting "special" attributes that act as an extra column or row feels out of scope.

The opt-in interface (attributes³) feels like the safer option, though. This only affects packages that subclass tibbles, I'm happy to make this easier but I'd rather not trade this for the safety of just stripping all attributes if in doubt.

@Deleetdk
Copy link

In my case, the information stored in attributes usually has a different structure than the main data. Often it's a single row data frame with other rows than the main one, used for various metadata. E.g. a chromosomes data frame has 24 chromosomes (rows) and their metrics (e.g. copy number, in columns), while the attribute has some genome level information (e.g. ISCN karyotype).

@hadley
Copy link
Member

hadley commented Sep 7, 2018

As a short-term solution, I think the functions in tibble that create new tibbles from an existing tibble (e.g. [.tbl_df) should copy over all attributes — and I think the short-term if that doesn't work for your class, you'll need to override the method. (In other words, the default will change from dropping all attributes to preserving all attributes). This also allows random additional attributes to be carried along for the ride.

In the long-term, these function should call sloop::reconstruct() (which is likely to move into the vctrs package and get a new name); that way people extending tibble with row or col based attributes will only need to override a single method.

@hadley
Copy link
Member

hadley commented Sep 7, 2018

This issue can be closed a PR. We should then open a new issue that lists all functions that create a tibble, and that will need to use reconstruct() in the future.

@krlmlr krlmlr removed this from the 1.4.0 milestone Sep 28, 2018
@hadley
Copy link
Member

hadley commented Sep 28, 2018

Note that sloop::reconstruct() is now vctrs::vec_restore()

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2020

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants