-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[
drops attributes of df
#155
Comments
Hm, I'm not sure if we want to maintain all attributes. What if they somehow relate to the number of rows or columns? A safer way would be to define your own class and implement What is your use case? |
I'm not sure if my use case is broadly applicable. I'm trying to delicately (lazily?) add some functionality to existing code by adding an attribute "rider" that contains some additional information doesn't fit neatly into the dataframe. My thought was that the attribute would be safely ignored in all the code except where I then explicitly look for it. The alternatives approach that comes to mind (besides a new class as you mention) is forcing my core data frame and new "rider" into a list at one point, then unpacking it midway in the code path into two variables that get passed as separate arguments to the functions I'm trying to add. I'm not sure where attributes fit into the tidyverse. To me, user-defined attributes (i.e., not dim and names) are never added "by mistake", and therefore seem like something that should be added, dropped, or modified only when explicitly called for by the user, and should otherwise be carried over by all the "infrastructure" code. But maybe that encourages bad coding. I honestly don't know. |
As far as I understand tidy data, assuming there is more than one rider, you'd probably use a data frame with a "rider" and a "data" column, the "data" column would be a list of data frames. A row from that data frame naturally becomes a named list, which is in the spirit of the alternative approach you sketched. From another angle, the attribute defines something like a "has-a" relationship: the tabular data also "has a rider". The list approach you suggested would be more like a compound object that has both tabular data and a rider. This looks tidier to me, but then I don't know the structure of your data. As shown in in the dplyr issue you linked, dplyr verbs also don't seem to (always) maintain attributes. This guarantee would be a feature that needs to be specified, implemented, and thoroughly tested. I'd review a pull request, but to me this is currently not a top priority for tibble. |
Closing for now, please comment or open a new issue for further questions. |
Thanks for taking a look. I agree this is not high priority. I was reminded of this recently when I updated readr and saw that readr tables now come with a new "spec" attribute. This addition is very analogous to my use case that led me to file this issue. I was adding an attribute that described how the tbl_df was generated (a legend, of sorts), and i was also grafting it onto existing code while trying to not break anything. It's clear that "spec" as an attribute is the correct way for readr to add that information to the returned object. The alternatives of adding a new column to the returned data frame, or returning a list (compound object), are not appropriate. So the "spec" attribute from readr makes for a good and specific thought experiment. Should functions like The more I think about it, the less of an opinion I have on what is the correct decision. It would just be nice if there was a consistent behavior across dplyr / purrr / tibble et al. But again, I agree this is low priority right now. Thank you for your time and attention. |
I basically have the same question asked here. I've been building S3 classes based on tibble with the main motivation being to create a print method. So the loss of nice printing (due to the loss of my class) just from looking at a subset of the rows is a source of aggravation. But I'm also waiting to see if I can exert my will over printing by applying a class to specific columns and then allowing tibble printing to happen normally. Therefore, in my case, this is very linked to #144. |
@t-kalinowski: In the readr case, I think the spec apply to the imported data, but not so much for any modifications of it. My feeling is that stripping the attribute isn't the worst thing to do here. @jennybc: But who is removing your class attribute? Can you please show an example? |
Some further thoughts: All are driven by a desire to add richer annotations to the tidy data, and for those annotations to travel with the object. Each of the examples above is limited to character strings unfortunately. It would be nice if annotations could also be lists.
It might make sense to support attributes at both levels, annotations for individual vectors, and annotations for the data frame. |
Here's an example of a tibble losing my class from row-subsetting: library(tibble)
x <- structure(head(iris), class = c("jenny", "data.frame"))
class(x)
#> [1] "jenny" "data.frame"
class(x[1:2, ])
#> [1] "jenny" "data.frame"
xt <- structure(tibble::as_tibble(head(iris)),
class = c("jenny", "tbl_df", "tbl", "data.frame"))
class(xt)
#> [1] "jenny" "tbl_df" "tbl" "data.frame"
class(xt[1:2, ])
#> [1] "tbl_df" "tbl" "data.frame" |
I just came across the last example in |
Can we agree that subclasses of library(dplyr, warn.conflicts = FALSE)
group_by(iris, Species)[45:55,]
#> Source: local data frame [11 x 5]
#> Groups: Species [2]
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fctr>
#> 1 5.1 3.8 1.9 0.4 setosa
#> 2 4.8 3.0 1.4 0.3 setosa
#> 3 5.1 3.8 1.6 0.2 setosa
#> 4 4.6 3.2 1.4 0.2 setosa
#> 5 5.3 3.7 1.5 0.2 setosa
#> 6 5.0 3.3 1.4 0.2 setosa
#> 7 7.0 3.2 4.7 1.4 versicolor
#> 8 6.4 3.2 4.5 1.5 versicolor
#> 9 6.9 3.1 4.9 1.5 versicolor
#> 10 5.5 2.3 4.0 1.3 versicolor
#> 11 6.5 2.8 4.6 1.5 versicolor |
Over time, my opinion on this has completely reversed. I think the current behavior, where subclasses of tbl_df should provide their own |
I think that's fine. Then I think it's important to document this well. Maybe there could be a vignette on making your own tibble-based classes. And it could list the minimal set of methods you should implement, with an example. |
I'm happy to call |
Thanks for the reminder and for collecting the pointers. We need sloop before we can work on it. tidyverse/dplyr#3259 is the only open dplyr issue that covers this problem. Please upvote (add 👍 to the first post with the "smiley" icon). |
Is there (will it be) an easy solution without requiring dplyr, tibble, sloop etc? Maybe an attributes marker "leave-my-attributes-alone" or a simple similarly-minded method? My problem is exactly as the one described by @jennybc . I add an extra class to a data.frame to customize printing and a bunch of non-structural attributes. |
Would the default that none of the attributes are vectorized on columns or rows make more sense? Then, only the subclasses dealing with such attributes would need to provide the |
I like the idea of a special attribute that contains the names of all attributes that are safe to copy over. This can be used later with a solution based on |
The exclusion declaration - list all attributes which should't be copied - seems more natural to me as well. But I am not sure how useful would it be. Subclasses which deal with such attributes will probably have to declare their |
hmm... if the attribute so neatly fits into being row-subsettable, why would it be an attribute and not an additional column? Also, if it neatly fits into being column subsettable, why would the attribute be set on the whole tibble and not on the column itself? Are you using attributes to implement user-hidden columns? |
I am not using attributes vectorized on rows or columns, but from what I see others do and, if I understand correctly, this is the foremost reason why attributes are dropped by dplyr verbs in the first place.
Because it's internal to the class representation and do not constitute user level data. |
I think e.g. sf is doing the right thing by keeping the geometry in a regular column. Supporting "special" attributes that act as an extra column or row feels out of scope. The opt-in interface (attributes³) feels like the safer option, though. This only affects packages that subclass tibbles, I'm happy to make this easier but I'd rather not trade this for the safety of just stripping all attributes if in doubt. |
In my case, the information stored in attributes usually has a different structure than the main data. Often it's a single row data frame with other rows than the main one, used for various metadata. E.g. a |
As a short-term solution, I think the functions in tibble that create new tibbles from an existing tibble (e.g. In the long-term, these function should call |
This issue can be closed a PR. We should then open a new issue that lists all functions that create a tibble, and that will need to use |
Note that |
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary. |
possibly related to tidyverse/dplyr#1984
this is my current workaround:
df[] <- df[names(df)]
The text was updated successfully, but these errors were encountered: