Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sparse vectors in tibbles? #196

Closed
vspinu opened this issue Nov 23, 2016 · 19 comments
Closed

Allow sparse vectors in tibbles? #196

vspinu opened this issue Nov 23, 2016 · 19 comments
Labels
vctrs ↗️ Requires vctrs package

Comments

@vspinu
Copy link
Member

vspinu commented Nov 23, 2016

How difficult would it be to allow for sparse vectors as part of tibbles?

For instance

> tibble(a = 1:10, b = Matrix::sparseVector(c(T, T, T), c(2, 5, 9), 10))
Error: Each variable must be a 1d atomic vector or list.
Problem variables: 'b'

My primary interest is in space efficient logical vectors.

@vspinu
Copy link
Member Author

vspinu commented Nov 23, 2016

Oh. I now see that tibble is a very lightweight wrapper around data.frame. So I guess the above is doomed.

@krlmlr
Copy link
Member

krlmlr commented Nov 23, 2016

We could do that, but probably not until we've switched to vctrs, and support for sparse columns is added there. What happens if you call as_tibble() on such a data frame?

@vspinu
Copy link
Member Author

vspinu commented Nov 23, 2016

Actually it works. If I remove the 1d check in dataframe.R I get:

> tt <- tibble(a = 1:10, b = sparseVector(c(T, T, T), c(2, 5, 9), 10))
> dd <- cbind(tt, tt)
> str(dd)
'data.frame':	10 obs. of  4 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b:Formal class 'lsparseVector' [package "Matrix"] with 3 slots
  .. ..@ x     : logi  TRUE TRUE TRUE
  .. ..@ length: num 10
  .. ..@ i     : num  2 5 9
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b:Formal class 'lsparseVector' [package "Matrix"] with 3 slots
  .. ..@ x     : logi  TRUE TRUE TRUE
  .. ..@ length: num 10
  .. ..@ i     : num  2 5 9
> dd[1, "b"]
sparse vector (nnz/length = 0/1) of class "lsparseVector"
[1] .

The problem is with the print method:

> tt
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L,  (from utils-format.r#89) : 
  first argument must be atomic

But that's easy to fix I would guess.

@vspinu
Copy link
Member Author

vspinu commented Nov 23, 2016

I can look deeper into it if there is interest.

Adjusting is_vector to

is_vector <- function(x) {
  is_atomic(x) || is.list(x) || inherits(x, "sparseVector")
}

is enough to fix the constructor.

@krlmlr
Copy link
Member

krlmlr commented Nov 23, 2016

I'd prefer if this was tackled with r-lib/vctrs#27, then tibbles get to support this data type once we switch to vctrs. For now you can probably work around at your own risk with as_tibble(validate = FALSE) .

@vspinu
Copy link
Member Author

vspinu commented Nov 25, 2016

I don't think you are right about passing the ball to that project. These two projects have orthogonal concerns.

The number of data structures supported by vctrs is limited by the design, because the whole set of vctrs operations must be re-implemented for each vector type. Then there will be an exponentially growing set of cross-type operations. If the aim is also to be high performance, then there is no way you can go beyond 2-3 core supported types. Look how bloated Matrix package is.

The core task of tibble is to store column data. Potentially arbitrary types of data could be stored as columns as long as those types support basic protocol (length, [, [[, what else?). Why would one want to artificially restrict this abstraction to vectrs?

Finaly the scope of that project is pretty undefined at the moment and it might take years till it finally takes shape.

@krlmlr
Copy link
Member

krlmlr commented Nov 26, 2016

I'd like to refrain from supporting in tibble anything that might bring dplyr in trouble. The support for nonstandard data types in dplyr is rather weak, to improve it we need a solid foundation like vctrs. So, currently you'll have to use validate = FALSE to override tibble's protection mechanisms, and be aware of potential consequences.

I don't see the issue of "exponential growth of cross-type operations" because vctrs will be very strict about what can be combined.

@hadley: Could you please comment?

@hadley
Copy link
Member

hadley commented Nov 26, 2016

Vctrs is where we will be thinking about "vector type" objects including thinking about what the generic API should be. A data frame/tibble is just a little of vectors of equal lengths. So this discussion definitely belongs in vctrs. I expect the bulk of vctrs to be complete by end of next year, and the a lot of the API design will be complete in the next 3 months.

@krlmlr
Copy link
Member

krlmlr commented Aug 7, 2019

@mmaechler: Now that vctrs is gaining traction, what's the best way to support sparse vectors and matrices in data frames or tibbles? Currently I see:

tibble::tibble(a = 1:10, b = Matrix::sparseVector(c(T, T, T), c(2, 5, 9), 10))
#> All columns in a tibble must be 1d or 2d objects:
#> * Column `b` is lsparseVector
data.frame(a = 1:10, b = Matrix::sparseVector(c(T, T, T), c(2, 5, 9), 10))
#> Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class 'structure("lsparseVector", package = "Matrix")' to a data.frame

Created on 2019-08-07 by the reprex package (v0.3.0)

@mmaechler
Copy link

mmaechler commented Aug 7, 2019

Interesting: As Matrix::sparseVector() and the corresponding small class hierarchy have been very stable for many years, this may be an interesting test case. I would not go for "sparseMatrix" yet, as the class hierarchy is much much more sophisticated there.

As vectrs is not even on CRAN, it's too early for me to look further. For me, such sparse objects have always belonged to matrices rather than data frames, as matrices are much more efficient for these for what typical usage is, but then you seem to have reasonable use cases? To have as.data.frame.default() work for "sparseVector" objects (there are several: it's a hierarchy) seems to be asked much, but maybe there are other lower-level auxiliary methods that would be sufficient to you ?

@krlmlr
Copy link
Member

krlmlr commented Aug 7, 2019

Thanks, Martin.

{vctrs} 0.2.0 is on CRAN. The relevant text is https://vctrs.r-lib.org/articles/s3-vector.html, it's talking about S3 vectors but perhaps we can use S3 dispatch for the S4 classes?

Outside of vctrs, we could see if we can implement an as.data.frame.lsparseVector() S3 method for the sparse vector class; perhaps we don't need to change base for this?

I suspect it's sufficient to implement methods for the base class(es).

We could start the work in a package that imports {Matrix} and then incorporate if desired.

@krlmlr
Copy link
Member

krlmlr commented Oct 26, 2019

@lionel-: What would be a good vec_proxy() implementation for the "lsparseVector" class?

@krlmlr krlmlr mentioned this issue Oct 26, 2019
@lionel-
Copy link
Member

lionel- commented Oct 28, 2019

I'm not familiar with sparse implementations. Is it similar to run-length encoding? Maybe a record vector would be appropriate?

@krlmlr
Copy link
Member

krlmlr commented Nov 1, 2019

It's an S4 class with two vector and one scalar components.

x <- Matrix::sparseVector(1:3, 3:5, 7)
x
#> sparse vector (nnz/length = 3/7) of class "isparseVector"
#> [1] . . 1 2 3 . .
as.numeric(x)
#> [1] 0 0 1 2 3 0 0
x@x
#> [1] 1 2 3
x@i
#> [1] 3 4 5
x@length
#> [1] 7

Created on 2019-11-01 by the reprex package (v0.3.0)

@lionel-
Copy link
Member

lionel- commented Nov 4, 2019

Thanks, I'll look into it once I'm done with tidyselect. I'm going to implement methods for third party packages like sf. By the way, I think the tibble methods should also stay in vctrs for a while, because there will probably be API changes in the coming year.

@krlmlr
Copy link
Member

krlmlr commented Nov 4, 2019

Does this also mean that 98cac0d (and other packages we discover) belongs in vctrs? Happy to submit PRs.

Are we still recommending the "list subclass" approach, where we treat each S3 class that explicitly inherits from "list" as a vector? I forgot where I read about it.

@lionel-
Copy link
Member

lionel- commented Nov 5, 2019

If it's an important class feel free to submit a PR implementing methods. Perhaps after I have implemented sf, so that we have figured out the layout, conventions, and testing setup.

We're definitely going to do the explicit list inheritance approach, that's the first thing I'll tackle when I get back to work on vctrs, which I plan to this week.

@krlmlr
Copy link
Member

krlmlr commented Mar 11, 2020

Requires r-lib/vctrs#332, r-lib/vctrs#550 and probably r-lib/vctrs#551, before we can start adding vctrs support to {Matrix} classes. Once the following returns TRUE we're good to go:

vctrs::vec_is(Matrix::sparseVector(integer(), integer(), 0))
#> [1] FALSE

Created on 2020-03-11 by the reprex package (v0.3.0)

Even then this would have to happen in a different place, I'm thinking about a {mtrx} package which could eventually be incorporated into {Matrix}.

@krlmlr krlmlr closed this as completed Mar 11, 2020
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
vctrs ↗️ Requires vctrs package
Projects
None yet
Development

No branches or pull requests

5 participants