Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ichunk #46

Open
dselivanov opened this issue Nov 1, 2015 · 5 comments
Open

ichunk #46

dselivanov opened this issue Nov 1, 2015 · 5 comments

Comments

@dselivanov
Copy link

I eveluated both itertools::ichunk and itertools2::ichunk, and the first (at least for me) looks more logical while trying to handle objects which length not divisible by chunk_size. For example I have vector v=1:5 and want to obtain list(c(1,2), c(3,4), c(5)), is it possible with itertools2::ichunk?

@ramhiser
Copy link
Owner

ramhiser commented Nov 2, 2015

itertools2::ichunk has a similar behavior, but could be improved. Here's a similar example to the one you provided:

> it <- itertools2::ichunk(1:5, chunk_size=2)
> nextElem(it)
[[1]]
[1] 1

[[2]]
[1] 2

> nextElem(it)
[[1]]
[1] 3

[[2]]
[1] 4

> nextElem(it)
[[1]]
[1] 5

[[2]]
[1] NA

@dselivanov
Copy link
Author

I understand how to use itertools2::ichunk, but I think last chunk with value c(5) will be much more logical. I'm ok with fill=NA option, but fill=NULL doesn't work.

Also there 2 notes:

  1. How to stop iterate over ichunk iterator? after the end it silently fills all values with NA and continue iteration:
it <- ichunk(1:10, 3)
while(TRUE) {
  val = try(nextElem(it), silent = T)
  if (class(val) == "try-error") break
}

Of course I can check it with anyNA() call, NA can be in underlying data... Believe, user should not care about such sort of things.
2. this note is about performance. I want to heavily use iterators in my text2vec package, but there is big issue with performance. Consider following examples:

devtools::install_github('dselivanov/text2vec@redesign')
# reviews of 25000 movies ~ 30mb
data("movie_review")
it <- ichunk(movie_review[['review']], chunk_size = 1000)
limit <- length(movie_review[['review']]) / 1000
i <- 1
system.time(
  # i <= limit because of behavior from #1
  while(TRUE && i <= limit) {
    val = try(nextElem(it), silent = T)
    if (class(val) == "try-error") break
    i <- i + 1
  }
)
#user  system elapsed 
#1.141   0.002   1.142 

#Now lets check without iterators

chunk <- function(x, n) suppressWarnings(split(x, rep(seq_len(n), each = ceiling(length(x) / n)) ))
ich <- chunk(seq_len(nrow(movie_review)), 25)
system.time(
  for(i in ich ) {
    tmp<-movie_review[['review']][i]
  }
)
#user  system elapsed 
#0.002   0.000   0.003

As you can see, there is very very big overhead of using iterators here. Can we do something with that?

@ramhiser
Copy link
Owner

ramhiser commented Nov 5, 2015

I'll take a look at the ichunk behavior this weekend and see what needs to improve.

As for the iterators package, I'm aware of its unfortunate overhead -- I have seen it crop up in several cases. Contributing to or forking iterators has been on my agenda for some time.

@dselivanov
Copy link
Author

iterators::idiv() solved my problems,. I write custom iterator based on
top of idiv and now everything works fine.

2015-11-05 16:52 GMT+03:00 John Ramey notifications@github.com:

I'll take a look at the ichunk behavior this weekend and see what needs
to improve.

As for the iterators package, I'm aware of its unfortunate overhead -- I
have seen it crop up in several cases. Contributing to or forking
iterators has been on my agenda for some time.


Reply to this email directly or view it on GitHub
#46 (comment).

Regards
Dmitriy Selivanov

@ramhiser
Copy link
Owner

ramhiser commented Nov 5, 2015

That is good to hear.

After you created the issue, I realized the ichunk is not ideal. I still intend to investigate it this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants