Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter and integer indices #226

Closed
romainfrancois opened this issue Jan 30, 2014 · 18 comments
Closed

filter and integer indices #226

romainfrancois opened this issue Jan 30, 2014 · 18 comments
Assignees
Labels
Milestone

Comments

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jan 30, 2014

Would it make sense to support integer indices in filter. Something like this:

> mtcars %.% group_by(cyl) %.% filter(1:3)
> mtcars %.% group_by(cyl) %.% filter(sample(n(), 10))

Perhaps it is another verb though ?

Would help for this question: http://stackoverflow.com/questions/21255366/sample-dataframe-with-dplyr

This is related to: #202 but not completely.

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Jan 30, 2014

Also, we should probably error when giving integer indices if we don't want to support them. At the moment, they are silently cast to logical, so gives:

# treats 1:3 as a logical vector, so fails because wrong length
>  mtcars %.% group_by(cyl) %.% filter(1:3)
Erreur : incorrect length (3), expecting: 14

# treats the sample as logical vector, so considers them to all be TRUE (!=0)
>  mtcars %.% group_by(cyl) %.% filter( sample(n()) )
Source: local data frame [32 x 11]
Groups: cyl

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
...

@hadley
Copy link
Member

@hadley hadley commented Jan 30, 2014

I agree that this would be useful, I want to keep each verb really simple, so I think it would have to be a new verb. I think that's ok - it would be analogous to the different ways of selecting columns in select() and mutate()

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Jan 30, 2014

Ok. First I'll fix filter so that it fails if the expression evaluates to anything but logical.
Then I can put some internal code in place which we can promote to a proper verb once this has a name.

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Jan 31, 2014

I put in place very early stage code for what I've called (but I don't like the name) integer_filter. It works the same as filter, but:

  • expects the expression to evaluate to an integer vector
  • only handles one expression.
> mtcars[1:3, ]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> mtcars %.% integer_filter( 1:3 )
Source: local data frame [3 x 11]

   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
>
> group_by( mtcars, cyl ) %.% integer_filter(1:2)
Source: local data frame [6 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
2 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
3 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
5 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
6 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

It does not handle negative indices yet. Maybe positive and negative should be a different function. Perhaps keep and drop or something.

This opens a few optimization possibilities, e.g. in integer_filter( 1:3 ) there is no need to materialize 1:3 into a vector, we could internally understand it. same with things like seq_len, seq_along ...

Anyway, for now I just wanted to put some initial code in place.

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Jan 31, 2014

Perhaps we can have something like css nth-child : http://www.w3schools.com/cssref/sel_nth-child.asp.

mtcars %.% integer_filter( nth_child(odd) )
mtcars %.% integer_filter( nth_child(even) )
mtcars %.% integer_filter( nth_child(3*n+1) )

Although we already can express all this with combinations of seq() and n()

@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2014

What happens if the row numbers aren't present in the subset? i.e. what does integer_filter(mtcars, 35) do?

This feels like a generalisation of head() and tail() so maybe something like slice()?

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Feb 17, 2014

Hm. Right now it fails because 35 is numeric and not integer:

> integer_filter(mtcars, 35)
Error: integer_filter condition does not evaluate to an integer vector.

and otherwise it gives garbage:

> integer_filter(mtcars, 35L )
Source: local data frame [1 x 11]

            mpg           cyl          disp hp           drat           wt
1 2.126212e-314 2.126212e-314 2.126212e-314  0 -2.681562e+154 1.28823e-231
Variables not shown: qsec (dbl), vs (dbl), am (dbl), gear (dbl), carb (dbl)

and could likely segfault as well. I think it is in the land of undefined behavior.

I like slice.

@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2014

I wonder if integer_filter() should silently drop rows outside the range. Then it's easy to express take the first 10 (1:10), or the last 10 rows ((n() - 10):n()). But that would also mean ignoring negative subscripts when mixed with positive, which would make the logic more complicated.

Another idea would be to make the arguments to integer_filter() like the arguments to seq(), and provide some value (e.g. n()) to mean the number of rows in the group. Then you could do:

# Select odd
slice(mtcars, from = 1, to = n(), by = 2)
# Select even
slice(mtcars, from = 2, to = n(), by = 2)
# Select first 10
slice(mtcars, from = 1, to = 10)
# Select last 10
slice(mtcars, from = n() - 10, to = n())`

from would default to 1, and to to n, so real code would be a bit more succinct:

# Select odd
slice(mtcars, by = 2)
# Select even
slice(mtcars, from = 2, by = 2)
# Select first 10
slice(mtcars, to = 10)
# Select last 10
slice(mtcars, from = n() - 10)`

That would require a little special evaluation.

It's not easy to implement this in SQL, but database users could always fall back to filter() + rownumber().

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Feb 17, 2014

Hold on to that thought, we need something similar here to express row filtering in fastread tidyverse/readr#6

I like the expressiveness of your last comment, but would we still be able to use arbitrary indices, e.g.:

slice( mtcars, c(1L,3L,2L,7L) )

not that it is particularly useful

@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2014

I don't think selecting arbitrary indices is useful enough to get a top level function.

It might be useful to preserve that behaviour in C++ so that both slice() and sample() could be written on top of it. OTOH, sample() will need to be more complicated to support samples weighted by another variable.

@hadley hadley added this to the v0.2 milestone Mar 17, 2014
@hadley hadley added this to the 0.3 milestone Apr 9, 2014
@hadley hadley removed this from the v0.2 milestone Apr 9, 2014
@hadley
Copy link
Member

@hadley hadley commented Apr 9, 2014

Looking at this again, I now think your original API looks better. I've fleshed out a few more options below:

by_cyl <- mtcars %.% group_by(cyl)
# Select first row in each group
mtcars %.% slice(1)
# Select last row in each group
mtcars %.% slice(n())
# Rows not present in group silently ignored
mtcars %.% slice(10)
# Select arbitrary rows
mtcars %.% slice(1, 3, 9)
mtcars %.% slice(c(1, 3, 9))
# Select even rows
mtcars %.% slice(seq(2, n(), by = 2))

# Drop first row in each group
mtcars %.% slice(-1)

# Returns all values
mtcars %.% slice()

But I don't think this is critical for 0.2, so I'm going to move the milestone

@hoesler hoesler mentioned this issue Apr 17, 2014
@hadley hadley self-assigned this Aug 1, 2014
hadley added a commit that referenced this issue Aug 28, 2014
@hadley
Copy link
Member

@hadley hadley commented Aug 28, 2014

I've exported integer_filter as slice. I think there only a few thinks it needs for an initial release:

  • should silently coerce numeric input to integers
  • needs to silently drop out of range inputs

@hadley hadley assigned romainfrancois and unassigned hadley Aug 28, 2014
@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Sep 11, 2014

I dealt with the coercion issue. So now:

> mtcars %>% group_by(cyl) %>% slice(1)
Source: local data frame [3 x 11]
Groups: cyl

   mpg cyl disp  hp drat   wt  qsec vs am gear carb
1 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
2 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
3 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2

For the other issue, I think we are covered already bc of this line https://github.com/hadley/dplyr/blob/master/src/dplyr.cpp#L1361

    for( int i=0; i<ngroups; i++, ++git){
        SlicingIndex indices = *git ;
        g_test = check_filter_integer_result( call_proxy.get( indices ) ) ;

        int ntest = g_test.size() ;
        for( int j=0; j<ntest; j++){
            int k = g_test[j] ;
            if( k > 0 && k <= indices.size() ){  // <<<<<<<<<<<<< HERE
                indx.push_back( indices[k-1] ) ;
            }
        }

    }

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Sep 11, 2014

Ah. But it's not handled for the non grouped case. On it.

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Sep 11, 2014

I've added tests about coercion and dropping of out of range. Perhaps slice also needs additional tests.

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Sep 11, 2014

Hmmm. Now looking back at the examples from above here, this one no longer works:

mtcars %>% slice(-1)

as I'm just considering it out of range. I guess this needs more work if we want to support dropping off rows as well as keeping them, e.g. checking that values are either all positives or all negatives. What do you think @hadley ? I'll hold on on this for now.

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Sep 20, 2014

I gave a good go at this. It should now handle:

  • silently drop out of range indices
  • all positive indices
  • all negative indices
  • error if both positive and negative

Might need more testing, but I've included some tests in test-slice.r

@hadley
Copy link
Member

@hadley hadley commented Sep 22, 2014

Awesome - thanks Romain!

krlmlr pushed a commit to krlmlr/dplyr that referenced this issue Mar 2, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants