Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect summarise behavior #452

Closed
kbicknell opened this issue Jun 6, 2014 · 10 comments
Closed

incorrect summarise behavior #452

kbicknell opened this issue Jun 6, 2014 · 10 comments
Assignees
Labels
Milestone

Comments

@kbicknell
Copy link

@kbicknell kbicknell commented Jun 6, 2014

In the following code (run with just dplyr loaded), only the second line calculates the summary variable g correctly:

data.frame(a = c(0, 1)) %>% summarise(f = length(a), g = sum(a)*2)
data.frame(a = c(0, 1)) %>% summarise(g = sum(a)*2, f = length(a))

This is the output:

  f g
1 2 0
  g f
1 2 2

I'm running R 3.1.0 on a mac, dplyr installed fresh from github today.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jun 6, 2014

It appears that, for whatever reason, in the first statement sum(a) * 2, a is taken as only the first value of a:

data.frame(a = c(10, 100)) %>% summarise(f = length(a), g = sum(a) * 2)

giving

  f  g
1 2 20

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jun 6, 2014

From what I can see, the hybrid evaluator is buggering up and substituting in 0 in place of a for your example.

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jun 6, 2014

It looks like dplyr::internal::Sum is getting a pointer to the correct data, but the indices are wrong about the size of the vector:

Process 23878 stopped
* thread #1: tid = 0x3d52d, 0x000000010eae8102 dplyr.so`dplyr::internal::Sum<14, false, SlicingIndex>::process(ptr=0x0000000109d53d30, indices=0x00007fff5fbf8160) + 18 at Sum.h:61, queue = 'com.apple.main-thread', stop reason = breakpoint 6.241
    frame #0: 0x000000010eae8102 dplyr.so`dplyr::internal::Sum<14, false, SlicingIndex>::process(ptr=0x0000000109d53d30, indices=0x00007fff5fbf8160) + 18 at Sum.h:61
   58       template <typename Index>
   59       struct Sum<REALSXP, false, Index> {
   60           static double process( double* ptr, const Index& indices ){
-> 61               long double res = 0.0 ;
   62               int n = indices.size() ;
   63               for( int i=0; i<n; i++){
   64                   // we don't test for NA here because += NA will give NA
(lldb) p ptr[0]
(double) $29 = 0
(lldb) p ptr[1]
(double) $30 = 1
(lldb) p indices.size()
(int) $31 = 1

So something is going wrong in constructing the indices in the first example.

Loading

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jun 6, 2014

I think the problem has to do with how the number of rows is computed for the LazySubsets class. Currently we're doing:

inline int nrows() const {
    return Rf_length( data_map.begin()->second ) ;  
}

In the first case, the subsets object looks like:

(dplyr::LazySubsets) $62 = {
  data_map = {
    table_ = {
      boost::unordered::detail::table<boost::unordered::detail::map<std::__1::allocator<std::__1::pair<SEXPREC *const, SEXPREC *> >, SEXPREC *, SEXPREC *, boost::hash<SEXPREC>, std::__1::equal_to<SEXPREC *> > > = {
        boost::unordered::detail::functions<boost::hash<SEXPREC>, std::__1::equal_to<SEXPREC *> > = {
          current_ = false
          funcs_ = {
            [0] = {
              data_ = (buf = "\x84", align_ = '\x84')
            }
            [1] = {
              data_ = (buf = "?", align_ = '?')
            }
          }
        }
        allocators_ = {}
        bucket_count_ = 4
        size_ = 1
        mlf_ = 1
        max_load_ = 4
        buckets_ = 0x0000000108147970
      }
    }
  }
}

In the second:

(dplyr::LazySubsets) $67 = {
  data_map = {
    table_ = {
      boost::unordered::detail::table<boost::unordered::detail::map<std::__1::allocator<std::__1::pair<SEXPREC *const, SEXPREC *> >, SEXPREC *, SEXPREC *, boost::hash<SEXPREC>, std::__1::equal_to<SEXPREC *> > > = {
        boost::unordered::detail::functions<boost::hash<SEXPREC>, std::__1::equal_to<SEXPREC *> > = {
          current_ = false
          funcs_ = {
            [0] = {
              data_ = (buf = "\x84", align_ = '\x84')
            }
            [1] = {
              data_ = (buf = "?", align_ = '?')
            }
          }
        }
        allocators_ = {}
        bucket_count_ = 4
        size_ = 2
        mlf_ = 1
        max_load_ = 4
        buckets_ = 0x0000000101421ac0
      }
    }
  }
}

Notice that the size is 1 in the first case, but it's 2 in the second case. Maybe the hybrid evaluator needs to query the size() instead of the nrows() when hybrid evaluation occurs like this? I am not so sure yet.

Loading

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jun 6, 2014

Thanks. I'll pick it up shortly.

Loading

@edonnachie
Copy link

@edonnachie edonnachie commented Jul 3, 2014

I also encountered this bug. A command like this:

data_frame %.% summarise( a1 = mean(x), b = sd(y), a2 = mean(x))

will sometimes output a1 = mean(x) and a2 = mean(x[1]). Unfortunately, I haven't been able to identify the exact circumstances under which this happens. It seems to depend on both the other variables summarised and the ordering of the arguments.

Loading

@hadley hadley added the bug label Jul 28, 2014
@hadley hadley added this to the 0.3 milestone Jul 28, 2014
@steadyfish
Copy link

@steadyfish steadyfish commented Aug 14, 2014

I came across a similar issue today. It does sometimes throw an error when there's no group_by preceding the summarise statement..but it works fine if summarise comes after a group_by.

Loading

@hadley
Copy link
Member

@hadley hadley commented Sep 12, 2014

This now appears to be correct.

Loading

@hadley hadley closed this Sep 12, 2014
@kbicknell
Copy link
Author

@kbicknell kbicknell commented Sep 13, 2014

I just installed the latest dplyr from github, tested out the example above, and got the same results as before. So maybe this isn't fully fixed yet? As before, I'm on a mac, R 3.1.1.

Loading

@hadley hadley reopened this Sep 16, 2014
@hadley
Copy link
Member

@hadley hadley commented Sep 16, 2014

This example makes it a little easier to see what's going wrong:

data.frame(a = c(10, 100)) %>% summarise(sum(a), sum(a) * 2)

Loading

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants