Skip to content

Ignore nulls for n_distinct() #1052

@saurabhRTR

Description

@saurabhRTR

Consider the following -

       uid term_number order_id
68 1001190           0  1985608
69 1001190           0  2052320
70 1001190           0  2089064
71 1001190           1  2125056
72 1001190           2  2275108
73 1001190           2  2296768
74 1001190           2  2343148
75 1001190           3  2474898
76 1001190           4  2676880
77 1001190           5  2718370
78 1001190           6       NA
79 1001190           7  3109466
80 1001190           7  3132486


mydf %.% 
group_by(uid, term_number) %.% 
summarize(n_distinct(order_id))


Source: local data frame [8 x 3]
Groups: uid

      uid term_number n_distinct(order_id)
1 1001190           0                    3
2 1001190           1                    1
3 1001190           2                    3
4 1001190           3                    1
5 1001190           4                    1
6 1001190           5                    1
7 1001190           6                    1
8 1001190           7                    2

This says 1 order for term 6 where it should be 0. This is because n_distinct does not ignore nulls. I guess it makes sense in some cases, so ideally a flag to ignore nulls would be useful.

PS, Most databases will ignore null by default. R's distinct will not.

Thanks,

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions