Issue 2134 #2136

koheiw · 2021-09-14T11:08:12Z

codecov · 2021-09-14T11:17:08Z

Codecov Report

Merging #2136 (8e4ed0f) into master (e008c48) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2136   +/-   ##
=======================================
  Coverage   96.14%   96.14%           
=======================================
  Files          87       87           
  Lines        4952     4953    +1     
=======================================
+ Hits         4761     4762    +1     
  Misses        191      191

Impacted Files	Coverage Δ
R/dfm_group.R	`98.46% <100.00%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e008c48...8e4ed0f. Read the comment docs.

kbenoit

Looks good. I was thinking however whether we should add an option to group the NAs - for instance use.na = TRUE. I investigated a few similar functions, and they work like this:

split()
Drops NA as a group.

Any missing values in f are dropped together with the corresponding values of x.
dplyr::group_split()
Keeps NA as a group

df <- structure(list(
  x = 1:10,
  y = structure(c(
    NA, 1L, 1L, 1L, 1L,
    NA, 3L, 3L, 3L, 3L
  ),
  .Label = c("a", "b", "c"),
  class = "factor"
  )
),
row.names = c(NA, -10L), class = "data.frame"
)
df
##     x    y
## 1   1 <NA>
## 2   2    a
## 3   3    a
## 4   4    a
## 5   5    a
## 6   6 <NA>
## 7   7    c
## 8   8    c
## 9   9    c
## 10 10    c

# drops NA
split(df, df$y, drop = TRUE)
## $a
##   x y
## 2 2 a
## 3 3 a
## 4 4 a
## 5 5 a
## 
## $c
##     x y
## 7   7 c
## 8   8 c
## 9   9 c
## 10 10 c

# keeps NA
dplyr::group_split(df, y, .drop = TRUE)
## <list_of<
##   tbl_df<
##     x: integer
##     y: factor<af15a>
##   >
## >[3]>
## [[1]]
## # A tibble: 4 × 2
##       x y    
##   <int> <fct>
## 1     2 a    
## 2     3 a    
## 3     4 a    
## 4     5 a    
## 
## [[2]]
## # A tibble: 4 × 2
##       x y    
##   <int> <fct>
## 1     7 c    
## 2     8 c    
## 3     9 c    
## 4    10 c    
## 
## [[3]]
## # A tibble: 2 × 2
##       x y    
##   <int> <fct>
## 1     1 <NA> 
## 2     6 <NA>

And aggregate() has an option na.action = na.omit (although it applies only to formulas):
> na.action a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.

koheiw · 2021-09-15T20:34:32Z

This PR only makes dfm_group() to work in the same way as tokens_group() and corpus_group(). See 5f40e22 and compare with others.

koheiw added 5 commits September 14, 2021 18:37

Drop rows for NA

7ce0362

Fix the bug

5f40e22

Add test

0eee8f3

Revert to original

182e895

Add more tests

0b4418b

koheiw requested a review from kbenoit September 14, 2021 11:28

kbenoit reviewed Sep 15, 2021

View reviewed changes

Merge branch 'master' into issue-2134

8e4ed0f

kbenoit approved these changes Oct 26, 2021

View reviewed changes

kbenoit added 2 commits October 26, 2021 17:00

Update NEWS for #2134

66ba240

Merge branch 'master' into issue-2134

8397835

kbenoit merged commit ae806d5 into master Oct 26, 2021

kbenoit deleted the issue-2134 branch October 26, 2021 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 2134 #2136

Issue 2134 #2136

koheiw commented Sep 14, 2021

codecov bot commented Sep 14, 2021 •

edited

kbenoit left a comment •

edited

koheiw commented Sep 15, 2021

Issue 2134 #2136

Issue 2134 #2136

Conversation

koheiw commented Sep 14, 2021

codecov bot commented Sep 14, 2021 • edited

Codecov Report

kbenoit left a comment • edited

Choose a reason for hiding this comment

koheiw commented Sep 15, 2021

codecov bot commented Sep 14, 2021 •

edited

kbenoit left a comment •

edited