Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 2134 #2136

Merged
merged 8 commits into from Oct 26, 2021
Merged

Issue 2134 #2136

merged 8 commits into from Oct 26, 2021

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Sep 14, 2021

Fix #2134

@codecov
Copy link

codecov bot commented Sep 14, 2021

Codecov Report

Merging #2136 (8e4ed0f) into master (e008c48) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2136   +/-   ##
=======================================
  Coverage   96.14%   96.14%           
=======================================
  Files          87       87           
  Lines        4952     4953    +1     
=======================================
+ Hits         4761     4762    +1     
  Misses        191      191           
Impacted Files Coverage Δ
R/dfm_group.R 98.46% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e008c48...8e4ed0f. Read the comment docs.

@koheiw koheiw requested a review from kbenoit September 14, 2021 11:28
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I was thinking however whether we should add an option to group the NAs - for instance use.na = TRUE. I investigated a few similar functions, and they work like this:

  • split()
    Drops NA as a group.

    Any missing values in f are dropped together with the corresponding values of x.

  • dplyr::group_split()
    Keeps NA as a group

df <- structure(list(
  x = 1:10,
  y = structure(c(
    NA, 1L, 1L, 1L, 1L,
    NA, 3L, 3L, 3L, 3L
  ),
  .Label = c("a", "b", "c"),
  class = "factor"
  )
),
row.names = c(NA, -10L), class = "data.frame"
)
df
##     x    y
## 1   1 <NA>
## 2   2    a
## 3   3    a
## 4   4    a
## 5   5    a
## 6   6 <NA>
## 7   7    c
## 8   8    c
## 9   9    c
## 10 10    c

# drops NA
split(df, df$y, drop = TRUE)
## $a
##   x y
## 2 2 a
## 3 3 a
## 4 4 a
## 5 5 a
## 
## $c
##     x y
## 7   7 c
## 8   8 c
## 9   9 c
## 10 10 c

# keeps NA
dplyr::group_split(df, y, .drop = TRUE)
## <list_of<
##   tbl_df<
##     x: integer
##     y: factor<af15a>
##   >
## >[3]>
## [[1]]
## # A tibble: 4 × 2
##       x y    
##   <int> <fct>
## 1     2 a    
## 2     3 a    
## 3     4 a    
## 4     5 a    
## 
## [[2]]
## # A tibble: 4 × 2
##       x y    
##   <int> <fct>
## 1     7 c    
## 2     8 c    
## 3     9 c    
## 4    10 c    
## 
## [[3]]
## # A tibble: 2 × 2
##       x y    
##   <int> <fct>
## 1     1 <NA> 
## 2     6 <NA>

And aggregate() has an option na.action = na.omit (although it applies only to formulas):
> na.action a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.

@koheiw
Copy link
Collaborator Author

koheiw commented Sep 15, 2021

This PR only makes dfm_group() to work in the same way as tokens_group() and corpus_group(). See 5f40e22 and compare with others.

@kbenoit kbenoit merged commit ae806d5 into master Oct 26, 2021
@kbenoit kbenoit deleted the issue-2134 branch October 26, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error in grouping dfm
2 participants