No is.null() for empty dfmSparse object computed by dfm() #811

contefranz · 2017-06-20T13:36:40Z

I found that when I compute a document-feature-matrix using dfm() and a custom dictionary, if no words are matched, then dfm() returns a NULL. The problem arises when I check this using the standard is.null() function which returns a FALSE. In the following you will find the minimal.

# defining a silly dictionary
my_dictionary = dictionary( list( a = c( "asd", "dsa" ),
                                  b = c( "foo", "jup" ) ) )

# writing a little piece of text
raw_text = c( "Wow I can't believe it's not raining!", 
              "Today is a beautiful day. The sky is blue and there are burritos" )

# building the related corpus
my_corpus = corpus( raw_text )
summary( my_corpus )

# building the related DFM base on my silly dictionary
my_dfm = dfm( my_corpus, dictionary = my_dictionary )

# now is.null() returns a FALSE when clearly is not
is.null( my_dfm )

The temporarily workaround I found is to convert my_dfm object to either a matrix or a data.table and check the dimensions as follows.

# in the case of a matrix, we will have the number of columns set to 
# zero while the number of rows correspond to the number of texts detected by corpus().
my_dfm_mat = as.matrix( my_dfm )
dim( my_dfm_mat )

# in the case of a data.table both rows and columns are zero
my_dfm_dt = as.data.table( my_dfm )
dim( my_dfm_dt )

So I guess that to execute a code chunk if and only if the document-feature-matrix is full, one can run something like the following:

# 1. matrix case: 
if ( ncol( my_df_mat ) > 0L ) {
  run your code ...
}

# 2. data.table case
if ( all( dim( my_dfm_dt ) != c( 0L, 0L ) ) ) {
  run your code ...
}

At least this worked for me, but it would be nice to have a direct control to check if dfm() computes an empty matrix.

The text was updated successfully, but these errors were encountered:

koheiw · 2017-06-20T13:50:35Z

An empty dfm looks like NULL when printed, but it is still a dfm. Please use nfeatures() == 0 to check it is empty.

my_dictionary = dictionary( list( a = c( "asd", "dsa" ),
                                  b = c( "foo", "jup" ) ) )

# writing a little piece of text
raw_text = c( "Wow I can't believe it's not raining!", 
              "Today is a beautiful day. The sky is blue and there are burritos" )

my_corpus = corpus( raw_text )
summary( my_corpus )

my_dfm = dfm( my_corpus, dictionary = my_dictionary )

is.null( my_dfm ) # FALSE
is.dfm( my_dfm ) # TRUE
nfeature( my_dfm ) # 0 (becasue it is still a dfm with zero features)
str( my_dfm )

contefranz · 2017-06-20T13:56:02Z

Cool! Thank you.
Though it is counterintuitive to me. If I see a NULL when I print the value of an object, my mind wants to control for that using the common way.
Do you think it would be possibile/makes sense to implement a is.null.dfm() function?

Thank you again.

kbenoit · 2017-06-20T21:17:41Z

Better would be to change the print method to reflect an empty dfm, which would be consistent with empty versions of other objects, e.g.

> matrix()
     [,1]
[1,]   NA
> data.frame()
data frame with 0 columns and 0 rows
> tibble::tibble()
# A tibble: 0 x 0
> as(matrix(), "dgCMatrix")
1 x 1 sparse Matrix of class "dgCMatrix"
       
[1,] NA

kbenoit · 2017-06-20T21:21:15Z

Burritos? 🌯

😄

contefranz · 2017-06-22T10:35:28Z

@kbenoit That would help indeed. My point was that when calling a printing method I am expecting to be able to check the output of that call. Printing NULL and not being able to check it with is.null() I think is something to change. What you suggest addresses the problem. Maybe, just because dfm() computes a matrix, even if it is of class dfmSparse, printing NAis the way to go. In this case calling is.na() should return TRUE. What do you think?

myeomans · 2017-06-25T21:58:44Z

I just ran into this same problem... But why the special NULL case, at all? Wouldn't an i-by-j matrix full of zeros be the more appropriate return? In my case, I am binding the dfm output to a separate matrix of dependency-based features from the same documents. An i-by-j matrix of zeros perfectly captures the information I need (i.e. none of these dictionary features are present) while the "NULL" 0-by-0 matrix breaks the pipeline.

kbenoit · 2017-06-28T07:42:56Z

We also need to change kwic():

(tmp <- kwic(data_char_ukimmig2010, "unicorn"))
## NULL
is.null(tmp)
## [1] FALSE

contefranz · 2017-06-28T08:21:46Z

Yes, that is correct.
Can we flag this issue to be related with design or enhancement?

kbenoit · 2017-06-28T08:27:18Z

Fix and update for both conditions is imminent.

- per issue #811

Fixes #811

kbenoit added design infrastructure labels Jun 28, 2017

kbenoit added a commit that referenced this issue Jun 29, 2017

Modify kwic print output for empty kwic

5f8ef43

- per issue #811

kbenoit added a commit that referenced this issue Jun 29, 2017

Modify print.dfm for dfms that are empty in one or both dimensions

96f2a30

Fixes #811

kbenoit mentioned this issue Jun 29, 2017

Issue 811 #823

Merged

kbenoit closed this as completed in #823 Jun 29, 2017

kbenoit added a commit that referenced this issue Jun 29, 2017

Merge pull request #823 from kbenoit/issue-811

e2ae400

Fixes #811

kbenoit mentioned this issue Jul 25, 2017

How to drop all the docvars from corpus? #879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No is.null() for empty dfmSparse object computed by dfm() #811

No is.null() for empty dfmSparse object computed by dfm() #811

contefranz commented Jun 20, 2017

koheiw commented Jun 20, 2017

contefranz commented Jun 20, 2017 •

edited

Loading

kbenoit commented Jun 20, 2017 •

edited

Loading

kbenoit commented Jun 20, 2017

contefranz commented Jun 22, 2017

myeomans commented Jun 25, 2017 •

edited

Loading

kbenoit commented Jun 28, 2017 •

edited

Loading

contefranz commented Jun 28, 2017

kbenoit commented Jun 28, 2017

No is.null() for empty dfmSparse object computed by dfm() #811

No is.null() for empty dfmSparse object computed by dfm() #811

Comments

contefranz commented Jun 20, 2017

koheiw commented Jun 20, 2017

contefranz commented Jun 20, 2017 • edited Loading

kbenoit commented Jun 20, 2017 • edited Loading

kbenoit commented Jun 20, 2017

contefranz commented Jun 22, 2017

myeomans commented Jun 25, 2017 • edited Loading

kbenoit commented Jun 28, 2017 • edited Loading

contefranz commented Jun 28, 2017

kbenoit commented Jun 28, 2017

contefranz commented Jun 20, 2017 •

edited

Loading

kbenoit commented Jun 20, 2017 •

edited

Loading

myeomans commented Jun 25, 2017 •

edited

Loading

kbenoit commented Jun 28, 2017 •

edited

Loading