Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No is.null() for empty dfmSparse object computed by dfm() #811

Closed
contefranz opened this issue Jun 20, 2017 · 9 comments
Closed

No is.null() for empty dfmSparse object computed by dfm() #811

contefranz opened this issue Jun 20, 2017 · 9 comments

Comments

@contefranz
Copy link

I found that when I compute a document-feature-matrix using dfm() and a custom dictionary, if no words are matched, then dfm() returns a NULL. The problem arises when I check this using the standard is.null() function which returns a FALSE. In the following you will find the minimal.

# defining a silly dictionary
my_dictionary = dictionary( list( a = c( "asd", "dsa" ),
                                  b = c( "foo", "jup" ) ) )

# writing a little piece of text
raw_text = c( "Wow I can't believe it's not raining!", 
              "Today is a beautiful day. The sky is blue and there are burritos" )

# building the related corpus
my_corpus = corpus( raw_text )
summary( my_corpus )

# building the related DFM base on my silly dictionary
my_dfm = dfm( my_corpus, dictionary = my_dictionary )

# now is.null() returns a FALSE when clearly is not
is.null( my_dfm )

The temporarily workaround I found is to convert my_dfm object to either a matrix or a data.table and check the dimensions as follows.

# in the case of a matrix, we will have the number of columns set to 
# zero while the number of rows correspond to the number of texts detected by corpus().
my_dfm_mat = as.matrix( my_dfm )
dim( my_dfm_mat )

# in the case of a data.table both rows and columns are zero
my_dfm_dt = as.data.table( my_dfm )
dim( my_dfm_dt )

So I guess that to execute a code chunk if and only if the document-feature-matrix is full, one can run something like the following:

# 1. matrix case: 
if ( ncol( my_df_mat ) > 0L ) {
  run your code ...
}

# 2. data.table case
if ( all( dim( my_dfm_dt ) != c( 0L, 0L ) ) ) {
  run your code ...
}

At least this worked for me, but it would be nice to have a direct control to check if dfm() computes an empty matrix.

@koheiw
Copy link
Collaborator

koheiw commented Jun 20, 2017

An empty dfm looks like NULL when printed, but it is still a dfm. Please use nfeatures() == 0 to check it is empty.

my_dictionary = dictionary( list( a = c( "asd", "dsa" ),
                                  b = c( "foo", "jup" ) ) )

# writing a little piece of text
raw_text = c( "Wow I can't believe it's not raining!", 
              "Today is a beautiful day. The sky is blue and there are burritos" )

my_corpus = corpus( raw_text )
summary( my_corpus )

my_dfm = dfm( my_corpus, dictionary = my_dictionary )

is.null( my_dfm ) # FALSE
is.dfm( my_dfm ) # TRUE
nfeature( my_dfm ) # 0 (becasue it is still a dfm with zero features)
str( my_dfm )

@contefranz
Copy link
Author

contefranz commented Jun 20, 2017

Cool! Thank you.
Though it is counterintuitive to me. If I see a NULL when I print the value of an object, my mind wants to control for that using the common way.
Do you think it would be possibile/makes sense to implement a is.null.dfm() function?

Thank you again.

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 20, 2017

Better would be to change the print method to reflect an empty dfm, which would be consistent with empty versions of other objects, e.g.

> matrix()
     [,1]
[1,]   NA
> data.frame()
data frame with 0 columns and 0 rows
> tibble::tibble()
# A tibble: 0 x 0
> as(matrix(), "dgCMatrix")
1 x 1 sparse Matrix of class "dgCMatrix"
       
[1,] NA

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 20, 2017

Burritos? 🌯

😄

@contefranz
Copy link
Author

@kbenoit That would help indeed. My point was that when calling a printing method I am expecting to be able to check the output of that call. Printing NULL and not being able to check it with is.null() I think is something to change. What you suggest addresses the problem. Maybe, just because dfm() computes a matrix, even if it is of class dfmSparse, printing NAis the way to go. In this case calling is.na() should return TRUE. What do you think?

@myeomans
Copy link

myeomans commented Jun 25, 2017

I just ran into this same problem... But why the special NULL case, at all? Wouldn't an i-by-j matrix full of zeros be the more appropriate return? In my case, I am binding the dfm output to a separate matrix of dependency-based features from the same documents. An i-by-j matrix of zeros perfectly captures the information I need (i.e. none of these dictionary features are present) while the "NULL" 0-by-0 matrix breaks the pipeline.

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 28, 2017

We also need to change kwic():

(tmp <- kwic(data_char_ukimmig2010, "unicorn"))
## NULL
is.null(tmp)
## [1] FALSE

@contefranz
Copy link
Author

Yes, that is correct.
Can we flag this issue to be related with design or enhancement?

@kbenoit
Copy link
Collaborator

kbenoit commented Jun 28, 2017

Fix and update for both conditions is imminent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants