Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NB documentation and cross validation #1010

Closed
cschwem2er opened this issue Oct 7, 2017 · 3 comments
Closed

NB documentation and cross validation #1010

cschwem2er opened this issue Oct 7, 2017 · 3 comments

Comments

@cschwem2er
Copy link

cschwem2er commented Oct 7, 2017

Hi,

the documentation for textmodel_NB does not include explanations for the different priors, although this is stated in the arguments:

prior	prior distribution on texts; see Details

And loosely related to this: Do you have any recommendations for using cross validation with quanteda textmodels? At the moment I manually split the data into training and testset, but it would be very handy to have a quanteda function for CV.

Edit: I also noticed that for distrubition = 'Bernoulli', the underlying code seems to automatically convert the dfm to binary:

 else if (object$distribution == "Bernoulli") {

        newdata <- tf(newdata, "boolean")
        Nc <- length(object$Pc)

If so, the related suggestion in the documentation could be removed.

@cschwem2er
Copy link
Author

Sorry, I have two other related questions:

  1. Is the smooth parameter used for laplacian smoothing as explained here?
  2. What would you say is the best way to identify words that are most important for assigning class labels? I'm currently using PcGw for this:
library(quanteda)
library(tidyverse)

movies <- quantedaData::data_corpus_movies$documents
movies <- sample_frac(movies, size = 1) # shuffle the dataset

trainingset <- dfm(movies$texts)
trainingset2 <- dfm_trim(trainingset,
         min_count = 10, max_docfreq = 0.5) # trim dataset
trainingclass <- movies$Sentiment %>% as.factor()


train <- trainingset2[1:1500]
test <- trainingset2[1501:2000]

train_lab <- trainingclass[1:1500]
test_lab <- trainingclass[1501:2000]


nb_movies <- textmodel_NB(train, train_lab, prior = "docfreq")
pred <- predict(nb_movies , newdata = test)



#table(pred$nb.predicted, test_lab) %>% 
 # caret::confusionMatrix() 
# 81% accuracy

# converting word class probabilities to dataframe
word_probs <- nb_movies$PcGw %>% as.matrix() %>% t() %>%
  as.data.frame() %>% 
  mutate(feature = rownames(.))

# top 20 negative words
word_probs %>% 
  arrange(desc(neg)) %>% 
  head(20)

# top 20 positive words
word_probs %>% 
  arrange(desc(pos)) %>% 
  head(20)

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 8, 2017

Point taken on the documentation, will fix asap.

On the Bernoulli, there are actually three distributions common in Naive Bayes for text: multinomial, Bernoulli, and "binary multinomial". In the issues below, the conditional for Bernoulli was not doing anything, but also the predict method was wrong (now fixed, see code starting https://github.com/kbenoit/quanteda/blob/master/R/textmodel_NB.R#L202). For fitting the model, it computes probabilities based on binary occurrence, but for predicting on new data, it needed fixing.

See the discussions here:

If this is not clear or you find an error, then please start a new issue on Naive Bayes - Bernoulli with specifics, otherwise I will fix the documentation and close this issue.

@kbenoit kbenoit closed this as completed in d7fad71 Oct 8, 2017
@kbenoit
Copy link
Collaborator

kbenoit commented Oct 8, 2017

I suggest that for cross-validation, you propose a desired behaviour and open it as a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants