NB documentation and cross validation #1010

cschwem2er · 2017-10-07T10:40:00Z

Hi,

the documentation for textmodel_NB does not include explanations for the different priors, although this is stated in the arguments:

prior	prior distribution on texts; see Details

And loosely related to this: Do you have any recommendations for using cross validation with quanteda textmodels? At the moment I manually split the data into training and testset, but it would be very handy to have a quanteda function for CV.

Edit: I also noticed that for distrubition = 'Bernoulli', the underlying code seems to automatically convert the dfm to binary:

 else if (object$distribution == "Bernoulli") {

        newdata <- tf(newdata, "boolean")
        Nc <- length(object$Pc)

If so, the related suggestion in the documentation could be removed.

The text was updated successfully, but these errors were encountered:

cschwem2er · 2017-10-07T11:57:17Z

Sorry, I have two other related questions:

Is the smooth parameter used for laplacian smoothing as explained here?
What would you say is the best way to identify words that are most important for assigning class labels? I'm currently using PcGw for this:

library(quanteda)
library(tidyverse)

movies <- quantedaData::data_corpus_movies$documents
movies <- sample_frac(movies, size = 1) # shuffle the dataset

trainingset <- dfm(movies$texts)
trainingset2 <- dfm_trim(trainingset,
         min_count = 10, max_docfreq = 0.5) # trim dataset
trainingclass <- movies$Sentiment %>% as.factor()


train <- trainingset2[1:1500]
test <- trainingset2[1501:2000]

train_lab <- trainingclass[1:1500]
test_lab <- trainingclass[1501:2000]


nb_movies <- textmodel_NB(train, train_lab, prior = "docfreq")
pred <- predict(nb_movies , newdata = test)



#table(pred$nb.predicted, test_lab) %>% 
 # caret::confusionMatrix() 
# 81% accuracy

# converting word class probabilities to dataframe
word_probs <- nb_movies$PcGw %>% as.matrix() %>% t() %>%
  as.data.frame() %>% 
  mutate(feature = rownames(.))

# top 20 negative words
word_probs %>% 
  arrange(desc(neg)) %>% 
  head(20)

# top 20 positive words
word_probs %>% 
  arrange(desc(pos)) %>% 
  head(20)

kbenoit · 2017-10-08T10:40:23Z

Point taken on the documentation, will fix asap.

On the Bernoulli, there are actually three distributions common in Naive Bayes for text: multinomial, Bernoulli, and "binary multinomial". In the issues below, the conditional for Bernoulli was not doing anything, but also the predict method was wrong (now fixed, see code starting https://github.com/kbenoit/quanteda/blob/master/R/textmodel_NB.R#L202). For fitting the model, it computes probabilities based on binary occurrence, but for predicting on new data, it needed fixing.

See the discussions here:

If this is not clear or you find an error, then please start a new issue on Naive Bayes - Bernoulli with specifics, otherwise I will fix the documentation and close this issue.

kbenoit · 2017-10-08T11:21:47Z

I suggest that for cross-validation, you propose a desired behaviour and open it as a new issue.

kbenoit closed this as completed in d7fad71 Oct 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NB documentation and cross validation #1010

NB documentation and cross validation #1010

cschwem2er commented Oct 7, 2017 •

edited

Loading

cschwem2er commented Oct 7, 2017

kbenoit commented Oct 8, 2017

kbenoit commented Oct 8, 2017

NB documentation and cross validation #1010

NB documentation and cross validation #1010

Comments

cschwem2er commented Oct 7, 2017 • edited Loading

cschwem2er commented Oct 7, 2017

kbenoit commented Oct 8, 2017

kbenoit commented Oct 8, 2017

cschwem2er commented Oct 7, 2017 •

edited

Loading