Predictions using textmodel_NB #129

kalebima · 2016-05-03T00:52:53Z

See Stack Overflow for original post

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.

I'm attempting this with Quanteda and have the following code:

library(quanteda)

bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


#80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)

Here is a link to the dataset.

Ken noted that there is a bug in the predict method when k > 2.

The text was updated successfully, but these errors were encountered:

kbenoit · 2016-05-06T14:16:44Z

Fixed in 0.9.5-25 (2051d5c).

kbenoit · 2016-05-06T16:14:19Z

It will now work with your code, which I have tidied up:

text <- textfile('~/Downloads/bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


# 80/20 split for training and test data
trainclass <- factor(c(docvars(bbc_corpus, "category")[1:1780], rep(NA, 445)))
testclass <- factor(c(docvars(bbc_corpus, "category")[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbcNb

bbc_pred <- predict(bbcNb, newdata = bbc_dfm[1781:2225, ])
bbc_pred

Your call to predict used an incorrect second argument, it needs to be a dfm (for newdata) not a factor label of text values. At some point we will add a cross-validate function but don't have one yet.

kalebima · 2016-05-09T01:09:54Z

Hey Ken,

When I executed your code I received the following error from predict():

Error in predict.textmodel_NB_fitted(bbcNb, newdata = bbc_dfm[1781:2225, :
scores must be equal in length to number of classes.

Can you confirm if you receive the same error?

Thank you,

Matt

On May 6, 2016, at 12:14 PM, Kenneth Benoit notifications@github.com wrote:

It will now work with your code, which I have tidied up:

text <- textfile('~/Downloads/bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)

80/20 split for training and test data

trainclass <- factor(c(docvars(bbc_corpus, "category")[1:1780], rep(NA, 445)))
testclass <- factor(c(docvars(bbc_corpus, "category")[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbcNb

bbc_pred <- predict(bbcNb, newdata = bbc_dfm[1781:2225, ])
bbc_pred
Your call to predict used an incorrect second argument, it needs to be a dfm (for newdata) not a factor label of text values. At some point we will add a cross-validate function but don't have one yet.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub #129 (comment)

kbenoit · 2016-05-09T09:55:06Z

I just verified that it works. Just pushed the newest build to CRAN.

packageVersion("quanteda")
## [1] ‘0.9.6.0’

kalebima · 2016-05-10T04:21:47Z

Thank you, it's working now with the latest version installed.

Would you mind shedding light on why calculating the confusion matrix as confusionMatrix(bbc_pred, testclass) gives the error Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?? Not sure where I'm going wrong here.

kbenoit · 2016-05-10T07:57:31Z

bbc_pred is a list, you probably want just the predicted class. That would be:

bbc_pred$nb.predicted

I will be adding accessor functions for these very soon, similar to coef(lm.class.object).

Ken

On 10 May 2016, at 05:21, Matt Kalebic <notifications@github.com mailto:notifications@github.com> wrote:

Thank you, it's working now with the latest version installed.

Would you mind shedding light on why calculating the confusion matrix as confusionMatrix(bbc_pred, testclass) gives the error Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?? Not sure where I'm going wrong here.

—
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHubhttps://github.com//issues/129#issuecomment-218055710

kbenoit added the bug label May 3, 2016

kbenoit self-assigned this May 3, 2016

kbenoit closed this as completed May 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predictions using textmodel_NB #129

Predictions using textmodel_NB #129

kalebima commented May 3, 2016

kbenoit commented May 6, 2016 •

edited

Loading

kbenoit commented May 6, 2016

kalebima commented May 9, 2016

80/20 split for training and test data

kbenoit commented May 9, 2016

kalebima commented May 10, 2016

kbenoit commented May 10, 2016

Predictions using textmodel_NB #129

Predictions using textmodel_NB #129

Comments

kalebima commented May 3, 2016

kbenoit commented May 6, 2016 • edited Loading

kbenoit commented May 6, 2016

kalebima commented May 9, 2016

80/20 split for training and test data

kbenoit commented May 9, 2016

kalebima commented May 10, 2016

kbenoit commented May 10, 2016

kbenoit commented May 6, 2016 •

edited

Loading