Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions using textmodel_NB #129

Closed
kalebima opened this issue May 3, 2016 · 6 comments
Closed

Predictions using textmodel_NB #129

kalebima opened this issue May 3, 2016 · 6 comments
Assignees
Labels

Comments

@kalebima
Copy link

kalebima commented May 3, 2016

See Stack Overflow for original post

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.

I'm attempting this with Quanteda and have the following code:

library(quanteda)

bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


#80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)

Here is a link to the dataset.

Ken noted that there is a bug in the predict method when k > 2.

@kbenoit kbenoit added the bug label May 3, 2016
@kbenoit kbenoit self-assigned this May 3, 2016
@kbenoit
Copy link
Collaborator

kbenoit commented May 6, 2016

Fixed in 0.9.5-25 (2051d5c).

@kbenoit kbenoit closed this as completed May 6, 2016
@kbenoit
Copy link
Collaborator

kbenoit commented May 6, 2016

It will now work with your code, which I have tidied up:

text <- textfile('~/Downloads/bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


# 80/20 split for training and test data
trainclass <- factor(c(docvars(bbc_corpus, "category")[1:1780], rep(NA, 445)))
testclass <- factor(c(docvars(bbc_corpus, "category")[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbcNb

bbc_pred <- predict(bbcNb, newdata = bbc_dfm[1781:2225, ])
bbc_pred

Your call to predict used an incorrect second argument, it needs to be a dfm (for newdata) not a factor label of text values. At some point we will add a cross-validate function but don't have one yet.

@kalebima
Copy link
Author

kalebima commented May 9, 2016

Hey Ken,

When I executed your code I received the following error from predict():

Error in predict.textmodel_NB_fitted(bbcNb, newdata = bbc_dfm[1781:2225, :
scores must be equal in length to number of classes.

Can you confirm if you receive the same error?

Thank you,

Matt

On May 6, 2016, at 12:14 PM, Kenneth Benoit notifications@github.com wrote:

It will now work with your code, which I have tidied up:

text <- textfile('~/Downloads/bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)

80/20 split for training and test data

trainclass <- factor(c(docvars(bbc_corpus, "category")[1:1780], rep(NA, 445)))
testclass <- factor(c(docvars(bbc_corpus, "category")[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbcNb

bbc_pred <- predict(bbcNb, newdata = bbc_dfm[1781:2225, ])
bbc_pred
Your call to predict used an incorrect second argument, it needs to be a dfm (for newdata) not a factor label of text values. At some point we will add a cross-validate function but don't have one yet.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub #129 (comment)

@kbenoit
Copy link
Collaborator

kbenoit commented May 9, 2016

I just verified that it works. Just pushed the newest build to CRAN.

packageVersion("quanteda")
## [1] ‘0.9.6.0’

@kalebima
Copy link
Author

Thank you, it's working now with the latest version installed.

Would you mind shedding light on why calculating the confusion matrix as confusionMatrix(bbc_pred, testclass) gives the error Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?? Not sure where I'm going wrong here.

@kbenoit
Copy link
Collaborator

kbenoit commented May 10, 2016

bbc_pred is a list, you probably want just the predicted class. That would be:

bbc_pred$nb.predicted

I will be adding accessor functions for these very soon, similar to coef(lm.class.object).

Ken

On 10 May 2016, at 05:21, Matt Kalebic <notifications@github.commailto:notifications@github.com> wrote:

Thank you, it's working now with the latest version installed.

Would you mind shedding light on why calculating the confusion matrix as confusionMatrix(bbc_pred, testclass) gives the error Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?? Not sure where I'm going wrong here.


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHubhttps://github.com//issues/129#issuecomment-218055710

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants