Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicted output coming wrong for textmodel_nb prediction #476

Closed
vibhutittu opened this issue Jan 18, 2017 · 3 comments
Closed

Predicted output coming wrong for textmodel_nb prediction #476

vibhutittu opened this issue Jan 18, 2017 · 3 comments

Comments

@vibhutittu
Copy link

vibhutittu commented Jan 18, 2017

I was working on naive bayes for text classification using your package. I was using textmodel_NB , for which you fixed a issue for the wrong priors using docfreq. Now its coming correct but the predictions should also be changed Please see the below example as mentioned in your package:

library(quanteda)
trainingset <- as.dfm(matrix(c(1, 2, 0, 0, 0, 0,
                               0, 2, 0, 0, 1, 0,
                               0, 1, 0, 1, 0, 0,
                               0, 1, 1, 0, 0, 1,
                               0, 3, 1, 0, 0, 1), 
                             ncol=6, nrow=5, byrow=TRUE,
                             dimnames = list(docs = paste("d", 1:5, sep = ""),
                                             features = c("Beijing", "Chinese",  "Japan", "Macao", 
                                                          "Shanghai", "Tokyo"))))
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
## replicate IIR p261 prediction for test set (document 5)
(nb.p261 <- textmodel_NB(trainingset, trainingclass,prior="docfreq"))
predict(nb.p261, newdata = trainingset[5, ])

Output:

Fitted Naive Bayes model:
Call:
	textmodel_NB(x = trainingset, y = trainingclass, prior = "docfreq")


Training classes and priors:
   N    Y 
0.25 0.75 

		  Likelihoods:		Class Posteriors:
6 x 4 Matrix of class "dgeMatrix"
                  Y         N          Y         N
Beijing  0.14285714 0.1111111 0.30000000 0.7000000
Chinese  0.42857143 0.2222222 0.39130435 0.6086957
Japan    0.07142857 0.2222222 0.09677419 0.9032258
Macao    0.14285714 0.1111111 0.30000000 0.7000000
Shanghai 0.14285714 0.1111111 0.30000000 0.7000000
Tokyo    0.07142857 0.2222222 0.09677419 0.9032258

This is coming correct now

Predicted textmodel of type: Naive Bayes

       lp(N)     lp(Y)     Pr(N)  Pr(Y) Predicted
**d5 -9.206303 -7.808069    0.1981 0.8019         N**

The prediction should be Y as Pr(Y)>Pr(N) but its is giving N

Please fix it to get the correct predictions.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 18, 2017

Thanks! Should be ok now.

Note that there was a worse bug, when the Pc was wrongly ordered, it also affected the computation of fitted likelihoods. That's all corrected now and the output matches the textbook example.

> predict(textmodel_NB(trainingset, trainingclass, prior = "docfreq"))
Predicted textmodel of type: Naive Bayes

       lp(Y)     lp(N)     Pr(Y)  Pr(N) Predicted
d1 -3.928188 -6.591674    0.9348 0.0652         Y
d2 -3.928188 -6.591674    0.9348 0.0652         Y
d3 -3.080890 -5.087596    0.8815 0.1185         Y
d4 -6.413095 -5.898527    0.3741 0.6259         N
d5 -8.107690 -8.906681    0.6898 0.3102         Y

@vibhutittu
Copy link
Author

vibhutittu commented Jan 19, 2017 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 19, 2017

Thanks! Would love you have you describe your experience through feedback in issue #461.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants