You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Attempting to run NDPmix() on a real data example following as closely as possible the Sin() wave example in the tutorial, with a single standardized predictor, sampling fails at intermediate times during sampling, usually soon after the burnin, with error
Error in sample.int(length(x), size, replace, prob) :
NA in probability vector
The data set has multiple repeated values in the x column (a years variable rounded to the nearest integer), but the same error occurs even if a jitter is added to x (or to y). The error also occurs under a range of parameter settings for init_k. I had conjectured that it might have something to do with the input data format, as the tutorial code with the simulated data does not create errors, but the data has no NAs and identical type to the simulated data. There may be some other feature of the data set that makes this happen, perhaps leading to incompatibility with the model: data and code are linked below.
Looking at the code for NDPmix(), my conjecture is that what is happening is that somewhere in the birth-death process for new clusters, somehow the last cluster gets deleted, leading to an empty value for class_names, a value of K equal to 0, and so an undefined sampling probability. I have not inspected the algorithm closely enough to be confident in this conjecture, or to evaluate whether, if it is correct, there could be a way to avoid it while maintaining the correctness of the algorithm, possibly by checking the length of the list of clusters at intermediate steps.
For completeness, I include the full code including data cleaning and formatting code up to the error below: the data set is Angrist and Krueger (1991) available at Angrist's website.
library(haven) #Open Stata files in R
library(ChiRP)
NEW7080<-read_dta("NEW7080.dta")
colnames(NEW7080)<-c("AGE","AGEQ","v3","EDUC","ENOCENT","ESOCENT","v7","v8","LWKLYWGE","MARRIED","MIDATL","MT","NEWENG","v14","v15","CENSUS","v17","QOB","RACE","SMSA","SOATL","v22","v23","WNOCENT","WSOCENT","v26","YOB")
#Split dataset to training and test
subsetsize<-500
trainfraction<-0.6
set.seed(1)
ids<-sample(1:length(NEW7080$EDUC), size = subsetsize, replace = F)
trainids<-ids[1:(trainfractionsubsetsize)]
testids<-ids[(trainfractionsubsetsize+1):subsetsize]
trainset<-NEW7080[trainids,]
testset<-NEW7080[testids,]
trainset2<-data.frame(x=trainset$EDUC,y=trainset$LWKLYWGE)
set.seed(7)
jitsize<-0.05 #enough to avoid integers but small enough to avoid misclassification
trainset2$x<-trainset2$x+jitsize*rnorm(length(trainset2$x))
trainset2$x<-scale(trainset2$x) #Demeans and standardizes
testset2<-data.frame(x=testset$EDUC) #,y=testset$LWKLYWGE)
testset2$x<-scale(testset2$x)
set.seed(160)
NDP_res<-NDPMix(d_train = trainset2, d_test = testset2,
formula = y ~ x,
burnin=3000, iter = 5000,
phi_y = c(5,10), beta_var_scale = 10000,
init_k = 10, mu_scale = 2, tau_scale = .001)
The text was updated successfully, but these errors were encountered:
Thanks for reporting this David! There must be some edge case in the cluster book-keeping that hasn't been coded around as you suggest. I'll look into this when I get a chance - but it might not be for quite some time (end of semester rush and all).
Attempting to run
NDPmix()
on a real data example following as closely as possible the Sin() wave example in the tutorial, with a single standardized predictor, sampling fails at intermediate times during sampling, usually soon after the burnin, with errorThe data set has multiple repeated values in the
x
column (a years variable rounded to the nearest integer), but the same error occurs even if a jitter is added to x (or to y). The error also occurs under a range of parameter settings forinit_k
. I had conjectured that it might have something to do with the input data format, as the tutorial code with the simulated data does not create errors, but the data has no NAs and identical type to the simulated data. There may be some other feature of the data set that makes this happen, perhaps leading to incompatibility with the model: data and code are linked below.Looking at the code for
NDPmix()
, my conjecture is that what is happening is that somewhere in the birth-death process for new clusters, somehow the last cluster gets deleted, leading to an empty value forclass_names
, a value ofK
equal to 0, and so an undefined sampling probability. I have not inspected the algorithm closely enough to be confident in this conjecture, or to evaluate whether, if it is correct, there could be a way to avoid it while maintaining the correctness of the algorithm, possibly by checking the length of the list of clusters at intermediate steps.For completeness, I include the full code including data cleaning and formatting code up to the error below: the data set is Angrist and Krueger (1991) available at Angrist's website.
The text was updated successfully, but these errors were encountered: