Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NDPMix() sampler appears to reach NA-valued probabilities #14

Open
donskerclass opened this issue Apr 22, 2023 · 1 comment
Open

NDPMix() sampler appears to reach NA-valued probabilities #14

donskerclass opened this issue Apr 22, 2023 · 1 comment

Comments

@donskerclass
Copy link

Attempting to run NDPmix() on a real data example following as closely as possible the Sin() wave example in the tutorial, with a single standardized predictor, sampling fails at intermediate times during sampling, usually soon after the burnin, with error

Error in sample.int(length(x), size, replace, prob) :
NA in probability vector

The data set has multiple repeated values in the x column (a years variable rounded to the nearest integer), but the same error occurs even if a jitter is added to x (or to y). The error also occurs under a range of parameter settings for init_k. I had conjectured that it might have something to do with the input data format, as the tutorial code with the simulated data does not create errors, but the data has no NAs and identical type to the simulated data. There may be some other feature of the data set that makes this happen, perhaps leading to incompatibility with the model: data and code are linked below.

Looking at the code for NDPmix(), my conjecture is that what is happening is that somewhere in the birth-death process for new clusters, somehow the last cluster gets deleted, leading to an empty value for class_names, a value of K equal to 0, and so an undefined sampling probability. I have not inspected the algorithm closely enough to be confident in this conjecture, or to evaluate whether, if it is correct, there could be a way to avoid it while maintaining the correctness of the algorithm, possibly by checking the length of the list of clusters at intermediate steps.

For completeness, I include the full code including data cleaning and formatting code up to the error below: the data set is Angrist and Krueger (1991) available at Angrist's website.

library(haven) #Open Stata files in R
library(ChiRP)
NEW7080<-read_dta("NEW7080.dta")
colnames(NEW7080)<-c("AGE","AGEQ","v3","EDUC","ENOCENT","ESOCENT","v7","v8","LWKLYWGE","MARRIED","MIDATL","MT","NEWENG","v14","v15","CENSUS","v17","QOB","RACE","SMSA","SOATL","v22","v23","WNOCENT","WSOCENT","v26","YOB")
#Split dataset to training and test
subsetsize<-500
trainfraction<-0.6
set.seed(1)
ids<-sample(1:length(NEW7080$EDUC), size = subsetsize, replace = F)
trainids<-ids[1:(trainfractionsubsetsize)]
testids<-ids[(trainfraction
subsetsize+1):subsetsize]
trainset<-NEW7080[trainids,]
testset<-NEW7080[testids,]
trainset2<-data.frame(x=trainset$EDUC,y=trainset$LWKLYWGE)
set.seed(7)
jitsize<-0.05 #enough to avoid integers but small enough to avoid misclassification
trainset2$x<-trainset2$x+jitsize*rnorm(length(trainset2$x))
trainset2$x<-scale(trainset2$x) #Demeans and standardizes
testset2<-data.frame(x=testset$EDUC) #,y=testset$LWKLYWGE)
testset2$x<-scale(testset2$x)
set.seed(160)
NDP_res<-NDPMix(d_train = trainset2, d_test = testset2,
formula = y ~ x,
burnin=3000, iter = 5000,
phi_y = c(5,10), beta_var_scale = 10000,
init_k = 10, mu_scale = 2, tau_scale = .001)

@stablemarkets
Copy link
Owner

Thanks for reporting this David! There must be some edge case in the cluster book-keeping that hasn't been coded around as you suggest. I'll look into this when I get a chance - but it might not be for quite some time (end of semester rush and all).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants