NDPMix() sampler appears to reach NA-valued probabilities #14

donskerclass · 2023-04-22T18:07:32Z

Attempting to run NDPmix() on a real data example following as closely as possible the Sin() wave example in the tutorial, with a single standardized predictor, sampling fails at intermediate times during sampling, usually soon after the burnin, with error

Error in sample.int(length(x), size, replace, prob) :
NA in probability vector

The data set has multiple repeated values in the x column (a years variable rounded to the nearest integer), but the same error occurs even if a jitter is added to x (or to y). The error also occurs under a range of parameter settings for init_k. I had conjectured that it might have something to do with the input data format, as the tutorial code with the simulated data does not create errors, but the data has no NAs and identical type to the simulated data. There may be some other feature of the data set that makes this happen, perhaps leading to incompatibility with the model: data and code are linked below.

Looking at the code for NDPmix(), my conjecture is that what is happening is that somewhere in the birth-death process for new clusters, somehow the last cluster gets deleted, leading to an empty value for class_names, a value of K equal to 0, and so an undefined sampling probability. I have not inspected the algorithm closely enough to be confident in this conjecture, or to evaluate whether, if it is correct, there could be a way to avoid it while maintaining the correctness of the algorithm, possibly by checking the length of the list of clusters at intermediate steps.

For completeness, I include the full code including data cleaning and formatting code up to the error below: the data set is Angrist and Krueger (1991) available at Angrist's website.

library(haven) #Open Stata files in R
library(ChiRP)
NEW7080<-read_dta("NEW7080.dta")
colnames(NEW7080)<-c("AGE","AGEQ","v3","EDUC","ENOCENT","ESOCENT","v7","v8","LWKLYWGE","MARRIED","MIDATL","MT","NEWENG","v14","v15","CENSUS","v17","QOB","RACE","SMSA","SOATL","v22","v23","WNOCENT","WSOCENT","v26","YOB")
#Split dataset to training and test
subsetsize<-500
trainfraction<-0.6
set.seed(1)
ids<-sample(1:length(NEW7080$EDUC), size = subsetsize, replace = F)
trainids<-ids[1:(trainfractionsubsetsize)]
testids<-ids[(trainfractionsubsetsize+1):subsetsize]
trainset<-NEW7080[trainids,]
testset<-NEW7080[testids,]
trainset2<-data.frame(x=trainset$EDUC,y=trainset$LWKLYWGE)
set.seed(7)
jitsize<-0.05 #enough to avoid integers but small enough to avoid misclassification
trainset2$x<-trainset2$x+jitsize*rnorm(length(trainset2$x))
trainset2$x<-scale(trainset2$x) #Demeans and standardizes
testset2<-data.frame(x=testset$EDUC) #,y=testset$LWKLYWGE)
testset2$x<-scale(testset2$x)
set.seed(160)
NDP_res<-NDPMix(d_train = trainset2, d_test = testset2,
formula = y ~ x,
burnin=3000, iter = 5000,
phi_y = c(5,10), beta_var_scale = 10000,
init_k = 10, mu_scale = 2, tau_scale = .001)

The text was updated successfully, but these errors were encountered:

stablemarkets · 2023-04-22T18:45:17Z

Thanks for reporting this David! There must be some edge case in the cluster book-keeping that hasn't been coded around as you suggest. I'll look into this when I get a chance - but it might not be for quite some time (end of semester rush and all).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NDPMix() sampler appears to reach NA-valued probabilities #14

NDPMix() sampler appears to reach NA-valued probabilities #14

donskerclass commented Apr 22, 2023

stablemarkets commented Apr 22, 2023

NDPMix() sampler appears to reach NA-valued probabilities #14

NDPMix() sampler appears to reach NA-valued probabilities #14

Comments

donskerclass commented Apr 22, 2023

stablemarkets commented Apr 22, 2023