No description, website, or topics provided.
Branch: master
Clone or download
Latest commit eac7afa May 17, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R correction on lasso May 17, 2018
man flexZBoost Oct 5, 2017
tests/testthat Fix bug with normalize_density Aug 19, 2017
vignettes flexZBoost Oct 5, 2017
.Rbuildignore loading all functions Jul 18, 2016
.gitignore removed bugs Mar 2, 2016
DESCRIPTION used can choose n_grid Nov 8, 2017
FlexCoDE.Rproj used can choose n_grid Nov 8, 2017
LICENSE included xgboost Jul 18, 2016
NAMESPACE flexZBoost Oct 5, 2017
README.md Update README.md Oct 7, 2017
Rplots.pdf flexZBoost Oct 5, 2017

README.md

FlexCoDE

Implements FlexCoDE, Flexible Conditional Density Estimator, in R.

FlexCode is a general-purpose methodology for converting any conditional mean point estimator of $z$ to a conditional {\em density} estimator $f(z \vert x)$, where $x$ represents the covariates. The key idea is to expand the unknown function $f(z \vert x)$ in an orthonormal basis ${\phi_i(z)}_{i}$:

$$f(z|x)=\sum_{i}\beta_{i }(x)\phi_i(z). $$

By the orthogonality property, the expansion coefficients are just conditional means

$$\beta_{i }(x) = \mathbb{E}\left[\phi_i(z)|x\right] \equiv \int f(z|x) \phi_i(z) dz.$$

These coefficients can easily be estimated from data by regression.

More on FlexCoDE: Izbicki, R.; Lee, A.B. Converting High-Dimensional Regression to High-Dimensional Conditional Density Estimation. Electronic Journal of Statistics, 2017

To install the package, run

# install.packages("devtools")
devtools::install_github("rizbicki/FlexCoDE")

A simple example:

# generate data
n=1000
d=10
data=matrix(NA,n,d+1)
data[,1:d]=matrix(rnorm(n*d),n,d)
data[,d+1]=data[,1]+rnorm(n,0,0.1)

# determine sample sizes
nTrain=round(0.7*n)
nValidation=round(0.25*n)
nTest=n-nTrain-nValidation

# split data
randomIndex=sample(1:n)
xTrain=data[randomIndex[1:nTrain],1:d]
xValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],1:d]
xTest=data[randomIndex[(nTrain+nValidation+1):n],1:d]
zTrain=data[randomIndex[1:nTrain],d+1]
zValidation=data[randomIndex[(nTrain+1):(nTrain+nValidation)],d+1]
zTest=data[randomIndex[(nTrain+nValidation+1):n],d+1]

# Fit nearest neighbors FlexCoDE
fit=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 20,regressionFunction = regressionFunction.NN)
fit$estimatedRisk
print(fit)
plot(fit,xTest,zTest)

# Plot estimated curves evaluates on new test points, and
# compare with true conditional density
predictedValues=predict(fit,xTest,B=500)
plot(predictedValues$z,predictedValues$CDE[1,])
lines(predictedValues$z,dnorm(predictedValues$z,xTest[1,1],0.1),col=2)

# Fit sparse additive FlexCoDE
fit=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 30,regressionFunction = regressionFunction.SpAM)
fit$estimatedRisk
print(fit)
plot(fit,xTest,zTest)

# Plot estimated curves evaluates on new test points, and
# compare with true conditional density
predictedValues=predict(fit,xTest,B=500)
plot(predictedValues$z,predictedValues$CDE[1,])
lines(predictedValues$z,dnorm(predictedValues$z,xTest[1,1],0.1),col=2)


# Fit sparse additive FlexCoDE using 4 cores (i.e., using parallel computing)
fit=fitFlexCoDE(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 30,regressionFunction = regressionFunction.SpAM,
            regressionFunction.extra=list(nCores=4))
fit$estimatedRisk
print(fit)
plot(fit,xTest,zTest)

FlexZBoost

FlexZBoost is a particular realization of FlexCode, where we use XGBoost for the regression part, as these techniques scale well for massive data. There is an additional tuning parameter in FlexZBoost: an exponent $\alpha$ that we use to sharpen the computed density estimates $\widehat{f}(z|x)$, according to $\widetilde{f}(z|x) \propto \widehat{f}(z|x)^\alpha$.

A simple example of fitting flexZBoost using the data generated above:

fit=flexZBoost(xTrain,zTrain,xValidation,zValidation,xTest,zTest,
            nIMax = 30)
fit$bestAlpha
fit$estimatedRisk
plot(fit,xTest,zTest)

An example to Buzzard data

If you are a member of the LSST DESC, you will find information about and pointers to Data Challenge 1 (DC1) data at https://confluence.slac.stanford.edu/pages/viewpage.action?spaceKey=LSSTDESC&title=DC1+resources

library(dplyr)
data.spec=read.table("buzzard_spec_witherrors_mass.txt",header=T)

data.spec.redshift=data.spec$redshift
# computes additional covariates:
data.spec.cov=data.spec %>% mutate(ug = u-g, gr = g-r, ri = r-i, iz = i-z, zy = z-y,
                                   ug.err = sqrt(u.err^2+g.err^2),
                                   gr.err = sqrt(g.err^2+r.err^2),
                                   ri.err = sqrt(r.err^2+i.err^2),
                                   iz.err = sqrt(i.err^2+z.err^2),
                                   zy.err = sqrt(z.err^2+y.err^2))
data.spec.cov=data.spec.cov %>% select(-redshift)

data.photo=read.table("buzzard_phot_witherrors_mass.txt",header=T)
# computes additional covariates:
data.photo.cov=data.photo %>% mutate(ug = u-g, gr = g-r, ri = r-i, iz = i-z, zy = z-y,
                                     ug.err = sqrt(u.err^2+g.err^2),
                                     gr.err = sqrt(g.err^2+r.err^2),
                                     ri.err = sqrt(r.err^2+i.err^2),
                                     iz.err = sqrt(i.err^2+z.err^2),
                                     zy.err = sqrt(z.err^2+y.err^2))
data.photo.redshift=read.table("data/buzzard_truth.txt",header=T)[,1]

# determine sample size for training and for validation:
n=nrow(data.spec.cov)
nTrain=round(0.8*n)
# split data
randomIndex=sample(1:n)
xTrain=data.spec.cov[randomIndex[1:nTrain],]
xValidation=data.spec.cov[-randomIndex[1:nTrain],]
zTrain=data.spec.redshift[randomIndex[1:nTrain]]
zValidation=data.spec.redshift[-randomIndex[1:nTrain]]

fit=flexZBoost(xTrain=xTrain,zTrain=zTrain,
               xValidation=xValidation,zValidation=zValidation,
               xTest = data.photo.cov,
               zTest= data.photo.redshift,
               nIMax = 40,regressionFunction.extra = list(nCores=5,
                                                                 ninter=2000))
fit$estimatedRisk
plot(fit,data.photo.cov,
     data.photo.redshift)