# Introduction

We will be using Google Colab for this example on density regression using semi-parametric quantile regression (SPQR). First, go to https://colab.research.google.com/. Click File -> New notebook in Drive, and then change the runtime to R (Runtime -> Change runtime type, then pick R in the dropdown). We will not be using GPUs, so keep the CPU box checked.

# Installation
To install Keras3, run the following code. Colab already has Python and Tensorflow modules installed, so we do not need to do anything particularly complicated here.

In [None]:
remotes::install_github("rstudio/tensorflow")
install.packages(c("keras3","splines2"))
library(keras3)

Set seed for reproducibility.

In [2]:
set_random_seed(1)

## The model

We will assume, that for a covariate vector $X$ (which could be multivariate) and a univariate response $Y$, $$Y_i\vert X_i \sim Normal(\mu(X_i),\sigma(X_i))$$

and estimate the conditional density of $Y\vert X$. We first demonstrate this using some simulated data, and then use the weather data use used for SPQR.

## Generating data

We generate 10000 data points from the following data generating process:
$$X_1\sim Beta(3,2), X_2\sim Beta(2,5)$$
$$Y_i\vert X_{1i},X_{2i} \sim Normal(X_{1i}^2 - 3X_{2i},2X_{1i})$$

We also generate 1000 test points.


In [3]:
x_train <- cbind(rbeta(10000,3,2), rbeta(10000,2,5))
y_train <- rnorm(10000,mean = x_train[,1]^2-3*x_train[,2]+5,sd = 2*x_train[,1])
x_test <- cbind(rbeta(1000,3,2), rbeta(1000,2,5))
y_test <- rnorm(1000,mean = x_test[,1]^2-3*x_test[,2]+5,sd = 2*x_test[,1])

Let's visualize this real quick.

In [None]:
par(mfrow=c(1,3))
plot(x_train[,1],y_train)
plot(x_train[,2],y_train)
plot(density(y_train))
par(mfrow=c(1,1))

All of this is really non-linear and I don't think linear regression would be a good idea here. Quantile regression may(?) be better? But we will estimate the conditional density instead.

## Define the keras model

This is no longer a sequential model, because we need to run our 2 outputs (the conditional mean and SD) through different activation functions. We employ what is called the functional API for `keras3`. A detailed guide on how it differs from the sequential models is found in the official docs at <https://keras3.posit.co/articles/functional_api.html>.

In short, things don't need to be in sequence and can have different branches, where we can do different things to different branches. Complex architectures will often be built using the functional API.

In [5]:
input1 <- keras_input(shape=dim(x_train)[2], name = 'covariates')
x_1 <- layer_dense(input1, units = 12, activation = 'relu')

x_2 <- layer_dense(x_1, 12, activation = 'relu')

mu <- layer_dense(x_2, 1, activation = 'linear', name = "mean")
sig <- layer_dense(x_2, 1, activation = 'exponential', name = "sd")

out_concat <- layer_concatenate(mu,sig)
out_concat <- layer_identity(out_concat, name='params')
model <- keras_model(inputs = list(input1),
                     outputs = out_concat, name = "norm_dist")

### The loss function

The output layer of the MLP above is not a point prediction of $Y$, but instead the conditional mean and SD of a Normal distribution fitted to $Y\vert X$. We will now do MLE. For this we need the log-likelihood of the Normal distribution. In particular, since this needs to be a loss function that is minimized, we have the negative log-likelihood.

In [7]:
nloglik_loss_normal  = function (y_true, y_pred){
    # print(numbasis)
    mu <- y_pred[,1]
    sig <- y_pred[,2]
    isthisloss <- op_sum(op_log(sig) + 0.5*((y_true-mu)/sig)**2)
    return(isthisloss)
}

I now compile and run the model. I have very few bells and whistles in my model specification; I am sure this can be improved with some effort.

In [9]:
model |> compile(
    loss = nloglik_loss_normal,
    optimizer = optimizer_adam(learning_rate=0.01)
)

history <- model |> fit(
    x= x_train,
    y=y_train,
    epochs = 50,
    batch_size = 128,
    verbose=0,
    callbacks=list(callback_early_stopping(monitor = "val_loss",
                                                     min_delta = 0, patience = 10)),
    validation_split = 0.2
)

Plot the history and the model:

In [None]:
model
plot(history)

It seems to have converged pretty quickly - even though this is non-linear, by deep learning standards, it is fairly straightforward..

## Predictions

Since I know what the acutal means and SDs were for this, we can directly compare.

In [None]:
preds <- as.matrix(model(x_test))
true_mean <- x_test[,1]^2-3*x_test[,2]+5
true_sd <- 2*x_test[,1]

par(mfrow=c(1,2))
plot(preds[,1],true_mean)
abline(0,1)

plot(preds[,2],true_sd)
abline(0,1)
par(mfrow=c(1,1))

# Fitting a distribution for `tmax|pr`

I'm going to get the same data I used for the SPQR example. I'll even keep the same model I used with the simulated data. The only change I'll make is I will add an intercept term to the precipitation data so that there are 2 covariates which is what the model is expecting. It also let to a better fitted and stable model when I checked.

In [None]:
file_url <- "https://github.com/reetamm/AI4stats/blob/main/weather.RDS?raw=true"
weather <- readRDS(url(file_url))
weather <- weather[weather$month==7 & weather$loc==1,]
head(weather)

mnth <- weather$month
tmax <- weather$tmax - 273 #convert tmax to celsius
pr <- log(weather$pr + 0.0001) #convert pr to log-scale
plot(tmax,pr,pch=20)
n_total <- length(tmax)
train_ind <- sample(1:n_total,ceiling(0.8*n_total)) #80% training data

tmax_range <- range(tmax)
y1 <- tmax
X <- cbind(1,pr)

# train and validation data
y1_train <- y1[train_ind]
y1_test <- y1[-train_ind]

# For conditional density of tmax with intercept and log-pr
X2_train <- X[train_ind,1:2]
X2_test <- X[-train_ind,1:2]

We now fit the model with this new dataset.

In [None]:
history <- model |> fit(
    x= X2_train,
    y=y1_train,
    epochs = 50,
    batch_size = 128,
    verbose = 0,
    callbacks=list(callback_early_stopping(monitor = "val_loss",
                                           min_delta = 0, patience = 10)),
    validation_split = 0.2
)
plot(history)

## Predictions
For each data point, we will be predicting out a mean and SD. There aren't any `true` values to compare again, but we can nevertheless check goodness of fit. I'll do that for the first 6 observations.

In [None]:
preds <- as.matrix(model(X2_test))
head(y_test)
head(preds)
y_pred <- matrix(NA,1000,6)
for(i in 1:6)
    y_pred[,i] <- rnorm(1000,preds[i,1],preds[i,2])
par(mfrow=c(2,3))
for(i in 1:6){
    prcp <- round(exp(X2_test[i,2]) - 0.0001,2)
    plot(density(y_pred[,i]),main=paste0('prcp = ',prcp))
    abline(v=y1_test[i])
}
par(mfrow=c(1,1))