# Finite mixture models

## Principle

The energy signature models only offer a coarse disaggregation of energy use into three components: heating, cooling, and baseline consumption. Furthermore, they rely on very long sampling times and cannot predict sub-daily consumption profiles. Finite Mixture Models (FMM) are one way to take the disaggregation of the baseline energy consumption further. Their most common specific case are the Gaussian Mixture Models (GMM).

Finite mixture models assume that the outcome $y$ is drawn from one of several distributions, the identity of which is controlled by a categorical mixing distribution.\cite{stan_guide} For instance, the mixture of $K$ normal distributions $f$ with locations $\mu_k$ and scales $\sigma_k$ reads:
\begin{equation}
	p(y_i|\lambda, \mu, \sigma) = \sum_{k=1}^K \lambda_k f(y_i|\mu_k,\sigma_k)
	(\#eq:fmm)
\end{equation}
where $\lambda_k$ is the (positive) mixing proportion of the $k$th component and $\sum_{k=1}^K \lambda_k = 1$. The FMM distributes the observed values into a finite number of distributions with probability $\lambda_k$. The optimal number of components is not always a trivial choice: studies involving GMM often rely on some model selection index, such as the Bayesian Information Criterion (BIC), to guide the choice of the appropriate value for $K$. 

The dependency of observations $y$ on explanatory variables $x$ can be included in the FMM, by formulating its parameters $\left\{ \lambda_k(x), \mu_k(x), \sigma_k(x) \right\}$ as dependent on the given value $x$ of these regressors. Furthermore, in order to include the effects of different power consumption demand behaviours, the mixture probabilities $\lambda_k$ can be modelled as dependent on a categorical variable $z$. Finite Mixture Models thus offer a very high flexibility for attempting to disaggregate and predict energy uses, while including the possible effects of continuous or discrete explanatory variables.

## Example

This example uses a data file provided [in the book's repository](https://github.com/srouchier/buildingenergygeeks/tree/master/data). The tutorial below is written in **R** and uses [Stan](https://mc-stan.org/). Unsurprisingly, the Stan user's guide also has [a chapter on finite mixtures](https://mc-stan.org/docs/2_27/stan-users-guide/mixture-modeling-chapter.html)

In [1]:
library(rstan)
library(tidyverse)

df <- read_csv("data/mixture.csv")
summary(df)
nrow(df)

Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ tibble  1.4.2     ✔ purrr   0.2.5
✔ tidyr   0.8.1     ✔ dplyr   0.7.6
✔ readr   1.1.1     ✔ stringr 1.2.0
✔ tibble  1.4.2     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks rstan::extract()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
Parsed with column specification:
cols(
  mean_consumption = col_double(),
  ratio_house = col_double(),
  ratio_apartment = col_double()
)


 mean_consumption  ratio_house     ratio_apartment 
 Min.   : 0.770   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 3.753   1st Qu.:0.2006   1st Qu.:0.1181  
 Median : 5.146   Median :0.6346   Median :0.3654  
 Mean   : 5.499   Mean   :0.5523   Mean   :0.4477  
 3rd Qu.: 6.939   3rd Qu.:0.8819   3rd Qu.:0.7994  
 Max.   :41.805   Max.   :1.0000   Max.   :1.0000  

This data file is an excerpt of an energy consumption census in France. Each row represents an "area" of about 2,000 residents. The available data are the mean residential energy consumption in each area, and the ratios of houses and apartments.

On average, we expect a house to have a higher energy consumption than an apartment: it is larger, has more residents, and more envelope surface with heat loss. Therefore, we can expect areas with more houses to have a higher mean consumption than areas with more apartments.

Let us look at a pairplot of the three features.

In [2]:
library(GGally)
ggpairs(df, columns=c("mean_consumption", "ratio_house", "ratio_apartment"))

ERROR: Error in library(GGally): there is no package called ‘GGally’


In [None]:
There is indeed *some* correlation between the ratio of houses in each area and the mean consumption. The density of `ratio_house` is strongly bimodal, and the density of `mean_consumption` looks like it could be split into two distributions as well.

We can now try to translate our assumptions into a simple mixture of two distributions. Equation \@ref(eq:hmm) can be formulated as such:
\begin{equation}
p(y_t | \lambda, \mu, \sigma) = \lambda_t f\left(y_t | \mu_1, \sigma_1 \right) + (1-\lambda_t) f\left(y_t | \mu_2, \sigma_2 \right) (\#eq:fmm2)
\end{equation}
where, for each data point $t$,

* $y_t$ is the dependent variable `mean_consumption`.
* $\lambda_t$ is the explanatory variable `ratio_house`.
* $f$ is a type of continuous probability distribution. It can be Normal, Gamma, LogNormal, etc.
* $\mu$ and $\sigma$ are the parameters of the distribution $f$ that we will choose. The indices $1$ and $2$ denote each of the two mixture components.

This is a Stan mixture model with any number `K` of components:

In [None]:
mixture <- "
data {
  // This block declares all data which will be passed to the Stan model.
  int<lower=0> N;       // number of data items in the training dataset
  int<lower=0> K;       // number of components
  real y[N];            // outcome energy vector
  real l[N, K];         // ratios in the training dataset
}
parameters {
  // This block declares the parameters of the model.
  vector[K] mu;
  vector[K] sigma;
}
model {
  for (n in 1:N) {
    vector[K] lps;
    for (k in 1:K) {
      lps[k] = log(l[n, k]) + lognormal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
}
"

We can separate the data into a training set and a test set like so. The following block allocates 75% of the data to the training set:

In [None]:
# set.seed(12345)  # this is optional but ensures that results are reproducible
train_ind <- sample(seq_len(nrow(df)), size = floor(0.75 * nrow(df)))

train <- df[train_ind, ]
test <- df[-train_ind, ]

The next step maps the data to the Stan model and runs the MCMC algorithm.

In [None]:
model_data <- list(
  N = nrow(train),
  N_test = nrow(test),
  K = 2,
  y = train$mean_consumption,
  l = train %>% select(ratio_house, ratio_apartment)
)

# Fittage
fit1 <- stan(
  model_code = mixture,  # Stan program
  data = model_data,        # named list of data
  chains = 2,               # number of Markov chains. 4 is better, 2 is faster
  warmup = 1000,            # number of warmup iterations per chain
  iter = 4000,              # total number of iterations per chain
  cores = 2,                # number of cores (could use one per chain)
)

Let us now display the results of the fitting:

In [None]:
print(fit1, pars=c("mu", "sigma", "lp__"))
traceplot(fit1, pars=c("mu", "sigma", "lp__"))
pairs(fit1, pars=c("mu", "sigma", "lp__"))

It looks like we can be satisfied with the MCMC convergence: `n_eff` is high enough and `Rhat` close to 1 for all parameters, and all chains seem stationary. The last step is to predict values of the mean consumption of each area in the test data set. We calculate this prediction from the ratios of houses and apartments, and from the mean estimated values of the distributions in the mixture model.

In [None]:
# Extracting distribution parameters from the fit object
la <- rstan::extract(fit1, permuted = TRUE)
mu <- colMeans(la$mu)
sigma <- colMeans(la$sigma)

# Predict the consumption of the test data from the ratios
test$y <- test$ratio_house * rlnorm(nrow(test), mu[1], sigma[1]) +
  test$ratio_apartment * rlnorm(nrow(test), mu[2], sigma[2])

# Plot to compare measured and predicted consumption on the test data
ggplot(data=test) +
  geom_histogram(mapping=aes(x=mean_consumption), bins=50, color='blue', alpha=0.3) +
  geom_histogram(mapping=aes(x=y), bins=50, color='red', alpha=0.3)