# Exercise 5 - Multivariate Gaussians

In this exercise, we will estimate a Gaussian from a dataset and answer inference queries using the mean- and canonical parameterizations. Runtime experiments will illustrate the importance of both parameterizations.

In the event of a persistent problem, do not hesitate to contact the course instructors under
- paul.kahlmeyer@uni-jena.de

### Submission

- Deadline of submission:
        04.12.2022
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=34630)

### Help
In case you cannot solve a task, you can use the saved values within the `help` directory:
- Load arrays with [Numpy](https://numpy.org/doc/stable/reference/generated/numpy.load.html)
```
np.load('help/array_name.npy')
```
- Load functions with [Dill](https://dill.readthedocs.io/en/latest/dill.html)
```
import dill
with open('help/some_func.pkl', 'rb') as f:
    func = dill.load(f)
```

to continue working on the other tasks.

# Dataset

In this exercise, we will use a dataset used for [predicting wine quality](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009).

You find this dataset stored as `dataset.csv`. 

### Task 1
Read this dataset into a $1599\times 12$ matrix.

Each row represents one specific wine, each column corresponds to a measured attribute.

In [1]:
import pandas as pd
import numpy as np
pd.read_csv('dataset.csv')

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [2]:
X = pd.read_csv('dataset.csv').to_numpy()

# Model selection

Here we use the model assumption that the samples come from a multivariate normal distribution. 

### Task 2
Estimate the Maximum Likelihood parameters

\begin{align}
\mu_{\text{ML}} &= \frac{1}{N}\sum_{i=1}^Nx^{(i)}\\
\Sigma_{\text{ML}} &= \frac{1}{N}\sum_{i=1}^N\left(x^{(i)}-\mu_{\text{ML}}\right)\left(x^{(i)}-\mu_{\text{ML}}\right)^T
\end{align}

for a multivariate normal distribution based on this dataset. Here $N$ is the number of samples and $x^{(i)}$ is the i-th sample.

In [3]:
N = X.shape[0]
sample_mean = (1/N) * np.sum(X, axis=0)
M = X.shape[1]
sample_cov = np.zeros((M,M))
for i in range(N):
    sample_cov += np.outer(X[i]-sample_mean, X[i]-sample_mean)
sample_cov = sample_cov / N
sample_cov

array([[ 3.02952057e+00, -7.98014785e-02,  2.27677527e-01,
         2.81580055e-01,  7.67389030e-03, -2.79916982e+00,
        -6.47829186e+00,  2.19385070e-03, -1.83470891e-01,
         5.39763142e-02, -1.14349595e-01,  1.74314505e-01],
       [-7.98014785e-02,  3.20423261e-02, -1.92595685e-02,
         4.83888167e-04,  5.16263851e-04, -1.96612867e-02,
         4.50144000e-01,  7.43900996e-06,  6.49063758e-03,
        -7.91647985e-03, -3.85760812e-02, -5.64405638e-02],
       [ 2.27677527e-01, -1.92595685e-02,  3.79237511e-02,
         3.94096081e-02,  1.86755609e-03, -1.24174408e-01,
         2.27554874e-01,  1.34090669e-04, -1.62873900e-02,
         1.03212557e-02,  2.28009045e-02,  3.55896216e-02],
       [ 2.81580055e-01,  4.83888167e-04,  3.94096081e-02,
         1.98665392e+00,  3.68786810e-03,  2.75688624e+00,
         9.41055252e+00,  9.44819611e-04, -1.86326290e-02,
         1.32011525e-03,  6.31794232e-02,  1.56252677e-02],
       [ 7.67389030e-03,  5.16263851e-04,  1.8675560

# Inference

Now that we have estimated the parameters of our underlying model, we want to perform inference in order to answer the query:

**"What quality and alcohol level can we expect, if we observe a wine with**
- **citric acid level of 0.6,**
- **residual sugar of 2.5,**
- **chlorides level of 0.1,**
- **density of 0.994,**
- **sulphate level of 0.5?"**

## Mean Parameterization

The mean parameterization of a Gaussian consists of the mean vector $\mu$ and the covariance matrix $\Sigma$.

**Marginalizing** dimensions from a Gaussian, to keep a subset $J$ of the dimensions results in a Gaussian with 
- Mean vector $\mu_J$
- Covariance matrix $\Sigma_{JJ}$

**Conditioning** a subset $J$ of the dimensions on values $x_J$ also gives us a Gaussian with 
- Mean vector $\mu_I+\Sigma_{IJ}\Sigma_{JJ}^{-1}(x_J-\mu_J)$ 
- Covariance matrix $S_{II} = \Sigma_{II}-\Sigma_{IJ}\Sigma_{JJ}^{-1}\Sigma_{JI}$

Here, the subscripts indicate the selected dimensions of the variables. $I$ denotes the remaining dimensions, after we condition on the dimensions $J$. $S$ denotes the Schur complement.

### Task 3
Implement the following class of a Gaussian with mean parameterization. Then use your implementation to answer the query.

Note: `marginalize` and `condition` should not return any parameters, but update the internal parameters.

In [4]:
def myindex(arr, idx_I, idx_J=[]): # Leaves only the indices idx_I and idx_J in array arr
    if arr.ndim > 2: return arr
    if arr.ndim == 1: return arr[idx_I]
    if arr.ndim == 2: return arr[idx_I,:][:,idx_J]

In [5]:
class MeanGaussian():
    def __init__(self, mu, sigma):
        '''
        Mean parameterization of a gaussian
        
        @Params: 
            mu... vector of size ndims
            sigma... matrix of size ndims x ndims
        '''
        
        self.mu = mu
        self.sigma = sigma
        
    def marginalize(self,idx_J):
        '''
        Marginalizes a set of indices from the Gaussian.
    
        @Params:
            idx_J... list of indices to keep after marginalization (these indices remain)
            
        @Returns:
            Nothing, parameters are changed internally
        '''
        self.mu = self.mu[idx_J] # mu and sigma have to be numpy arrays
        self.sigma = myindex(self.sigma, idx_J, idx_J)
        
    
    def condition(self,idx_J, x_J):
        '''
        Conditions a set of indices on values.
        
        @Params:
            idx_J... list of indices that are conditioned on
            x_J... values that are conditioned on
            
        @Returns:
            Nothing, parameters are changed internally
        '''
        idx_I = np.array([idx for idx in range(len(self.mu)) if idx not in idx_J])
        sigma_ii = myindex(self.sigma, idx_I, idx_I)
        sigma_ij = myindex(self.sigma, idx_I, idx_J)
        sigma_ji = myindex(self.sigma, idx_J, idx_I)
        sigma_jj = myindex(self.sigma, idx_J, idx_J)
        self.mu = self.mu[idx_I] + sigma_ij @ np.linalg.inv(sigma_jj) @ (x_J - self.mu[idx_J])
        self.sigma = sigma_ii - sigma_ij @ np.linalg.inv(sigma_jj) @ sigma_ji

distribution = MeanGaussian(sample_mean, sample_cov)
idx_J = [2,3,4,7,9]
x_J = [0.6, 2.5, 0.1, 0.994, 0.5]
distribution.condition(idx_J, x_J)
distribution.marginalize([5,6])
distribution.mu
# Mu gives us the expected values alcohol and quality

array([11.78770805,  6.10114144])

## Canonical Parameterization

The canonical parameterization $(\nu,\Lambda)$ results from the mean parameterization trough

\begin{align}
\nu &=\Sigma^{-1}\mu\\
\Lambda &= \Sigma^{-1}
\end{align}


In the canonical parameterization, **marginalizing** dimensions from a Gaussian, to keep a subset $J$ of the dimensions, results in a Gaussian with 
- Vector $\nu_J-\Lambda_{JI}\Lambda_{II}^{-1}\nu_I$
- Precision matrix $S_{JJ}=\Lambda_{JJ}-\Lambda_{JI}\Lambda_{II}^{-1}\Lambda_{IJ}$

**Conditioning** a subset $J$ of the dimensions on values $x_J$ again gives us a Gaussian with 
- Vector $\nu_I-\Lambda_{IJ}x_J$
- Precision matrix $\Lambda_{II}$

The subscripts indicate the selected dimensions of the variables. $I$ denotes the remaining dimensions, after we remove the dimensions $J$. $S$ denotes the Schur complement.

We shall later see, that there are some cases, where you would prefer canonical parameterization over the mean parameterization.

### Task 4
Implement the following class of a Gaussian with canonical parameterization. Then use your implementation to answer the query.

Note: `marginalize` and `condition` should not return any parameters, but update the internal parameters.
The solution should be the same as in Task 3.

In [8]:
class CanonicalGaussian():
    def __init__(self, nu, lamb):
        '''
        Canconical representation of a gaussian
        
        @Params: 
            nu... vector of size ndims
            lamb... matrix of size ndims x ndims (precision matrix)
        '''
        
        self.nu = nu
        self.lamb = lamb
        
    def marginalize(self,idx_J):
        '''
        Marginalizes a set of indices from the Gaussian.
    
        @Params:
            idx_J... list of indices to keep after marginalization (these indices remain)
            
        @Returns:
            Nothing, parameters are changed internally
        '''
        
        idx_I = np.array([idx for idx in range(len(self.nu)) if idx not in idx_J])
        lambda_ii = myindex(self.lamb, idx_I, idx_I)
        lambda_ij = myindex(self.lamb, idx_I, idx_J)
        lambda_ji = myindex(self.lamb, idx_J, idx_I)
        lambda_jj = myindex(self.lamb, idx_J, idx_J)
        self.nu = self.nu[idx_J] - lambda_ji @ np.linalg.inv(lambda_ii) @ self.nu[idx_I]
        self.lamb = lambda_jj - lambda_ji @ np.linalg.inv(lambda_ii) @ lambda_ij
        
    
    def condition(self,idx_J, x_J):
        '''
        Conditions a set of indices on values.
        
        @Params:
            idx_J... list of indices that are conditioned on
            x_J... values that are conditioned on
            
        @Returns:
            Nothing, parameters are changed internally
        '''
        idx_I = np.array([idx for idx in range(len(self.nu)) if idx not in idx_J])
        lambda_ii = myindex(self.lamb, idx_I, idx_I)
        lambda_ij = myindex(self.lamb, idx_I, idx_J)
        self.nu = self.nu[idx_I] - lambda_ij @ x_J
        self.lamb = lambda_ii
    
distribution = CanonicalGaussian(np.linalg.inv(sample_cov) @ sample_mean, np.linalg.inv(sample_cov))
idx_J = [2,3,4,7,9]
x_J = [0.6, 2.5, 0.1, 0.994, 0.5]
distribution.condition(idx_J, x_J)
distribution.marginalize([5,6])

# nu = inv(sigma) @ mu
# sigma @ nu = mu
# inv(lambda) @ nu = mu
np.linalg.inv(distribution.lamb) @ distribution.nu

array([11.78770805,  6.10114144])

# Computational costs

Why do we need two different parameterizations of the same probability distribution?
What is the difference?

We cannot observe the effect of a different parameterization on our dataset, as it is way to small (too few dimensions).

In the `synthetic/` directory, you find parameters for a Gaussian with 300 dimensions, as well as a value vector `x` for conditioning.
Load these arrays and calculate the parameters for the canoncial parameterization.

In [9]:
mu = np.load('synthetic/mu.npy')
sigma = np.load('synthetic/sigma.npy')
x = np.load('synthetic/x.npy')
lamb = np.linalg.inv(sigma)
nu = lamb @ mu

We now want to investigate the computation times for the following inference operations:

1. Marginalize out the dimensions 200-299, then condition on the dimensions 100-199 with $x$
2. Condition on the dimensions 100-199 with $x$, then marginalize out the dimensions 200-299


<div>
<img src="images/indices.png" width="700"/>
</div>

Both operations yield the same result, $p(x_0,\dots,x_{99}|x_{100},\dots,x_{199})$ they just change the order of marginalization and conditioning.

### Task 5
Track the computational costs for both inference operations using the mean parameters and the canoncial parameters.

What do you observe? Try to find an explanation for your observations.

In [10]:
import time
meang = MeanGaussian(mu, sigma)
canonicalg = CanonicalGaussian(nu, lamb)

# 1.
start = time.time()
meang.marginalize([x for x in range(200)])
meang.condition([x for x in range(100,200)], x)
end = time.time()
mean_1 = end-start

start = time.time()
canonicalg.marginalize([x for x in range(200)])
canonicalg.condition([x for x in range(100,200)], x)
end = time.time()
canonical_1 = end-start

In [11]:
meang = MeanGaussian(mu, sigma)
canonicalg = CanonicalGaussian(nu, lamb)

# 2.
start = time.time()
meang.condition([x for x in range(100,200)], x)
meang.marginalize([x for x in range(100)])
end = time.time()
mean_2 = end-start

start = time.time()
canonicalg.condition([x for x in range(100,200)], x)
canonicalg.marginalize([x for x in range(100)])
end = time.time()
canonical_2 = end-start

In [12]:
print('   Mean                 Canonical')
print(f'1: {mean_1} {canonical_1}')
print(f'2: {mean_2} {canonical_2}')

   Mean                 Canonical
1: 0.008557558059692383 0.01397848129272461
2: 0.013402938842773438 0.007674455642700195


Mean parameterization is about twice as fast as canonical parameterization when we marginalize first, and canonical parameterization is about twice as fast as mean parameterization when we condition first.

I would think that the speed differences come from the fact for mean parameterization, marginalization is a cheaper computation than marginalization, while it is the other way around for canonical parameterization. This causes the cases where the cheap computation is done first (being done on the larger parameters, compared to the computation done second), to go faster.