# Covariance matrix estimation in the presence of missing values

There are many strategies to estimate the covariance matrix in the presence of missing values:

* mean value imputation
* maximum likelihood imputation
* pairwise deletion 
* more complex EM-algorithm based solutions

If data is missing completely at random (MCAR) then it is reasonable to compute the covariance matrix by computing pairwise covariances and omit pairs where at least one value is missing.

However, this leads to an estimate that is not guaranteed to be positive semi-definite (PSD). A common hack to solve this is to find the nearest PSD matrix. The following notebook combines various internet sources to implement this approach.

In [2]:
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.stats as stats
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hclust 
import sklearn

from pandas import Series
from pandas import DataFrame
from typing import List,Tuple

from pandas import Categorical
from pandas.api.types import CategoricalDtype

from tqdm import tnrange#, tqdm_notebook
from plotnine import *

from scipy.stats import norm
from scipy.stats import multivariate_normal
from scipy.stats import uniform 


from sklearn.cluster import AgglomerativeClustering

# Local imports
from common import *
from convenience import *

## I. Correct implementation of pairwise coefficient computation strategy

The corresponding functions exist in `pandas` and `numpy` packages. However, the corresponding implementations differ according to [the documentation](https://github.com/pandas-dev/pandas/issues/16837).
Hence, we are going to choose the one that is closest to the GNU R implementation. See comments in [this Stackoverflow discussion](
https://stackoverflow.com/questions/8287573/numpy-ma-cov-pairwise-correlations-for-with-missing-values).

### Test and target values in the Stackoverflow discussion

In [3]:
test = [[np.nan, np.nan, 0.217, 0.562],
        [np.nan, np.nan, 0.217, 0.562],
        [0.269, 0.0, 0.217, 0.562],
        [np.nan, np.nan, 0.217, -0.953],
        [np.nan, np.nan, 0.217, -0.788],
        [0.75, 0.0, 0.217, 0.326],
        [0.207, 0.0, 0.217, 0.814],
        [np.nan, np.nan, 0.217, 0.562],
        [np.nan, np.nan, 0.217, -0.022],
        [np.nan, np.nan, 0.217, 0.562],
        [np.nan, np.nan, 0.217, -0.953],
        [np.nan, np.nan, 0.217, -0.953],
        [0.078, 0.0, 0.217, -0.953],
        [np.nan, np.nan, 0.217, -0.953],
        [0.078, 0.0, 0.217, 0.562]]

target = [[0.0769733, 0, 0, 0.0428294],
          [0.0000000, 0, 0, 0.0000000],
          [0.0000000, 0, 0, 0.0000000],
          [0.0428294, 0, 0, 0.5536484]]

### Corresponding code

In [4]:
maskedarr = np.ma.array(test, mask=np.isnan(test))
cov = np.ma.cov(maskedarr,rowvar=False,allow_masked=True)
display(DataFrame(np.ma.filled(cov.astype(float), np.nan)))

Unnamed: 0,0,1,2,3
0,0.0769733,0.0,5.3926040000000004e-33,0.0428294
1,0.0,0.0,0.0,0.0
2,5.3926040000000004e-33,0.0,3.301594e-33,8.804251e-34
3,0.0428294,0.0,8.804251e-34,0.5536484


### Corresponding function

In [5]:
def pairwise_cov_matrix(X:np.array)-> np.array:
    """
    Computes covariance matrix in the presence of missing values using pairwise covariance estimates.
    
    Argument:
    X – data matrix where observations are in the rows
    
    Return value:
    Covariance matrix that may contain nan values if the number of matching observations is too low
    for a particular column pair.
    
    This implementation is the closest match to GNU R function cov(X, use = "pairwise"). 
    Note that the pandas implementation of the covariance matrix is different. 
    """
    
    return np.ma.filled(np.ma.cov(np.ma.array(X, mask=np.isnan(X)),
                                  rowvar=False, allow_masked=True).astype(float), np.nan)

In [6]:
display(DataFrame(pairwise_cov_matrix(test) - target))

Unnamed: 0,0,1,2,3
0,0.0,0.0,5.3926040000000004e-33,-3.4694470000000005e-17
1,0.0,0.0,0.0,0.0
2,5.3926040000000004e-33,0.0,3.301594e-33,8.804251e-34
3,-3.4694470000000005e-17,0.0,8.804251e-34,9.52381e-09


## II.  Correct implementation of the nearest PSD matrix

We take code directly from [the Stackoverflow discussion](https://stackoverflow.com/questions/10939213/how-can-i-calculate-the-nearest-positive-semi-definite-matrix) that also refers to [this blog post](http://statsadventure.blogspot.com/2011/12/non-pd-matrices-in-r-cont.html) for further details.

### Test and target values in the blog post discussion

In [7]:
test = np.array([[1.0, 0.9, 0.7],
                 [0.9, 1.0, 0.3],
                 [0.7, 0.3, 1.0]])

target = np.array([[1.0000000, 0.8940244, 0.6963191],
                   [0.8940244, 1.0000000, 0.3009690],
                   [0.6963191, 0.3009690, 1.0000000]])

### Corresponding function

In [8]:
def near_psd_matrix_simple(A,epsilon=0):
   n = A.shape[0]
   eigval, eigvec = np.linalg.eig(A)
   val = np.matrix(np.maximum(eigval,epsilon))
   vec = np.matrix(eigvec)
   T = 1/(np.multiply(vec,vec) * val.T)
   T = np.matrix(np.sqrt(np.diag(np.array(T).reshape((n)) )))
   B = T * vec * np.diag(np.array(np.sqrt(val)).reshape((n)))
   out = B*B.T
   return(out)

In [9]:
def near_psd_matrix(x:np.array, epsilon:float=0):
    '''
    Calculates the nearest postive semi-definite matrix for a correlation/covariance matrix

    Parameters
    ----------
    x : array_like
      Covariance/correlation matrix
    epsilon : float
      Eigenvalue limit (usually set to zero to ensure positive semi-definiteness)

    Returns
    -------
    near_cov : array_like
      closest positive semi-definite covariance/correlation matrix

    Notes
    -----
    Document source
    http://www.quarchome.org/correlationmatrix.pdf
    
    Source is directly copied form 
    https://stackoverflow.com/questions/10939213/how-can-i-calculate-the-nearest-positive-semi-definite-matrix
    '''

    if min(np.linalg.eigvals(x)) > epsilon:
        return x

    # Removing scaling factor of covariance matrix
    n = x.shape[0]
    var_list = np.array([np.sqrt(x[i,i]) for i in range(n)])
    y = np.array([[x[i, j]/(var_list[i]*var_list[j]) for i in range(n)] for j in range(n)])

    # getting the nearest correlation matrix
    eigval, eigvec = np.linalg.eig(y)
    val = np.matrix(np.maximum(eigval, epsilon))
    vec = np.matrix(eigvec)
    T = 1/(np.multiply(vec, vec) * val.T)
    T = np.matrix(np.sqrt(np.diag(np.array(T).reshape((n)) )))
    B = T * vec * np.diag(np.array(np.sqrt(val)).reshape((n)))
    near_corr = B*B.T    

    # returning the scaling factors
    near_cov = np.array([[near_corr[i, j]*(var_list[i]*var_list[j]) for i in range(n)] for j in range(n)])
    return near_cov

In [10]:
display(DataFrame(near_psd_matrix(test)-target))
display(DataFrame(near_psd_matrix_simple(test)-target))

Unnamed: 0,0,1,2
0,-3.330669e-16,8.508599e-09,-3.388561e-08
1,8.508599e-09,-2.220446e-16,3.610459e-08
2,-3.388561e-08,3.610459e-08,0.0


Unnamed: 0,0,1,2
0,-3.330669e-16,8.508599e-09,-3.388561e-08
1,8.508599e-09,-2.220446e-16,3.610459e-08
2,-3.388561e-08,3.610459e-08,0.0


### III. Correct implementation of the covariance matrix with pairwise computation strategy

In [11]:
def pairwise_cov_matrix(X:np.array, psd_correction:bool=False, ridge_coeff:float=0)-> np.array:
    """
    Computes covariance matrix in the presence of missing values using pairwise covariance estimates.
    
    Arguments:
    X – data matrix where observations are in the rows
    psd_correction – if set, makes the covariance matrix positive semi-definite
    ridge_coeff – coefficient to be added to the main diagonal (before psd correction)
    
    Return value:
    Covariance matrix that may contain nan values if the number of matching observations is too low
    for a particular column pair. If psd_correction is set, return a matrix that is positive semi-definite
    by finding the nearest PSD matrix for the original covariance matrix.  
    
    This implementation is the closest match to GNU R function cov(X, use = "pairwise"). 
    Note that the pandas implementation of the covariance matrix is different. 
    """
    
    cov_matrix = np.ma.cov(np.ma.array(X, mask=np.isnan(X)), rowvar=False, allow_masked=True)

    if np.ma.is_masked(cov_matrix):
        raise Exception("Too many missing values")
    
    cov_matrix = np.ma.filled(cov_matrix.astype(float), np.nan)
    cov_matrix = 0.5 * (cov_matrix + cov_matrix.T) 
    cov_matrix[np.diag_indices_from(cov_matrix)] += ridge_coeff

    if np.linalg.matrix_rank(cov_matrix) != cov_matrix.shape[0]:
        raise Exception("Covariance matrix is linearly dependent. Increase ridge regularisation parameter")
    
    return near_psd_matrix(cov_matrix) 

In [12]:
pairwise_cov_matrix(test, ridge_coeff=0.00001)

array([[ 0.02334333,  0.05166667, -0.03166667],
       [ 0.05166667,  0.14334333, -0.11833333],
       [-0.03166667, -0.11833333,  0.12334333]])