### <span style="color:red">IMPORTANT: Only modify cells which have the following comment:</span>
```python
# Modify this cell
```
##### <span style="color:red">Do not add any new cells when you submit the homework</span>

# Setting Up Notebook

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local[4]")

In [3]:
import numpy as np
import math
from numpy import linalg as LA

# Exercise:
The function **computeCov** computes the covariance matrix using RDDs. The code allows undefined entries and calculates the covariance without bias. 

Your homework is to complete the missing parts in **computeCov** (Marked with `...`) so that it calculates the covariance correctly.

    Note: The functions and libraries in the cell below will be useful to you

In [4]:
def outerProduct(X):
    """Computer outer product and indicate which locations in matrix are undefined"""
    O=np.outer(X,X)
    N=1-np.isnan(O)
    return (O,N)

def sumWithNan(M1,M2):
    """Add two pairs of (matrix,count)"""
    (X1,N1)=M1
    (X2,N2)=M2
    N=N1+N2
    X=np.nansum(np.dstack((X1,X2)),axis=2)
    return (X,N)

In [5]:
# Modify this cell

def computeCov(RDDin):
    # input: RDDin is an RDD of np arrays, all of the same length
    # we insert 1 at the beginning of each vector so the calculation also yields the mean vector
    
    RDD=RDDin.map(lambda v:np.array(np.insert(v,0,1),dtype=np.float64)) 
    # separating map and reduce does not matter, since Spark uses lazy execution.
    OuterRDD=RDD.map(lambda X:outerProduct(X))    #<-- do mapping here
    (S,N)=OuterRDD.reduce(lambda m1,m2:sumWithNan(m1,m2))   #<-- do reducing here
    
    E=S[0,1:]
    NE=np.float64(N[0,1:])

    print 'shape of E=',E.shape,'shape of NE=',NE.shape
    Mean=E/NE
    O=S[1:,1:]
    NO=np.float64(N[1:,1:])

    Cov=O/NO-np.outer(Mean,Mean) # This is the covariance matrix
    
    # Output also the diagnal which is the variance for each day
    Var=np.array([Cov[i,i] for i in range(Cov.shape[0])])
    return {'E':E,'NE':NE,'O':O,'NO':NO,'Cov':Cov,'Mean':Mean,'Var':Var}


In [6]:
import Tester.SmallPCA as pca
pca.exercise(computeCov, sc)

Checking data_list of length 3 with length 10 vectors each having 2 np.NaN values
[array([ 1.        ,  0.28505698, -0.68273734, -0.21508341,  0.36345365,
       -0.4729565 ,  0.24899262, -0.02985747, -0.11534323,         nan,
               nan]), array([ 1.        , -1.49465064, -0.6474741 ,  0.66631294,         nan,
        0.75373141,         nan,  0.35621844,  0.54247292,  0.33588386,
       -1.02131315]), array([ 1.        ,         nan,         nan,  0.21262433,  0.78617074,
        0.06847457, -0.69221111,  0.17744531,  0.21787918, -0.54160755,
       -0.53479166])]
shape of E= (10,) shape of NE= (10,)

Checking data_list of length 100 with length 10 vectors each having 4 np.NaN values
[array([ 1.        ,         nan,  0.00482591,  0.04274835,         nan,
        0.01841503, -0.00810807,         nan,  0.02753697,         nan,
        0.0305126 ]), array([ 1.        , -0.64668782, -0.0658972 ,         nan,         nan,
        0.0328489 ,  0.27665254,         nan,  0.90044631,