# Tutorial 2 - Covariance and Principle Components

In this tutorial we will get some experience with estimating the covariance matrix and finding principle components of some data. You will also get some experience manipulating matrices in python.

An estimator for the covariance matrix between variables $x$ and $y$ is
\begin{align}
{\hat{C}}_{xy} = \frac{1}{N-1} \sum_{i=1}^N \left( x_i - \bar{x} \right) \left( y_i - \bar{y} \right)
\end{align}
where
\begin{align}
\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i
\end{align}



You will need to import numpy, matplotlib.pyplot and pandas


1) Read file `homework_01_2d-datafile.csv` into a dataframe using pandas

Make a scatter plot of X vs Y.
 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pa

 2) Find the covariance matrix for the two variables.  Don't use the 
 function numpy.cov(), write your own function this first time.   



In [None]:

.
.
.

C[0,0] = ...
C[1,0] = ...
C[0,1] = ...
C[1,1] = ...

# Print the covariance matrix C for X and Y.

# Do X and Y appear to be correlated?

# What is the variance of X?

# What is the variance of Y?


3) Find the precision matrix, $C^{-1}$, and print it. (Use `np.linalg.inv()`.)

The normalized covariance coefficients, or Pearson's correlation coefficients,  are
\begin{align}
\rho_{ij} = \frac{\hat{C}_{ij} }{\sqrt{\hat{C}_{ii} \hat{C}_{jj} }}
\end{align}

The off-diagonal components of this matrix are measures of correlations.  They have a range of -1 to 1.

 4) Consider the following code:
 
 ```
    D = np.diag(C)
    
    print(D)
    
    J =  np.outer(D,D)
    
    print(J)
```
 Use it to efficiently calculate the denominator for the normalized covariance matrix and print the matrix.


 5) Decompose the covariance (not normalized) matrix using 
    an eigenvalue decompositions.
    
    Use V,M = numpy.linalg.eig() to find the decomposition.  V contains the eigenvalues and M contains the eigenvectors as columns. 

In [60]:
.
.
.
    # What are the principle components (eigenvectors) of the data? 
 .
 .
 .   
    print("v1 = ",...)
    print("v2 = ",...)


    # What are the eigenvalues?
.
.

 6) Transform the data into the basis of the principle components (or eigenvectors) that were found in 5).  If `M` is a matrix whose *rows* are the eigenvectors of `C` then `x=M d` is the tranformation of the data point `d` into its principle components `x`.

  You can collect the data into a structure `data = np.array([X,Y]`) and then matrix 
  multiply it by a matrix with `np.dot(,)` (or `@`). You might need to transpose something...
  
`numpy.shape()` is useful to make sure your doing matrix multiplications correctly when the matrices are not square.  The trick here is to remember what dimensions things should be.  This should take only a single line of code.

In [None]:

#make a scatter plot of the data in this basis

#print the covariance matrix of the data in this basis



7) Compare the diagonal elements (variances) of the covariance matrix 
to the eigenvalues you got in part 4).  Are the variances of the principle components larger, smaller or both 
larger and smaller than the original variances?

 8) Do 1) through 6), but using the data file `homework_01_5d-datafile.csv` this time.
    In this case the data is 5 dimensional.  You can use numpy.cov() this time.
    You can't plot all those dimensions so you don't have to do the scatter plots
    ,but you can plat pairs of them.

 The best way to find the covariance matrix is to make a 5 by 2000 array out 
 of the columns of the dataframe using `numpy.array([...])` as we did in the 2D case above.

In [None]:
    # read in the data into a dataframe
    
df = ...

#print(df.info())

# put the data into a 5 x 2000 numpy array

data = ...
np.shape(data)

# find the covariance matrix of the data

...

print(C)

# Which variables seem to be correlated with each other and which ones not?
# Find this with the Person correlation coefficients.

...

print('R = ',R)

9) Find the principle components and their variances.

In [None]:

    # What are the principle components (eigenvectors) of the data? 

....

    # What are the variances of each principle component?
...


10) Make a copy (`data2`) of the `data` transformed into the PCA basis.

In [None]:
data2 = ...

11) This is a function that will make a scatter plot of all pairs of parameters int a triangle of panels.  Fill in the missing parts.

In [None]:

def triangle_plot(data,labels=None):
    n = data.shape[0]
    fig, ax = plt.subplots(n-1,n-1)
    
    if(labels == None) :
        labels = [ 'X'+str(i) for i in range(n)]
    
    for i in range(n-1):
        for j in range(i) :
            ax[i,j].axis('off')
        for j in range(i+1,n):
            ax[i,j-1].scatter(...,...,s=0.2)
            min = np.min([np.min(data[i,:]),np.min(data[j,:])])
            max = np.max([np.max(data[i,:]),np.max(data[j,:])])
            delt = (max - min)*0.1
            ax[i,j-1].set_xlim(min-delt,max+delt)
            ax[i,j-1].set_ylim(min-delt,max+delt)
            if(i == j-1) :
                ax[i,j-1].set_xlabel(labels[j],fontsize=8)
                ax[i,j-1].set_ylabel(labels[i],fontsize=8)
            else :
                ax[i,j-1].tick_params(axis='both', which='both', labelsize=10)
            
    plt.show()

12) Use `triangle_plot()` to plot `data2`.