# Correlation Metrics

## Authors
B.W. Holwerda

## Learning Goals
* learn about different correlation metrics
* Pearson, Spearman and Kendall Tau rankings
* Correlation is not causation
* But strong correlation does imply a common origin.

## Keywords
Pearson ranking, Spearman ranking, Kendall tau ranking, correlation

## Companion Content


## Summary
One of the first things in physics of large data samples is to determine if there are correlations between samples. The Pearson, Spearman and Kendall rankings are all correlation metrics which come with a significance estimate.

<hr>


## Student Name and ID: Christopher Stephens and 5439371



## Date: 11/10/2023

<hr>

### Galaxy properties.

In Assignment 5, we encountered a catalog of galaxies. We will now see which of the galaxy properties are related, how strongly and how significantly.


0 - GAMA CATAID

1 - Stellar mass (log10 solar masses)

2 - u-r colour

3 - S'ersic index (log10)

4 - Half-light radius (log10 kpc)

5 - Specific star formation rate (log10 Gyr^-1)



In [2]:
import matplotlib.pyplot as plt
from astropy.io import ascii
from scipy.stats import spearmanr
from scipy.stats import pearsonr
from scipy.stats import kendalltau
import numpy as np


# data = ascii.read("GAMA.csv", format='csv', names=['cataid','Mstar','u-r','n','r50','sSFR'],fast_reader=False)

data = np.genfromtxt("GAMA.csv",  names=['cataid','Mstar','ur_color','n','r50','sSFR'], delimiter=',')
# lets get this thing done BOYS!
print(data)

[(   6802., 9.04805315, 1.3398169 , -0.04411991,  0.04909877, -9.74545192)
 (   6821., 7.50064806, 0.53505933,  0.76676222, -0.15638103, -8.90204893)
 (   6989., 7.73407941, 1.0682539 , -0.08586849,  0.08764822, -9.41850546)
 ...
 (3913968., 8.54777471, 1.1930662 ,  0.00423537,  0.44101094, -9.82217503)
 (3913987., 9.72427584, 1.3254836 ,  0.1085312 ,  0.51973835, -9.69143559)
 (3913997., 8.46434048, 1.0752945 ,  0.24519176,  0.25875511, -9.24328784)]


## Spearman and Pearson Rankings.

**Spearman's ranking** is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a *monotonic function*.

**Pearson ranking** is a measure of the *linear* correlation between two variables X and Y. 

**Kendall $\tau$** is another measure of monotonic function. Both this and the Spearman ranking are cases of a general correlation coefficient.


### Exercise 1 -- Which two parameters are the most correlation according to their Spearman ranking

hint: run through the list of column names and use *for* loops.

In [5]:
# student work here
names=['cataid','Mstar','ur_color','n','r50','sSFR']
maxspearman = 0
for val1 in names:
    for val2 in names:
        spearman, pval = spearmanr(data[val1], data[val2])
        print('%5s %5s: %4.5f \t %4.5f' % (val1, val2, spearman,pval))
        if abs(spearman) > maxspearman and val1 != val2:
            maxspearman = abs(spearman)
            maxval1 = val1
            maxval2 = val2
            maxpval = pval

print('\n')
print('%5s %5s: %4.5f \t %4.5f' % (maxval1, maxval2, maxspearman,maxpval))


cataid cataid: 1.00000 	 0.00000
cataid Mstar: 0.04537 	 0.00010
cataid ur_color: 0.02186 	 0.06115
cataid     n: -0.01508 	 0.19640
cataid   r50: 0.03082 	 0.00829
cataid  sSFR: 0.01408 	 0.22787
Mstar cataid: 0.04537 	 0.00010
Mstar Mstar: 1.00000 	 0.00000
Mstar ur_color: 0.71900 	 0.00000
Mstar     n: 0.35914 	 0.00000
Mstar   r50: 0.55979 	 0.00000
Mstar  sSFR: -0.61320 	 0.00000
ur_color cataid: 0.02186 	 0.06115
ur_color Mstar: 0.71900 	 0.00000
ur_color ur_color: 1.00000 	 0.00000
ur_color     n: 0.40271 	 0.00000
ur_color   r50: 0.24851 	 0.00000
ur_color  sSFR: -0.83486 	 0.00000
    n cataid: -0.01508 	 0.19640
    n Mstar: 0.35914 	 0.00000
    n ur_color: 0.40271 	 0.00000
    n     n: 1.00000 	 0.00000
    n   r50: 0.03651 	 0.00176
    n  sSFR: -0.37945 	 0.00000
  r50 cataid: 0.03082 	 0.00829
  r50 Mstar: 0.55979 	 0.00000
  r50 ur_color: 0.24851 	 0.00000
  r50     n: 0.03651 	 0.00176
  r50   r50: 1.00000 	 0.00000
  r50  sSFR: -0.19057 	 0.00000
 sSFR cataid: 0.0140

### Exercise 2 -- Is that correlation statistically significant? 

What is the p-value?

*your answer here*
0

### Exercise 3 -- Did you use an 'if' statement to find the highest ranking? 

if not, what would one look like?

*your answer here*
Yes!

### Exercise 4 --  Which two parameters are the most correlation according to their Pearson ranking

hint: define the list of column names and use *for* loops.

In [6]:
# student work here
names=['cataid','Mstar','ur_color','n','r50','sSFR']
maxpearson = 0
for val1 in names:
    for val2 in names:
        pearson, pval = pearsonr(data[val1], data[val2])
        print('%5s %5s: %4.5f \t %4.5f' % (val1, val2, pearson,pval))
        if abs(pearson) > maxpearson and val1 != val2:
            maxpearson = abs(pearson)
            maxval1 = val1
            maxval2 = val2
            maxpval = pval

print('\n')
print('%5s %5s: %4.5f \t %4.5f' % (maxval1, maxval2, maxpearson,maxpval))

cataid cataid: 1.00000 	 0.00000
cataid Mstar: 0.02863 	 0.01418
cataid ur_color: 0.00033 	 0.97750
cataid     n: -0.01276 	 0.27442
cataid   r50: 0.02562 	 0.02819
cataid  sSFR: 0.01505 	 0.19726
Mstar cataid: 0.02863 	 0.01418
Mstar Mstar: 1.00000 	 0.00000
Mstar ur_color: 0.74484 	 0.00000
Mstar     n: 0.42056 	 0.00000
Mstar   r50: 0.58792 	 0.00000
Mstar  sSFR: -0.58382 	 0.00000
ur_color cataid: 0.00033 	 0.97750
ur_color Mstar: 0.74484 	 0.00000
ur_color ur_color: 1.00000 	 0.00000
ur_color     n: 0.47684 	 0.00000
ur_color   r50: 0.27085 	 0.00000
ur_color  sSFR: -0.81018 	 0.00000
    n cataid: -0.01276 	 0.27442
    n Mstar: 0.42056 	 0.00000
    n ur_color: 0.47684 	 0.00000
    n     n: 1.00000 	 0.00000
    n   r50: 0.09432 	 0.00000
    n  sSFR: -0.46669 	 0.00000
  r50 cataid: 0.02562 	 0.02819
  r50 Mstar: 0.58792 	 0.00000
  r50 ur_color: 0.27085 	 0.00000
  r50     n: 0.09432 	 0.00000
  r50   r50: 1.00000 	 0.00000
  r50  sSFR: -0.16003 	 0.00000
 sSFR cataid: 0.0150

### Exercise 5 -- Now do this for the Kendall $\tau$

In [4]:
# student work here




### Exercise 6 -- Plot the most and second most correlated values against each other

According to the above correlation functions, two pairs of values are most correlated. 

In [5]:
## Plot the two most correlated values 
# student work here


In [6]:
## plot the second most correlated values here:
# student work here


### Exercise 7 -- Are both correlations the similar?

The two strongest correlations, are they similar? Or is there evidence for sub-populations? (hint: set alpha=0.1 to see the density of points).


*your answer here*
The two strongest correlations I have noticed: (u-r color and specific star formation rate) and (half-light radius and specific star formation rate.  Is there evidence of subpopulations? Well to answer this question I must first address

### Exercise 8 -- Use the correlation

Do you think the u-r color would make a good estimator for the stellar mass? Motivate why or why not.

*your answer and motivation here*
Yes. The pval indicates significance

### Galaxy properties.

In Assignment 5, we encountered a catalog of galaxies. We will now see which of the galaxy properties are related, how strongly and how significantly.


0 - GAMA CATAID

1 - Stellar mass (log10 solar masses)

2 - u-r colour

3 - S'ersic index (log10)

4 - Half-light radius (log10 kpc)

5 - Specific star formation rate (log10 Gyr^-1)

