# Correlation


## Introduction

In this notebook we will learn how to correlate two variables/signals, and what 'correlation' roughly means. 

In general, the correlation is used to measure the degree of similarity between two variables or even two signals.

The 'classic' linear correlation produces one single coefficient as output, while the cross-correlation produces a whole new signal. The first approach is often used when two signals have e.g a common time axis but are of different origin, while the cross-correlation is used for example when signals display the same signal type, but happen/where recorded on different times.



<div class ="alert alert-warning">
One important aspect beforehand:

*Correlation ≠ Causation*

Even if two things correlate, that does not directly mean they have a causal relation.
</div>
 

Lectures:
-  ..


## Table of Contents
- [Classic Correlation](#Correlation)
- [Cross-correlation](#Cross-correlation)
- [Autocorrelation](#Autocorrelation)
- [Summary](#Summary)



<a id='Correlation'></a> 
# Classic Correlation
The 'classic' linear correlation, defined by one coefficient, is often used when two signals have e.g a common time axis but are of different origin.

The correlation measures how similar the trend of two variables is, independent of the time.

[More to correlation](https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/)

In [None]:
import numpy as num
import matplotlib.pyplot as plt
from scipy import signal
import time

First, as always, creating some synthetic noisy data.

In [None]:
# Creating data
num.random.seed(0)
datlen = 100
data1 = num.linspace(-10, 10, datlen) + num.random.normal(0, 2, datlen)
data2 = num.linspace(-10, 10, datlen) + num.random.normal(0, 1, datlen)

plt.figure()
plt.scatter(data1, data2)
plt.xlabel('Data 1')
plt.ylabel('Data 2')
plt.show()

#### Covariance matrix 

Two important metrics to check during correlation are the [covariance](https://en.wikipedia.org/wiki/Covariance) and the [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix). The latter gives the 'direction' of correlation but not the intensity. In simpler words, we can use it to find out if the data has a positive or negative relation, but not how strong this connection is.


$\large Cov(x,y) = \frac{\sum_{i=1}^{N}(x_i - \bar{x}) \cdot (y_i - \bar{y})}{N - 1}$ 



$\large CovMatrix(x,y) = [ {Cov(x,x)\atop Cov(y,x)}{Cov(x,y)\atop Cov(y,y)}] $

In [None]:
covariance = num.cov(data1, data2)
print('Covariance-matrix:\n', covariance)

#### Correlation coefficients
Correlation is a standardized covariance. This allows now to investigate how strong the relation is. The correlation coefficient is defined between -1 (anti-correlation) and 1 (correlation). If the coefficient is 0 the variables are not correlated at all. 


Attention is needed, often some formulas/algorithms require that the data is gaussian-distributed.

In [None]:
from scipy.stats import pearsonr, spearmanr

# Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))
# Gaussian or Gaussian-like distribution
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation coefficient : %.3f' % corr)

# Spearman's correlation coefficient = covariance(rank(X), rank(Y)) / (stdv(rank(X)) * stdv(rank(Y)))
# non-Gaussian distribution
corr, _ = spearmanr(data1, data2)
print('Spearmans correlation coefficient: %.3f' % corr)

<div class ="alert alert-success">
Tasks
    
- Change sign
- Change randomness/increase std
</div>

### Non-linear correlation 
It is not so trivial ... .

# Cross Correlation

It is possible to include any 2-dimensional information of the correlated signals into the correlation. In the frame of seismic signals this would correspond, e.g. to time. This procedure is called cross correlation.

Cross-correlation investigates how similiar two (time)signals are and additional at which (lag/shift) time the correlation is the highest. As a result we obtain a new 'trace' consisting of "correlation values" for different shift/lag times instead of a single value as for the correlation.
Often differing signals are the main objective for cross-correlation.

Lets start with generating some 'peaky' data.

In [None]:
# Creating data
xdata = num.linspace(0, 100, 15)
ydata1 = num.zeros(len(xdata))
ydata2 = num.zeros(len(xdata))
ydata1[5] = 1
ydata1[10] = 1

ydata2[5] = 2
ydata2[10] = -1

plt.figure()
plt.plot(xdata, ydata1, label='Signal1')
plt.plot(xdata, ydata2, label='Signal2')
plt.xlabel('Time [s]')
plt.legend()
plt.show()

Now we use the [correlate](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html) function. What we get is a new time-series. The correct time shifts can be obtained from the [correlation_lags](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlation_lags.html#scipy.signal.correlation_lags) function.

In [None]:
corr = signal.correlate(ydata1, ydata2)
lags = signal.correlation_lags(len(ydata1), len(ydata2))

plt.figure()
plt.plot(lags, corr)
plt.xlabel('Lag-Time [s]')
plt.show()

To understand this maybe a bit better, there is a simple, custom-made function that hopefully visualize the cross-correlation better. 

In [None]:
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

def signal_correlation_animation(sig1, sig2, pausetime=0.):
    ## might be a small WARNING bug
    
    if len(sig2) > len(sig1):
        sig2_alt = sig2
        sig1_alt = sig1
    
        sig2 = sig1_alt
        sig1 = sig2_alt
        
    x1 = num.arange(len(sig1))
    x2 = num.arange(len(sig2))
    
    fig, axs = plt.subplots(2,1, figsize=(16,9))
    ax1 = axs[0]
    ax2 = axs[1]
     
    corr = signal.correlate(sig1, sig2)
    lags = signal.correlation_lags(len(sig1), len(sig2))
     
    crosscorr = [0]
    for ii in range(len(sig1) + len(sig2)):
        if ii == 0:
            continue
        
        if ii <= len(sig1):
            idxx = (ii - len(sig2))
            if idxx < 0:
                idxx = 0
            x = sig1[idxx: ii]
        else:
            xx = ii - len(sig1)
            x = sig1[xx + abs(len(sig1) - len(sig2)): ]
        
        if ii <= len(sig2):
            idxy = -ii + len(sig1)
            if idxy >= 0:
                idxy = None
            y = sig2[-ii: idxy]
        else:
            yy = ii - len(sig2)
            idxyy = len(sig1) -yy
            if idxyy > len(sig2):
                idxyy = None
            y = sig2[:idxyy]
        
        crosscorr.append(num.sum(x * y))
        
        ax1.cla()
        ax1.plot(lags, corr, color='orange', alpha=0.5, zorder=-2)
        ax1.scatter(num.arange(len(crosscorr)) - len(sig2), crosscorr, c='blue')
        ax1.set_xlabel('Lag-Time [s]')
        
        ax2.cla()
        ax2.plot(x1 + len(sig2) - ii, sig1)
        ax2.plot(x2, sig2)
        ax2.set_xlim(-len(sig1), len(sig1) + len(sig2))
        ax2.set_xlabel('Time [s]')

        display(fig)
        
        if pausetime >= 0.01:
            plt.pause(pausetime)
        
        if ii + 1 == len(sig1) + len(sig2):
            clear_output(wait=False)
        else:
            clear_output(wait=True)
    
    return

Run this same example with the new function:

In [None]:
signal_correlation_animation(ydata1, ydata2, pausetime=0.1)

Doing it with sinus and cosine.

In [None]:
df = 10
xdata = num.arange(num.pi * df) / df
ysin = num.sin(2*num.pi*xdata * 1/ num.pi)
ycos = num.cos(2*num.pi*xdata * 1/ num.pi)

plt.figure()
plt.plot(xdata, ysin, label='sin')
plt.plot(xdata, ycos, label='cos')
plt.legend()
plt.grid()

corr = signal.correlate(ysin, ycos)
lags = signal.correlation_lags(len(ysin), len(ycos))

plt.figure()
plt.plot(lags / df, corr)
plt.grid()
plt.show()

# signal_correlation_animation(ysin, ycos, pausetime=0.001)

<div class ="alert alert-success">
Tasks
    
- test several different signals
- different time length
</div>



It is possible that a smaller signal should be cross-correlated against a longer one. Both signals do not require to be equally long. But, they need the same sampling frequency (interval)! 

This shall be illustrated in the following. We have a longer time-series that represents our noise data. Now, we want to test if this signal contains a certain reference signal and where. Here, we try to 'hide' our signal in noise, to see if we can find the time 

In [None]:
# With noise

xdata = num.linspace(0, 100, 100)
ysin = num.sin(2*num.pi*xdata)
#ycos = num.cos(2*num.pi*xdata)

xdata2 = num.linspace(0, 1000, 1000)
ydata = num.zeros(len(xdata2))
ydata[200:200 + len(ysin)] = ysin
ydata[700:700 + len(ysin)] = -ysin
ydata += num.random.normal(0, 0.99, len(xdata2))

plt.figure()
plt.plot(xdata2, ydata, label='Signal')
plt.plot(xdata, ysin, label='sin')
plt.legend()
plt.grid()
plt.show()

corr = signal.correlate(ysin, ydata)
lags = signal.correlation_lags(len(ysin), len(ydata))

plt.figure()
plt.plot(lags, corr)
plt.grid()
plt.show()

# signal_correlation_animation(ysin, ydata, pausetime=0.001)

<a id='Autocorrelation'></a> 
## Autocorrelation
The autocorrelation is a special case of the cross-correlation. Instead of using two independent signals, one signal is cross-correlated with itself. It is often used to identify recurring and periodic patterns within a signal.

We will show the resulting function for a simple two pulse signal.

In [None]:
# Creating data
xdata = num.linspace(0, 100, 30)
ydata = num.zeros(len(xdata))
ydata[10] = 1
ydata[20] = -1
# ydata[15] = 1
# ydata[5] = 1

plt.figure()
plt.plot(xdata, ydata)
plt.xlabel('Time [s]')


# Autocorrelation

corr = signal.correlate(ydata, ydata)
lags = signal.correlation_lags(len(ydata), len(ydata))

plt.figure()
plt.plot(lags, corr)
plt.xlabel('Lag-Time [s]')
plt.show()

# signal_correlation_animation(ydata, ydata, pausetime=0.2)

At lag-time 0 there is always the highest peaks and, as expected, two further side peaks. 


<div class ="alert alert-success">
Tasks

- Correlate random noise
- different data signals 
</div>


<a id='Summary'></a> 
# Summary

We have learned that
- the correlation of two variables can be tested with Pearsons correlation (Gaussian distributions) or the Spearman's correlation (non-Gaussian distribution).
- two time signals can be tested on correlation with the Auto- (if its twice the same signal) or Cross-correlation. Besides the maximum value also the side-peaks can be of interest and therefore their lag time.