# Correlations

## Pairwise correlations

In this notebook we will see how to compute pairwise correlations coefficients across columns of a pandas DataFrame using the [pairwise_corr](https://raphaelvallat.github.io/pingouin/build/html/generated/pingouin.pairwise_corr.html#pingouin.pairwise_corr) function.

We'll start by generating a pandas dataframe with three continuous variables, each in a separate column.

In [1]:
import numpy as np
import pandas as pd
np.random.seed(123)
n = 20
mean, cov = [4, 6], [(1, .6), (.6, 1)]
# x and y are two correlated random variables
x, y = np.random.multivariate_normal(mean, cov, n).T
z = np.random.normal(size=n)
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.head()

Unnamed: 0,X,Y,Z
0,4.524991,7.417044,-0.805367
1,4.420532,5.073261,-1.727669
2,3.778971,7.256061,-0.3909
3,6.362303,7.978672,0.573806
4,3.25533,4.480094,0.338589


Let's compute the pairwise correlations between all the columns of the DataFrame:

By default, the function returns the two-sided Pearson's correlation coefficients. This can be adjusted using the `tail` and `method` arguments. In addition, the output dataframe contains:

1. the parametric 95% confidence intervals of the r value (`CI95%`)
2. the R<sup>2</sup> (= coefficient of determination, `r2`)
3. the adjusted R<sup>2</sup> (`adj_r2`)
4. the standardized (Z-transformed) correlation coefficients (`z`)
5. the uncorrected p-values (`p-unc`)
6. the Bayes Factor for the alternative hypothesis (`BF10`)

In the example below, we can see that there is a strong correlation between variables `X` and `Y`, as indicated by the correlation coefficient (0.583), the p-value (.007) and the Bayes Factor (6.27, meaning that the alternative hypothesis is ~6 times more likely than the null hypothesis given the data).

In [2]:
from pingouin import pairwise_corr
pairwise_corr(df)

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc,BF10
0,X,Y,pearson,two-sided,0.583,"[0.19, 0.82]",0.34,0.262,0.667,0.007004,6.27
1,X,Z,pearson,two-sided,-0.083,"[-0.51, 0.37]",0.007,-0.11,-0.083,0.729457,0.181
2,Y,Z,pearson,two-sided,-0.197,"[-0.59, 0.27]",0.039,-0.074,-0.2,0.40414,0.241


### Non-parametric correlations
If your data do not follow a normal distribution, the software will display a warning message suggesting you to use a non-parametric method such as the Spearman rank-correlation.

In the example below, we compute the one-sided Spearman pairwise correlations between a subset of columns. Note that the Bayes Factor is only computed when using the Pearson method and is therefore not present in the table above.

In [3]:
pairwise_corr(data=df, columns=['X', 'Y'], tail='one-sided', method='spearman')

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc
0,X,Y,spearman,one-sided,0.537,"[0.12, 0.79]",0.288,0.204,0.6,0.007332


### Robust correlations
If you believe that your dataset contains outliers, you can use a robust correlation method. There are currently three robust correlation methods implemented in Pingouin, namely the percentage bend correlation ([Wilcox 1994](https://link.springer.com/article/10.1007/BF02294395)), the Shepherd's pi correlation ([Schwarzkopf et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397314/)) and the skipped correlation ([Rousselet and Pernet 2012](https://www.frontiersin.org/articles/10.3389/fnhum.2012.00119/full)). 

While the former method is particularly well-suited for univariate outlier (e.g. present in only one variable), the two latter methods work well with multivariate outliers. Note that the skipped correlation requires the scikit-learn package. Learn more on the documentation of the [pairwise_corr](https://raphaelvallat.github.io/pingouin/build/html/generated/pingouin.pairwise_corr.html#pingouin.pairwise_corr) function.

In [4]:
# Introduce two outliers in variable X
df.loc[[5, 12], 'X'] = 18

# Percentage bend correlation
pairwise_corr(data=df, columns=['X', 'Y'], method='percbend')

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc
0,X,Y,percbend,two-sided,0.56,"[0.16, 0.8]",0.313,0.232,0.633,0.01031


In [5]:
# Shepherd's correlation
pairwise_corr(data=df, columns=['X', 'Y'], method='shepherd')

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc
0,X,Y,shepherd,two-sided,0.507,"[0.08, 0.78]",0.257,0.169,0.559,0.031873


### Correction for multiple comparisons
Finally, if you are computing a large number of correlation coefficients, you might want to correct the p-values for multiple comparisons. This can be done with `padjust` argument:

In [6]:
pairwise_corr(df, method='spearman', padjust="holm")

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc,p-corr,p-adjust
0,X,Y,spearman,two-sided,0.532,"[0.12, 0.79]",0.283,0.198,0.593,0.015812,0.047435,holm
1,X,Z,spearman,two-sided,-0.081,"[-0.51, 0.37]",0.007,-0.11,-0.081,0.733511,0.733511,holm
2,Y,Z,spearman,two-sided,-0.224,"[-0.61, 0.24]",0.05,-0.062,-0.228,0.342286,0.684573,holm
