# Correlations

## Pairwise correlations

In the first part of this notebook we will see how to compute pairwise correlations coefficients across columns of a pandas DataFrame using the [pairwise_corr](https://raphaelvallat.github.io/pingouin/build/html/generated/pingouin.pairwise_corr.html#pingouin.pairwise_corr) function.

To do so, we will first load an example dataset in which each row represents one subject and each columns represent a score on the well-known Big Five personality traits. There are 500 subjects in total.

In [1]:
from pingouin.datasets import read_dataset

df = read_dataset('pairwise_corr')

# Remove the 'Subject' column
df.drop(columns='Subject', inplace=True)

# Print the first lines
df.head()

Unnamed: 0,Neuroticism,Extraversion,Openness,Agreeableness,Conscientiousness
0,2.47917,4.20833,3.9375,3.95833,3.45833
1,2.60417,3.1875,3.95833,3.39583,3.22917
2,2.8125,2.89583,3.41667,2.75,3.5
3,2.89583,3.5625,3.52083,3.16667,2.79167
4,3.02083,3.33333,4.02083,3.20833,2.85417


Let's see if the personality dimensions are correlated or not. For that, we will compute the pairwise correlations between all the columns of the DataFrame:

By default, the function returns the two-sided Pearson's correlation coefficients. This can be adjusted using the `tail` and `method` arguments. In addition, the output dataframe contains:

1. the parametric 95% confidence intervals of the r value (`CI95%`)
2. the R<sup>2</sup> (= coefficient of determination, `r2`)
3. the adjusted R<sup>2</sup> (`adj_r2`)
4. the standardized (Z-transformed) correlation coefficients (`z`)
5. the uncorrected p-values (`p-unc`)
6. the Bayes Factor for the alternative hypothesis (`BF10`)

In the example below, we can see that the highest correlation between personality dimensions is between `Neuroticism` and `Conscientiousness`, as indicated by the correlation coefficient (-0.368), the p-value (1.75e-17) and the Bayes Factor (1.80e14).

In [2]:
from pingouin import pairwise_corr
pairwise_corr(df)

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc,BF10
0,Neuroticism,Extraversion,pearson,two-sided,-0.35,"[-0.42, -0.27]",0.123,0.119,-0.365,7.323047e-16,4592461000000.0
1,Neuroticism,Openness,pearson,two-sided,-0.01,"[-0.1, 0.08]",0.0,-0.004,-0.01,0.816854,0.037
2,Neuroticism,Agreeableness,pearson,two-sided,-0.134,"[-0.22, -0.05]",0.018,0.014,-0.135,0.002615436,3.286
3,Neuroticism,Conscientiousness,pearson,two-sided,-0.368,"[-0.44, -0.29]",0.135,0.132,-0.386,1.7589680000000002e-17,180831000000000.0
4,Extraversion,Openness,pearson,two-sided,0.267,"[0.18, 0.35]",0.071,0.068,0.274,1.287742e-09,3481580.0
5,Extraversion,Agreeableness,pearson,two-sided,0.055,"[-0.03, 0.14]",0.003,-0.001,0.055,0.2233908,0.075
6,Extraversion,Conscientiousness,pearson,two-sided,0.065,"[-0.02, 0.15]",0.004,0.0,0.065,0.1492461,0.101
7,Openness,Agreeableness,pearson,two-sided,0.159,"[0.07, 0.24]",0.025,0.021,0.16,0.0003516781,21.015
8,Openness,Conscientiousness,pearson,two-sided,-0.013,"[-0.1, 0.07]",0.0,-0.004,-0.013,0.7641957,0.037
9,Agreeableness,Conscientiousness,pearson,two-sided,0.159,"[0.07, 0.24]",0.025,0.021,0.16,0.0003685092,20.117


### Non-parametric correlations
If your data do not follow a normal distribution, you may want to use a non-parametric method such as the Spearman rank-correlation.

In the example below, we compute the one-sided Spearman pairwise correlations between a subset of columns. Note that the Bayes Factor is only computed when using the Pearson method and is therefore not present in the table above.

In [3]:
pairwise_corr(data=df, columns=['Neuroticism', 'Extraversion'], tail='one-sided', method='spearman')

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc
0,Neuroticism,Extraversion,spearman,one-sided,-0.325,"[-0.4, -0.24]",0.106,0.102,-0.337,4.192429e-14


### Robust correlations
If you believe that your dataset contains outliers, you can use a robust correlation method. There are currently three robust correlation methods implemented in Pingouin, namely the percentage bend correlation ([Wilcox 1994](https://link.springer.com/article/10.1007/BF02294395)), the Shepherd's pi correlation ([Schwarzkopf et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397314/)) and the skipped correlation ([Rousselet and Pernet 2012](https://www.frontiersin.org/articles/10.3389/fnhum.2012.00119/full)). 

While the former method is particularly well-suited for univariate outlier (e.g. present in only one variable), the two latter methods work well with multivariate outliers. Note that the skipped correlation requires the scikit-learn package. Learn more on the documentation of the [pairwise_corr](https://raphaelvallat.github.io/pingouin/build/html/generated/pingouin.pairwise_corr.html#pingouin.pairwise_corr) function.

In [4]:
# Introduce two outliers in variable X
df.loc[[5, 12, 24, 58], 'Neuroticism'] = 18

# Percentage bend correlation
pairwise_corr(data=df, columns=['Neuroticism', 'Extraversion'], method='percbend')

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc
0,Neuroticism,Extraversion,percbend,two-sided,-0.327,"[-0.4, -0.25]",0.107,0.104,-0.339,5.985071e-14


In [5]:
# Shepherd's correlation
pairwise_corr(data=df, columns=['Neuroticism', 'Extraversion'], method='shepherd')

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc
0,Neuroticism,Extraversion,shepherd,two-sided,-0.319,"[-0.4, -0.24]",0.102,0.098,-0.331,6.790904e-13


### Correction for multiple comparisons
Finally, if you are computing a large number of correlation coefficients, you might want to correct the p-values for multiple comparisons. This can be done with `padjust` argument:

In [6]:
pairwise_corr(df, method='spearman', padjust="holm").round(3)

Unnamed: 0,X,Y,method,tail,r,CI95%,r2,adj_r2,z,p-unc,p-corr,p-adjust
0,Neuroticism,Extraversion,spearman,two-sided,-0.33,"[-0.41, -0.25]",0.109,0.105,-0.343,0.0,0.0,holm
1,Neuroticism,Openness,spearman,two-sided,-0.02,"[-0.11, 0.07]",0.0,-0.004,-0.02,0.662,1.0,holm
2,Neuroticism,Agreeableness,spearman,two-sided,-0.132,"[-0.22, -0.04]",0.017,0.014,-0.133,0.003,0.015,holm
3,Neuroticism,Conscientiousness,spearman,two-sided,-0.365,"[-0.44, -0.29]",0.133,0.129,-0.383,0.0,0.0,holm
4,Extraversion,Openness,spearman,two-sided,0.243,"[0.16, 0.32]",0.059,0.055,0.248,0.0,0.0,holm
5,Extraversion,Agreeableness,spearman,two-sided,0.062,"[-0.03, 0.15]",0.004,-0.0,0.062,0.166,0.666,holm
6,Extraversion,Conscientiousness,spearman,two-sided,0.056,"[-0.03, 0.14]",0.003,-0.001,0.056,0.213,0.666,holm
7,Openness,Agreeableness,spearman,two-sided,0.17,"[0.08, 0.25]",0.029,0.025,0.172,0.0,0.001,holm
8,Openness,Conscientiousness,spearman,two-sided,-0.007,"[-0.09, 0.08]",0.0,-0.004,-0.007,0.88,1.0,holm
9,Agreeableness,Conscientiousness,spearman,two-sided,0.161,"[0.07, 0.24]",0.026,0.022,0.162,0.0,0.002,holm


***
## Partial correlation

In some cases, you will want to measure the correlation between two variables whilst controlling for the potential influence of other variables (also known as covariates). This can be done easily using the [partial_corr](https://raphaelvallat.github.io/pingouin/build/html/generated/pingouin.partial_corr.html#pingouin.partial_corr) function.

To illustrate this, we will append two new (fake) columns to our dataframe with the age and Body Mass Index (BMI) of each subject:

In [7]:
np.random.seed(123)
df['Age'] = np.random.randint(18, 70, size=df.shape[0])
df['BMI'] = np.random.randint(18, 45, size=df.shape[0])

df.head()

Unnamed: 0,Neuroticism,Extraversion,Openness,Agreeableness,Conscientiousness,Age,BMI
0,2.47917,4.20833,3.9375,3.95833,3.45833,63,28
1,2.60417,3.1875,3.95833,3.39583,3.22917,20,36
2,2.8125,2.89583,3.41667,2.75,3.5,46,25
3,2.89583,3.5625,3.52083,3.16667,2.79167,52,27
4,3.02083,3.33333,4.02083,3.20833,2.85417,56,33


In [8]:
from pingouin import partial_corr

# Correlation between extraversion and openess whilst controlling for age:
partial_corr(data=df, x='Extraversion', y='Openness', covar='Age', method='pearson')

Unnamed: 0,r,CI95%,r2,adj_r2,p-val,BF10
pearson,0.267,"[0.18, 0.35]",0.072,0.068,1.229016e-09,3643495.714


In [9]:
# Correlation between extraversion and openess whilst controlling for age and BMI:
partial_corr(data=df, x='Extraversion', y='Openness', covar=['Age', 'BMI'], method='pearson')

Unnamed: 0,r,CI95%,r2,adj_r2,p-val,BF10
pearson,0.266,"[0.18, 0.35]",0.071,0.067,1.531824e-09,2940187.983
