In [1]:
# %pip install scikit_posthocs

In [2]:
import scipy.stats as stats
import pandas as pd
import scikit_posthocs as sp
from sklearn.preprocessing import LabelEncoder


# Correlation Measures
This notebook enumerates various methods for computing correlations between features/variables. The focus is to look at comparable methods for the three different cases: discrete-continuous, discrete-discrete and continuous-continuous.

# discrete-discrete
There are two types of correlation measures for this case:
* Distance metrics: Euclidean, Manhattan
* Statistical metrics: Goodman-Kruksal's lambda, chi-square test for use on contingency tables

### Distance metrics
When used to compare features with different number of categories/values, the metrics aren't comparable
* Sum of Absolute Distance
* Sum of Squared Distance
* Mean-Absolute Error
* Euclidean Distance, with scaled and centered data, easier conversion to Pearson 
* Manhattan Distance
* Chessboard Distance
* Minkowski Distance
* Canberra Distance
* Cosine Distance
* Hamming Distance

### Contigency table analysis
* Goodman-Kruksal's lambda
* Cramer's V, biased, use correction method

* Phi coefficient
* Tschuprow's T
* Contigency coefficient C, has different max value depending on size of contingency table

# discrete-continuous
This is one of the less commonly documented cases. 
* Point Biserial Correlation: [-1, 1] range, assume continuous variable is normal, and target is dichotomous, but if discretized based on continuous data, Biserial Correlation is better (Why?)
* Logistic Regresion:assumes a linear relation between features and logit of outcome. Assumptions of normality and homoskedascity are relaxed
* Kruksal-Wallis H-test: doesn't assume normality of residuals, as parametric version, one-way ANOVA

### Point Biserial Correlation

In [3]:
x = [0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0]
y = [12, 14, 17, 17, 11, 22, 23, 11, 19, 8, 12]
stats.pointbiserialr(x, y)

SignificanceResult(statistic=np.float64(0.2181634545788746), pvalue=np.float64(0.5192842928773611))

The point-biserial correlation coefficient is 0.21816 and the corresponding p-value is 0.51928.

Since the correlation coefficient is positive, this indicates that when the variable x takes on the value “1” that the variable y tends to take on higher values compared to when the variable x takes on the value “0.”

Since the p-value of this correlation is not less than .05, this correlation is not statistically significant. 

### Kruksal-Wallis

In [7]:
df_ = pd.DataFrame.from_dict({
    "group1": [7, 14, 14, 13, 12, 9, 6, 14, 12, 8],
    "group2": [15, 17, 13, 15, 15, 13, 9, 12, 10, 8],
    "group3": [6, 8, 8, 9, 5, 14, 13, 8, 10, 9],
})
df_

Unnamed: 0,group1,group2,group3
0,7,15,6
1,14,17,8
2,14,13,8
3,13,15,9
4,12,15,5
5,9,13,14
6,6,9,13
7,14,12,8
8,12,10,10
9,8,8,9


`stack` takes all columns and 'stacks' them to get a narrow dataframe where `level_0` is the original index and `level_1` is the column name and `0` is the valiue of the orig column at that index

In [6]:
df = df_.stack().reset_index()
df

Unnamed: 0,level_0,level_1,0
0,0,group1,7
1,0,group2,15
2,0,group3,6
3,1,group1,14
4,1,group2,17
5,1,group3,8
6,2,group1,14
7,2,group2,13
8,2,group3,8
9,3,group1,13


In [None]:
df = df.drop('level_0', axis=1)
df = df.rename(columns={'level_1': 'group', 0: 'value'})
df.head()

In [11]:
le = LabelEncoder()
le.fit(df['group'])
df['group'] = le.transform(df['group'])
df.head()

Unnamed: 0,group,value
0,0,7
1,1,15
2,2,6
3,0,14
4,1,17


In [12]:
stats.kruskal(df_['group1'], df_['group2'], df_['group3'])

KruskalResult(statistic=np.float64(6.287801578353988), pvalue=np.float64(0.043114289703508814))

In [13]:
df.groupby('group').median()

Unnamed: 0_level_0,value
group,Unnamed: 1_level_1
0,12.0
1,13.0
2,8.5


Use pairwise Conover tests to determine what groups differ

In [13]:
sp.posthoc_conover(df, val_col='value', group_col='group', p_adjust = 'holm')

Unnamed: 0,0,1,2
0,1.0,0.317264,0.317264
1,0.317264,1.0,0.03286
2,0.317264,0.03286,1.0


In [15]:
import scikit_posthocs as sp
posthoc = sp.posthoc_dunn([df_['group1'], df_['group2'], df_['group3']], p_adjust='bonferroni')
posthoc

Unnamed: 0,1,2,3
1,1.0,0.550846,0.718451
2,0.550846,1.0,0.036633
3,0.718451,0.036633,1.0


## continuous-continuous
* Spearman
* Kendall
* Pearson