# Day 15 Workout - Bivariate Statistics

The following dataset contains information about patients who have heart and blood conditions that are commonly associated with heart attacks. Your task is to determine how correlated each of the features collected is with the occurence of a heart attack.

num = 1 then heart attack; num = 0 then no

In [51]:
import pandas as pd

df = pd.read_csv('data/heart_attack_clean.csv')

In [61]:
df.rename(columns={df.iloc[:,-1].name:'num'}, inplace=True)

Create a correlation matrix of all numeric features.

In [63]:
corr_matrix = df.corr()
corr_matrix

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
age,1.0,0.020133,0.142592,0.257889,0.096937,0.198526,0.052657,-0.460095,0.264962,0.208633,0.160249
sex,0.020133,1.0,0.217588,0.094937,0.055653,0.067651,-0.081372,-0.073062,0.129519,0.120925,0.249531
cp,0.142592,0.217588,1.0,0.079504,0.161049,0.044556,0.006512,-0.390128,0.481323,0.360063,0.503254
trestbps,0.257889,0.094937,0.079504,1.0,0.11689,0.115005,0.02225,-0.220708,0.231742,0.229117,0.148295
chol,0.096937,0.055653,0.161049,0.11689,1.0,0.124957,0.056886,-0.136292,0.172802,0.113572,0.217929
fbs,0.198526,0.067651,0.044556,0.115005,0.124957,1.0,0.022119,-0.082902,0.125333,0.069241,0.178642
restecg,0.052657,-0.081372,0.006512,0.02225,0.056886,0.022119,1.0,-0.011117,0.056668,0.023457,-0.019413
thalach,-0.460095,-0.073062,-0.390128,-0.220708,-0.136292,-0.082902,-0.011117,1.0,-0.425644,-0.327207,-0.345074
exang,0.264962,0.129519,0.481323,0.231742,0.172802,0.125333,0.056668,-0.425644,1.0,0.641122,0.55786
oldpeak,0.208633,0.120925,0.360063,0.229117,0.113572,0.069241,0.023457,-0.327207,0.641122,1.0,0.565669


Which feature is most (and least) correlated with whether or not the patient has had a heart attack?

In [65]:
corr_matrix['num'].apply(abs).sort_values(ascending=False)

num         1.000000
oldpeak     0.565669
exang       0.557860
cp          0.503254
thalach     0.345074
sex         0.249531
chol        0.217929
fbs         0.178642
age         0.160249
trestbps    0.148295
restecg     0.019413
Name: num, dtype: float64

Let's check for assumptions of a correlation:

1. Check for the skewness of each variable

In [66]:
df.skew()

age        -0.295159
sex        -1.074823
cp         -0.261933
trestbps    0.747295
chol        1.447011
fbs         3.307711
restecg     1.962358
thalach    -0.126144
exang       0.786107
oldpeak     1.511448
num         0.517266
dtype: float64

Calculate the p-value for each correlation between every feature and only the label('num'). Make sure not to do the p-value of the label with itself. Sort the results by lowest to highest p-values.

In [71]:
from scipy import stats
import numpy as np

In [68]:
for col in df.columns:
    corr = stats.pearsonr(df['num'], df[col])
    print(f'The p-value for the {col} feature is: {round(corr[1],4)}')

The p-value for the age feature is: 0.0095
The p-value for the sex feature is: 0.0
The p-value for the cp feature is: 0.0
The p-value for the trestbps feature is: 0.0165
The p-value for the chol feature is: 0.0004
The p-value for the fbs feature is: 0.0038
The p-value for the restecg feature is: 0.7549
The p-value for the thalach feature is: 0.0
The p-value for the exang feature is: 0.0
The p-value for the oldpeak feature is: 0.0
The p-value for the num feature is: 0.0


In [76]:
### Did skewness change anything?

df['oldpeaksq'] = np.sqrt(df['oldpeak'])
df['oldpeakcube'] = np.cbrt(df['oldpeak'])
df['oldpeaklog'] = np.log1p(df['oldpeak'])

print(df['oldpeaksq'].skew())
print(df['oldpeakcube'].skew())
print(df['oldpeaklog'].skew())


In [79]:
df['oldpeakcube'].corr(df['num'])

0.5715154748492568

Search online to find out how to change the default correlation algorithm from Pearson to Spearman in the Scipy stats package. 

Create a DataFrame with the results of the Pearson and Spearman correlations and p-values for the heart attack dataset

It should look like this:

| . | Pearson r | Pearson p | Spearman r | Spearman p |
| - | - | - | - | - |
|cp | .503 | .000 | .517 | .000 |
| etc | # | #| # | # | # |

In [82]:
columns =[]

pearsonr = []
pearsonp = []

spearmanr = []
spearmanp = []

for col in df.columns:
    if col != 'num':

        columns.append(col)

        pearson = stats.pearsonr(df.num, df[col])
        spearman = stats.spearmanr(df.num, df[col])
        
        pearsonr.append(pearson[0])
        pearsonp.append(pearson[1])

        spearmanr.append(spearman[0])
        spearmanp.append(spearman[1])


statsdf = pd.DataFrame(columns=['Pearson r', 'Pearson p', 'Spearman r', 'Spearman p'])

statsdf['Pearson r'] = pearsonr
statsdf['Pearson p'] = pearsonp
statsdf['Spearman r'] = spearmanr
statsdf['Spearman p'] = spearmanp

statsdf.index = columns

statsdf.round(4)



Unnamed: 0,Pearson r,Pearson p,Spearman r,Spearman p
age,0.1602,0.0095,0.1447,0.0194
sex,0.2495,0.0,0.2495,0.0
cp,0.5033,0.0,0.517,0.0
trestbps,0.1483,0.0165,0.1567,0.0112
chol,0.2179,0.0004,0.1834,0.0029
fbs,0.1786,0.0038,0.1786,0.0038
restecg,-0.0194,0.7549,-0.0031,0.9605
thalach,-0.3451,0.0,-0.3363,0.0
exang,0.5579,0.0,0.5579,0.0
oldpeak,0.5657,0.0,0.5792,0.0
