# Aim: 
## Exploring gender difference: whether there are different indicators of heart disease in males and females.

# Methods:
## Perfomed binary classification (heart disease absent or present) in each gender group
## 1. Classification: Logistic Regression with L1 penalty.
## 2. 10-Fold stratified cross-validation
## 3. Examined coefficients with post-hoc chi-square test for independence: Given a indicator value (e.g.,having_chest_pain==0), whether gender and disease presence are independent. 
### NB: The goal here is to examine potential different heart disease indicators in each gender. It is NOT the current goal to examine the underlying physiological mechanisms of each disease indicators.

# Results:
## 1. Both gender groups achieved higher than chance accuracy (>80%).
## 2. Indicators common in both groups.
### - Thalassemia (thal): reversable blood flow defect present (thal==3), likely NO heart disease (NB: Some other notebooks indicated this feature standed for Thalassemia, but I can't find official information, and given Thalassemia is a specific genetic blood disorder, the values do not seem to make sense. In either case, I don't tend to intepret the physiological bases here)


## 3. Distinct indicators between genders.

### a. Chest pain (cp):
#### - Female: "pain without relation to angina"(cp==2), likely heart disease.
#### - Male: Absent of chest pain (cp==0), likely NO heart disease
### ==> i.e., For male, whether pain is present is a good indicator already, whereas for female that's not enough and more specific kind of pain is needed. 
### NB: There seem to be inconsistencies in the meanings of the values. Here assuming: 0=absent, 1=atypical angina, 2=related to angina, 3=typical angina. Again we don't intend to interpret the actual physiological bases but focus on difference between the genders.

### b. Major vessels seen on flouroscopy (ca)
#### - Female: No vessels seen (ca==0), likely heart disease (similar trend for Male but smaller effect). 
#### - Male (ca==2): 2 vessels seen, likely NO disease (marginal, p=0.06)

### c. Thalassemia (thal): 
#### - Female: fixed deffect (thal==2), likely heart disease (similar trend for Male but smaller effect).

### d. Slope (of the peak exercise ST segment):
#### Female: ascending slope, likely heart disease.
#### Male: flat slope, likely NO heart disease.
### ==> The shape of the slope peak is very different in male and female.

# Conclusions and Discussion
## The indicators of heart disease showed very different degrees of sensitivity in men and women. In particular,
## a) In men, absence of chest pain reliably indicated no disease, whereas in women that was not sufficient, and more speficic type of chest pain indicated presence of disease.
## b) Regarding the number of vessels seen on flouroscopy, for men, absence of disease should have 2 vessels visible on the image, whereas for women, it only needs to be greater than 0.
## c) Men and women showed distinct shapes of slope of the peak exercise ST segment. For men, a flat shape is a strong indicator of no disease, whereas for women an ascending shape indicated presence of disease. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Thanks to https://www.kaggle.com/carlosdg/a-detail-description-of-the-heart-disease-dataset    
# https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model
# for some background info of the features
import sklearn as sk
from matplotlib import pyplot as pl
heart = pd.read_csv("../input/heart-disease-uci/heart.csv")
print(f"Dataset shape: {heart.shape}")#print(heart.describe())

# Variables
## Continuous variables (5):
* age
* trestbps: resting blood pressure (mm Hg on admission to the hospital)
* chol: The person's cholesterol measurement in mg/dl
* thalach: The person's maximum heart rate achieved
* oldpeak: ST depression induced by exercise relative to rest: greater value, more displacement indicating abnormality

## Categoriacal variables (8):
* sex: The person's sex (1 = male, 0 = female) 
* cp: Chest Pain type: 0 = asymptomatic; 1 = atypical angina; 2 = pain without relation to angina; 3 = typical angina
* fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
* restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
* exang: Exercise induced angina (1 = yes; 0 = no)
* slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
* ca: The number of major vessels seen (0-3)
* thal: ? some blood disorder**

## target: Heart disease (0 = no, 1 = yes)

In [None]:
#pairwise correlation across all features (not ideal for categorical variables, but maybe still useful as higher values are generally worse)
pl.figure(figsize=(8,8))
cm = np.corrcoef(heart.T)
pl.imshow(cm,vmin=-.5,vmax=.5,cmap='jet')
pl.xticks(range(heart.shape[1]),heart.columns,rotation=90)
pl.yticks(range(heart.shape[1]),heart.columns)
pl.colorbar()
pl.show()
# a few feautres are already correlated with the target

In [None]:
from sklearn import linear_model, model_selection, metrics 

In [None]:
# Simple setup from sklearn tutorial, results are already good.
# Use Logistic regression following typical apporoach (multiple regression) in public health 
# and often time the features are likely correlated w/ each other
clf = linear_model.LogisticRegression(penalty='l1', solver='liblinear',tol=1e-6, max_iter=int(1e6), warm_start=True,intercept_scaling=10000.)

<a>Divide</a>
# Divide dataset by gender

In [None]:
heart_f = heart[(heart.sex==0)]
heart_m = heart[(heart.sex==1)]
heart_f.shape, heart_m.shape
n_male = heart_m.shape[1]
n_female = heart_f.shape[1]

In [None]:
#pairwise correlation across all features FOR EACH GENDER GROUP
pl.figure(figsize=(8,12))
pl.subplot(121)
pl.imshow(np.corrcoef(heart_f.T),vmin=-.5,vmax=.5,cmap='jet')
pl.title('Female')
pl.xticks(range(heart.shape[1]),heart.columns,rotation=90)
pl.yticks(range(heart.shape[1]),heart.columns)
pl.subplot(122)
pl.imshow(np.corrcoef(heart_m.T),vmin=-.5,vmax=.5,cmap='jet')
pl.title('Male')
pl.xticks(range(heart.shape[1]),heart.columns,rotation=90)
pl.yticks(range(heart.shape[1]),heart.columns)#pl.colorbar()
pl.show()

In [None]:
# Run the same log reg clf (stratified 10fold) for each gender group:
var_cat = ['cp','fbs','restecg','exang','slope','ca','thal']
n_it = 10
kf = model_selection.StratifiedKFold(n_splits=n_it)

In [None]:
# Female: 
X = pd.get_dummies(heart_f,columns=var_cat) #NB female missing ca_4
y = X.target 
X = X.drop(columns=['sex','target','fbs_0','exang_0'])
print(X.shape,X.columns)
accs = np.zeros(n_it)
coefs = []
for it, (tr,te) in enumerate(kf.split(X,y)):
    clf.fit(X.iloc[tr],y.iloc[tr])
    y_true = y.iloc[te]
    y_pred = clf.predict(X.iloc[te])
    accs[it] = metrics.accuracy_score(y_pred,y_true)
    print(it,tr.shape,te.shape,accs[it])
    coefs.append(clf.coef_.ravel())
print('Female: %f+/-%f' % (np.mean(accs),np.std(accs)))
coefs_f = np.asarray(coefs)
cols_f = X.columns

# Male: 
X = pd.get_dummies(heart_m,columns=var_cat)
y = X.target 
X = X.drop(columns=['sex','target','fbs_0','exang_0'])
print(X.shape,X.columns)
accs = np.zeros(n_it)
coefs = []
for it, (tr,te) in enumerate(kf.split(X,y)):
    clf.fit(X.iloc[tr],y.iloc[tr])
    y_true = y.iloc[te]
    y_pred = clf.predict(X.iloc[te])
    accs[it] = metrics.accuracy_score(y_pred,y_true)
    print(it,tr.shape,te.shape,accs[it])
    coefs.append(clf.coef_.ravel())
print('Male: %f+/-%f' % (np.mean(accs),np.std(accs)))
coefs_m = np.asarray(coefs)
cols_m = X.columns

In [None]:
# Note: accuracy
# F(n=96): 0.8844+/-0.0573
# M(n=207): 0.8064+/-0.0926
#print(coefs_f.shape,coefs_m.shape)#print(cols_f,cols_m)#
pl.figure(figsize=(8,10))
pl.subplot(211)
pl.imshow(coefs_f,cmap='jet')#pl.colorbar()#
pl.xticks(range(len(cols_f)),cols_f,rotation=90)
pl.title('FEMALE')
pl.subplot(212)
pl.imshow(coefs_m,cmap='jet')#pl.colorbar()#
pl.title('MALE')
pl.xticks(range(len(cols_m)),cols_m,rotation=90)
pl.show()

<a id="Vis"></a> 
# Result summary and visualization 
## Similar btw both genders:
* -cp_0: no chest pain, no heart disease 
* +ca_0: no major vessles seen, likely heart disease
* -thal_3: reversable blood flow defect present, no heart disease

## Different:
* +cp_2 ("pain without relation to angina"): Female: present, likely heart disease 
* -ca_2: Male: present, No heart disease
* -exang_1: exercise induced angina: Female: present, no heart disease
* +thal_2: normal blood flow: Female: present, likely heart disease (similar trend for Male too)
* -slope_1(flat): Male: present, No
* +slope_2(Asc): Female: present, likely 

In [None]:
from scipy import stats as spst
def visualizeByGender(targFeat):
    # helper func for visualizing the targeted feature of the two gender groups
    # and chi-square test for indenpence
    
    pl.figure(figsize=(12,3))
    ax = pl.subplot(121)
    pd.crosstab(heart_f.target,heart_f[targFeat]).plot(kind="bar",title='FEMALE'+' '+targFeat,ax=ax)
    pl.xticks([0,1],['Absent','Present'],rotation=0)
    ax = pl.subplot(122)
    pd.crosstab(heart_m.target,heart_m[targFeat]).plot(kind="bar",title='MALE'+' '+targFeat,ax=ax)#edgecolor='k',color=['w','k'])
    pl.xticks([0,1],['Absent','Present'],rotation=0)
    pl.show()
    
def contingency_table_test(targFeat,targValue):
    # build the contingency table: gender by disease (2*2), count freq of the given target value for the target feature
    values = np.unique(heart[targFeat])#print(values)
    cont_tbl = np.zeros((2,2))
    cont_tbl[0,0] = np.sum(heart_f[heart_f.target==0][targFeat]==targValue)
    cont_tbl[0,1] = np.sum(heart_f[heart_f.target==1][targFeat]==targValue)
    cont_tbl[1,0] = np.sum(heart_m[heart_m.target==0][targFeat]==targValue)
    cont_tbl[1,1] = np.sum(heart_m[heart_m.target==1][targFeat]==targValue)
    chi2, p, dof, ex = spst.chi2_contingency(cont_tbl)
    return (chi2, p, dof, ex)

<a id='cp'></a>
# Chest Pain ('cp'): 
* For both, "Absence of chest pain" (cp_0, blue) is a strong indicator of no heart disease.
    * And it is better in Male than Female.
* For female, "pain without relation to angina" (cp_2, green) strongly indicates presence of heart disease, whereas this is not strong indicator for male 

In [None]:
visualizeByGender('cp')
chi2, p, dof, ex = contingency_table_test('cp',0)
print('cp==0: chi2(%d)=%f, p=%f' % (chi2,dof,p))
chi2, p, dof, ex = contingency_table_test('cp',2)
print('cp==2: chi2(%d)=%f, p=%f' % (chi2,dof,p))

<a id='ca'></a>
# Number of main blood vessels seen with the radioactive dye ('ca')
* For both, no major vessel seen (ca_0, blue) indicates presence:
    * Better in Female than Male   
* For male, being able to see 2 vessels indicate no heart disease (ca_2, green)

In [None]:
visualizeByGender('ca')
chi2, p, dof, ex = contingency_table_test('ca',0)
print('ca==0: chi2(%d)=%f, p=%f' % (chi2,dof,p))
chi2, p, dof, ex = contingency_table_test('ca',2)
print('ca==2: chi2(%d)=%f, p=%f' % (chi2,dof,p))
# 

In [None]:
chi2, p, dof, ex = contingency_table_test('ca',1)
print('ca==1: chi2(%d)=%f, p=%f' % (chi2,dof,p))
chi2, p, dof, ex = contingency_table_test('ca',3)
print('ca==3: chi2(%d)=%f, p=%f' % (chi2,dof,p))

<a id='thal'></a>
# thalassemia ('thal'). 
** There seems to be inconsistency about what the values indicate. Assuming: 1=normal, 2=fixed defect, 3=reversable defect
* For both, reversable blood flow defect (thal_3, red) indicates no heart disease
* For female, fixed defect (thal_2, green) indicates present of heart disease 
    * This is the same direction for male but effect is smaller

In [None]:
visualizeByGender('thal')
chi2, p, dof, ex = contingency_table_test('thal',3)
print('thal==3: chi2(%d)=%f, p=%f' % (dof,chi2,p))
chi2, p, dof, ex = contingency_table_test('thal',2)
print('thal==2: chi2(%d)=%f, p=%f' % (chi2,dof,p))

<a id='exang'></a>
# EXercise induced ANGina ('exang')
* Presence of exercise induced angina indicates no disease for Female, but n.s. by Chi-square test.

In [None]:
visualizeByGender('exang')
chi2, p, dof, ex = contingency_table_test('exang',1)
print('exang: chi2(%d)=%f, p=%f' % (dof,chi2,p))

<a id='slope'></a>
# Slope:
## (The slope of the peak exercise ST segment, 'slope')
* F and M have distinct disease indicators:
    * For Female, ascending slope (slope_2, green) indicates presence of disease
    * For Male, flat slope (slope_1, orange) indicates no disease.

In [None]:
visualizeByGender('slope')
chi2, p, dof, ex = contingency_table_test('slope',2)
print('slope==2: chi2(%d)=%f, p=%f' % (dof,chi2,p))
chi2, p, dof, ex = contingency_table_test('slope',1)
print('slope==1: chi2(%d)=%f, p=%f' % (dof,chi2,p))