## Data visualization with seaborn - 漂亮点的数据可视化
- The data is a little bit confusing: no description for `oldpeak` and `slp` (这俩都是啥)
- Is `target` the same as `output`? (这俩属性一样吗)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## About this dataset (Copied from the description)
- Age : Age of the patient
- Sex : Sex of the patient (1 = female; 0 = male)
- exang: exercise induced angina (1 = yes; 0 = no)
- ca: number of major vessels (0-3)
- cp : chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor
- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg : resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach : maximum heart rate achieved
- oldpeak : previous peak
- slp : slope
- target : 0 = less chance of heart attack 1 = more chance of heart attack

## Explore the dataset - 查看数据集
- Continous variables: 'age', 'trtbps', 'chol', 'thalachh', 'oldpeak' (这五个变量是连续非离散的)
- According to 'sex' & 'age', the dataset is not balanced (样本中性别比例和年龄群体并不均衡)

In [None]:
data_source = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
data_source.info()

In [None]:
data_source.describe()

## Gender w.r.t. each age group - 各性别在各年龄段中的比例
- Sex : Sex of the patient (1 = female; 0 = male) (1 : 女性，0 : 男性)
- A lot more female patients' data than male (女性样本明显更多)

In [None]:
plt.figure(figsize=(9, 5))
sns.histplot(
    data=data_source,
    x="age",
    hue="sex",
    kde=True,
    binwidth=2
)
plt.title("Distribution of gender - age")
plt.show()

## Distribution of the continous vars - 几个连续变量的分布特点
- We may want to pay close attention to patients with higher 'trtbps' and 'chol' (看看这两样更高的人群里是否发病率与一般人群有明显不同)
- And patients with lower 'thalachh' (同上，这次关注低的)
- What is 'oldpeak'? (这个到底是什么)

In [None]:
con_feature_list = ['trtbps', 'chol', 'thalachh', 'oldpeak']
plt.figure(figsize=(24, 1))
for idx in range(len(con_feature_list)):
    plt.subplot(1, len(con_feature_list), idx+1)
    sns.boxplot(
        data=data_source,
        x=con_feature_list[idx]
    )
plt.show()

## Heart attack rate in dataset - 样本中的发病率
- target : 0 = less chance of heart attack 1 = more chance of heart attack (0：低风险，1：高风险)

In [None]:
no_heart_atk, heart_atk = data_source.output.value_counts()[[0, 1]]
plt.figure(figsize=(6, 4))
sns.barplot(
    x=["no_heart_atk", "heart_atk"],
    y=[no_heart_atk, heart_atk]
)
plt.show()

## Heart attack rate among each age group - 各年龄阶段中的发病情况
- Seems like in the dataset more people between 40~55 suffer (似乎40-55岁发病率更高)
- The heart attack rate drops when people getting older (60左右的，步入老年的人群反而发病率降低)
- This may because the working pressure for the younger people? Or the data has some kind of bias when collected (可能由于工作生活压力的增加，也可能是数据本身收集时候有引入bias)

In [None]:
plt.figure(figsize=(9,5))
sns.histplot(
    data=data_source,
    x="age",
    hue="output",
    kde=True,
    binwidth=2
)
plt.title("Distribution of heart attack - age")
plt.show()

## Heart attack rate w.r.t. gender & age - 各性别各年龄阶段发病率
- The rate is signficant higher in male group (反映出的男性发病率高的吓人)
- During 40~55 age group in each gender, the rate is high (不论男女，40-55发病人数比例都很高)
- The number of samples for male group is not enough I suppose (男性样本应该存在挺大的数据不足问题)

In [None]:
# heart_atk in each gender & age group
male_patient = data_source[data_source.sex == 0]
female_patient = data_source[data_source.sex == 1]
plt.figure(figsize=(18, 5))
plt.subplot(121)
sns.histplot(
    data=male_patient,
    x="age",
    hue="output",
    kde=True,
    binwidth=2
)
plt.title("Heart attack - male patients")
plt.subplot(122)
sns.histplot(
    data=female_patient,
    x="age",
    hue="output",
    kde=True,
    binwidth=2
)
plt.title("Heart attack - female patients")
plt.show()

## Countplot for disecret variables - 离散数据的柱状图
- Seems like we need exercise even sometimes it may cause angina (看来，还是要运动，就算有诱发心绞痛的危险)
- 'ca' and 'cp' should be highly correlated with the heart attack (这俩与Heart attack的关系看起来很大)
- 'fbs' may not have obvious relationship with heart attack (这一项血糖看上去并不能反映出明显关系)
- 'rest_ecg' of 1 and 'slope' of 1 and 2 need to be explored (还有一些可以被关注的项目，不过slope到底是什么)

In [None]:
# exang: exercise induced angina (1 = yes; 0 = no)
# ca: number of major vessels (0-3)
# rest_ecg : resting electrocardiographic results
# cp : chest pain type
# fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
plt.figure(figsize=(24, 10))
plt.subplot(231)
sns.countplot(
    data=data_source,
    x="exng",
    hue="output"
)
plt.title("Heart attack - exercise induced angina")
plt.subplot(232)
sns.countplot(
    data=data_source,
    x="caa",
    hue="output"
)
plt.title("Heart attack - number of major vessels")
plt.subplot(233)
sns.countplot(
    data=data_source,
    x="restecg",
    hue="output"
)
plt.title("Heart attack - resting electrocardiographic results")
plt.subplot(234)
sns.countplot(
    data=data_source,
    x="cp",
    hue="output"
)
plt.title("Heart attack - Chest Pain type")
plt.subplot(235)
sns.countplot(
    data=data_source,
    x="fbs",
    hue="output"
)
plt.title("Heart attack - fasting blood sugar > 120 mg/dl")
plt.subplot(236)
sns.countplot(
    data=data_source,
    x="slp",
    hue="output"
)
plt.title("Heart attack - slope")
plt.show()

## Correlation in continuous vars - 连续变量的关系
- From the pair plot, most pairs do not have clear linear relationship (第一张图上，一眼看上去并没有关系特别强的组合)
- 'thalachh' and 'age' seems to have some correlation and can seperate the dataset (这一组还算是有着一定的线性关系，也能把两类人群稍微分开一些)
- Actually from the heat map of Pearson correlation, 'output' and 'oldpeak', 'oldpeak' and 'thalachh', 'thalachh' and 'age' all have a pretty high correlation (从heatmap来看，这三组关系密切)

In [None]:
con_feature_list = ["age", "trtbps", "chol", "thalachh", "oldpeak", "output"]
sns.pairplot(
    data=data_source[con_feature_list], 
    hue="output"
)
plt.show()

In [None]:
corr = data_source[con_feature_list].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(
    data=corr,
    annot=True,
    square=True
)
plt.title("Pearson correlation between continous vars")
plt.show()

## A simple RandomForest model - 简单的随机森林

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, precision_recall_curve, plot_precision_recall_curve
from sklearn.ensemble import RandomForestClassifier

def model_report(model, tar_x, tar_y):
    pred = model.predict(tar_x)
    f1 = f1_score(tar_y, pred)
    print("f1-score: ", f1)
    acc = accuracy_score(tar_y, pred)
    print("accuracy: ", acc)
    cm = confusion_matrix(tar_y, pred)
    print("confusion matrix:\n",cm)
    cls_report = classification_report(tar_y, pred)
    print("classification report:\n", cls_report)
    disp = plot_precision_recall_curve(model, tar_x, tar_y)
    disp.ax_.set_title('2-class Precision-Recall curve')

In [None]:
split_seed = 77
ratio = .2
train_set, test_set = train_test_split(data_source, test_size=ratio, random_state=split_seed)
train_y = train_set.output
train_x = train_set.drop(columns=['output'])
test_y = test_set.output
test_x = test_set.drop(columns=['output'])

randomForest = RandomForestClassifier(random_state=7)
randomForest.fit(train_x, train_y)
model_report(randomForest, test_x, test_y)