# Introduction
In this task we have to classify whether person is likely to get heart attack, metric we want to maximize is recall (because we want to find all persons that are likely to get heart attack).

    Age : Age of the patient

    Sex : Sex of the patient

    exang: exercise induced angina (1 = yes; 0 = no)

    ca: number of major vessels (0-3)

    cp : Chest Pain type chest pain type
        Value 1: typical angina
        Value 2: atypical angina
        Value 3: non-anginal pain
        Value 4: asymptomatic

    trtbps : resting blood pressure (in mm Hg)

    chol : cholestoral in mg/dl fetched via BMI sensor

    fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

    rest_ecg : resting electrocardiographic results
        Value 0: normal
        Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

    thalach : maximum heart rate achieved

    target(output) : 0= less chance of heart attack 1= more chance of heart attack


# Imports & Read Data

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df

The first thing I want to do is to separate categorical and numerical features because later we will need to make some transformations and plot some graphs.

In [None]:
numerical_features = ['age','trtbps','chol','thalachh','oldpeak']
categorical_features = ['sex','cp','fbs','restecg','exng','slp','caa','thall']

# Univariate analysis

Let's look on how balanced our target variable is.

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x=df['output']);

We see that there is no disbalance in target variable.

It is also nice practice to look on statistics and distributions of our features.

In [None]:
df.describe()

### From here we see some useful information about our features:

    1. Min age = 29, max age = 77, mean age = 54.3, from here we see that we have no information about young people.
    2. Sex mean 0.68 means that 68% of our observations are labeled as 1 
    3. It also looks like we have no outliers (judging by the max-min values), maybe only the observation with chol = 564 is an outlier, but I will not remove it because it is very high but still achievable value.
    4. We can make some other conclusions about our data but they won't be obvious for me and for most people 

# Bivariate analysis

In this section we will look on dependency between target variable and features.

In [None]:
_,ax = plt.subplots(1,2,figsize=(14,6))

sns.histplot(data=df,x='age',hue='output',kde=True,ax=ax[0])
sns.boxplot(data=df,y='age',x='output',ax=ax[1])

From graph below we see strange thing - median age of people that are more likely to have heart attack is lower then median age of people that are less likely.

In [None]:
sns.countplot(data=df,x='sex',hue='output')

We see that sex labeled as 0 is more likely to have heart attack then sex 1.

In [None]:
_,ax = plt.subplots(2,2,figsize=(16,10))
for i,x in enumerate(['trtbps','chol','thalachh','oldpeak']):
    sns.histplot(data=df,x=x,hue='output',kde=True,ax=ax[i%2][i//2])
    

From the graphs we see that all this features might be useful.

Now let's look on distribution of categorical variables. 

In [None]:
_,ax = plt.subplots(4,2,figsize=(16,15))
for i,x in enumerate(['cp','fbs','restecg','exng','slp','caa','thall']):
    sns.histplot(data=df,x=x,hue='output',kde=True,ax=ax[i%4][i//4])

From this graphs we can make conclusion that feature fbs might not be useful. To check this we can make two models first with this feature and  second without.

# Missing values

In [None]:
df.isna().sum()

In this data we don't have missing values.

# Scaling and Encoding

We want to scale our data to make all variables have similar values range. To do this we can't simply scale all data because this will lead to data leakage when we will be evaluating our model, so we have to split data on train and test set.

Encoding is used to transform (categorical) data so that model can understand it, I will use OneHotEncoding since this is ont of the best choices for linear models.

I will make pipeline for transformations using ColumnTransformer to do this I need to specify columns dtype to choose appropriate transformation for them.

In [None]:
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
df.dtypes

In [None]:
df[numerical_features]=df[numerical_features].astype('float64')
df[categorical_features] = df[categorical_features].astype('category')

In [None]:
df.dtypes

In [None]:
num_transformer = Pipeline(steps=[('scaler',StandardScaler())])

cat_transformer = Pipeline(steps=[('onehot',OneHotEncoder(handle_unknown='ignore'))])

transformer = ColumnTransformer(transformers=[
        ('num', num_transformer, numerical_features),
        ('cat', cat_transformer,categorical_features)])


# Modeling

Looks like our transformer is ready now we can make some models.

In [None]:
from sklearn.metrics import recall_score, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np

In [None]:
df_train, df_test = train_test_split(df,random_state=1)
X_test = df_test.drop(columns=['output'])
y_test = df_test.output

In [None]:
X = df_train.drop(columns=['output'])
y = df_train.output

In [None]:
def print_recall(model,message=''):
    print('-'*9)
    print(message)
    pipeline = Pipeline(steps=[('transformer',transformer),('model',model)])
    print('Recall = ',round(np.mean(cross_val_score(pipeline,X,y,scoring='recall')),5))
    print('Precision = ',round(np.mean(cross_val_score(pipeline,X,y,scoring='precision')),5))    

In [None]:
print_recall(LogisticRegression(C=0.1),'LogRegres, c = 0.1')
print_recall(LogisticRegression(C=1),'LogRegres, c = 1')
print_recall(LogisticRegression(C=10),'LogRegres, c = 10')

print_recall(SVC(C=0.1),'SVC, c = 0.1')
print_recall(SVC(C=1),'SVC, c = 1')
print_recall(SVC(C=10),'SVC, c = 10')

In [None]:
print_recall(KNeighborsClassifier(n_neighbors=3),'KNN, k = 3')
print_recall(KNeighborsClassifier(n_neighbors=5),'KNN, k = 5')
print_recall(KNeighborsClassifier(n_neighbors=8),'KNN, k = 8')

print_recall(RandomForestClassifier(min_samples_leaf=1),'Forest, 1 sample per leaf')
print_recall(RandomForestClassifier(min_samples_leaf=3),'Forest, 3 samples per leaf')
print_recall(RandomForestClassifier(min_samples_leaf=5),'Forest, 5 samples per leaf')

Let's now focus on logistic regression and try to get recall equal to 0.95

In [None]:
pipeline = Pipeline(steps=[('transformer',transformer),('model',LogisticRegression(C=0.1))])

In [None]:
prob = 0.0
for p in np.linspace(0.5,0,100):
    kfold = KFold()
    recall = list()
    precision = list()
    for train_idx,test_idx in kfold.split(X):
        pipeline.fit(X.iloc[train_idx],y.iloc[train_idx])
        proba = pipeline.predict_proba(X.iloc[test_idx])
        predictions = proba[:,1] >= p
        recall.append(recall_score(y.iloc[test_idx],predictions))
        precision.append(precision_score(y.iloc[test_idx],predictions))

    if np.mean(recall) > 0.95:
        prob = p
        print('p = ', p)
        print('Recall = ', np.mean(recall))
        print('Precision = ', np.mean(precision))
        break

As we see we need threshold ~ 0.39 to get recall = 0.95 let's check results on test set.

In [None]:
pipeline.fit(X,y);

In [None]:
proba = pipeline.predict_proba(X_test)
predictions = proba[:,1] >= prob
print('Recall = ',recall_score(y_test,predictions))
print('Precision = ',precision_score(y_test,predictions))

Well, result differs, but this is because we have small dataset so the split has huge influence on the metrcis.