# Heart Attack Analysis and Prediction - Binary Classification with Logistic Regression

About dataset

* Age : Age of the patient
* Sex : Sex of the patient
* exang: exercise induced angina (1 = yes; 0 = no)
* ca: number of major vessels (0-3)
* cp : Chest Pain type chest pain type 
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
* trtbps : resting blood pressure (in mm Hg)
* chol : cholestoral in mg/dl fetched via BMI sensor
* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* rest_ecg : resting electrocardiographic results 
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved
* **target : 0= less chance of heart attack 1= more chance of heart attack**

### **This notebook proceeds with the steps below:)**
```
Step 1. Data Description
Step 2. EDA
Step 3. Correlation Check
Step 4. Test Data Split and Standard Scaling
Step 5. Modeling and Prediction (sklearn-LogisticRegression)
```

## Step 1. Data Description

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
cd Heart_Attack_Analysis_and_Prediction

In [None]:
ls

In [None]:
df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
df

In [None]:
df.info()

In [None]:
import missingno as msno

In [None]:
msno.bar(df);

In [None]:
df.nunique()

In [None]:
df.describe()

In [None]:
df.shape

------------
1. All Columns are numeric
2. Null value isn't exist
3. colums -> x : 13, y : 1
4. rows -> 14
5. it need to be normalized

In [None]:
df

# Step 2. EDA

In [None]:
df.nunique()

----------
There are some categorical columns(but int type). Let's EDA that first

In [None]:
df['sex'].value_counts()

In [None]:
cat_cols = ['sex','cp','fbs','restecg','exng','slp','caa','thall']

In [None]:
len(cat_cols)

In [None]:
fig,  ax = plt.subplots(nrows=4, ncols=2, figsize=(20,16))
t = 0

for i in range(4):
    for j in range(2):
        cat_bar = [df[col].value_counts() for col in cat_cols]
        axes = ax[i][j]
        sns.barplot(x=cat_bar[t].index, y=cat_bar[t].values, ax=axes)
        axes.set_title(cat_cols[t])
        t += 1
plt.show()

In [None]:
df.nunique()

-------
Let's see data info again - about categorical columns
* Sex : Sex of the patient
    * Value 0: Women
    * Value 1: Man
    
    
* exang: exercise induced angina 
    * Value 0: no (False)
    * Value 1: yes (True)
    
    
* ca: number of major vessels
    * Value 0: NOthing
    * Value 1: 1 vessel
    * Value 2: 2 vessel
    * Value 3: 3 vessel
    * Value 4: 4 vessel
    
    
* cp : Chest Pain type chest pain type 
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
    
    
* fbs : (fasting blood sugar > 120 mg/dl) 
    * Value 0: fasting blood sugar =< 120 (False)
    * Value 1: fasting blood sugar > 120 (True)
    
        
* rest_ecg : resting electrocardiographic results 
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria


* **target : 0= less chance of heart attack 1= more chance of heart attack

-----------
- We don't have information about ['slp','thall','caa']
- We have information below (around)
    * Sex : Man - 70%
    * cp : Typical angina - 50%
    * fbs : fasting blood sugar <= 120(low) 85%
    * rest_ecg : 'Value 0 and 1' are 90% (normal or ST-T wave abnormal)
    * exang(exercise induced angina) : 0(False) 70%
    

In [None]:
df.output.value_counts()

In [None]:
output = df.output.value_counts()

In [None]:
sns.barplot(x=output.index, y=output.values)

Well Classified 'Y(Output)'

# Step 3. Correlation Check

In [None]:
df.info()

In [None]:
colormap = plt.cm.RdBu
plt.figure(figsize=(20,16))
sns.heatmap(df.corr())
plt.show()


In [None]:
df.corr()

In [None]:
df.corr()['output']

In [None]:
df.corr()['output'].sort_values(ascending=False)

In [None]:
Output = pd.DataFrame(df.corr()['output'].sort_values(ascending=False))

In [None]:
sns.heatmap(Output)

------------
**There are high relation - Output and ['cp','thalachh','slp','restecg']**

# Step 4. Test Data Split and Standard Scaling

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df

In [None]:
train = df.iloc[:,:-1]

In [None]:
test = df.iloc[:,-1:]

In [None]:
train

In [None]:
test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train,test, test_size=0.3, random_state=32)

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

In [None]:
y_test

In [None]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

----------
Well split, rows=212, cols=(x:13, y:1)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_train_raw = scaler.fit_transform(X_train)
X_test_raw = scaler.transform(X_test)

In [None]:
train.columns

In [None]:
test.columns

In [None]:
X_test.shape

In [None]:
X_train.shape

In [None]:
X_train = pd.DataFrame(X_train_raw, columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(X_test_raw, columns=X_test.columns, index=X_test.index)

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

In [None]:
y_test

-------------
OK, Well Scaled! Let's Build a Model

# Step 5. Modeling and Prediction

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()

In [None]:
lr.fit(X_train, y_train.values.ravel()) # ravel() : 1d - array transform
y_pred = lr.predict(X_test)


-----------------
!. Let's See more about ravel()

In [None]:
y_train.values.ravel()

In [None]:
y_train

This difference make important warning!

In [None]:
(y_pred == y_test.values.ravel()).sum() / len(y_pred)

This code is same like ..

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_pred, y_test)