# Exploratory Data Analysis

## Aim :
- Understand the data ("A small step forward is better than a big one backwards")
- Begin to develop a modelling strategy

## Features

- Age : Age of the patient

- Sex : Sex of the patient

- exang: exercise induced angina (1 = yes; 0 = no)

- ca: number of major vessels (0-3)

- cp : Chest Pain type chest pain type :
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
  
- trtbps : resting blood pressure (in mm Hg)

- chol : cholestoral in mg/dl fetched via BMI sensor

- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- rest_ecg : resting electrocardiographic results :
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

- thalach : maximum heart rate achieved

- target : 0= less chance of heart attack 1= more chance of heart attack


## Base Checklist
#### Shape Analysis :
- **target feature** : output
- **rows and columns** : 303 , 14
- **features types** : qualitatives : 0 , quantitatives : 14
- **NaN analysis** :
    - NaN (0 % of NaN)

#### Columns Analysis :
- **Target Analysis** :
    - Balanced (Yes/No) : Yes
    - Percentages : 55% / 45%
- **Categorical values**
    - There is 8 categorical features (0/1) (not inluding the target)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Dataset Analysis

In [None]:
data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df = data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()

In [None]:
(df.isna().sum()/df.shape[0]*100).sort_values(ascending=False)

In [None]:
print('There is' , df.shape[0] , 'rows')
print('There is' , df.shape[1] , 'columns')

## Visualising Target and Features

In [None]:
df['output'].value_counts(normalize=True) #Classes déséquilibrées

In [None]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure()
    sns.displot(df[col],kind='kde',height=3)
    plt.show()

In [None]:
X = df.drop('output',axis=1)
y = df['output']

## Detailed Analysis

In [None]:
riskyDF = df[y == 1]
safeDF = df[y == 0]

In [None]:
plt.figure(figsize=(4,4))
sns.pairplot(data,height=1.5)
plt.show()

In [None]:
corr = df.corr(method='pearson').abs()

fig = plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='tab10', vmin=-1, vmax=+1)
plt.title('Pearson Correlation')
plt.show()

print (df.corr()['output'].abs().sort_values())

In [None]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure(figsize=(4,4))
    sns.distplot(riskyDF[col],label='High Risk')
    sns.distplot(safeDF[col],label='Low Risk')
    plt.legend()
    plt.show()

### Comments

It looks like we have some very useful features here, with a correlation > 0.4.
The following features seems promising for predicting wether a patient will have a heart attack or not :
- **oldpeak**
- **exng**
- **cp**
- **thalachh**

We can also notice that **sip** and **oldpeak** looks correlated, let's find out !

In [None]:
for col in X.select_dtypes(include=['float64','int64']):
    plt.figure(figsize=(4,4))
    sns.lmplot(x='oldpeak', y=col, hue='output', data=df)

#### Comments



# A bit of data engineering ...

In [None]:
def encoding(df):
    code = {
            # All columns are made of quantitative values (floats actually), so there is no need to encode the features
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)
        
    return df

def imputation(df):
    
    df = df.dropna(axis=0) # There are no NaN anyways
    
    return df

def feature_engineering(df):
    useless_columns = [] # Let's consider we want to use all the features
    df = df.drop(useless_columns,axis=1)
    return df

In [None]:
def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('output',axis=1)
    y = df['output']    
      
    return df,X,y

### Comments
We can now analyze categorical features as quantitative features (rem : no qualitative features to be encoded here)

# Modelling

In [None]:
from sklearn.model_selection import train_test_split
df = data.copy()
trainset, testset = train_test_split(df, test_size=0.2, random_state=0)
print(trainset['output'].value_counts())
print(testset['output'].value_counts())

In [None]:
_, X_train, y_train = preprocessing(trainset)
_, X_test, y_test = preprocessing(testset)

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler

In [None]:
preprocessor = make_pipeline(RobustScaler())

RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=0))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=0))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=0,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression())

In [None]:
dict_of_models = {'RandomForest': RandomPipeline,
'AdaBoost': AdaPipeline,
'SVM': SVMPipeline,
'KNN': KNNPipeline,
'LR': LRPipeline}

In [None]:
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV

def evaluation(model):
    model.fit(X_train, y_train)
    # calculating the probabilities
    y_pred_proba = model.predict_proba(X_test)

    # finding the predicted valued
    y_pred = np.argmax(y_pred_proba,axis=1)
    print('Accuracy = ', accuracy_score(y_test, y_pred))
    print('-')
    print(confusion_matrix(y_test,y_pred))
    print('-')
    print(classification_report(y_test,y_pred))
    print('-')
    
    N, train_score, val_score = learning_curve(model, X_train, y_train, 
                                               cv=4, scoring='f1', 
                                               train_sizes=np.linspace(0.1,1,10))
    plt.figure(figsize=(12,8))
    plt.plot(N, train_score.mean(axis=1), label='train score')
    plt.plot(N, val_score.mean(axis=1), label='validation score')
    plt.legend()

In [None]:
for name, model in dict_of_models.items():
    print('---------------------------------')
    print(name)
    evaluation(model)

### Comments

All 4 models look promising, but **SVM** has a slightly better accuracy **(79%)**

# Using AdaBoost

In [None]:
AdaPipeline.fit(X_train, y_train)
y_proba = AdaPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("Adaboost : ", accuracy_score(y_test, y_pred))

In [None]:
y_pred_prob = AdaPipeline.predict_proba(X_test)[:,1]

fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

plt.plot(fpr,tpr,label='AdaBoost ROC Curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("AdaBoost ROC Curve")
plt.show()

# If you likeplease upvote !