This kernel will be divided into 3 sections

* Data preparation
* Data exploration and visualization
* Building the model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df= pd.read_csv("/kaggle/input/health-care-data-set-on-heart-attack-possibility/heart.csv")

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

<h2>Data Preparation</h2>
For a medical dataset , it's always a good practice to categorize the age so that it gives us better insights and understanding. 

I will be categorizing the age based on seniority , middle age ,adult ,  young adult, child , teenager so we can understand the affected age groups

* 'S' - Senior Citizen (Age 60 and above)
* 'MA' - Middle Aged(45 - 60)
* 'A' - Adult (30 - 45)
* 'YA' - Young Adult (20-30)
* 'T' - Teenager (12-19)
* 'C' - Child(1-12)


In [None]:
def categorize_age(age):
    if (age>60):
        return 'S'
    elif (age>45 and age<=60):
        return 'MA'
    elif (age>1 and age<=12):
        return 'C'
    elif (age>12 and age<=19):
        return 'T'
    elif (age>20 and age<=30):
        return 'YA'
    elif (age>=30 and age<=45):
        return 'A'

In [None]:
def categorize_sex(sex):
    if (sex==0):
        return "F"
    else:
        return "M"

In [None]:
df['age_category'] = df['age'].apply(lambda x: categorize_age(x))

In [None]:
df['gender'] = df['sex'].apply(lambda x:categorize_sex(x))

In [None]:
df.head()

<h2>Data Exploration</h2>

Now let the visualization do all the talking :) . Simple graphs but yet effective

In [None]:
df['gender'].value_counts().plot(kind='bar').set_title("Total records of Male & Female")

In [None]:
df['gender'].loc[df['target']==1].value_counts().plot(kind='bar').set_title("Gender has risk of heart disease")

In [None]:
df['gender'].loc[df['target']==0].value_counts().plot(kind='bar').set_title("Genders having no risk of heart disease")

In [None]:
df.head()

In [None]:
df['age_category'].loc[(df['target']==1) & (df['gender']=='M')].value_counts().plot(
    kind='bar').set_title("Age categories having high risk - For Male")

In [None]:
df['age_category'].loc[(df['target']==1) & (df['gender']=='F')].value_counts().plot(
    kind='bar').set_title("Age categories having high risk - For Female")

We can see the majority of them are adults and middle aged. There could be many factors that reason's the person to have any sort of heart disease

* Lifestyle of the person
* Food choices
* Smoking

and many more.


In [None]:
sns.scatterplot(x='thalach' , y='chol' , data=df , hue='target').set_title("Relationship between the cholestrol level and the highest heart rate recorded")

In [None]:
sns.violinplot(x='cp',y='chol',data=df.loc[df['target']==1] , 
               hue='age_category').set_title("Plot for different kind of chest pain's with respect to the cholestrol level & age")

We can see that the middle age and adults have chest pain type of 1 and 2 , and the cholestrol level is also higher than average

In [None]:
sns.violinplot(x='gender',y='thalach',data=df.loc[df['target']==1] , hue='age_category').set_title("Highest heart rate with respect to the Age Categories")

In [None]:
sns.distplot(df['thalach'].loc[(df['target']==1) & (df['gender']=='M')],
             color='r').set_title("Distribution for the highest heart rate recorded for Males")

In [None]:
sns.distplot(df['thalach'].loc[(df['target']==1) & (df['gender']=='F')] 
             , color='y').set_title("Distribution for the highest heart rate recorded for Females")

In [None]:
sns.countplot(x="age_category" , data=df.loc[(df["gender"]=="M") & 
                                             (df["exang"]==1) &
                                             (df["target"]==1)]).set_title("Male category who are in high risk with Exercised induced Angina")

In [None]:
sns.countplot(x="age_category" , data=df.loc[(df["gender"]=="F") & 
                                             (df["exang"]==1) &
                                             (df["target"]==1)]).set_title("Female category who are in high risk with Exercise induced Angina")

In [None]:
sns.distplot(df["thalach"].loc[(df["target"]==1)& (df["exang"]==0) & (df["gender"]=="M")]
            ,color='r').set_title("Highest heart rate recorded for Male(Red) & Female(Yellow) (No Risk)")
sns.distplot(df["thalach"].loc[(df["target"]==1)& (df["exang"]==0) & (df["gender"]=="F")]
            ,color='y')

In [None]:
sns.distplot(df["thalach"].loc[(df["target"]==1)& (df["exang"]==1) & (df["gender"]=="M")]
            ,color='r').set_title("Highest heart rate recorded for Male(red) & Female(yellow) (High Risk)")
sns.distplot(df["thalach"].loc[(df["target"]==1)& (df["exang"]==1) & (df["gender"]=="F")]
            ,color='y')

In [None]:
sns.distplot(df["thalach"].loc[(df["target"]==0)& (df["exang"]==1) & (df["gender"]=="M")]
            ,color="r").set_title("Highest heart rate recorded for Male(Red) & Female(Yellow) who is Low risk but has Angina")
sns.distplot(df["thalach"].loc[(df["target"]==0)& (df["exang"]==1) & (df["gender"]=="F")]
            ,color="y")

<h2>Building model</h2>

I will be using various boosting & bagging algorithms such as : 

* CatBoost
* XGBClassifier
* RandomForestClassifier
* DecisionTreeClassifier

The accuracy of these model could be improved with some tuning and other techniques

In [None]:
train = df.drop(['target','gender','age_category'],axis=1)

In [None]:
target=df['target'].values

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report


In [None]:
x_train , x_test , y_train, y_test = train_test_split(train,target,test_size=0.1)

In [None]:
print (f"X Train : {x_train.shape} \nY Train : {y_train.shape} \nX Test : {x_test.shape} \nY Test : {y_test.shape}")

In [None]:
classifiers = {"randomforest":RandomForestClassifier(),
              "xgboost":XGBClassifier(),
              "catboost":CatBoostClassifier(),
              "decisiontree": DecisionTreeClassifier()
              }

In [None]:
for model_name , model in classifiers.items():
    print (f"For : {model_name}")
    model.fit(x_train,y_train)
    prediction = model.predict(x_test)
    print (f"Classification Report for : {model_name}")
    print (classification_report(y_test,prediction))

<h2>Summary</h2>


XGBoost performed the best compared to the rest of the models 

<h3>Thank you! :) Critic's are welcome </h3>