# Business Context

Cardio Catch Disease is a company specialized in detecting cardiac diseases at early stages.

- **Business Model:** The price of the diagnosis varies according to the accuracy achieved by the team of specialists, the client pays R$500 for each 5% accuracy above 50%.


- **Actual Scenario:** The current diagnostic accuracy varies between 55% and 65%, due to the complexity of the diagnosis, at a cost of R$1000.


- **Main Goal:** Create a disease diagnosis tool with stable accuracy. 


- **Secundary Goals:** Delivery a report ansewirng the follow questions:
    1. How accurate and precise is the tool?
    2. How much profit Cardio Catch Diseases will make with the new tool?
    3. How reliable are the results given by the new tool?
    

Business Project reference: https://sejaumdatascientist.com/projeto-de-data-science-diagnostico-precoce-de-doencas-cardiovasculares/

Data Source: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

# 0.0. Imports

In [1]:
# data manipulation
import numpy as np
import pandas as pd

# stratified sampling
from sklearn import model_selection as ms

# machine learning models
from sklearn.ensemble import RandomForestClassifier

# model evaluation
from sklearn.metrics import accuracy_score

## 0.1. Auxiliar Functions

## 0.2. Load Data

In [2]:
data_raw = pd.read_csv('../datasets/cardio_train.csv', sep=';')

## 0.3. Split Data into Train and Test

In [3]:
data_train, data_test = ms.train_test_split(data_raw, test_size=0.2, random_state=42)

# 1.0. Data Description

In [4]:
df01 = data_train.copy()

## 1.1. Data Types

In [5]:
df01.dtypes

id               int64
age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

## 1.2. Data Dimensions

In [6]:
print(f'Number of rows: {df01.shape[0]}')
print(f'Number of columns: {df01.shape[1]}')

Number of rows: 56000
Number of columns: 13


## 1.3. Check NA

In [7]:
df01.isna().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

## 1.4. Fillout NA

## 1.5. Change Data Type

## 1.6. Check Balance Data

In [8]:
df01['cardio'].value_counts(normalize=True)

0    0.500589
1    0.499411
Name: cardio, dtype: float64

## 1.7. Descriptive Analysis

In [9]:
# select numeric attributes
num_attributes = df01[['age', 'height', 'weight', 'ap_hi', 'ap_lo']]

# select categoric attributes (include binary and status attributes)
cat_attributes = df01[['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']]

### 1.7.1. Numerical Attributes

In [10]:
# central tendency - mean, median
mean = pd.DataFrame(num_attributes.apply(np.mean)).T
median = pd.DataFrame(num_attributes.apply(np.median)).T

# dispersion - std, min, max, range, skew, kurtosis
std = pd.DataFrame(num_attributes.apply(np.std)).T
min_ = pd.DataFrame(num_attributes.apply(min)).T
max_ = pd.DataFrame(num_attributes.apply(max)).T
range_ = pd.DataFrame(num_attributes.apply(lambda x: x.max() - x.min())).T
skew = pd.DataFrame(num_attributes.apply(lambda x: x.skew())).T
kurtosis = pd.DataFrame(num_attributes.apply(lambda x: x.kurtosis())).T

# join dataframes
num_stats = pd.concat([min_, max_, range_, mean, median, skew, kurtosis]).T.reset_index()
num_stats.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'skew', 'kurtosis']

# display numerical analysis
num_stats

Unnamed: 0,attributes,min,max,range,mean,median,skew,kurtosis
0,age,10798.0,23713.0,12915.0,19464.929107,19699.0,-0.305523,-0.823461
1,height,55.0,250.0,195.0,164.348125,165.0,-0.594831,7.616794
2,weight,22.0,200.0,178.0,74.188586,72.0,1.015661,2.630229
3,ap_hi,-140.0,14020.0,14160.0,128.737893,120.0,85.641414,7642.334178
4,ap_lo,0.0,11000.0,11000.0,97.025536,80.0,31.969044,1369.19557


### 1.7.2. Categorical Attributes

In [11]:
cat_attributes.astype('object').describe()

Unnamed: 0,gender,cholesterol,gluc,smoke,alco,active,cardio
count,56000,56000,56000,56000,56000,56000,56000
unique,2,3,3,2,2,2,2
top,1,1,1,0,0,1,0
freq,36401,41910,47619,51030,52929,45011,28033


# 2.0. Feature Engineering

In [12]:
df02 = df01.copy()

# 3.0. Data Filtering

In [13]:
df03 = df02.copy()

## 3.1. Filtering Rows

## 3.2. Columns Selection

# 4.0. Exploratory Data Analysis

In [14]:
df04 = df03.copy()

# 5.0. Data Preparation

In [15]:
df05 = df04.copy()

# 6.0. Feature Selection

In [16]:
df06 = df05.copy()

## 6.1. Split Dataframe into Training and Validation Dataset

In [17]:
# features dataset
X = df06.drop(['cardio'],axis=1)

# response dataset
y = df06['cardio']

In [18]:
# split dataset into training  and validation
X_train, X_val, y_train, y_val = ms.train_test_split(X, y, test_size=0.2, random_state=42)

# 7.0. Machine Learning Model

In [19]:
# select data for machine learning models
x_train = X_train.copy()
x_val = X_val.copy()

## 7.1. Random Forest Classifier 

In [20]:
# model definition
model_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1).fit(x_train, y_train)

### 7.1.1. Random Forest Classifier - Performance

In [21]:
# prediction 
pred_rf = model_rf.predict(X_val)

# accuracy
acc_rf = accuracy_score(y_val, pred_rf)  
print(f'Random Forest - Accuracy: {acc_rf}')

Random Forest - Accuracy: 0.7297321428571428
