# **Heart Disease**
## **Introduction**
Heart Failure remains one of the leading cause of mortality worldwide. Effective analysis and modelling of clinical records can provide valuable insights into the factors contributing to heart failure enabling healthcare professionals to make informed decision and improve performance.

## **Project Objective**

This machine learning project aims to analyze clinical records of patients with heart failure using Python. The primary objectives are:

1. **Data Exploration:** Examine the distribution and characteristics of clinical variables.
2. **Feature Engineering:** Select and transform relevant features for modeling.
3. **Model Development:** Train and evaluate machine learning models to predict heart disease outcomes.
4. **Insight Generation:** Identify key factors associated with heart disease and provide actionable recommendations.


## **Importing Libraries and Loading Data**

In [1]:
import warnings
import sys
import pandas as pd

sys.path.append("..")

from analysis_src.data_inspection import DataInspector, DataTypeInspection, SummaryStatisticsInspection
from analysis_src.missing_values_analysis import SimpleMissingValuesAnalysis
from src.clean_data import SimpleDataCleaner

warnings.filterwarnings("ignore")
df = pd.read_csv("../datasets/heart_failure_clinical_records_dataset.csv")

## **Data Preparation**
The data was cleaned by transforming each feature into the right format using custom Python Classes.

In [11]:
cleaner = SimpleDataCleaner()
cleaned_data = cleaner.clean_data(df)

## **Exploratory Data Analysis**

In [3]:
# Step 1: Data Inspection
inspector = DataInspector(DataTypeInspection())
inspector.inspect(cleaned_data)


Data Types and Non-NULL counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    object 
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    object 
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    object 
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    object 
 10  smoking                   299 non-null    object 
 11  time                      299 non-null    int64  
 12  death_event               299 non-null    int64  
dtypes: float64(3), int64(5), object(

### **Data Inspection Results**
The dataset has been successfully loaded and cleaned. Below is a summary of the cleaned data:

**Data Overview**
The dataset contains 299 records of patients with heart failure. Each record includes the following variables:

- **age:** Age of the patient (float)
- **anaemia:** Whether the patient has anaemia (object)
- **creatinine_phosphokinase:** Level of the CPK enzyme in the blood (int)
- **diabetes:** Whether the patient has diabetes (object)
- **ejection_fraction:** Percentage of blood leaving the heart at each contraction (int)
- **high_blood_pressure:** Whether the patient has high blood pressure (object)
- **platelets:** Platelets in the blood (float)
- **serum_creatinine:** Level of serum creatinine in the blood (float)
- **serum_sodium:** Level of serum sodium in the blood (int)
- **sex:** Gender of the patient (object)
- **smoking:** Whether the patient smokes (object)
- **time:** Follow-up period (days) (int)
- **death_event:** Whether the heart failure led to a death event during the follow-up period (int)

### **Sample Data**
Here is a sample of the cleaned data:

|   | age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex   | smoking | time | death_event |
|---|-----|---------|--------------------------|----------|-------------------|---------------------|-----------|------------------|--------------|-------|---------|------|-------------|
| 0 | 75.0| No      | 582                      | No       | 20                | Yes                 | 265000.00 | 1.9              | 130          | Man   | No      | 4    | 1           |
| 1 | 55.0| No      | 7861                     | No       | 38                | No                  | 263358.03 | 1.1              | 136          | Man   | No      | 6    | 1           |
| 2 | 65.0| No      | 146                      | No       | 20                | No                  | 162000.00 | 1.3              | 129          | Man   | Yes     | 7    | 1           |
| 3 | 50.0| Yes     | 111                      | No       | 20                | No                  | 210000.00 | 1.9              | 137          | Man   | No      | 7    | 1           |
| 4 | 65.0| Yes     | 160                      | Yes      | 20                | No                  | 327000.00 | 2.7              | 116          | Woman | No      | 8    | 1           |

### **Data Types**
The data types of the columns are as follows:

- **float64:** 3 columns
- **int64:** 5 columns
- **object:** 5 columns

The dataset has no missing values.

The dataset is now ready for further analysis and modeling.

In [14]:
inspector.set_strategy(SummaryStatisticsInspection())
inspector.inspect(cleaned_data)


Summary statistics for `numerical` features:
              age  creatinine_phosphokinase  ejection_fraction      platelets  \
count  299.000000                299.000000         299.000000     299.000000   
mean    60.833893                581.839465          38.083612  263358.029264   
std     11.894809                970.287881          11.834841   97804.236869   
min     40.000000                 23.000000          14.000000   25100.000000   
25%     51.000000                116.500000          30.000000  212500.000000   
50%     60.000000                250.000000          38.000000  262000.000000   
75%     70.000000                582.000000          45.000000  303500.000000   
max     95.000000               7861.000000          80.000000  850000.000000   

       serum_creatinine  serum_sodium        time  death_event  
count         299.00000    299.000000  299.000000    299.00000  
mean            1.39388    136.625418  130.260870      0.32107  
std             1.03451      

### **Summary Statistics**

The summary statistics of the cleaned dataset provide an overview of the central tendency, dispersion, and shape of the distribution of each feature. Below is a detailed summary of the key statistics for each variable:

- **age:** 
    - Mean: 60.83 years
    - Standard Deviation: 11.89 years
    - Minimum: 40 years
    - Maximum: 95 years

- **creatinine_phosphokinase:** 
    - Mean: 581.84 units/L
    - Standard Deviation: 970.29 units/L
    - Minimum: 23 units/L
    - Maximum: 7861 units/L

- **ejection_fraction:** 
    - Mean: 38.08%
    - Standard Deviation: 11.83%
    - Minimum: 14%
    - Maximum: 80%

- **platelets:** 
    - Mean: 263358.03 platelets/mL
    - Standard Deviation: 97804.24 platelets/mL
    - Minimum: 25100 platelets/mL
    - Maximum: 850000 platelets/mL

- **serum_creatinine:** 
    - Mean: 1.39 mg/dL
    - Standard Deviation: 1.03 mg/dL
    - Minimum: 0.50 mg/dL
    - Maximum: 9.40 mg/dL

- **serum_sodium:** 
    - Mean: 136.63 mEq/L
    - Standard Deviation: 4.41 mEq/L
    - Minimum: 113 mEq/L
    - Maximum: 148 mEq/L

- **time:** 
    - Mean: 130.26 days
    - Standard Deviation: 77.87 days
    - Minimum: 4 days
    - Maximum: 285 days

- **death_event:** 
    - Mean: 0.32 (32% of patients experienced a death event)
    - Standard Deviation: 0.47
    - Minimum: 0
    - Maximum: 1

These statistics provide a comprehensive understanding of the dataset, highlighting the variability and distribution of each clinical variable. This information is crucial for further analysis and modeling to predict heart disease outcomes.