# **Heart Disease**
## **Introduction**
Heart Failure remains one of the leading cause of mortality worldwide. Effective analysis and modelling of clinical records can provide valuable insights into the factors contributing to heart failure enabling healthcare professionals to make informed decision and improve performance.

## **Project Objective**

This machine learning project aims to analyze clinical records of patients with heart failure using Python. The primary objectives are:

1. **Data Exploration:** Examine the distribution and characteristics of clinical variables.
2. **Feature Engineering:** Select and transform relevant features for modeling.
3. **Model Development:** Train and evaluate machine learning models to predict heart disease outcomes.
4. **Insight Generation:** Identify key factors associated with heart disease and provide actionable recommendations.

## **Dataset Overview**
This analysis utilizes the [Dataset Name], containing [Number] records of patients with heart disease. The dataset includes variables such as:
- **Demographics** (age, sex)
- **Medical history** (high blood pressure, diabetes, anaemia, smoking)
- **Clinical measurements** (creatinine_phosphokinase, ejection_fraction, amount of platelets, serum_creatinine, serum_sodium)
- **Heart failure outcomes:** If the heart failure lead to a death event or not during the follow-up period. The follow-up period was also captured (in days).


## **Data Preparation**
The data was cleaned by ttransforming each feature into the right format. The libraries which were used to transform and analyse the data were imported below.

In [1]:
import warnings
import sys
import pandas as pd

sys.path.append("../src/")

from analysis_src.data_inspection import DataInspector, DataTypeInspection, SummaryStatisticsInspection
from analysis_src.missing_values_analysis import SimpleMissingValuesAnalysis
from ..src import clean_data

warnings.filterwarnings("ignore")
df = pd.read_csv("../datasets/heart_failure_clinical_records_dataset.csv")

ImportError: attempted relative import with no known parent package

In [13]:
cleaned_data = 

In [14]:
inspector.inspect(df)


Data Types and Non-NULL counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory 

In [15]:
inspector.set_strategy(SummaryStatisticsInspection())
inspector.inspect(df)


Summary statistics for `numerical` features:
              age     anaemia  creatinine_phosphokinase    diabetes  \
count  299.000000  299.000000                299.000000  299.000000   
mean    60.833893    0.431438                581.839465    0.418060   
std     11.894809    0.496107                970.287881    0.494067   
min     40.000000    0.000000                 23.000000    0.000000   
25%     51.000000    0.000000                116.500000    0.000000   
50%     60.000000    0.000000                250.000000    0.000000   
75%     70.000000    1.000000                582.000000    1.000000   
max     95.000000    1.000000               7861.000000    1.000000   

       ejection_fraction  high_blood_pressure      platelets  \
count         299.000000           299.000000     299.000000   
mean           38.083612             0.351171  263358.029264   
std            11.834841             0.478136   97804.236869   
min            14.000000             0.000000   25100.0000

In [16]:
import numpy as np
df["diabetes"] = np.where(df["diabetes"] == 0, "No", "Yes")
df["anaemia"].value_counts()

anaemia
No     170
Yes    129
Name: count, dtype: int64

In [22]:
df.sex

0      1
1      1
2      1
3      1
4      0
      ..
294    1
295    0
296    0
297    1
298    1
Name: sex, Length: 299, dtype: int64

In [23]:
df.smoking

0      0
1      0
2      1
3      0
4      0
      ..
294    1
295    0
296    0
297    1
298    1
Name: smoking, Length: 299, dtype: int64