Cardio Catch Diseases

Summary

0. Business Problem
- 0.1. What is a Service
- 0.2. What is Cardiovascular Diseases
1. Solution Strategy & Assumptions Resume
- 1.1. First CRISP Cycle
- 1.2. Second CRISP Cycle
2. Exploratory Data Analysis
- 2.1. EDA On First Cycle
- 2.2. Top 3 Eda Insights
3. Data Preparation
- 3.1. Dataset Balance
4. Embedding Space Study
5. Machine Learning Models
6. Model Tuning
- 6.1. First Cycle Model Tuning
  - 6.1.1. Calibration Curves
  - 6.1.2. Confidence Intervals
- 6.2. Second Cycle Model Tuning
  - 6.2.1. Calibration Curves
  - 6.2.2. Confidence Intervals
7. Model Bussiness Results
8. Model Deployment
9. References

0. Business Problem

Cadio Catch Diseases is a company specializing in early-stage heart disease detection. Its business model is of the service type, that is, the company offers the early diagnosis of a cardiovascular disease for a certain price.

Currently, the diagnosis of a cardiovascular disease is done manually by a team of specialists. The current accuracy of the diagnosis varies between 55% and 65%, due to the complexity of the diagnosis and also to the fatigue of the team that takes turns to minimize the risks. The cost of each diagnosis, including the equipment and the analysts' payroll, is around R$ 1,000.00.

With a Model, get a better precison on cardiovascular diagnosis.

What is the precision and accuracy of this new tool?

How mutch profit the Cardio Catch Diseases will earn with this new tool?

What is the confidence interval of this new tool?

0.1. What is a Service

Service is a business model like consultory, the company make a work and receive a profit based on her work results. For example, Cardio Catch Diseases, the price of the diagnosis, paid by the client, varies according to the precision achieved by the time of specialists, the client pays R$500.00 for every 5% of accuracy above 50%. For example, for an accuracy of 55%, the diagnosis is R$500.00 for the client, for an accuracy of 60%, the value is R$1000.00, and so on. If the diagnostic accuracy is 50%, the customer does not pay for it.

Other example is terrain analysis, based on terrain size and terrain quality (terrain analysis) the price of the analysis can change severely.

0.2. What is Cardiovascular Diseases

Cardiovascular Diseases (CVD's) are a group of disorders of the heart and blood vessels.

Heart attacks and strokes are usually acute events and are mainly caused by a blockage that prevents blood from flowing to the heart or brain. The most common reason for this is a build-up of fatty deposits on the inner walls of the blood vessels that supply the heart or brain.

The most important behavioural risk factors of heart disease and stroke are unhealthy diet, physical inactivity, tobacco use and harmful use of alcohol. The effects of behavioural risk factors may show up in individuals as raised blood pressure, raised blood glucose, raised blood lipids, and overweight and obesity. These “intermediate risks factors” can be measured in primary care facilities and indicate an increased risk of heart attack, stroke, heart failure and other complications.

In addition, drug treatment of hypertension, diabetes and high blood lipids are necessary to reduce cardiovascular risk and prevent heart attacks and strokes among people with these conditions.

0.2.1. Heart Attack

A heart attack occurs when the blood flow to a part of the heart is blocked by a blood clot, fat or other substances. If this clot cuts off the blood flow completely, the part of the heart muscle supplied by that artery begins to die, if the blood flow is interrupted can damage or destroy part of heart muscle.

The medications and lifestyle changes that your doctor recommends may vary according to how badly your heart was damaged, and to what degree of heart disease caused the heart attack.

0.2.2. Heart Failure

Occurs when the heart muscle doesn't pump blood as well as it should. When this happens, blood often backs up and fluid can build up in the lungs, causing shortness of breath. In heart failure, the main pumping chambers of the heart (the ventricles) may become stiff and not fill properly between beats. In some people, the heart muscle may become damaged and weakened. The ventricles may stretch to the point that the heart can't pump enough blood through the body.

One way to prevent heart failure is to prevent and control conditions that can cause it, such as coronary artery disease, high blood pressure, diabetes and obesity.

0.2.3. Heart Valve Problems

Your heart has four valves that keep blood flowing in the correct direction. In some cases, one or more of the valves don't open or close properly. This can cause the blood flow through your heart to your body to be disrupted.

Regurgitation.

The valve flaps don't close properly, causing blood to leak backward in your heart. This commonly occurs due to valve flaps bulging back, a condition called prolapse.

Stenosis.

The valve flaps become thick or stiff and possibly fuse together. This results in a narrowed valve opening and reduced blood flow through the valve.

Atresia.

The valve isn't formed, and a solid sheet of tissue blocks the blood flow between the heart chambers.

0.2.4. Stroke

Have two types of Stroke:

Ischemic stroke

It happens when the brain's blood vessels become narrowed or blocked, causing severely reduced blood flow (ischemia). Are caused by fatty deposits that build up in blood vessels or by blood clots or other debris that travel through the bloodstream, most often from the heart, and lodge in the blood vessels in the brain.

Hemorrhagic stroke

Occurs when a blood vessel within the brain bursts. This is most often caused by uncontrolled hypertension (high blood pressure).

0.2.5. Arrhythmia

Arrhythmia refers to an abnormal heart rhythm. There are various types of arrhythmias, The heart can beat too slow, too fast or irregularly, an arrhythmia can affect how well your heart works. With an irregular heartbeat, your heart may not be able to pump enough blood to meet your body’s needs

Bradycardia

Heart rate that’s too slow, is when the heart rate is less than 60 beats per minute.

Tachycardia

Heart rate that’s too fast, refers to a heart rate of more than 100 beats per minute.

0.3. Blood Pressure

The blood pressure is other important thing to check heart health, itsn very important to check the systolic and diastolic, like a fraction of blood pressure on mm Hg ( 120 systolic / 60 diastolic ).

Systolic Pressure

The top number refers to the amount of pressure in your arteries during the contraction of your heart muscle.

Diastolic Pressure

The bottom number refers to your blood pressure when your heart muscle is between beats.

The Dataset Base Cardiovascular Disease.

1. Solution Strategy and Assumptions Resume

The Deployment of the model is on Google Sheets, the cardio team can check the probability of the new users on base have or no a 'Cardio Diseases'.

Full Documentation PT-BR

1.1. First CRISP Cycle

Data Cleaning & Descriptive Statistical.: First real step is download the dataset, import in jupyter and start in seven steps to change data types, data dimension, fillout na... At first statistic dataframe, i used simple statistic descriptions to check how my data is organized, and check, in dataset have only numerical attirbutes!
Feature Engineering.: In this step, with coggle.it to make a mind map and use the mind map to create some hypothesis list, after this list, i created some new features based on blood, like blood volume, blood systolic and diastolic pressure, pulse pressure and bmi, but on dataset do not have other features for more feature engineering.
Data Filtering.: On Dataset Have some Outliers, height, weight, blood pressure, extreme negative diastolic pressure, etc,to work with this i have tried to get a "medical intuition" and removed extreme negative diastolic and systolic blood pressure, and a little height and weight threshold.
Data Balance.: On Next Cycle i like to use SMOTEEN to clean data overlapping for better model accuracy and precision.
Exploratory Data Analysis.: With this dataset is hard to define a class limit, need much deeper feature engineering.
Data Preparation.: Used MinMiaxScaler, Robust Scaler & Frequency Encoding for Rescaling some features and drop "alco" & "smoke", because XGBoost and RF did not classify these two features as relevant.
ML Models.: I try 7 models on total, four are Tree-based models.

1.2. Second CRISP Cycle

Data Balance.: I used SMOTEEN and SMOTETOMEK for Dataset Balance.
Data Preparation.: Used MinMiaxScaler, Robust Scaler & Frequency Encoding for Rescaling both datasets (Smoteen Dataset and Smotetomek Dataset) some features and drop "alco" & "smoke", because XGBoost and RF did not classify these two features as relevant.
ML Models.: I Used SGD and Ada, focus on SGD classifier.

3.1.3. Third Cycle

Feature Space Study.: In this new cycle, i have try some Embedding spaces using UMAP, PCA, tSNE and Tree-Based Embedding.; In this step, I used all the datasets that I had created previously, like the entire dataset, Smoteen and Smotetomeklinks dataset, to analyze all these data behaviors with different tools.

2. Exploratory Data Analysis

EDA is the most important step on Data Science projects, in this step you "deep dive" on data and work with univariable, bivariable and multivariable data analysis.

2.1. EDA On First Cycle

In Univariable Analisys

The Dataset have some identical and normal features, good for machine learning model.
Have a good balance between classes.

Bivariate Analysis

The Features based on class, it's hard to find a separating boundary.

With Pearson's correlation method, i get aprox .50 positive correlation with height and gender!

2.2. Top 3 Eda Insights

People who suffer from dwarfism have 25% higher cholesterol than a normal adult person.

Alcoholic people have a greater chance of developing cardiovascular disease than people who smoke.

People over 45 are 70% more likely to develop cardiovascular disease.

3. Data Preparation

For Rescaling i used both, MinMax and RobustScaler and Frequency Encoding for numerical features like Gluc Level.

On first cycle i did not used Smoteen for cleaning data overlapping, in next cycles i will go try more things like better feature engineering, PCA, Smoteen...

3.1. Dataset Balance

3.1.1. First Cycle

On Next Cycle i will try balance Dataset with Smoteen and Smotetomeklinks.

3.1.2. Second Cycle

I Try both, Smoteen and Smotetomek on Cardio Dataset.

Smoteen removed a lot of data overlapping on Dataset and Smotetomek do not work much well than Smoteen for Balance.

4. Embedding Space Study

The Feature Space study or Embedding study, in this step i have dedicated a complete Cycle, the Third Cycle, to analyze all data behaviors with different datasets that i have already created the Smoteen Dataset, Smotetomeklinks Dataset and Full Dataset with diferent tools and methods.

For Rescaling i used Both, StandardScaler and MinMaxScaler to analyze this differences.

With this both Rescaling methods i performed the Umap, tSNE, PCA and Tree-Based Embedding on all Three Datasets, you can check the Notebook of Third Cycle to see the different behaviors.

I'm not going to share all the spaces in the README so as not to make it too big.

5. Machine Learning Models

Support Vector Machines

I studied about the power of SVM, but in training I didn't see it that powerful, maybe I'll proceed with this model to tune.

XGBoost

My personal favorite model, fast, light and haved a normal results on training with this dataset.

Random Forest

Random Forest get less results than XGBoost, but, rf have selected some important features thai i selected.

K Nearest Neighbors

First time I trained a KNN, I really liked the result with 10 neighbors.

Stochastic Gradient Descent

This is a good "linear" model, maybe i use on tuning too.

Light GBM

Similar to XGBoost, but significantly better with this only one train dataset.

Ada Boost

First time trained AdaBoost.

6. Model Tuning

This is the principal step on this Data Science Project, because, there aren't many ways to create features or collect new data, so a very very *very* detailed data preparation and a good tuning in these cases is very important.

6.1. First Cycle Model Tuning

For First Cycle i used SGD and Ada Boosting for Tuning, because Ada have great performace and SGD is a ""linear model"" with linear coeficients. But after tuning, i chosed the SGD because he is it is much lighter on HD than ada, '5Kb' of Disk Space.

I used a simple Random Search to find the best params for model.

On Cross Validation the model have a good performace (Precision).

6.1.1. Calibration Curves

This step is after tuning the model, to calibrate the super and sub estimation adjustments.

The Calibrated SGD simple performace.

6.1.2. Confidence Intervals

This is the last step of the step of tuning the machine learning model, in this step the confidence intervals are calculated using a ready-made formula from MachineLearningMastery

6.2. Second Cycle Model Tuning

I Using SGD and Ada on Second Cycle Too for tuning on Smoteen and Smotetomek Dataset. But after some tests i prefer to use ADA to production.

Final ADA Model Performace on Cycle II for Production.

6.2.1. Calibration Curves

The calibration curve of Raw ADA model

The calibration curve of Tuned ADA model

6.2.2. Confidence Intervals

The Bootstrap of Tuned Only ADA Model.

I do not selected calibrated + tuned model because on bootstrap eith calibrated + tuned model i get an insignificantly larger error. I only used Tuned Model to Deploy.

7. Model Bussiness Results

Need to answer the Questions

What is the precision and accuracy of this new tool?

How mutch profit the Cardio Catch Diseases will earn with this new tool?

What is the confidence interval of this new tool?

7.1. What is the precision and accuracy of this new tool

At Cross Validation Between ( Mean + / - Std )

Accuracy ( 0.724 + / - 0.0006 )
Precision ( 0.7874 + / - 0.0048)

6.2. How mutch profit the CCD will earn with this new tool

Based on All Dataset (68k Patients).

% of Precision	50 %	55 %	60 %	65 %	70 %	75 %	80 %
Money / Precision	FREE	$ 500	$ 1000	$ 1500	$ 2000	$ 2500	$ 3000
Actual Money / Precision	$ 0	$ 343,530.00	$ 687,060.00	$ 1,030,590.00	$ --#--	$ --#--	$ --#--
Model Money / Precision	$ --#--	$ 343,530.00	$ 687,060.00	$ 1,030,590.00	$ 1,374,120.00	$ 1,717,650.00	$ --#--

__ / __	Best Scenario	Worst Scenario
Model	+/- $ 1,717,650.00	+/- $ 1,374,120.00
Actual	+/- $ 1,030,590.00	+/- $ 343,530.00

7.3. What is the confidence interval of this new tool

25% confidence interval of Model Performace ( 77.81% & 78.34% )
50% confidence interval of Model Performace ( 77.44% & 78.59% )
75% confidence interval of Model Performace ( 77.02% & 78.95% )

8. Model Deployment

The Model is on Google Sheets, because it's a good deploy strategy and it also has a good benefit as for the study of cardiovascular disease cases and checking if the model needs improvement, example, the cardio team like to change the values to study new behaviors and how the model will classify against this new behavior

This is the step for user use the Model Prediction to make better decisions, i chose to deploy on Heroku and make a JS Request on Google Scripts.

google_sheets_cardio.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
img		img
notebooks		notebooks
scripts		scripts
Documentação PT-BR.pdf		Documentação PT-BR.pdf
README.md		README.md

xGabrielR/Cardio-Catch-Diseases

Folders and files

Latest commit

History

Repository files navigation