Skip to content

Heartbeat classification (>90% accuracy, >95% CV score, >90% precision recall and f score), and disease detection(>80% accuracy)

Notifications You must be signed in to change notification settings

simonegiancola09/heart_datasets_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

heart_datasets_ML

Heartbeat classification (>90% accuracy, >95% CV score, >90% precision recall and f score), and disease detection(>80% accuracy)

Author

Simone Maria Giancola
simonegiancola09@gmail.com
+39 3314788683
Linkedin Profile: Simone Giancola
Github Profile: @simonegiancola09

I am absolutely not a professional, so any suggestion, either typo, code, writing, or analysis related is highly appreciated, and will be welcomed with enthusiasm.

IMPORTANT

Being quite big, my Jupyter environment did not load a table of contents at the beginning, making it very difficult to navigate the document. Gooogle Colab has a built in table of contents on the left. This also makes Jupyter reading kind of painful since I put numbers to heading by hand and they would not match those of Jupyter. If Colab does not open there is also a link to NbViewer below for both documents:

The second dataset is definitely more interesting, and has better results, so I suggest giving a look at it after having skimmed the method in the first dataset, without spending too much time on it. I would also point out that most of the models I found online had a higher accuracy but a low CV score (around 30% less), meaning less reliability. My model, though less accurate, seems to be more robust.

After posting the project on LinkedIn, some people reached out to me suggesting further development or giving advice for future projects, thus I chose to reason whether applying some of them to this analysis. Notebooks were split after this exchange of thoughts. Though some suggested using a validation set instead of cross validating, my choice has been that of not changing this, since my personal studies showed that this choice, as many others, should be justified and assessed each time, looking at pros and cons. My conclusion in these situations is that of leaving it as it is to try the other option in another project, being aware of the possible changes. For more information, about this, it is enough to check the textbook used in the course I will mention later in this file, searching for the keywords: "cross validation".

While attending university I had the opportunity to learn the basics of Python and some of its libraries. This is one of the attempts of satisfying my personal taste with projects that could also be useful.
After having attended the Machine Learning course at Bocconi University, I felt motivated enough to apply the knowledge acquired in the healthcare setting.
Kaggle was a good starting point. There I found the two datasets I will work with in the following notebook. The first one is rather small, while the second one is definitely of higher size. More information can be found at the following links:

Technical and knowledge sources vary from across the internet, libraries documentation, and lecture notes of the Python courses I had at Bocconi University. The framework, notation and main idea recalls that of the Machine Learning course I already mentioned, which used as a textbook: "Hands on Machine Learning with Scikit-learn Keras and Tensorflow".

In order to avoid writing an enormous document, I leave to the writer the option to review data sources, description of features and acknowledgements. The first dataset in particular had 4 or 5 missing values that were adjusted by a user who cleaned the file and shared it with the public (all explained on the link provided).
I will develop a process of thought to analyze data and then apply it to both scenarios, adapting it to the needs of each piece of information, to see its results.

In the event that the reader is curious of the result only, I suggest skipping all chapters up to "Pipeline and Ensemble" for both datasets. I will present my reasoning with words and numbers, with more details than a simple script. I suggest to give at least a look at data visualization cells across the whole notebook.

About

Heartbeat classification (>90% accuracy, >95% CV score, >90% precision recall and f score), and disease detection(>80% accuracy)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published