Paste Github notebook link in https://nbviewer.org/ to see full notebook. Some charts only work there and are not shown on GitHub since GitHub only shows static images. Our notebook includes HTML/JavaScript embeddings, and Github cannot display such cells properly! Alternatively, just click THIS LINK HERE!
This is a Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence). We use the Stroke Dataset from Kaggle.
The order of our notebook is:
- Introduction
- Exploratary Data Analysis
- Data Balancing
- Modeling
- Final Thoughts
(Links to jump around the notebook will be provided in the notebook itself to minimise scrolling. However it only works on https://nbviewer.org/ !)
- @tengyaolong2000 Teng Yao Long
- @jewel-chin Jewel Chin
- @yuminp Park Yumin
- What are the main predictors of Hypertension?
- Which model would be the best to predict Hypertension?
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machine
- Artificial Neural Network
- eXtreme Gradient Boosting Classifier
- K Nearest Neighbours
- Naive Bayes Classifier
- Age and BMI unanimously are the biggest predictors of hypertension.
- Other predictors include average glucose level and heart disease (It's important to exercise!!! 🏃♂️🏃♀️).
- Tree models are good are predicting hypertension if we focus only on recall. However the scores of other metrics are sacrificed too much.
- Logistic Regression and Naive Bayes models have decent recall without sacrificing other scores too much. These models are good if we have limited resources (GPU/ memory).
- If we have sufficient resources, the Neural Network has the potential to be the best after more hyperparameter tuning/ increase in model complexity
- However we also need to deal with overfitting.
- If we were to use a Deep Learning approach, we could also utilise transfer learning/ ensemble modeling.
- Handling imbalanced datasets using resampling methods and imblearn package (SMOTE)
- Feature selection/ feature importance techniques (RFE, SHAP, Permutation importance)
- Logistic Regression with sklearn
- Random Forest with sklearn
- Support Vector Machines with sklearn
- Aritficial Neural Networks with TensorFlow Keras
- XGBoost with xgboost
- K Nearest Neighbours with sklearn
- Naive Bayes Classifier with sklearn
- Collaborating using GitHub
- Data visualisation with plotly
- Grid Search to determine best hyperparameters
- Concepts on different metrics such as Recall, F1 score
- https://en.wikipedia.org/wiki/Artificial_neural_network
- https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://keras.io/
- https://plotly.com/python/
- https://en.wikipedia.org/wiki/Logistic_regression
- https://en.wikipedia.org/wiki/Random_forest
- https://scikit-learn.org/stable/modules/svm.html
- https://en.wikipedia.org/wiki/Support-vector_machine
- https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
- https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
- https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- https://scikit-learn.org/stable/modules/naive_bayes.html
- https://en.wikipedia.org/wiki/XGBoost
- https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
- https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
- https://machinelearningmastery.com/rfe-feature-selection-in-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
- https://shap.readthedocs.io/en/latest/index.html
- https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
- https://scikit-learn.org/stable/modules/permutation_importance.html#:~:text=The%20permutation%20feature%20importance%20is,model%20depends%20on%20the%20feature.