This project helps us to predict diabetes leveraging different machine learning algorithms that can predict whether an individual is likely to have diabetes or not based on their lifestyle related attributes and historical data.
The dataset used in this project is originally from NIDDK(National Institute of Diabetes and Digestive and Kidney Diseases).Predicting diabetes in patients based on diagnostic measurements and thereby, evaluating the predictive performance of different machine learning algorithms in predicting the outcomes and identifying trends for diabetes management and early personalized treatment , based on the following records of the patients : Pregnancies (Number of times a female has conceived) Glucose (Blood Sugar Level)-mMol/L Blood Pressure (Diastolic Blood Pressure)-mm Hg Skin Thickness (Fold Thickness Of Triceps Skin)-mm Insulin (2 hours Serum Insulin Intake)-mL Body Mass Index (Body Mass / Body Weight)- kg Diabetes Pedigree Function (Hereditary Carrier Cases)/(Genetic Influence) Age (Age Of Females)- years So, these are all the independent variables, on the basis of which, I would utilize the different machine learning algorithms to predict the target variable , thereby, to solve this disease prediction problem under health care domain.
For 3 foremost reasons I am performing this problem of diabetes prediction. EARLY IDENTIFICATION OF THE DISEASE- acts as a boon to save a large count of lives before they actually develop this disease. PLANNING THE PERSONALIZED TREATMENTS- Our predictive models actually helps the healthcare professionals to plan a proper personalized treatments for the patients who are suffering or at the likelihood of developing the disease. ALLOCATION OF RESOURCES- Looking at the likelihood of developing this disease in a person, the healthcare professionals tend to take some measures of allocation like , educational campaign, preventive measures, in a way that a patient gets benefitted by that.
Data Availability Data Quality And Standardization Data Privacy And Security Feature Selection And Dimensionality Imbalanced Data Model Interpretability
Accurate Predictions Early Prevention Improved Patient Outcomes Research And Insights Continuous Learning And Improvement Personalized approach
Here, getting the total count and percentages of who had diabetes and who did not have diabetes............so, as we can see
From the above visualisation I can say, that the distribution of Blood Pressure of a person both with diabetes
From the above plot , I can easily interpret that the distribution of Pregnancies in female is right
From the above plots I came to know regarding the distribution of various features in the data, that is, the
distribution of some of the features are right skewed, which tells us that the mean is greater than the
median,such as ,'Pregnancies','Skin Thickness','Insulin','Diabetes Pedigree Function' ,'BMI' and 'Age',these
** If it is nearer to 1, then it says that the two variables are perfectly correlated to each other.
Hereby, it is pretty understandable that the maximum people with non-diabetic condition is having a Insulin
level in the range of (0-200)mL, as its showing that its wider as compared to that of those who are diabetic
(less people). However, we can also see that the median value of Insulin level is more in diabetic females as
compared to that in non-diabetic females, also along with the maximum number of outliers in this category.
From the above visualization, I would interpret that all the features(independent variables) are containing
lastly, we can conclude that, after training the models and evaluating their performances we come with the conclusion that the random forest is the best performing model
inorder to solve the diabetes prediction with the highest recall value 0.60 among all the four models employed by us, thereby fetching its confusion matrix.
This model can be leveraged in other domains such as : FINANCE : Random Forests can be applied in finance for tasks like fraud detection, investment prediction, by training the model on the basis of the historical financial data. So, that it can assess credit worthiness, detect anomalous transactions. RETAIL AND E-COMMERCE : Random Forests can be utilized here to predict customer behavior , recommendation of the products on the basis of customer attributes, their purchase history and hence, makes easier for the business to take the data driven decisions. Now, it can be scaled with some more tweaks such as: Feature engineering and domain specific knowledge Additional features makes the task more easier for the model to predict the target variable. Model ensembling can help in improving the overall prediction by leveraging the strength of different algorithms.