# Part 3: Machine learning model training and feature importance

In [1]:
# import libraries
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix, classification_report

In [2]:
# prepare independent and dependent variables
df = pd.read_csv("diabetes_data_clean.csv")

X = df.drop('class', axis = 1)
y = df['class']

In [3]:
# split data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

In [4]:
# model training
# start with DummyClassifier to establish baseline
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

In [5]:
# assess DummyClassifier model using confusion matrix
confusion_matrix(y_test, dummy_pred)

array([[ 0, 40],
       [ 0, 64]], dtype=int64)

In [6]:
# assess DummyClassifier model using a classification report
print(classification_report(y_test, dummy_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        40
           1       0.62      1.00      0.76        64

    accuracy                           0.62       104
   macro avg       0.31      0.50      0.38       104
weighted avg       0.38      0.62      0.47       104



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [17]:
# train LogisticRegression model
logr = LogisticRegression(max_iter=10000)
logr.fit(X_train, y_train)
logr_pred = logr.predict(X_test)

In [18]:
# assess LogisticRegression model using confusion matrix
confusion_matrix(y_test, logr_pred)

array([[39,  1],
       [ 2, 62]], dtype=int64)

In [19]:
# assess LogisticRegression model using a classification report
print(classification_report(y_test, logr_pred))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96        40
           1       0.98      0.97      0.98        64

    accuracy                           0.97       104
   macro avg       0.97      0.97      0.97       104
weighted avg       0.97      0.97      0.97       104



- Compared with our baseline (DummyClassifier), LogisticRegression model performs much better, with an accuracy of 97%, higher than the accuracy of baseline model of 62%. 
- The false positives rate for LogisticRegression model is also significantly lower than the DummyClassifier model.
- The false negatives rate for LogisticRegression model is higher than the baseline model, which is not desirable in a real world scenario (in healthcare).
- In the real world scenario for diabetes prediction, a model with lower false negatives would be preferrable (it is better to be falsely diagnosed with diabetes than to false assess someone as diabetes free).

In [20]:
# train DecisionTree model
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)

In [21]:
# assess DecisionTree model using confusion matrix
confusion_matrix(y_test, tree_pred)

array([[39,  1],
       [ 1, 63]], dtype=int64)

In [22]:
# assess DecisionTree model using a classification report
print(classification_report(y_test, tree_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        40
           1       0.98      0.98      0.98        64

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104



- Compared with our baseline (DummyClassifier), DecisionTree model performs much better, with an accuracy of 98%, higher than the accuracy of baseline model of 62%. 
- Compared with our LogisticRegression model, DecisionTree model performs slightly better with an accuracy of 98%, higher than the accuracy of LogisticRegression model of 97%.
- The false negative rate for DecisionTree model is slightly lower than the LogisticRegression model.
- The false positive rate for DecisionTree model is the same as the LogisticRegression model.
- Although there is not a significant increase in the accuracy, the false negative rate for DecisionTree model is slightly lower than the LogisticRegression model. In a real world scenario, the DecisionTree model would be a better fit for this problem statement.

In [13]:
# train RandomForest model
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)

In [14]:
# assess RandomForest model using confusion matrix
confusion_matrix(y_test, forest_pred)

array([[40,  0],
       [ 1, 63]], dtype=int64)

In [15]:
# assess RandomForest model using a classification report
print(classification_report(y_test, forest_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99        40
           1       1.00      0.98      0.99        64

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



- Compared with our baseline (DummyClassifier), RandomForest model performs much better, with an accuracy of 99%, higher than the accuracy of baseline model of 62%. 
- Compared with our LogisticRegression and DecisionTree, RandomForest model performs slightly better, with an accuracy of 98%, higher than the accuracy of both LogisticRegression and DecisionTree. 
- The false positive rate for RandomForest model is zero, which is lower than all the other models.
- The false negative rate for RandomForest model is one, which is the same as DecisionTree anad lower than all the other models.
- It can be concluded that RandomForest is the best performing model, with accuracy of 99%, precision of 1, recall of 0.97 and f1 score of 0.99
- With the lowest false negative rate of one, and lowest false positive rate of zero, this model would be the best fit in the real world scenario.

In [16]:
# feature importances for random forest model training
pd.DataFrame({'feature': X.columns, 
              'importance': forest.feature_importances_}).sort_values('importance', ascending=False)

Unnamed: 0,feature,importance
2,polyuria,0.225614
3,polydipsia,0.198683
1,ismale,0.095163
0,age,0.095106
12,partial paresis,0.050188
4,sudden weight loss,0.045033
14,alopecia,0.040403
10,irritability,0.039587
11,delayed healing,0.038481
9,itching,0.031948


It can be seen that in this model, the features that contributes the most importance, or affects the results of the prediction is polyuria and polydipsia. 
<br>
gender is also quite significant feature, followed by age.