We are trying to predict who gets diabetes and who doesn't (Outcome = 1 and 0, respectively). This notebook shows a simple Random Forest implementation that gets 83% accuracy (TP+TN). I just drop 2 columns, impute missing values for some other columns, and apply the default random forest model. This is my first notebook, so please let me know if there are any ways I can improve!

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
X = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
y = X.pop('Outcome')

In [None]:
X

0 is an impossible value for some fields like "Glucose"--therefore, it probably denotes missing value. Plotting distributions with and without 0 values shows for which fields 0 is not a natural part of the distribution, and therefore is a null value. This is true for Glucose and the 4 columns after it.

In [None]:
numRows = 8
numCols = 2
figWidth = 5
figHeight = 5
fig, axes = plt.subplots(numRows, numCols, sharex=False, figsize=(numCols * figWidth, numRows * figHeight))

for i in range(numRows):
    colname = X.columns[i]
    var2plot = X[colname]

    # left = plot of original column
    sns.histplot(x=var2plot, hue = y, ax=axes[i][0], common_bins=False, element='step')
    
    # right = plot of column with 0's filtered out
    sns.histplot(x=var2plot[var2plot>0], hue = y, ax=axes[i][1], common_bins=False, element='step')
    
    plt.legend()

The code below reveals what percentage of the data are missing for those columns.

In [None]:
for i in range(1, 6):
    value_counts = X.iloc[:, i].value_counts().sort_index()
    print("{} \t {}".format(X.columns[i], value_counts[0] / 768))
    #print(X.columns[i], value_counts.index[0], value_counts[0] / 768)

In [None]:
# Impute missing values as the median.

from sklearn.impute import SimpleImputer
start_i = 1
end_i = 5

# Imputation: missing values in columns 1-5 ("Glucose" to "BMI") are denoted by 0. I replace these with the median of the column.
my_imputer = SimpleImputer(missing_values=0, strategy='median')
imputed_cols = pd.DataFrame(my_imputer.fit_transform(X.iloc[:, start_i : end_i+1]))

# Imputation removed column names; put them back
imputed_cols.columns = X.columns[start_i : end_i+1]

In [None]:
# Replace original columns in X with imputed columns, call it "X_i"
X_i = pd.concat([X.Pregnancies, imputed_cols, X.iloc[:, 6:8]], axis=1)
X_i

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Define the models
#model = DecisionTreeClassifier(random_state=0) # 2% less accurate
model = RandomForestClassifier(random_state=0)

### Select data to use for modelling ###

#my_X = X.iloc[:, :1] # Using first variable alone (Pregnancies) gets 73% accuracy
#my_X = X.iloc[:, 0:4] # Using first 4 variables boosts accuracy to 77%
#my_X = X.iloc[:, :8] # Using all 8 gets 79% accuracy

#my_X = X_i # 82% after imputing missing values
my_X = pd.concat([X_i.iloc[:, :3], X_i.iloc[:, 5:]], axis=1) # 83% by dropping SkinThickness and Insulin, which have many missing values

# Break off validation set from training data
X_t, X_v, y_t, y_v = train_test_split(my_X, y, train_size=0.8, test_size=0.2,random_state=0)

model.fit(X_t, y_t)
preds = model.predict(X_v)


# Make confusion matrix
confusion = [[0,0],[0,0]]
for i in range(len(preds)):
    confusion[y_v.iloc[i]][preds[i]] += 1
    
confusion_perc = [[0,0],[0,0]]
for i in range(2):
    for j in range(2):
        confusion_perc[i][j] = confusion[i][j] / len(preds)
print('Confusion matrix:')
print(confusion_perc)

print('Accuracy: ' + str( confusion_perc[0][0] + confusion_perc[1][1] ))