# Train diabetes classification model

This notebook reads a CSV file and trains a model to predict diabetes in patients. The data is already preprocessed and requires no feature engineering.

The evaluation methods were used during experimentation to decide whether the model was accurate enough. Moving forward, there's a preference to use the autolog feature of MLflow to more easily deploy the model later on.

## Read data from local file



In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/diabetes-dev.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'data/diabetes.csv'

In [None]:
df

## Split data

In [None]:
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values

In [None]:
len(X)

In [None]:
import numpy as np

In [None]:
print(np.unique(y, return_counts=True))

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

## Train model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)

## Evaluate model

In [None]:
import numpy as np

In [None]:
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)

In [None]:
acc

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])

In [None]:
auc

In [None]:

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

In [None]:
# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')