This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
x = np.arange(-6,6+0.01,0.01)
f = np.exp(x)/(1 + np.exp(x))
plt.plot(x,f)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid()
plt.show()

In [None]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin',
             'bmi', 'pedigree', 'age', 'label']
# We read the table with column name change
diabetes = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv", header = 1, names = col_names)

In [None]:
# visualization of first row of the Data Frame
diabetes.head()

In [None]:
# we choose in the dataset the explanatory variables (features)
# we choose in the dataset the variable to be explained (target)
feature_cols = ['pregnant', 'insulin', 'bmi', 'age', 'glucose', 'bp', 'pedigree']
X = diabetes [feature_cols] # explanatory variables
y = diabetes.label # variable to explain

In [None]:
# 75% of the data is used for learning
# 25% of the data is used for the test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
# import of the LogisticRegression class
from sklearn.linear_model import LogisticRegression
# creation of an instance of the LogisticRegression class
logistiquereg = LogisticRegression(solver = 'lbfgs', max_iter = 120)
# estimation of model parameters
logistiquereg.fit(X_train,y_train)
# Using the model to make a prediction
y_pred=logistiquereg.predict(X_test)

In [None]:
# evaluation of model performance using the confusion matrix
from sklearn import metrics
confusionMatrix = metrics.confusion_matrix(y_test, y_pred)
confusionMatrix

The confusion matrix is here a 2×2 table because we solve a binary classification problem (2 classes, one class 0 and one class 1).

The values on the main diagonal represent the exact predictions.

The values on the non-diagonal terms are inaccurate predictions.

In [None]:
nbPredictionExact = confusionMatrix [0, 0] + confusionMatrix [1, 1]
print('The number of exact prediction is ', nbPredictionExact)
nbPredictionInexact = confusionMatrix [0, 1] + confusionMatrix [1, 0]
print('The number of inaccurate prediction is ', nbPredictionInexact)
nbPredictionTotal = nbPredictionExact + nbPredictionInexact
print('Total number of prediction {0}, to be compared with the number of y_test items {1}'.
format(nbPredictionTotal, len(y_test)))

In [None]:
# visualization of the confusion matrix in the form of a heatmap
import seaborn as sns

In [None]:
nomClasse = [0, 1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(nomClasse))
plt.xticks(tick_marks, nomClasse)
plt.yticks(tick_marks, nomClasse)
# create heatmap
sns.heatmap(pd.DataFrame(confusionMatrix), annot = True, cmap = "YlGnBu",fmt = 'g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Matrice de confusion', y=1.1)
plt.ylabel('Label réel')
plt.xlabel('Label prédit')
plt.show()

In [None]:
# model evaluation
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

In [None]:
# ROC curve (Receiver Operating Characteristic)
y_pred_proba = logistiquereg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label = "data 1, auc=" + str(auc))
plt.legend(loc = 4)
plt.show()