## Problem - Diabetes Prediction



Let's build the diabetes prediction model.

Here, you are going to predict diabetes using the Logistic Regression Classifier.

Let's first load the required Pima Indian Diabetes dataset using the pandas' read CSV function. You can download data from the following link: https://www.kaggle.com/uciml/pima-indians-diabetes-database or select a dataset from DataCamp: https://www.datacamp.com/workspace/datasets. The ready-to-use dataset provides you the option to train the model on DataCamp's Workspace, which is a free Jupyter notebook on the cloud. 

## Import Packages

In [87]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

## Get Data

In [88]:
df = pd.read_csv("diabetes.csv")

In [89]:
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Modeling

### Create feature matrix X

Select features

In [90]:
X = df.iloc[:,:-1]

Add constant

In [91]:
X = sm.add_constant(X); # Add constant to estimte intercept parameter

In [92]:
X

Unnamed: 0,const,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,1.0,6,148,72,35,0,33.6,0.627,50
1,1.0,1,85,66,29,0,26.6,0.351,31
2,1.0,8,183,64,0,0,23.3,0.672,32
3,1.0,1,89,66,23,94,28.1,0.167,21
4,1.0,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...,...
763,1.0,10,101,76,48,180,32.9,0.171,63
764,1.0,2,122,70,27,0,36.8,0.340,27
765,1.0,5,121,72,23,112,26.2,0.245,30
766,1.0,1,126,60,0,0,30.1,0.349,47


In [93]:
y = df.iloc[:,-1]
y.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

### Fit Model

In [94]:
model = sm.Logit(y, X)

In [95]:
res = model.fit()

Optimization terminated successfully.
         Current function value: 0.470993
         Iterations 6


In [96]:
print(res.summary())

                           Logit Regression Results                           
Dep. Variable:                Outcome   No. Observations:                  768
Model:                          Logit   Df Residuals:                      759
Method:                           MLE   Df Model:                            8
Date:                Wed, 27 Mar 2024   Pseudo R-squ.:                  0.2718
Time:                        15:36:59   Log-Likelihood:                -361.72
converged:                       True   LL-Null:                       -496.74
Covariance Type:            nonrobust   LLR p-value:                 9.652e-54
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                       -8.4047      0.717    -11.728      0.000      -9.809      -7.000
Pregnancies                  0.1232      0.032      3.840      0.000       0.060       0.

## Use Model

Predict probabilities for class `churn`:



In [97]:
X

Unnamed: 0,const,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,1.0,6,148,72,35,0,33.6,0.627,50
1,1.0,1,85,66,29,0,26.6,0.351,31
2,1.0,8,183,64,0,0,23.3,0.672,32
3,1.0,1,89,66,23,94,28.1,0.167,21
4,1.0,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...,...
763,1.0,10,101,76,48,180,32.9,0.171,63
764,1.0,2,122,70,27,0,36.8,0.340,27
765,1.0,5,121,72,23,112,26.2,0.245,30
766,1.0,1,126,60,0,0,30.1,0.349,47


In [98]:
res.predict(X)

0      0.721727
1      0.048642
2      0.796702
3      0.041625
4      0.902184
         ...   
763    0.317115
764    0.318969
765    0.170416
766    0.284976
767    0.072014
Length: 768, dtype: float64

Get class based on decision threshold `0.5`:

In [99]:
y_hat = (res.predict(X) >= 0.5).astype(int)

In [100]:
y_hat

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    0
767    0
Length: 768, dtype: int32