In [1]:
# Some of the libraries we will use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Now lets first load our data in python using pandas

In [3]:
diabetes = pd.read_csv('diabetes.csv')
print(diabetes.columns)
diabetes.head()

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### The diabetes dataset consists of 768 data points, with 8 features each and one label

In [116]:
print("dimension of diabetes data: {}".format(diabetes.shape))

dimension of diabetes data: (768, 9)


### Outcome 0 means No diabetes, outcome 1 means diabetes
Of these 768 data points, 500 are labeled as 0 and 268 as 1:

In [117]:
print(diabetes.groupby('Outcome').size())

Outcome
0    500
1    268
dtype: int64


### Feel free to explore some more and come out with more expressive plots for the features vs the outcomes. Scatter plots for example.

# Logistic Regression
As any ML algorithm we will deal with, we need to go through few steps. The steps do not usually change

### 1- Split the data between training and validation
For that we use the function `train_test_split` located in `sklearn.model_selection`

In [4]:
from sklearn.model_selection import train_test_split
X = diabetes.loc[:, diabetes.columns != 'Outcome']
Y = diabetes['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X,Y, stratify=Y, random_state=66, test_size=0.25)
# random_state: to use the same random seed everytime we run the code .. If you remove, the splitted data will be different in each run.
# in production, we will most like not use it

# stratify: makes a split so that the proportion of values in the produced will be the same as the proportion of values provided to parameter stratify.
# For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure 
#that your random split has 25% of 0's and 75% of 1's.

# test_size: should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

# 2- Create the Model
In the last lecture, you have seen how powerful it is to create the model from scratch. However, it sometimes can be tedious and repeated. Best thing in coding is if we can package the repeated steps and allow some sort of interaction APIs, where you can change the necessary things you want without too much effort and obtain the outcomes. Oh yes, where have we seen that before? correct .. it is the function. Lucky for us, someone created a function for us that does all the work we need. That includes, setting up the cost equation and the update equations and packaged all in scikit learn logistic regression. Hence we will use that to create our model

In [7]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

# 3- Fit the Model
Now that the model with its calculations is created, we need to train it i.e., fit it. In other words, we feed it the training data to learn from it.

In [8]:
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

# 4- See How Good our Training is
That means we now ask our model to predict the outcome and compare it with the true labels in y. The function `predict`, predicts the outcome and you can compare it. Whenever the prediction for the samples matches the true outcome, it gets marked as correct. Otherwise, it is a false. Then the accuracy will be how many times you got it correct vs the total number of samples. However, there is one function that does the two steps for us at the same time and it is called `score`.

In [10]:

train_score = logreg.score(X_train, y_train)
print('Training accuracy is ', train_score)

Training accuracy is  0.78125


# 5- We Validate our Model
By calculating the accuracy for test set only, we can make sure that our model generalizes. Notice that our model never seen or learnt anything from the validation before and we are just using it for predictions and calculating the general accuracy.

In [11]:
test_score=logreg.score(X_test, y_test)
print('Training accuracy is ', test_score)

Training accuracy is  0.7708333333333334
