# Diabetes Prediction Web Application using Logistic Regression and Streamlit

Diabetes is a chronic condition in which the body develops a resistance to insulin, a hormone which converts food into glucose. Diabetes affect many people worldwide and is normally divided into Type 1 and Type 2 diabetes. Both have different characteristics. This article intends to analyze and create a model on the PIMA Indian Diabetes dataset to predict if a particular observation is at a risk of developing diabetes, given the independent factors. This article contains the methods followed to create a suitable model, including EDA along with the model.

## Import the important libraries and dependencies

In [1]:
import pandas as pd
import numpy as np

In [2]:
# here pd.read_csv will read the CSV file and will create a datafram.  .head() will return top 5 records.

dataset = pd.read_csv("diabetes.csv")

dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## now what we have to do with this dataset is that do some kind of quick analysis which is very important to know more about the data  

for that we can use some inbuuilt functions like

1) ***data.info()***

**so what will this function do??**  

it is basically going to talk about

1) no of entries means number of records.

2) Data columns means number of columns in dataset.

3) what are the data types of the columns that are present in the dataset 

4) then in each row out of total number of records i.e. entries, how many records are non null. this is very imp because it'll show if there's any empty or null records are present in you data. 

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


#### Q)now one question here our data is of which type Numerical Data or Categorical Data?
as we can see data typee of every column is int/float means all the values are in numbers so data is numerical data

2) ***data.describe()***

#### Q) what will this function do?
it will return the **Summary statistics of the Series or Dataframe provided.**

For numeric data, the result’s index will include **count**, **mean**, **std**, **min**, **max** as well as **lower, 50 and upper percentiles** for every column present in dataset

#### Q) what is by default value of Lower and Upper Percentile?
By default the **lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median**.

In [4]:
dataset.describe()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### Q) what are all of these Value?
1) **count** - return Count number of non-NA/null observations.

2) **mean** - return Mean of the values.

3) **std** - return Standard deviation of the observations.

4) **min** - return Minimum of the values in the object.

5) **max** - return Maximum of the values in the object.

**now most important step is to check missing values in dataset**  

.isnull().sum() will give the sum of null values for every column.

In [5]:
dataset.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

here you can see there it returns **0** means **no missing values** and as we saw above in **data.info()** we know that our dataset contains no null values

## Now let's prepare the independent and dependent feature and store in the variable X and y

#### Q) what are independent and dependent features?  

**independent feature** : name suggests it is independent means theres in no effect in the value of it from the change in values of other features.  

**dependent feature**: that feature whose value gets change by changing the value of other features.  

**eg**: suppose there are 3 columns in dataset **no of hrs study**, **no of hrs play**, **result**  

here you can say that **result is dependent on no of hrs study and no of hrs play** because **if I study more and play less then result will be pass and if I study less and play more then result will be fail**  

so if change occur in the values of study and play then it'll effect the value of result that is why **result is dependent feature and remaining both are independent**.  

we can also say that **the value which we are going to predict will always be dependent feature and rest are independent**.

In [17]:
## Independent and Dependent features

# separating the data and labels. here we are dropping OUTCOME column as it is dependent featue and remaining column is as it is
# and storing it in X.  axis = 1 means drop entire column
X = dataset.drop(columns = 'Outcome', axis=1)

# dependent feature is only one so we are selecting only that column and store it in y
y = dataset['Outcome']

In [18]:
# all independent features
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [19]:
# dependent feature
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

## Train Test Split  

#### Q) what is Train Test Split?? 
It's a **process to split your data into Training Data and Testing Data**.  
**Training data is**, as the name suggests, **used to train your model**. **Test data** is the **unknown data that the model hasn’t seen during the training**

In [20]:
##Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

**sklearn.model_selection** contains a **train_test_split() function** that is used to split your data in training and test sets.  

**test_size = 0.3** which indicates that **30% of our data is for testing** and remaining 70% is for training, w can change that as per our need

here we created Test data and training data for both Independent and Dependent feature means X and y

In [21]:
# training data of independent feature
X_train

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
334,1,95,60,18,58,23.9,0.260,22
139,5,105,72,29,325,36.9,0.159,28
485,0,135,68,42,250,42.3,0.365,24
547,4,131,68,21,166,33.1,0.160,28
18,1,103,30,38,83,43.3,0.183,33
...,...,...,...,...,...,...,...,...
71,5,139,64,35,140,28.6,0.411,26
106,1,96,122,0,0,22.4,0.207,27
270,10,101,86,37,0,45.6,1.136,38
435,0,141,0,0,0,42.4,0.205,29


In [22]:
# testing data of independent feature
X_test

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
668,6,98,58,33,190,34.0,0.430,43
324,2,112,75,32,0,35.7,0.148,21
624,2,108,64,0,0,30.8,0.158,21
690,8,107,80,0,0,24.6,0.856,34
473,7,136,90,0,0,29.9,0.210,50
...,...,...,...,...,...,...,...,...
619,0,119,0,0,0,32.4,0.141,24
198,4,109,64,44,99,34.8,0.905,26
538,0,127,80,37,210,36.3,0.804,23
329,6,105,70,32,68,30.8,0.122,37


## Model Building   

here we we use Logistic Regression Algorithm to build our model.  

#### Q) what is logistic regression??  

**Logistic regression is a classification algorithm. It is used to predict a binary outcome based on a set of independent variables.**.  

Ok, so what does this mean? A binary outcome is one where there are **only two possible scenarios—either the event happens (1) or it does not happen (0)**. **Independent variables are those variables or factors which may influence the outcome (or dependent variable)**.

So: Logistic regression is the correct type of analysis to use when you’re working with binary data. You know you’re dealing with binary data when the output or dependent variable is dichotomous or categorical in nature; in other words, if it fits into one of two categories (such as “yes” or “no”, “pass” or “fail”, and so on).  

**eg**: predicting if an incoming email is spam or not spam, or predicting if a credit card transaction is fraudulent or not fraudulent.  

<img src="logic_reg.png" width="300">  

**outcome will always be lies between 0 and 1**

In [23]:
# we have to import model from sklearn.linear_model
from sklearn.linear_model import LogisticRegression

In [24]:
#creating an object if the model
model = LogisticRegression()

In [25]:
# now we'll train our model using object name.fit and pass our training data

model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [26]:
# accuracy score on the training data. used to check the accuracy of the model

from sklearn.metrics import accuracy_score

# model.predict() is used to make prediction.
X_train_prediction = model.predict(X_train)


training_data_accuracy = accuracy_score(X_train_prediction, y_train)

In [29]:
print('Accuracy score in training data = ' ,training_data_accuracy)

Accuracy score in training data =  0.7821229050279329


here Accuracy Score is 78% means out of 100 times 78 times our model is making right predictions which is good. 

In [32]:
# accuracy score on the test data 

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)

print('Accuracy score in test data = ' ,test_data_accuracy)

Accuracy score in test data =  0.7402597402597403


here you can see that model is making predictions with almost same accuracy on both trainig data as well as test data which means our **model is neither Overfitting** nor **underfitting**  

#### Q) what is Overfitting and Underfitting?  

**Overfitting** :When a model performs very well for training data but has poor performance with test data (new data), it is known as overfitting

**underfitting** : An underfit model has poor performance on the training data and will result in unreliable predictions.

## Making Predictions.   

here we will take values of all the independent variables and store it in input_data. values of Pregnancies, Glucose, BloodPressure, SkinThickness,	Insulin, BMI, DiabetesPedigreeFunction,	Age is taken and based on these value we are going to predict whether the patient is diabetic or not.  

if outcome = 1 then person is diabetic, if outcome is 0 the person is non-diabetic.

In [33]:
# predictive system 

input_data = (5,166,72,19,175,25.8,0.587,51)

# changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

# model.predict() will make prediction based on the given input data and will store that in vatiable
prediction = model.predict(input_data_reshaped)
print(prediction)

# prediction will result an array [1] or [0] we want value which is why we are doing prediction[0] to fetch the value present at
# 0th index

# condition if value = 0 then person in not diabetic and if value = 1 then person is diabetic
if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')

[1]
The person is diabetic




Now we'll make a pickle file and store our model in that so that we can use our model is other place too and we did'nt need to train our model it everytime.  We'll just call this pickle file and our model will work .

In [34]:
import pickle

pickle.dump(model, open('trained_model', 'wb'))

In [35]:
# loading the pickle file. We'll load our trained model into file called trained_model

loaded_model = pickle.load(open('trained_model', 'rb'))

#### Making Prediction using pickle file

In [36]:
input_data = (5,166,72,19,175,25.8,0.587,51)

# changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = loaded_model.predict(input_data_reshaped)
print(prediction)

if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')

[1]
The person is diabetic




## !!!! DONT RUN THE CODE FROM THIS CELL IN JUPYTER NOTEBOOK

Now the Next Step

## Let's Create a Streamlit App and make a beautiful front-end application  


### Steps to Create:  

1) Open Anaconda Navigator in your PC. if you don't have then download it. It is desktop GUI which allows to launch applications and easily manage conda packages, environments and channels without the need to use command line commands.  

2) Create a seperate Environment so that all the packages we will be going to install for the app will not effect the entire computer system. All the Packages will remain in that environment only. You can create it in from Anaconda Navigator only, there will be CREATE button to create new environement.

3) after creating new environment open the terminal of that environment only and install all the necessary packages like pandas, streamlit, scikit-learn ,etc.  

using this command **pip install pandas** or **pip install streamlit** ohterwise you'r app will throw an error.  

4) After setting the environment we will need to make an front-end application using Streamlit. You can any IDE you want, I used Visual Studio.  

5) Launch VScode from Anaconda Navigator only then new project and then open that folder where your ipynb file and pickle files are there and the start coding. Create a file and save it with extension ".py"  We are build our entire code in this file only.   


### !!!! WARNING DON'T RUN ANY CELLS FROM HERE.

In [None]:
## !!!! DONT RUN THIS CODE HERE, IT'LL NOT WORK. INSTEAD DO FOLLOW THE ABOVE INSTRUCTIONS.

import numpy as np
import pickle
import streamlit as st


# loading the saved model
loaded_model = pickle.load(open('trained_model', 'rb'))


# creating a function for Prediction

def diabetes_prediction(input_data):
    

    # changing the input_data to numpy array
    input_data_as_numpy_array = np.asarray(input_data)

    # reshape the array as we are predicting for one instance
    input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

    prediction = loaded_model.predict(input_data_reshaped)
    print(prediction)

    if (prediction[0] == 0):
      return 'The person is not diabetic'
    else:
      return 'The person is diabetic'
  
    
  
def main():
    
    
    # giving a title
    st.title('Diabetes Prediction Web App')
    
    
    # getting the input data from the user
    
    
    Pregnancies = st.number_input('Number of Pregnancies', step=1.)
    Glucose = st.number_input('Glucose Level',step=1.)
    BloodPressure = st.number_input('Blood Pressure value',step=1.)
    SkinThickness = st.number_input('Skin Thickness value',step=1.)
    Insulin = st.number_input('Insulin Level',step=1.)
    BMI = st.number_input('BMI value',step=1.)
    DiabetesPedigreeFunction = st.number_input('Diabetes Pedigree Function value',step=1.)
    Age = st.number_input('Age of the Person', step=1.)
    
    
    # code for Prediction
    diagnosis = ''
    
    # creating a button for Prediction
    
    if st.button('Diabetes Test Result'):
        diagnosis = diabetes_prediction([Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age])
        
        
    st.success(diagnosis)
    
    
if __name__ == '__main__':
    main()

### Let's understand the code:  


1) first we are importing the libraries. Make sure your all the packages are installed in your environment otherwise it'll throw error while running. 

2) we'll load our trained model into a variable but loading the pickle file and we will give acess to rb means read binary. Now our model is also ready to make prediction.  

3) now We'll create a main Function in which we'll take values for out independent features from the user in the form of HTML FORM.   

> st.title will create a TITLE to you app.    
>st.number_input will create a FORM which will ask for Numbers from the user. Here our all features are numerical that is why I used number_input instead of text_input.  
>we are storing values of each features into variables.  

> st.button will create a button. As the user click the button another function gets called which will make prediction.  
as soon as user click button all the values that user has inserted will go into anaother function. 

4) create a function to make prediction and here 