<a href="https://colab.research.google.com/github/zehor-l/Diabetes-Prediction-Using-Machine-Learning/blob/main/Diabetes_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Diabetes Prediction Using Machine Learning
This project aims to predict diabetes by applying a Support Vector Machine (SVM) model, a supervised learning
algorithm. By feeding the model with labeled medical data, such as BMI, blood glucose levels, and insulin levels, the SVM attempts to find a hyperplane that effectively separates diabetic patients from non-diabetic ones.

In [2]:
#Let's begin by importing the necessary libraries!
 # we need numpy arrays for processing
import numpy as np
# for dataframes to structure data
import pandas as pd
# To standarlize data we need Standarlizer function so we use
from sklearn.preprocessing import StandardScaler
#To split data into traning and test data we use
from sklearn.model_selection import train_test_split
#import model
from sklearn import svm
from sklearn.metrics import accuracy_score


Now we do Data Collection and Analysis part. We use PIMA Diabetes Dataset (You can find it in Kaggle)

In [3]:
# loading dataset
dataset= pd.read_csv('/content/diabetes.csv')
# printing the first rows of the dataset
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


For instance, we can try to understand our dataset and what the numbers represent. For example, skin thickness relates to the fat stored in a particular muscle. BMI refers to Body Mass Index, which is calculated by dividing weight by height squared. The Diabetes Pedigree Function indicates the likelihood of diabetes based on family history

In [4]:
# Let's explore our dataset more like we can get the number of rows and Columns in this dataset
dataset.shape

(768, 9)

The rows represent the number of people in the dataset, and the columns correspond to the various features or attributes of each individual.Now, let's obtain the statistical summary of the dataset

In [5]:
# getting the statistical measures of the data
dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Now, let's see how many cases there are for diabetic (1) and non-diabetic (0) examples in the dataset.

In [6]:
dataset['Outcome'].value_counts()

Unnamed: 0_level_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


In [7]:
dataset.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


separating the data and labels

In [9]:
X= dataset.drop(columns='Outcome', axis=1) # axis 1 for a column, 0 for a row
Y= dataset['Outcome']
print (X) # we expect to get all data expect Outcome

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


We can see that the ranges of our features are different, which may make it difficult for our ML model to predict accurately. Therefore, the next step in our data processing is data standardization

In [12]:
scalar= StandardScaler()
#fit the data
scalar.fit(X)
#Transform the data
standarlized_data= scalar.transform(X)
X= standarlized_data
print(X)


[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


Now our data is ready to be split to train test data


In [13]:
# We choose to split our data to 80% traning, 20%test using test_size
# Y values are either 1 or 0, we want our dat to be splitted in the same propotion so we use startifying based on Y
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [14]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


The fun part: training the model!

In [15]:
classifier= svm.SVC(kernel='linear')
#training
classifier.fit(X_train,Y_train)


Now, we can evaluate our model

In [17]:
train_predictions= classifier.predict(X_train)
train_accuracy= accuracy_score(train_predictions, Y_train)
print('train_Accuracy is:', train_accuracy)

train_Accuracy is: 0.7866449511400652


In [19]:
# accuracy of testing data
test_predictions= classifier.predict(X_test)
test_accuracy= accuracy_score(test_predictions, Y_test)
print('test_Accuracy is:', test_accuracy)

test_Accuracy is: 0.7727272727272727


Now that our model is trained and evaluated, we need to create a predictive system. This system will take a person’s medical data as input, standardize it, and use the trained model to predict whether the person is diabetic or not. We will build a function that performs these steps and returns the prediction result

In [23]:
input_data = (5,166,72,19,175,25.8,0.587,51)

# changing the input_data to numpy array
new_data= np.asarray(input_data)

# reshape the array as we are predicting for one instance
data_reshaped = new_data.reshape(1,-1)

# standardize the input data
std_data = scalar.transform(data_reshaped)
print(std_data)

prediction = classifier.predict(std_data)
print(prediction)

if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')

[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[1]
The person is diabetic


