# Diabetes Prediction

**WORKFLOW:**

1. **Data Collection**: First, we need to gather the diabetes data, which includes information and labels indicating whether each person has diabetes or not. We'll use this data to train our machine learning model.

2. **Data Pre-processing**: Before we can use this data to train our model, we need to clean and prepare it. This involves analyzing the data to understand its structure and ensuring that it's in a format suitable for machine learning. Specifically, we'll standardize the data, meaning we'll adjust all the different medical measurements so they fall within the same range. This step is important because it ensures that the model treats all features equally when making predictions.

3. **Splitting the Data**: After pre-processing, we'll divide the data into two sets: a training set and a test set. The training set will be used to teach the model, while the test set will be used to evaluate how well the model has learned. By testing the model, we can determine how accurately it can predict diabetes.

4. **Model Training**: We’ll use a Support Vector Machine (SVM) model to classify the data. This classifier will learn from the training data to determine whether a patient has diabetes or not. Once the model is trained, it will be able to predict whether a new patient is diabetic based on their medica diabetes.

## Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler # normalize features so they have mean 0 and variance 1
from sklearn.model_selection import train_test_split
from sklearn import svm # a machine learning algorithm for classification tasks
from sklearn.metrics import accuracy_score

## Data Collection and Analysis

In [2]:
# Loading the diabetes dataset to a pandas dataframe
diabetes_dataset = pd.read_csv('diabetes.csv')

In [3]:
# Printing the first 5 rows of the dataset
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


**Observation:** column 'Outcome' is the label
- 1 represents that the person is diabetic.
- 0 represents that the person is non-diabetic.

In [4]:
# Number of rows and columns in this dataset
diabetes_dataset.shape

(768, 9)

In [5]:
# Getting the statistical measures of the data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
diabetes_dataset['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

0 --> Non-Diabetic \
1 --> Diabetic

In [7]:
diabetes_dataset.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [8]:
# Separating the data and labels
X = diabetes_dataset.drop(columns='Outcome', axis=1)
Y = diabetes_dataset['Outcome']

In [9]:
print(X)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


In [10]:
print(Y)

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


## Data Standardization

In [11]:
scaler = StandardScaler()

In [12]:
scaler.fit(X)

In [13]:
standardized_data = scaler.transform(X) 

In [14]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [15]:
X = standardized_data
Y = diabetes_dataset['Outcome']

In [16]:
print(X)
print(Y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


## Train Test Split

In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    test_size=0.2, # 20% of the data is used for testing
                                                    stratify=Y, # Maintain the same class distribution as in Y
                                                    random_state=2) # Set random seed for reproducibility

In [18]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


## Training the Model

In [19]:
classifier = svm.SVC(kernel='linear')

In [20]:
# training the support vector machine classifier
classifier.fit(X_train, Y_train)

## Model Evaluation

### Accuracy Score

In [21]:
# Accuracy score on the training data
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [22]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.7866449511400652


In [23]:
# Accuracy score on the testing data
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [24]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.7727272727272727


## Making a Predictive System

In [25]:
input_data = (9,171,110,24,240,45.4,0.721,54) # This is the input data; it doesn't include the label, as the model will predict it.

# Changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the array because the model expects input for one instance (1 row, multiple columns)
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

# Standardize the input data so that it has the same scale as the data used to train the model
std_data = scaler.transform(input_data_reshaped)
print(std_data)

# Make a prediction using the classifier
prediction = classifier.predict(std_data)
print(prediction)

# Check the prediction result: 0 means not diabetic, 1 means diabetic
if (prediction[0] == 0): # prediction variable is a list, so it doesn't store an integer but an element with index 0.
    print('The person is not diabetic')
else:
    print('The person is diabetic')

[[1.53084665 1.56815814 2.11415525 0.21726125 1.39100445 1.70165987
  0.75238313 1.76634642]]
[1]
The person is diabetic


