## Diabetes Prediction with Support Vector Machine & Linear Regression Analysis

### Introduction

Predictive analysis is important to identify the likelihood of future outcomes based on given sets of data. This notebook focuses on the Support Vector Machine learning model to predict if a given patient is diabetic. 

The Support Vector Machine model (SVM) was chosen for this analysis for its ability to predict two-group classification problems. This kind of predictive analysis is important for medical professionals to address underlying issues that could possibly lead to a confirmed diagnosis of a health condition. 

The insights help professionals make informed decisions, and allows for the exercise of cautionary and preventative measures to assist with lowering the severity of such diseases. 

### Objective
To diagnostically predict if a patient is diabetic based on certain variables such as Age, Number of Pregnancies, Body Mass Index (BMI), Insulin, and Blood Pressure.

### Data Source

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). It was retrieved from Kaggle and can be accessed at https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

### Importing the Libraries

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

###
### Data Collection and Pre-Processing

Using the read_csv function, the data will be loaded to a Pandas dataframe called db_data

In [5]:
db_data = pd.read_csv('diabetes.csv')

In [6]:
db_data.head() #To see the first five rows of the data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
db_data.tail()  #To see the last five rows of the data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


### 
To better understand our data, we can obtain more information about using the shape, column and info functions in Pandas

In [6]:
db_data.info

<bound method DataFrame.info of      Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   5

In [7]:
db_data.columns #To see the column names

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [8]:
db_data.shape #To get the number of rows and columns

(768, 9)

###
### Checking for null values

The data has 768 rows and 9 columns. We would need to check for null values in the data and remove them to make the analysis much easier. The null function can be used to check this with the following syntax

In [7]:
db_data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

There are no missing values. Data is ready to be processed.

###
### Diabetic versus non-diabetic Patients in the dataset
We can check the number of diabetic and non-diabetic patients in the data by using the value counts function to return 0 (or True) if the patient is diabetic, and 1 (or False) if they are non-diabetic

In [8]:
db_data['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

500 patients are non-diabetic, 268 are diabetic!

###
### Dependent and Independent Variables
Since the goal of this analysis is to determine if a given patient is diabetic or not, we can drop the 'Outcome' column (dependent variable) and use the model to predict it from the independent variables. The drop function in pandas would be used to remove the desired column

In [9]:
X = db_data.drop(columns = 'Outcome', axis =1) #Independent variable
Y = db_data['Outcome'] #becomes the dependent variable

###
Using the head function to see the values of the first 5 values in the modified data

In [10]:
X.head ()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [13]:
Y.head ()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

### 
Here, we will use the Sklearn library for data standardization and store the values in a new dataframe called SC 

In [11]:
SC = StandardScaler()
SC.fit(X) #fits the values of X in an array

#### 
We can also tranform the data to fit all inconsistent data with the scaler function

In [12]:
standardize_data = SC.transform(X)
standardize_data 

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

#### 
The X values will now take the place of the standardized data, and Y will be the Outcome we are trying to predict

In [13]:
X = standardize_data
X

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [14]:
Y = db_data['Outcome']
Y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

###
### Data Splitting (Train Test Split)
Splitting data into training and test sets using 80% of our data for training and 20% for testing

In [34]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2) #using 90% of our data for training

###
### Support Vector Machine Module Classifier
The model can be trained using the Support Vector Machine Classifier by fitting the training X and Y values of the data in the model

In [35]:
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, Y_train)

### 
The model can then be evaluated for accuracy using the training and test values

In [36]:
X_train_pred = classifier.predict(X_train)
train_data_acc = accuracy_score(X_train_pred, Y_train)
print(f'The accuracy score for the trained model is {(round(train_data_acc,3)*100)}%')

The accuracy score for the trained model is 77.2%


In [37]:
X_test_pred = classifier.predict(X_test)
test_data_acc = accuracy_score(X_test_pred, Y_test)
print(f'The accuracy score for the test model is {(round(test_data_acc,3)*100)}%') 

The accuracy score for the test model is 76.6%


###
### Model Prediction
With the creation of the model, we can try the first two rows of our original data to see if the model gives the correct prediction. The data will be converted to a numpy array, reshaped, and standardized for accurate prediction

In [19]:
db_input_data = (6,148,72,35,0,33.6,0.672,50) #Values from the original dataset
db_input_data_as_array = np.asarray(db_input_data)
db_input_data_reshaped = db_input_data_as_array.reshape(1, -1)
standardized_data = SC.transform(db_input_data_reshaped)
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075  0.90726993 -0.69289057  0.20401277
   0.60439732  1.4259954 ]]


In [20]:
prediction = classifier.predict(standardized_data) #plugging the values into the model
print(prediction)

[1]


###
The prediction for the above data sample is True or 1. This is consistent with the outcome of the first row in the dataset. We can use the 'if' and 'else' statements to print the outcome of the results

In [21]:
if (prediction[0] == 0):
    print('This patient is not diabetic')
else:
    print('This patient is Diabetic. Requires further evaluation') 

This patient is Diabetic. Requires further evaluation


###
The same procedure can be done with the second row in the dataset

In [22]:
db_input_data = (1,85,66,29,0,26.6,0.351,31)
db_input_data_as_array = np.asarray(db_input_data)
db_input_data_reshaped = db_input_data_as_array.reshape(1, -1)
standardized_data = SC.transform(db_input_data_reshaped)
print(standardized_data) 

[[-0.84488505 -1.12339636 -0.16054575  0.53090156 -0.69289057 -0.68442195
  -0.36506078 -0.19067191]]


In [23]:
prediction = classifier.predict(standardized_data)
print(prediction)

[0]


In [24]:
if (prediction[0] == 0):
    print('This patient is not diabetic')
else:
    print('This patient is Diabetic. Requires further evealuation') 

This patient is not diabetic


###
### Insights
- The accuracy scores for the trained and test models were 77.2% and 76.6% respectively. The model was tested with two sets of data and the same outcomes from the original dataset was obtained. 

- While this model seems to predict the outcome as expected, we can achieve higher accuracy scores by changing the sizes of the training and test data. 

### Next Steps

The next steps for the analysis would be use other prediction Models such as the Random Forest Classifier and Logistic Regression to see how well they can predict the outcome of this dataset.