# Introduction
Diabetes is a condition where the body struggles to control blood sugar levels. If not treated properly, it can cause serious health issues.

In this project, I created a simple machine learning model using Support Vector Machine (SVM) in Python to predict whether a person has diabetes based on their medical information. This includes features like glucose, BMI, and age.

The project also helps me learn more about how machine learning models work, especially SVM — a popular algorithm used for classification tasks like this one.

# Project Objectives

1. To clean, explore, and prepare the dataset for machine learning.
2. To apply the Support Vector Machine (SVM) algorithm to build a classification model and evaluate the model’s performance using accuracy and other metrics.
3. To predict whether a person has diabetes based on medical data.

# Installing Libraries

In [5]:
import pandas as pd  # For handling data in DataFrames (data manipulation and analysis)
import numpy as np  # For numerical operations
import seaborn as sns # For Data Visualisation
import matplotlib.pyplot as plt # To plot Chart
from sklearn.preprocessing import StandardScaler  # For standardizing feature values
from sklearn.model_selection import train_test_split  # For splitting the dataset into train/test sets
from sklearn import svm  # For using Support Vector Machine algorithms
from sklearn.metrics import accuracy_score  # For evaluating the model’s performance

The sklearn library is very versatile and handy and serves real-world purposes. It provides wide range of ML algorithms and Models.

# Importing Data

In [13]:
Data= pd.read_csv("C:/Users/hp/Downloads/Diabetes/diabetes.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/hp/Downloads/Diabetes/diabetes.csv'

In [None]:
Data.head() #Show top 5 rows

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
Data.shape

(768, 9)

In [None]:
Data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
Data['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

# Missing Value Analysis

In [None]:
# Explore missing values
Data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

# Separating Input Features (X) and Target Label (Y)

In [None]:
# Split the data into features (x) and target (y)
# x contains all columns except 'Outcome'
# y is the 'Outcome' column, which is the label (1 = diabetic, 0 = not diabetic)

x=Data.drop(columns='Outcome',axis=1)
y=Data['Outcome']

In [None]:
x

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [None]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [None]:
# Standardize the input features so they all have the same scale
# This helps the machine learning model perform better

scaler=StandardScaler()
scaler.fit(x)


In [None]:
# Apply the standardization to the input features
# This transforms the data so each feature has a mean of 0 and standard deviation of 1

standardized_data=scaler.transform(x)
standardized_data

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [None]:
x=standardized_data

In [None]:
x

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [None]:
# Split the dataset into training and testing sets
# 80% of the data is used for training, and 20% for testing

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)


In [None]:
x_train.shape

(614, 8)

In [None]:
x_test.shape

(154, 8)

# Training the Support Vector Machine (SVM) Model

In [None]:
# Create an SVM model with a linear kernel
# Train the model using the training data

clf=svm.SVC(kernel='linear')
clf.fit(x_train,y_train)

In [None]:
# Predict the output (diabetes or not) for the training data
# Compare predictions with actual labels to check accuracy

x_train_prediction=clf.predict(x_train)
accuracy_score(x_train_prediction,y_train)

0.7801302931596091

In [None]:
# Make predictions on the test data using the trained model
# Then, check how accurate the predictions are by comparing them to the actual test labels

X_test_prediction=clf.predict(x_test)
accuracy_score(X_test_prediction,y_test)

0.7792207792207793

In [None]:
# Create a sample input (e.g. a new patient's data)
input_sample=(5,166,72,19,175,22.7,0.6,51)
# Convert the input to a NumPy array so it can be used by the model
input_np_array=np.asarray(input_sample)
# Reshape the array to match the model's expected input format
input_np_array_reshaped=input_np_array.reshape(1,-1)

In [None]:
# Apply the same standardization to the new input data
# This makes sure the input has the same scale as the data used to train the model

std_data=scaler.transform(input_np_array_reshaped)
std_data



array([[ 0.3429808 ,  1.41167241,  0.14964075, -0.09637905,  0.82661621,
        -1.179407  ,  0.38694877,  1.51108316]])

In [None]:
# Use the trained model to predict whether the person has diabetes (1) or not (0)

prediction=clf.predict(std_data)
prediction

array([1], dtype=int64)

In [None]:
# Check the prediction result and print a meaningful message
if(prediction[0]==1):
    print("person is diabetic")
else:
    print("person is not diabetic")

person is diabetic


# Conclusion

In this project, I built a simple machine learning model using the Support Vector Machine (SVM) algorithm to predict whether a person has diabetes based on medical information.

The dataset was preprocessed by standardizing the values, splitting it into training and test sets, and training the model using the training data. The model achieved good accuracy on both training and test data, showing that it can generalize well to new inputs.

This project helped me understand key steps in a machine learning workflow, including data preparation, model training, evaluation, and making predictions on new data.

Overall, this model can assist in early detection of diabetes and support medical decision-making, although it should not replace professional diagnosis.