<a href="https://colab.research.google.com/github/zera-sol/Diabetes-Prediction/blob/main/Diabets_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing Dependencies**


In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

**Data collection and analysis**

In [None]:
diabets_dataset = pd.read_csv("diabetes.csv")
diabets_dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


**Getting the statistical measures of the data**

In [None]:
diabets_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
diabets_dataset["Outcome"].value_counts()

Unnamed: 0_level_0,count
Outcome,Unnamed: 1_level_1
0,500
1,268


0 -> Non-Diabetic

1 -> Diabetic

In [14]:
x = diabets_dataset.drop(columns = "Outcome", axis=1)
y = diabets_dataset["Outcome"]


**Standardize the data**

In [17]:
scaler = StandardScaler()
scaler.fit(x)
standardized_data = scaler.transform(x)

In [18]:
x = standardized_data

**Split the data as Training and testing**

In [19]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, stratify=y, random_state=2)

In [25]:
print(x.shape, x_train.shape, x_test.shape)

(768, 8) (614, 8) (154, 8)


**Training the Model**

In [27]:
classifier = svm.SVC(kernel='linear')

**svm.SVC:**

* This is the Support Vector Classifier (SVC) from the svm module in scikit-learn.

* SVM is a powerful supervised learning model commonly used for classification tasks, and it works by finding the optimal hyperplane that separates data points of different classes.

**kernel='linear':**

* The kernel parameter defines the type of kernel function to use. The kernel function transforms the data into a higher-dimensional space, which helps in finding a boundary for data that isn’t linearly separable.
* Setting kernel='linear' specifies that a linear kernel should be used, meaning the classifier will try to find a straight line (or hyperplane) to separate classes.
*  linear kernel is typically used for linearly separable data or when simplicity and interpretability are important.

In [28]:
classifier.fit(x_train, y_train)

**Model Evaluation**

 **Accuracy score**

In [36]:
#accuracy score on the training data
x_train_prediction = classifier.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)
training_data_accuracy

0.7866449511400652

In [38]:
#accuracy score on the testing data
x_test_prediction = classifier.predict(x_test)
testing_data_accuracy = accuracy_score(x_test_prediction, y_test)
testing_data_accuracy

0.7727272727272727

* Accuracy on Training Data: 0.79 (79%)
* Accuracy on Testing Data: 0.77 (77%)

**What These Values Mean**

*  Accuracy Score: This score represents the proportion of correct predictions out of all predictions. An accuracy of 0.79, for example, means that the model correctly predicted 79% of the training data labels.

**Training Accuracy (0.79):**

* The model was trained on this data, so it should generally achieve a high accuracy score here.
* An accuracy of 79% suggests the model is performing relatively well on training data but still makes some mistakes.

**Testing Accuracy (0.77):**

* Testing accuracy is often lower than training accuracy, as it’s based on new data the model hasn't seen before.
* A testing accuracy of 77% shows that the model is generalizing well but isn't perfect.
* A testing accuracy close to training accuracy (79% vs. 77%) typically indicates the model isn’t overfitting and is learning patterns that apply generally rather than just memorizing training data.

**Let The model predict for new data**

In [52]:
input_data = (1,	890,	66,	23,	94,	28.1,	0.167,	80	)

#changing the input data into numpy array
input_data_as_numpy_array = np.asarray(input_data)
input_data_as_numpy_array

array([1.00e+00, 8.90e+02, 6.60e+01, 2.30e+01, 9.40e+01, 2.81e+01,
       1.67e-01, 8.00e+01])

In [53]:
#reshape the array as we are predicting for one instance
input_data_reshped = input_data_as_numpy_array.reshape(1, -1)

#standardize the input data

std_data = scaler.transform(input_data_reshped)

prediction = classifier.predict(std_data)

if (prediction[0] == 0):
  print("The Person is not Diabetic")
else:
  print("The person is Diabetic")

The person is Diabetic


