Importing the dependencies

    1. Numpy is used to make numpy arrays, which are helpful in indexing, reshaping.
  
    2. Pandas is used to make DataFrame, which is helpful to convert data in nice structured table.

    3. StandardScaler : To standardize the data.

    4. train_test_split : To split our data for training and testing.  

    5. svm : Stands for Support Vector Machine.

    6. accuracy_score : to predict the accuracy score.


    # Remember: accuracy_score must be presented with confusion matrix, precision score, recall & f1-score.

    Accuracy = (TP+TN)/(TP+FP+FN+TN)

    Precision = (TP)/(TP+FP)

    Recall = (TP)/(TP+FN)

    F1-Score = 2(Precision*Recall/Precision+Recall) 

Classification metrics:

    1. accuracy_score
    2. precision
    3. recall
    4. f1-score


Regression Metrics:

    1. mean_squared_error
    2. root_mean_squared_error
    3. mean_absolute_error

This data falls under classification type, because we have the label as 0 & 1 which is the binary outcome for some value.

Here we have our label = column ['Outcome'] with value as 0 & 1, where 0 represents non-diabetic and 1 represents diabetic.


In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler   
from sklearn.model_selection import train_test_split
from sklearn import svm     
from sklearn.metrics import accuracy_score

Data Collection and Analysis

    PIMA Diabetes Dataset

In [2]:
# loading a dataset through pandas DataFrame

diabetes_df = pd.read_csv("/content/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Outcome column is Label and Other columns are features.

diabetes_df.shape

(768, 9)

In [5]:
# getting the statistical measures of the data

diabetes_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
diabetes_df["Outcome"].value_counts()

# 0 --> Non-diabetic
# 1 --> Diabetic

0    500
1    268
Name: Outcome, dtype: int64

In [8]:
diabetes_df.groupby("Outcome").mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [9]:
# separating features and labels 
X = diabetes_df.drop(columns=['Outcome'], axis=1)
Y = diabetes_df['Outcome']

In [14]:
print(X.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  


In [11]:
print(Y.head())

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


In [16]:
# Data Standardization : means standardizing the dataset within it's range

scaler = StandardScaler()

In [17]:
scaler.fit(X)

StandardScaler()

In [18]:
standardized_data = scaler.transform(X)

# Instead of using [ fit ] and [ transfom ] separately we can also use [ scaler.fit_transform ] function

In [19]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [20]:
X = standardized_data
Y = diabetes_df["Outcome"]

In [23]:
print(X.shape)
print(Y.shape)

(768, 8)
(768,)


Train Test Split

In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

# stratify with repect to Y, means as per the Y our X and Y datas must be uniformly distributed

# random_state = 2, is the manner at which our data is uniformly separated 
# if we change random_state=1, then our data will be separated at different manner

In [26]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


Training the model

In [27]:
classifier = svm.SVC(kernel="linear")                  # SVC = Support Vector Classifier

In [29]:
# training the support vector machine classifier

classifier.fit(X_train, Y_train)

SVC(kernel='linear')

Model Evaluation :

Evaluation is to check how many times our ML model is predicting correctly.

.

Accuracy Score

In [32]:
# accuracy_score for the training data

X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [33]:
print("The accuracy score of the training data : ", training_data_accuracy)     # accuracy score above 75 is acceptable

The accuracy score of the training data :  0.7866449511400652


This means out of 100 predictions, our model is predicting 79 times the correct prediction.

In [34]:
 # accuracy_score for the test data

X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [35]:
print("The accuracy score of the test data : ", test_data_accuracy)     

The accuracy score of the test data :  0.7727272727272727


This is a good result, this tells that our model has not fallen under Over-fitting condition.

Over-fitting : means we are getting very high accuracy for training data but not very good accuracy for testing data.

Making a Predictive System

In [38]:
input_data = (11,143,94,33,146,36.6,0.254,51)

# converting the input_data into numpy array
input_data_as_np_array = np.asarray(input_data)

# reshape the array as we are predicting for one instance
  # why reshaping : because our model is trained on 768 instances with 8 columns, so our model expects same
  # amount of data to perform the same task 
input_data_reshaped = input_data_as_np_array.reshape(1, -1)

# standardize the input data
std_data = scaler.transform(input_data_reshaped)
print(std_data)


prediction = classifier.predict(std_data)
print(prediction)

if (prediction[0]==0):
  print("The person is not diabetic.")
else:
  print("The person is diabetic.")

[[ 2.12477957  0.69183807  1.28699125  0.7818138   0.57481223  0.58477051
  -0.65801229  1.51108316]]
[1]
The person is diabetic.


  "X does not have valid feature names, but"


Saving the trained model 

In [43]:
import pickle

In [44]:
filename = "diabetes_trained_model.sav"
pickle.dump(classifier, open(filename, "wb"))         # wb = writing the file in binary format 

In [45]:
# loading the saved model

loaded_model = pickle.load(open("diabetes_trained_model.sav", "rb"))         # rb = reading the file in binary format

In [49]:
input_data = (4,110,92,0,0,37.6,0.191,30)

# converting the input_data into numpy array
input_data_as_np_array = np.asarray(input_data)

# reshape the array as we are predicting for one instance
  # why reshaping : because our model is trained on 768 instances with 8 columns, so our model expects same
  # amount of data to perform the same task 
input_data_reshaped = input_data_as_np_array.reshape(1, -1)


# we can skip the [ standardize the input data step ], because here currently we are having the small
# small dataset, but we have to use it while working on bigdatas.

# standardize the input data
std_data = scaler.transform(input_data_reshaped)
print(std_data)


prediction = loaded_model.predict(std_data)
print(prediction)

if (prediction[0]==0):
  print("The person is not diabetic.")
else:
  print("The person is diabetic.")

[[ 0.04601433 -0.34096773  1.18359575 -1.28821221 -0.69289057  0.71168975
  -0.84827977 -0.27575966]]
[0]
The person is not diabetic.


  "X does not have valid feature names, but"
