<center><h1>Classification Model using Naive Bayes</h1></center>

The dataset used here is for determining whether someone is diabetic or not. This dataset is consists of 995 data with two classification features.
- ***Glucose***: It states the glucose level of the person.
- ***BloodPressure***: It states the blood pressure of the person.

### Import dependencies

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### Load the csv file, drop the 'Diabetes' column, and divide into test and train

In [4]:
df = pd.read_csv('Naive-Bayes-Classification-Data.csv')
df

Unnamed: 0,Glucose,BloodPressure,Diabetes
0,148,72,1
1,85,66,0
2,183,64,1
3,89,66,0
4,137,40,1
...,...,...,...
990,113,80,0
991,138,82,0
992,108,68,0
993,99,70,0


In [5]:
x = df.drop('Diabetes', axis=1)
y = df['Diabetes']

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [7]:
x_train

Unnamed: 0,Glucose,BloodPressure
459,134,74
959,111,90
984,85,55
514,99,54
61,133,72
...,...,...
895,194,68
576,108,44
299,112,72
276,106,60


In [8]:
y_train

459    0
959    0
984    0
514    0
61     1
      ..
895    1
576    0
299    0
276    1
608    0
Name: Diabetes, Length: 796, dtype: int64

### Train model

In [9]:
model = GaussianNB()
model.fit(x_train, y_train)

### Test and evaluate the model

In [10]:
predict = model.predict(x_test)
predict

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       0], dtype=int64)

In [11]:
print(metrics.classification_report(y_test, predict))

              precision    recall  f1-score   support

           0       0.76      0.87      0.81       119
           1       0.76      0.59      0.66        80

    accuracy                           0.76       199
   macro avg       0.76      0.73      0.74       199
weighted avg       0.76      0.76      0.75       199



In [12]:
print('Accuracy Score: {:.1f}%'.format(accuracy_score(predict, y_test) * 100))

Accuracy Score: 75.9%


In [13]:
print('Training Accuracy: {:.1f}%'.format(model.score(x_train, y_train) * 100))
print('Test Accuracy: {:.1f}%'.format(model.score(x_test, y_test) * 100))

Training Accuracy: 73.2%
Test Accuracy: 75.9%


In [14]:
cf = confusion_matrix(predict, y_test)
print('Confusion Matrix: \n', cf)

Confusion Matrix: 
 [[104  33]
 [ 15  47]]


In [15]:
tp, fp, tn, fn = cf.ravel()

print('True Prositive: ', tp)
print('False Prositive: ', fp)
print('True Negative: ', tn)
print('False Negative: ', fn)

True Prositive:  104
False Prositive:  33
True Negative:  15
False Negative:  47


In [16]:
recall = metrics.recall_score(predict, y_test) * 100
precision = metrics.precision_score(predict, y_test) * 100
accuracy = metrics.accuracy_score(predict, y_test) * 100
type_1_err = (fp / (fp + tn)) * 100
type_2_err = (fn / (fn + tp)) * 100

print('Recall: {:.1f}%'.format(recall))
print('Precision: {:.1f}%'.format(precision))
print('Accuracy: {:.1f}%'.format(accuracy))
print('Type I Error: {:.1f}%'.format(type_1_err))
print('Type II Error: {:.1f}%'.format(type_2_err))

#Formulas:
#recall = (tp / (tp + fn)) * 100
#precision = (tp / (tp + fp)) * 100
#accuracy = ((tp + tn) / (tp + fp + tn + fn)) * 100

Recall: 75.8%
Precision: 58.8%
Accuracy: 75.9%
Type I Error: 68.8%
Type II Error: 31.1%


### Count and print all wrong predictions

In [17]:
count = 0
for i, row in df.iterrows():
  data = row.drop('Diabetes').values.reshape(1, -1) # type: ignore
  prediction = model.predict(data)

  if prediction != row['Diabetes']:
    print('Glucose: {}, Blood Pressure: {}, Actual: {}, Prediction {}'.format(row['Glucose'], row['BloodPressure'], row['Diabetes'], prediction))
    count +=1
print('Wrong predictions: {}'.format(count))

Glucose: 137, Blood Pressure: 40, Actual: 1, Prediction [0]
Glucose: 78, Blood Pressure: 50, Actual: 1, Prediction [0]
Glucose: 115, Blood Pressure: 0, Actual: 0, Prediction [1]
Glucose: 125, Blood Pressure: 96, Actual: 1, Prediction [0]
Glucose: 118, Blood Pressure: 84, Actual: 1, Prediction [0]
Glucose: 107, Blood Pressure: 74, Actual: 1, Prediction [0]
Glucose: 115, Blood Pressure: 70, Actual: 1, Prediction [0]
Glucose: 119, Blood Pressure: 80, Actual: 1, Prediction [0]
Glucose: 125, Blood Pressure: 70, Actual: 1, Prediction [0]
Glucose: 145, Blood Pressure: 82, Actual: 0, Prediction [1]
Glucose: 102, Blood Pressure: 76, Actual: 1, Prediction [0]
Glucose: 90, Blood Pressure: 68, Actual: 1, Prediction [0]
Glucose: 111, Blood Pressure: 72, Actual: 1, Prediction [0]
Glucose: 180, Blood Pressure: 64, Actual: 0, Prediction [1]
Glucose: 159, Blood Pressure: 64, Actual: 0, Prediction [1]
Glucose: 146, Blood Pressure: 56, Actual: 0, Prediction [1]
Glucose: 103, Blood Pressure: 66, Actual: 1

### Inference

In [18]:
df_sample = pd.DataFrame(columns=['Glucose', 'BloodPressure'])
glucose = int(input('Enter glucose: '))
bp = int(input('Enter blood pressure: '))
df_sample.loc[len(df_sample.index)] = [glucose, bp] # type: ignore

prediction = model.predict(df_sample)

if prediction == 0:
  print('Non-diabetic')
else:
  print('Diabetic. Stay away from sweets.')

Non-diabetic
