In [1]:
import numpy as np
import pandas as pd

In [3]:
dataset = pd.read_csv('data6.csv')
dataset.head()


Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


In [7]:
X = dataset.iloc[:, :-1].values  #all rows except last row
y = dataset.iloc[:, -1].values   #last row 


This line splits your dataset into **training** and **testing** sets.

* `X` = features (input)
* `y` = target (output)
* `test_size=0.25` means **25% for testing** and **75% for training**
* `random_state=2` ensures you get the **same split every time** for reproducibility

So after this:

* `X_train`, `y_train` → used to **train** the model
* `X_test`, `y_test` → used to **test** the model


In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 2)

Gaussian Naive Bayes (GaussianNB) is a classification algorithm based on Bayes’ Theorem and assumes that the features follow a normal (Gaussian) distribution.
Imports the Gaussian Naive Bayes model from sklearn.
Creates an object named classifier of the GaussianNB class.
Trains the model using your training data (X_train, y_train).
It "learns" the patterns between features (X_train) and labels (y_train).

In [9]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

 Make predictions using the trained GaussianNB model on the test data (X_test) and store the predicted labels in y_pred.

In [10]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0])

A confusion matrix is a table used to evaluate the performance of a classification model. It shows how many predictions the model got right or wrong, and it breaks them down by class.

|                     | Predicted Positive  | Predicted Negative  |
| ------------------- | ------------------- | ------------------- |
| **Actual Positive** | True Positive (TP)  | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN)  |

Accuracy is the ratio of correctly predicted observations to the total observations. It shows how often the model is right.
Error Rate is the ratio of incorrect predictions to the total observations. It shows how often the model is wrong.
Precision is the ratio of correctly predicted positive observations to the total predicted positives. It shows how precise the model is when it predicts positive.
Recall (also called Sensitivity or True Positive Rate) is the ratio of correctly predicted positives to all actual positives. It shows how well the model identifies actual positives.

In [12]:
# Compute confusion matrix
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [13]:
# Calculate metrics
accuracy = (tn + tp) *100 / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = (2 * precision * recall) / (precision + recall)
specificity = tn *100/ (tn + fp)
error = (fp + fn) *100/ (tp + tn + fp + fn)

In [14]:
# Display metrics
print(f"Accuracy: {accuracy:.2f}%")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")
print(f"Specificity: {specificity:.2f}%")
print(f"Error Rate: {error:.2f}%")

Accuracy: 87.00%
Precision: 0.88
Recall: 0.76
F1 Score: 0.82
Specificity: 93.55%
Error Rate: 13.00%


         Predicted
          0   1
Actual  -----------
   0   | 58 | 4  |  → True Negative (TN), False Positive (FP)
   1   |  9 | 29 |  → False Negative (FN), True Positive (TP)

In [15]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Display the confusion matrix and evaluation metrics
print("Confusion Matrix:")
print(cm)


Confusion Matrix:
[[58  4]
 [ 9 29]]


StandardScaler is a preprocessing tool from scikit-learn used to standardize the features of your dataset. It transforms the data such that the mean becomes 0 and the standard deviation becomes 1.

In [19]:
# Predicting the result
from sklearn.preprocessing import StandardScaler
# Feature Scaling (assuming you want to scale your features)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)  # Fit and transform the training set
X_test = sc.transform(X_test)  # Only transform the test set

# Accept user input for prediction
age = float(input("Enter age: "))
salary = float(input("Enter salary: "))

# Scale the input before making the prediction
scaled_input = sc.transform([[age, salary]])  # Scale the input values
prediction = classifier.predict(scaled_input)  # Make the prediction

print(f"The predicted class for age {age} and salary {salary} is: {prediction[0]}")




The predicted class for age 32.0 and salary 150000.0 is: 1
