# K-Nearest Neighbors (KNN)

**K-Nearest Neighbors (KNN)** is a simple supervised machine learning algorithm used mainly for classification tasks.  
It classifies new data points based on feature similarity, by comparing them with the nearest examples in the training set.

## Overview
- KNN is based on the assumption that similar data points exist close to each other in feature space.  
- It stores all training data and classifies new samples according to the majority label among their nearest neighbors.  
- The parameter **k** determines how many neighbors are considered during classification.  
- Choosing the right value of **k** (called **parameter tuning**) is important for good accuracy.  
  - A common rule of thumb is **k = √n**, where *n* is the total number of data points.  
  - Use an **odd value of k** to avoid ties between classes.

## When to Use KNN
- The data is **labeled**.  
- The dataset is **noise-free**.  
- The dataset is **small**, since KNN is a *lazy learner* and does not build an explicit model.

## How KNN Works
1. Choose the number of neighbors (**k**).  
2. Calculate the distance between the new data point and all training points (commonly **Euclidean distance**):  
   \[
   d = \sqrt{(x - a)^2 + (y - b)^2}
   \]
3. Select the **k** nearest neighbors.  
4. Perform **majority voting** among them to assign the class label to the new sample.

## Recap
- A positive integer **k** and a new sample are given.  
- Select the **k** closest entries** from the dataset.  
- The **most common class** among them becomes the predicted label for the new sample.


# K-Nearest Neighbors (KNN) – Predict Diabetes

## Objective
The objective of this project is to predict whether a person is likely to be diagnosed with diabetes or not using the **K-Nearest Neighbors (KNN)** classification algorithm.

This model uses patient health data (such as glucose level, BMI, age, etc.) to learn patterns associated with diabetes and classify new cases based on similarity to known data.

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Model and preprocessing
from sklearn.model_selection import train_test_split      # Split dataset into training and testing sets
from sklearn.preprocessing import StandardScaler          # Normalize feature values to avoid bias
from sklearn.neighbors import KNeighborsClassifier        # KNN model

# Evaluation metrics
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score


In [11]:
dataset = pd.read_csv('/content/diabetes.csv')
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Data Cleaning

In the dataset, certain features such as **Glucose**, **BloodPressure**, **SkinThickness**, **Insulin**, and **BMI** cannot have a value of zero.  
Such entries are invalid and could negatively affect model performance.  
To handle this, we replace zero values in these columns with the **mean** of the respective column.

values of columns like "Glucose", "BloodPressure" cannot be accepted as zeroes because it will affect the outcome.
replace such values with the mean of the respective column

In [10]:
# Replace Zeroes
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']
for column in zero_not_accepted:
  dataset[column] = dataset[column].replace(0, np.nan)
  mean = int(dataset[column].mean(skipna=True))
  dataset[column] = dataset[column].replace(np.nan, mean)

## Split Dataset

Next, we split the dataset into **training** and **testing** sets.  
This ensures that the model is trained on one portion of the data and evaluated on unseen data to assess its performance.

- **X (features):** all columns except the target variable  
- **y (target):** the outcome column  
- **Train/Test ratio:** 80% training and 20% testing

In [26]:
# Split dataset into features (X) and target (y)
x = dataset.iloc[:, 0:8]   # All rows, columns 0 to 7
y = dataset.iloc[:, 8]     # All rows, column 8 (Outcome)

# Split into training and testing sets (80/20)
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=0
)

## Feature Scaling

KNN is a distance-based algorithm, meaning it calculates the distance between data points to determine similarity.  
If features have different scales (e.g., *Age* in years vs. *Glucose* in mg/dL), the larger values can dominate the distance calculation.

**Rule of thumb:**  
Any algorithm that **computes distance** (like KNN, SVM, or K-Means) or **assumes normality** (like PCA, LDA, Logistic Regression) requires feature scaling.

We will use **StandardScaler** to standardize the features so that each has a mean of 0 and a standard deviation of 1.


In [15]:
#feature scaling
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

## Model Definition and Training

In this step, we initialize and train the **K-Nearest Neighbors (KNN)** classifier.

- **n_neighbors = 11:** the number of nearest neighbors (k).  
- **p = 2:** defines the use of **Euclidean distance** as the metric.  
- **metric = 'euclidean':** confirms the distance calculation method.  

Since this is a binary classification problem (diabetes or not), Euclidean distance is appropriate for measuring similarity between data points.


In [21]:
import math
math.sqrt(len(y_test))

12.409673645990857

In [28]:
# Define the KNN model
classifier = KNeighborsClassifier(
    n_neighbors=11,
    p=2,
    metric='euclidean'
)

# Fit the model on the training data
classifier.fit(x_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=1, n_neighbors=11, p=2, weights='uniform')
print("KNN model training completed.")


KNN model training completed.


In [31]:
# Predict the test set result
y_pred = classifier.predict(x_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Model Evaluation – Confusion Matrix

To evaluate the model, we use a **confusion matrix**, which summarizes the number of correct and incorrect predictions made by the classifier.

- **True Positive (TP):** correctly predicted diabetes cases  
- **True Negative (TN):** correctly predicted non-diabetes cases  
- **False Positive (FP):** non-diabetes cases incorrectly predicted as diabetes  
- **False Negative (FN):** diabetes cases incorrectly predicted as non-diabetes

After generating the confusion matrix, we calculate additional evaluation metrics to better understand model performance:

- **Accuracy Score:** measures the overall proportion of correctly classified instances.  
- **F1 Score:** harmonic mean of precision and recall, useful for imbalanced datasets.

In [32]:
# Evaluate the model
cm = confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

# Display results
print("Confusion Matrix:\n", cm)
print(f"F1 Score: {f1:.4f}")
print(f"Accuracy: {acc:.4f}")

Confusion Matrix:
 [[91 16]
 [20 27]]
F1 Score: 0.6000
Accuracy: 0.7662
