## [K Nearest Neighbor Classifier](https://towardsdatascience.com/k-nearest-neighbor-classifier-explained-a-visual-guide-with-code-examples-for-beginners-a3d85cad00e1/)

A K Nearest Neighbor classifier is a machine learning model that makes predictions based on the majority class of the K nearest data points in the feature space. The KNN algorithm assumes that similar things exist in close proximity, making it intuitive and easy to understand.

The KNN classifier operates by finding the K nearest neighbors to a new data point and then voting on the most common class among these neighbors. Here's how it works:

1. Calculate the distance between the new data point and all points in the training set.
2. Select the K nearest neighbors based on these distances.
3. Take a majority vote of the classes of these K neighbors.
4. Assign the majority class to the new data point.

In [1]:
!pip install -q numpy pandas scikit-learn matplotlib

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from scipy.spatial import distance

# Load data
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Choose a Distance Metric
distance_metric = 'euclidean'

# Preprocess data
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

In [4]:
# Split data
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Standardize features
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Trying to calculate distance between ID 0 and ID 1
print("distance between ID 0 and ID 1: ", np.linalg.norm(X_train.loc[0].values - X_train.loc[1].values))

# Train model
knn_clf = KNeighborsClassifier(n_neighbors=3, metric=distance_metric)
knn_clf.fit(X_train, y_train)

distance between ID 0 and ID 1:  1.3789269844186147


0,1,2
,n_neighbors,3
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'euclidean'
,metric_params,
,n_jobs,


In [5]:
# Compute the distances from the first row of X_test to all rows in X_train
distances = distance.cdist(X_test.iloc[0:1], X_train, metric='euclidean')

# Create a DataFrame to display the distances
distance_df = pd.DataFrame({
    'Train_ID': X_train.index,
    'Distance': distances[0].round(2),
    'Label': y_train
}).set_index('Train_ID')

print(distance_df.sort_values(by='Distance'))

          Distance  Label
Train_ID                 
1             0.26      0
0             1.22      0
7             1.89      0
11            2.02      1
2             2.05      1
10            2.12      1
9             2.15      1
12            2.21      1
13            2.28      0
3             2.59      1
4             2.82      1
8             2.86      1
5             3.46      0
6             3.88      1


In [6]:
# Predict and evaluate
y_pred = knn_clf.predict(X_test)

print("Label     :",list(y_test))
print("Prediction:",list(y_pred))

accuracy = accuracy_score(y_test, y_pred)
#print(f'Accuracy: {accuracy.round(4)*100}%')
print(f'Accuracy: {accuracy * 100:.2f}%')

Label     : [0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1]
Prediction: [np.int64(0), np.int64(1), np.int64(1), np.int64(1), np.int64(0), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1), np.int64(1)]
Accuracy: 78.57%
