# KNN Algorithm
The k-nearest neighbors (k-NN) algorithm is a simple yet effective supervised machine learning algorithm used for classification and regression tasks. It operates on the principle of similarity, where the output of a data point is determined by the majority vote (for classification) or averaging (for regression) of its k nearest neighbors in the feature space. Here's a detailed theoretical overview of the k-NN algorithm:

**Algorithm Overview:**

1. Given a dataset D consisting of n samples, each represented by a feature vector and a corresponding class label (in the case of classification) or continuous target variable (in the case of regression), the k-NN algorithm works as follows:

2. For a new input data point, the algorithm calculates the distance between this point and all other points in the dataset.

3. It then identifies the k nearest neighbors based on these distances.

4. For classification tasks, the algorithm assigns the most common class label among the k nearest neighbors to the new data point.

5. For regression tasks, it computes the average of the target variables of the k nearest neighbors and assigns this as the predicted value for the new data point.

**Distance Metrics:**

1. The choice of distance metric significantly influences the performance of the k-NN algorithm. Common distance metrics include Euclidean distance, Manhattan distance, Minkowski distance, and cosine similarity.

2. Euclidean distance is often preferred for its simplicity and effectiveness, especially in low-dimensional spaces.

**Parameter Selection:**

1. The key hyperparameter in k-NN is k, the number of neighbors to consider. Choosing the right value of k is crucial and depends on the dataset characteristics.

2. Smaller values of k lead to more flexible models with higher variance but lower bias, while larger values of k result in smoother decision boundaries with lower variance but higher bias.

**Curse of Dimensionality:**

1. The performance of k-NN can deteriorate significantly as the dimensionality of the feature space increases. This is known as the curse of dimensionality.

2. In high-dimensional spaces, the notion of distance becomes less meaningful, and the data points become increasingly sparse, making it difficult to identify meaningful nearest neighbors.

**Preprocessing:**

1. Proper preprocessing of the data, such as feature scaling and normalization, can improve the performance of k-NN.

2. Additionally, feature selection or dimensionality reduction techniques like PCA (Principal Component Analysis) can help mitigate the curse of dimensionality.

**Computational Complexity:**

1. The computational complexity of k-NN at training time is negligible since it simply stores all training samples.

2. However, the prediction time complexity can be high, particularly for large datasets, as it involves computing distances to all training samples.

3. Various optimizations, such as KD-trees or Ball trees, can be employed to speed up the search for nearest neighbors.

**Handling Imbalanced Data:**

1. In classification tasks with imbalanced class distributions, the majority class may dominate the decision-making process. Techniques like weighted k-NN or resampling methods can help alleviate this issue.

**Evaluation:**

1. Performance evaluation of the k-NN algorithm typically involves metrics such as accuracy, precision, recall, F1-score (for classification), and mean squared error or R-squared (for regression).

2. Cross-validation techniques like k-fold cross-validation are commonly used to assess the algorithm's generalization performance.

**Applications:**

1. k-NN is widely used in various domains, including pattern recognition, image classification, recommendation systems, and anomaly detection, among others.

By implementing and analyzing the k-nearest neighbors algorithm, one can gain insights into its strengths, weaknesses, and suitability for different types of datasets and tasks. Experimentation with different parameter settings, distance metrics, and preprocessing techniques can provide valuable practical experience in machine learning. Additionally, comparing the performance of k-NN with other algorithms can help in understanding its relative advantages and limitations.


---


Dataset used from https://www.kaggle.com/datasets/rakeshrau/social-network-ads


# Importing the Necessary Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline

# Data Importation and Exploration

In [2]:
# Loading and previewing our dataset
social_df = pd.read_csv('./Social_Network_Ads.csv')
social_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
# Determining the size of our dataset
social_df.shape

(400, 5)

# Data Preparation

In [4]:
social_df["Gender"] = np.where(social_df["Gender"].str.contains("Male", "Female"), 1, 0)
social_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,1,19,19000,0
1,15810944,1,35,20000,0
2,15668575,0,26,43000,0
3,15603246,0,27,57000,0
4,15804002,1,19,76000,0


# Data Modeling

In [5]:
# Preparing our dataset for training
X = social_df.iloc[:, [1, 2 ,3]].values
y = social_df.iloc[:, 4].values

In [6]:
# Splitting the dataset into a training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [7]:
# Normalisation
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train = norm.transform(X_train)
X_test = norm.transform(X_test)

In [8]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier()

# Using these classifiers to fit our data, X_train and y_train
knn_classifier.fit(X_train, y_train)


KNeighborsClassifier()

In [9]:
# Predicting the test set results
knn_y_prediction = knn_classifier.predict(X_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [10]:
# Printing the evaluation metrics to determine the accuracy of classifiers
from sklearn.metrics import classification_report, accuracy_score


print(accuracy_score(knn_y_prediction, y_test))

0.92


In [11]:
# Printing the classification report
print('KNN Classifier:')
print(classification_report(y_test, knn_y_prediction))

KNN Classifier:
              precision    recall  f1-score   support

           0       0.95      0.92      0.94        63
           1       0.87      0.92      0.89        37

    accuracy                           0.92       100
   macro avg       0.91      0.92      0.92       100
weighted avg       0.92      0.92      0.92       100



In [12]:
# Using a confusion matrix to determine the accuracy of our model
from sklearn.metrics import confusion_matrix

print('KNN Classifier:')
print(confusion_matrix(knn_y_prediction, y_test))

KNN Classifier:
[[58  3]
 [ 5 34]]


In [13]:
# Making a new prediction & comparing results
new_case = [[0, 60, 2500]] # Gender, Age, Salary

# We will need to transform our new case
new_case = norm.transform(new_case)

print('KNN classifier:', knn_classifier.predict(new_case))

KNN classifier: [1]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
