# K-Nearest Neighbors Algorithm

It is a non-parametric supervised learning classifier, which uses proximity to make classifications or predictions about the clustering of an individual data point. While it can be used for regression or classification problems, it is typically used as a classification algorithm, assuming that similar points can be found close to each other.


# intuition behind the KNN algorithm.

For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that is most frequently represented around a given data point is used. While this is technically considered “plurality voting”, the term, “majority vote” is more commonly used in literature. The distinction between these terminologies is that “majority voting” technically requires a majority of greater than 50%, which primarily works when there are only two categories. When you have multiple classes—e.g. four categories, you don’t necessarily need 50% of the vote to make a conclusion about a class; you could assign a class label with a vote of greater than 25%.

Regression problems use a similar concept as classification problem, but in this case, the average the k nearest neighbors is taken to make a prediction about a classification. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous ones. However, before a classification can be made, the distance must be defined. Euclidean distance is most commonly used.

# pseudocode

1. KNN algorithm inputs:

X_train: Training data set, consisting of a feature matrix and a label matrix.

y_train: Array of labels of the training data set.

X_test: Test data set, consisting of a feature matrix.

k: Number of nearest neighbors to consider.

2. Output

The output of the KNN algorithm is a prediction matrix for the test data set.

3. Algorithm

The KNN algorithm can be divided into the following steps:

1.   Calculate the distances between the test and training data points. This step can be performed using any distance measure, such as the Euclidean distance, Manhattan distance, or Mahalanobis distance.
2.   Find the k nearest neighbors for each test data point. This step can be performed using any search method, such as bubble sort, selection sort, or insertion sort.
3.   Assign nearest neighbor labels to each test data point. This step can be done using the majority vote, the average, or the mean.






#implementation


In [None]:
# Import the necessary libraries
import numpy as np # this is used for numerical operations
import pandas as pd # Pandas simplifies data manipulation and analysis
from sklearn.model_selection import train_test_split # Scikit-learn is a machine learning library, and 'train_test_split' is used to split the dataset into training and testing sets
from sklearn.neighbors import KNeighborsClassifier # Scikit-learn's 'KNeighborsClassifier' provides an implementation of the k-nearest neighbors algorithm for classification
from sklearn.metrics import accuracy_score # 'accuracy_score' from Scikit-learn metrics is used to measure the accuracy of the model's predictions



In [None]:
np.random.seed(42) # Sets the seed of the NumPy random number generator to 42.

In [None]:
data = {
    'Rojo': np.random.randint(0, 256, 100),
    'Verde': np.random.randint(0, 256, 100),
    'Azul': np.random.randint(0, 256, 100),
    # Generate lists of 100 random integers between 0 and 255 to represent red color intensity for the toys
    'Juguete': ['Carro' if i < 50 else 'Muñeca' for i in range(100)]  # Create a list of 100 elements, where the first 50 are 'Carro' and the next 50 are 'Muñeca'
}


 These generate simulated data for a set of toys with color attributes (red, green, blue) and a label indicating whether it is a car or a doll.

In [None]:
df = pd.DataFrame(data) # Create a DataFrame named 'df' using the previously generated 'data'
X_train, X_test, y_train, y_test = train_test_split(df[['Rojo', 'Verde', 'Azul']], df['Juguete'], test_size=0.2, random_state=42)


Split into training and test sets (train_test_split): Splits the DataFrame into training (X_train, y_train) and test (X_test, y_test) sets.

In [None]:

k = 3  # Number of neighbors (defined by us)
knn = KNeighborsClassifier(n_neighbors=k)  # Create a k-NN classifier with k neighbors
knn.fit(X_train, y_train)  # Train the model using the training data
nuevo_juguete = pd.DataFrame({
    'Rojo': [150],
    'Verde': [30],
    'Azul': [60]
})


In [None]:
# Make a prediction for the new toy
prediccion = knn.predict(nuevo_juguete)
print(f'The predicted toy type is: {prediccion[0]}')

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Calculate the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


The predicted toy type is: Muñeca
Accuracy: 0.4


k
 -NN does not have a loss function that can be minimized during training. In fact, this algorithm is not trained at all. The only "training" that happens for k
-NN, is memorising the data (creating a local copy), so that during prediction you can do a search and majority vote. Technically, no function is fitted to the data, and so, no optimization is done (it cannot be trained using gradient descent).

https://stats.stackexchange.com/questions/420416/does-knn-have-a-loss-function

It doesn't involve optimization of model parameters through iterative updates like gradient descent in neural networks or genetic algorithms. Instead, it classifies or predicts new data points based on their similarity to existing data points.

https://medium.com/@denizgunay/knn-algorithm-3604c19cd809#:~:text=%E2%96%B9KNN%20is%20a%20simple,similarity%20to%20existing%20data%20points.