## K Nearest Neighbour

K Nearest Neighbors (KNN) is a popular and simple machine learning algorithm used for classification and regression tasks. It is a non-parametric and instance-based learning algorithm, meaning it doesn't make any assumptions about the underlying data distribution and stores the entire training dataset to make predictions. Here's how the KNN algorithm works:

1. Data Preparation: Gather a labeled dataset with features and corresponding target labels. KNN can be used for both classification (discrete labels) and regression (continuous labels) tasks.

2. Choosing 'K': Select a value for 'K', which represents the number of nearest neighbors to consider when making predictions. It is a hyperparameter, and the choice of 'K' can significantly impact the algorithm's performance.

3. Distance Metric: Determine a distance metric to measure the similarity between data points. Common distance metrics include Euclidean distance, Manhattan distance, Minkowski distance, etc.

4. Prediction: When a new data point (query point) needs to be classified or predicted, the algorithm finds the 'K' closest data points (nearest neighbors) based on the chosen distance metric.

5. Voting (Classification) or Averaging (Regression): For classification tasks, the algorithm takes a majority vote among the 'K' nearest neighbors and assigns the most common class label to the new data point. For regression tasks, it calculates the average of the target values of the 'K' nearest neighbors and assigns that value as the prediction for the new data point.

6. Choosing the Output: In some cases, you may need to decide how to handle ties (e.g., two classes with equal votes in classification). Various approaches can be used, such as considering the distance of neighbors or using weighted voting.

7. Model Evaluation: Finally, evaluate the performance of the KNN model using appropriate evaluation metrics such as accuracy, precision, recall, F1 score, mean squared error (MSE), etc.

It's important to note that KNN can be sensitive to the choice of 'K' and the distance metric. A small 'K' value can lead to overfitting, while a large 'K' value can cause underfitting. Additionally, feature scaling is recommended, as KNN is distance-based and features on different scales can dominate the distance calculations.

While KNN is simple and intuitive, it can be computationally expensive for large datasets since it requires comparing the query point to all data points during prediction. However, it can be an effective choice for smaller datasets or as a baseline model to establish a performance benchmark for more complex algorithms.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [2]:
from sklearn.datasets import make_classification

In [3]:
X, y = make_classification(
    n_samples=1000, # 1000 observations 
    n_features=3, # 3 total features
     n_redundant=1,
    n_classes=2, # binary target/label 
    random_state=999 
)

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

In [7]:
classifier=KNeighborsClassifier(n_neighbors=5,algorithm='auto')
classifier.fit(X_train,y_train)

KNeighborsClassifier()

In [8]:
y_pred=classifier.predict(X_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [9]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [10]:
print(confusion_matrix(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
print(classification_report(y_pred,y_test))

[[158  20]
 [ 11 141]]
0.906060606060606
              precision    recall  f1-score   support

           0       0.93      0.89      0.91       178
           1       0.88      0.93      0.90       152

    accuracy                           0.91       330
   macro avg       0.91      0.91      0.91       330
weighted avg       0.91      0.91      0.91       330



## K Nearest Neighbour Regressor

K Nearest Neighbors (KNN) is a simple and popular supervised machine learning algorithm used for both classification and regression tasks. Here, we'll focus on the K Nearest Neighbors Regressor, which is specifically used for regression problems. The basic idea behind the KNN Regressor is to predict the value of a data point by averaging the values of its K nearest neighbors. In the context of regression, this means finding the K data points with the closest feature values to the input data point and averaging their target values to make the final prediction. Here's how the KNN Regressor works:

1. Data Preparation: Ensure your dataset is properly cleaned and preprocessed, with features and corresponding target values.

2. Choosing K: The first step is to choose the value of K, which represents the number of neighbors to consider. A smaller K will make the model more sensitive to noise, while a larger K may smooth out the predictions but could lead to less localized and less accurate results. The best K value is usually determined through cross-validation techniques.

3. Distance Metric: Next, you need to define a distance metric to measure the similarity between data points. The most common distance metrics for continuous features are Euclidean distance or Manhattan distance. These distance metrics help find the K closest neighbors to a given data point.

4. Prediction: For each data point in the test set, the KNN Regressor algorithm does the following:

   - Calculate the distance between the test data point and all the data points in the training set using the chosen distance metric.
   - Sort the distances in ascending order and select the K nearest neighbors.
   - Calculate the average (or weighted average) of the target values of the K neighbors. This average will be the predicted value for the test data point.

5. Regression Evaluation: After making predictions for all test data points, evaluate the performance of the model using regression evaluation metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), or R-squared.

In [11]:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=2, noise=10, random_state=42)

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [13]:
from sklearn.neighbors import KNeighborsRegressor

In [14]:
regressor=KNeighborsRegressor(n_neighbors=6,algorithm='auto')
regressor.fit(X_train,y_train)

KNeighborsRegressor(n_neighbors=6)

In [15]:
y_pred=regressor.predict(X_test)

In [16]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [17]:
print(r2_score(y_test,y_pred))
print(mean_absolute_error(y_test,y_pred))
print(mean_squared_error(y_test,y_pred))

0.9189275159979495
9.009462452972217
127.45860414317289
