![KNN](https://i.ibb.co/FWgzgFL/knn.png)

- **ML Part 1** - Logistic Regression
- **ML Part 2 - K-Nearest Neighbors (KNN)**
- **ML Part 3** - Support Vector Machine (SVM)
- **ML Part 4** - Artificial Neural Network (NN) 
- **ML Part 5** - Classification and Regression Tree (CART)
- **ML Part 6** - Random Forests
- **ML Part 7** - Gradient Boosting Machines (GBM)
- **ML Part 8** - XGBoost
- **ML Part 9** - LightGBM
- **ML Part 10** - CatBoost

K-nearest neighbors is a nonparametric method used for classification and regression. It is one of the easiest machine learning techniques used. It is a lazy learning model with a local approach.

## Basic Theory
The basic logic behind KNN is to discover what's around it, assume the test data point is similar to them, and derive the output. In KNN, we seek neighbors and come with foresight.

In the case of KNN classification, the majority vote is applied to the nearest k data points, while in the KNN regression, the average of the nearest k data points is calculated as output. As a general rule, we choose the odd numbers as k. KNN is a lazy learning model in which calculations occur only at run time. While majority voting is applied over the nearest k data points, in KNN regression, the average of the nearest k data points is calculated as output. As a general rule, we choose the odd numbers as k. KNN is a lazy learning model in which calculations occur only at run time.

![](https://i.ibb.co/CMtFYSX/Ek-A-klama-2020-08-28-000822.jpg)

Yellow and purple dots in the above diagram correspond to Class A and Class B in the education data. The red star indicates test data that should be classified. When k = 3, we predict Class B as output, and when K = 6, Class A as output.

## Lost Function
There is no training on KNN. During the test, the minimum distance k neighbor will be included in the classification / regression.

## Advantages
- Easy and simple machine learning model.
- Few hyperparameters to adjust.

## Disadvantages
- The value of k should be chosen wisely.
- High calculation cost over the run time if the sample size is large.
- Appropriate scaling should be provided for fair treatment across features.

## Hyperparameters
- KNN mainly includes two hyperparameters, K value and distance function.
- K value:
    - How many neighbors will join the KNN algorithm. k should be set according to the authentication error.
- Distance function:
    - Euclidean distance is the most used similarity function. Manhattan distance, Hamming Distance, Minkowski distance are different alternatives.

## Assumptions
- There should be a clear requirement about the input area.
- Moderate sample size as appropriate (due to space and time constraints).
- Collinearity and outliers should be addressed before training.

## Comparison with Other Models
- A general difference between KNN and other models is the large real-time computing KNN needs compared to others.


### KNN vs Naive Bayes
- Naive bayes is much faster than KNN due to the real time implementation of KNN.
- While naive bayes are parametric, KNN is not parametric.

### KNN vs SVM
- SVM handles outliers better than KNN.
- SVM performs better than KNN when there are large features and less training data.

### KNN vs Neural Networks (NN)
- Neural networks require larger training data compared to KNN to achieve sufficient accuracy.
- NN needs a lot of hyperparameter settings compared to KNN.

# Coding Time

![](https://ac-cdn.azureedge.net/infusionnewssiteimages/agingcare/21e637ea-aa74-4ae2-b278-181d2cded7a3.jpg)

In [None]:
# Import the necessary packages

import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import and read dataset

input_ = "../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv"
data = pd.read_csv(input_)
df = data.copy()

data.head(10)

In [None]:
data.describe()

In [None]:
inp_data = data.drop(data[['DEATH_EVENT']], axis=1)
out_data = data[['DEATH_EVENT']]

scaler = StandardScaler()
inp_data = scaler.fit_transform(inp_data)

X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2, random_state=42)

In [None]:
print("X_train Shape : ", X_train.shape)
print("X_test Shape  : ", X_test.shape)
print("y_train Shape : ", y_train.shape)
print("y_test Shape  : ", y_test.shape)

In [None]:
test_scores = []
train_scores = []

for i in range(1,50):
    knn = KNeighborsClassifier(i)
    knn.fit(X_train, y_train)
    
    train_scores.append(knn.score(X_train,y_train))
    test_scores.append(knn.score(X_test,y_test))

In [None]:
max_train_score = np.max(train_scores)
train_scores_ind = [i for i,v in enumerate(train_scores) if v == max_train_score]
print('Max Train Score: {}'.format(max_train_score*100))
print("k: {}".format(list(map(lambda x: x+1, train_scores_ind))))

In [None]:
max_test_score = np.max(test_scores)
test_scores_ind = [i for i,v in enumerate(test_scores) if v == max_test_score]
print('Max Test Score: {}'.format(max_test_score*100))
print("k: {}".format(list(map(lambda x: x+1, test_scores_ind))))

In [None]:
plt.figure(figsize=(12,5))
p = sns.lineplot(range(1,50),train_scores,marker='*',label='Train Score')
p = sns.lineplot(range(1,50),test_scores,marker='o',label='Test Score')

In [None]:
knn = KNeighborsClassifier(3)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)

y_pred = knn.predict(X_test)

cf_matrix = confusion_matrix(y_pred, y_test)
sns.heatmap((cf_matrix / np.sum(cf_matrix)*100), annot = True, fmt=".2f", cmap="Blues")

## Reporting
I evaluated the results I found with Confusion Matrix, the results are as follows:

### Correctly predicted -> %71.67 (214 of 299 predict are correct)
- True Negative -> %55.00 -> Those who were predicted not to die and who did not die
- True Positive -> %16.67 -> Those who were predicted to die and who did die

### Wrong predicted-> %28.33 (85 of 299 predict are wrong)
- False Positive -> %03.33 -> Those who were predicted to die but who did not die
- False Negative -> %25.00 -> Those who were predicted to not die but who did die

### Not dead
- 203 -> Those who haven't died in the real data set
- 174 -> Predicted for test data set

### The dead
- 96 -> Those who have died in the real data set
- 125 -> Predicted for test data set

---

## Finding the highest score

In [None]:
inp_data = data.iloc[:, [4,7,11]].values
out_data = data.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(inp_data, out_data, test_size=0.2, random_state=12345)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

k_range = list(range(1, 50))
w = list(["distance","uniform"])
param_grid = dict(n_neighbors=k_range, weights=w)

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_

In [None]:
knn = KNeighborsClassifier(n_neighbors = 4, weights='distance')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)


print('Accuracy Score:               : {:.4f}'.format(accuracy_score(y_test, y_pred)))
print('KNN f1-score  : {:.4f}'.format(f1_score(y_pred, y_test)))
print('KNN precision : {:.4f}'.format(precision_score(y_pred, y_test)))
print('KNN recall    : {:.4f}'.format(recall_score(y_pred, y_test)))
print("\n",classification_report(y_pred, y_test))

In [None]:
cf_matrix = confusion_matrix(y_pred, y_test)
sns.heatmap((cf_matrix / np.sum(cf_matrix)*100), annot = True, fmt=".2f", cmap="Blues")