# k-Nearest Neighbors

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement, non-parametric, lazy learning, supervised machine learning algorithm that can be used to solve both classification and regression problems using feature similarity.

Learning KNN machine learning algorithm is a great way to introduce yourself to machine learning and classification in general. At its most basic level, it is essentially classification by finding the most similar data points in the training data, and making an educated guess based on their classifications.

Although very simple to understand and implement, this method has seen wide application in many domains, such as in recommendation systems, semantic searching, and anomaly detection.

## What is K- Nearest Neighbors?

**K- Nearest Neighbors is a**

* **Non parametric** as it does not make an assumption about the underlying data distribution pattern
* **Lazy algorithm** as KNN does not have a training step. All data points will be used only at the time of prediction. With no training step, prediction step is costly.
* **Supervised machine learning algorithm** as target variable is known
* Used for both **Classification** and **Regression**
* Uses **feature similarity/nearest neighbors** to predict the cluster that the new point will fall into.

(Read Full Article [Here](https://medium.com/@rndayala/k-nearest-neighbors-a76d0831bab0))

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Import Dataset

In [1]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
submission = pd.read_csv('../input/titanic/gender_submission.csv')
test['Survived']= submission['Survived']

### Combine Train and Test Datasets for Data Cleaning

In [1]:
df = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

## Data Overview

In [1]:
df.head(2)

In [1]:
[train.shape , test.shape , submission.shape]

In [1]:
df.info()

In [1]:
sns.barplot(data=df, x='Sex', y= 'Survived')

In [1]:
sns.countplot(data=df, x='Survived')

In [1]:
sns.scatterplot(data=df, x='Age', y='Fare', hue='Survived')

## Data Cleaning
I did this part with details on [this notebook](https://www.kaggle.com/sajjadnajafi/logistic-regression-titanic/).

In [1]:
# Deal with "Cabin" Column
df = df.drop(['Cabin'] , axis=1)

# Some columns have no effect on survival so we remove them:
df = df.drop(['PassengerId'] , axis=1)
df = df.drop(['Ticket'] , axis=1)
df = df.drop(['Name'] , axis=1)

# Deal with "Age" Column
df = df.dropna(axis=0, subset=['Age'])

# We have some null in Embarked and Fare columns:
df = df.dropna(axis=0, subset=['Embarked'])
df = df.dropna(axis=0, subset=['Fare'])

# Dealing with Categorical Data
df['Pclass'] = df['Pclass'].apply(str)

# Convert All Object type to One hot encoding

# START ONE HOT ENCODING
df_num = df.select_dtypes(exclude='object')
df_obj = df.select_dtypes(include='object')
non_dummy_cols = ['ind']
dummy_cols = list(set(df_obj.columns) - set(non_dummy_cols))
df_obj = pd.get_dummies(df_obj, columns=dummy_cols, drop_first=True)
df = pd.concat([df_num, df_obj], axis = 1)
# END ONE HOT ENCODING

# Split Test data from df
test, train = df[df["ind"].eq("test")], df[df["ind"].eq("train")]

# We should Drop indicator Column from test and train dataframes:
test= test.drop(['ind'], axis=1)
train= train.drop(['ind'], axis=1)

In [1]:
df.head(2)

In [1]:
# Determine the Features & Target Variable
# Split the Data to Train & Test
X_test, y_test = test.drop(columns='Survived').copy(), test['Survived'].copy()
X_train, y_train = train.drop(columns='Survived').copy(), train['Survived'].copy()

### Scaling the Features

In [1]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X_train)
scaled_X_train= scaler.transform(X_train)
scaled_X_test= scaler.transform(X_test)

## Train the Model

In [1]:
from sklearn.neighbors import KNeighborsClassifier
knn_model= KNeighborsClassifier(n_neighbors=1)
knn_model.fit(scaled_X_train, y_train)

## Predicting Test Data

In [1]:
y_pred= knn_model.predict(scaled_X_test)
#The prediction Value VS Actual Value of Test Data
pd.DataFrame({'Y_Test':y_test, 'Y_Pred': y_pred})

## Evaluating the Model

In [1]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
accuracy_score(y_test, y_pred)

In [1]:
confusion_matrix(y_test, y_pred)

In [1]:
print(classification_report(y_test, y_pred))

## Elbow Method for Choosing Reasonable K Values

In [1]:
test_error_rate= []


for k in range (1, 30):
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(scaled_X_train, y_train)
    
    y_pred_test = knn_model.predict(scaled_X_test)
    
    test_error=1- accuracy_score(y_test, y_pred_test)
    test_error_rate.append(test_error)

In [1]:
test_error_rate

In [1]:
plt.figure(figsize=(10, 6))
plt.plot(range(1, 30), test_error_rate, label='Test Error')
plt.legend()
plt.ylabel('Error Rate')
plt.xlabel('K Value')

## Creating a Pipeline to find K value

In [1]:
scaler= StandardScaler()
knn= KNeighborsClassifier()
knn.get_params().keys()

In [1]:
operations= [('scaler', scaler), ('knn', knn)]

In [1]:
from sklearn.pipeline import Pipeline
pipe= Pipeline(operations)
from sklearn.model_selection import GridSearchCV
k_values= list(range(1, 20))
param_grid= {'knn__n_neighbors': k_values}
full_cv_classifier= GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
full_cv_classifier.fit(X_train, y_train)

In [1]:
full_cv_classifier.best_estimator_.get_params()

In [1]:
full_cv_classifier.cv_results_.keys()

## Final Model

In [1]:
scaler= StandardScaler()
knn14= KNeighborsClassifier(n_neighbors=14)
operations= [('scaler', scaler), ('knn14', knn14)]

In [1]:
pipe= Pipeline(operations)
pipe.fit(X_train, y_train)

In [1]:
pipe_pred= pipe.predict(X_test)
print(classification_report(y_test, pipe_pred))

In [1]:
sample= X_test.iloc[35]
sample

In [1]:
sample.values

In [1]:
sample.values.reshape(1, -1)

In [1]:
pipe.predict(sample.values.reshape(1, -1))

In [1]:
pipe.predict_proba(sample.values.reshape(1, -1))