K Nearest Neighbor (KNN) is a supervised machine learning algorithm which can be used for both classification and regression. It does not use a model that generalizes on training data, that's why it's described as a **lazy learning method**. On the other hand, the methods using models to generalize on training data are called **eager learning methods**, e.g. neural network, SVM, tree based methods. On inference, lazy learning methods are slow and computationally expensive whereas eager learning methods do the hardwork during training. Lazy learning is especially suitable if training data is updated very often.

KNN is also denoted as a nonparametric method which means it does not make any assumptions on data. On the other hand, a parametric method makes some strong assumptions. For example, if you want to fit a probability distribution to your data and assume Gaussian distribution, this is a parametric method. You only need to compute mean and standard deviation. If your assumption is consistent with your data, then your method gives good results, otherwise your method may fail. As a nonparametric method, KNN is suitable for both linear and nonlinear cases.

Now, think about classification. How does KNN do classification without training? When the class of a test sample is queried, KNN inspects the similarity of all training samples with test sample. Degree of similarity is measured with a distance metric. It is assumed that if two samples are close to each other in feature space, they probably belong to same class. Searching for the closest point in a set is named as nearest neighbor search. KNN takes K closest samples. Final decision is made with majority voting. The mode of the classes of K nearest neighbors is the class of the queried test sample.

K is a hyperparameter that determines the sensitivity of KNN. As K increases, the number of voting samples increases, decreasing the sensitivity. Large K results in low variance, high bias and small K results in high variance, low bias.

Outline of the work is as follows:

* Load Data
* Feature Engineering
* Split Data
* Outlier Check with IQR
* Visualization
* Standardization
* Correlation Analysis
* KNN with Brute NN Search
* KNN with KDTree

In [None]:
import numpy as np
import pandas as pd

import seaborn as sea

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

In [None]:
sea.set_style("darkgrid")

## Load Data

**Telco Customer Churn** dataset is used. Following information is included:

* Customers who left within the last month – target column **Churn**
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

Dataset is loaded from input csv file. **Pandas** extracts the data and stores in a dataframe. Pandas styling is used to customize the look of the table.

In [None]:
data = pd.read_csv("/kaggle/input/telco-customer-churn/"
                   "WA_Fn-UseC_-Telco-Customer-Churn.csv")

data.head(10).style.set_precision(2). \
                    set_properties(**{"min-width": "80px"}). \
                    set_properties(**{"color": "#111111"}). \
                    set_properties(**{"text-align": "center"}). \
                    set_table_styles([
                          {"selector": "th",
                           "props": [("font-weight", "bold"),
                                     ("font-size", "12px"),
                                     ("text-align", "center")]},
                          {"selector": "tr:nth-child(even)",
                           "props": [("background-color", "#f2f2f2")]},
                          {"selector": "tr:nth-child(odd)",
                           "props": [("background-color", "#fdfdfd")]},
                          {"selector": "tr:hover",
                           "props": [("background-color", "#bcbcbc")]}])

**customerID** has nothing to do with churn prediction, so it's dropped.

In [None]:
data.drop("customerID", axis=1, inplace=True)

Features and corresponding labels are assigned to **data_X** and **data_Y**, respectively. Using Pandas info function, we inspect column data types and number of non-null values in data_X and data_Y.

In [None]:
# disable SettingWithCopyWarning
pd.options.mode.chained_assignment = None

data_X = data.loc[:, data.columns != "Churn"]
data_Y = data[["Churn"]]

print("\ndata_X info:\n")
data_X.info()
print("\ndata_Y info:\n")
data_Y.info()

Dataset has 7043 rows (training samples). There are 19 features. Target column **Churn** is also categorical.

## Feature Engineering

The unique values each feature can take are inspected below.

In [None]:
for c in data_X.columns:
    
    print("Feature name: {}".format(c))
    print("Unique values:\n")
    print(data_X[c].unique())
    print("\n--------------------------------------------------\n")

Features gender, Partner, Dependents, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling and PaymentMethod are categorical. Although datatype of SeniorCitizen is int64, it is categorical, it takes 0 and 1 values.

Features tenure, MonthlyCharges and TotalCharges are numeric. Note that data type of TotalCharges is object (as info() function shows). But, if we look carefully, the entries of TotalCharges are float values converted to string.

Spaces are removed from each entry of TotalCharges if there are any, then datatype is converted to float.

In [None]:
data_X["TotalCharges"] = [s.replace(" ","")
                          for s in data_X["TotalCharges"]]
data_X["TotalCharges"] = pd.to_numeric(data_X["TotalCharges"])

We have to check if there are any null values in TotalCharges.

In [None]:
data_X["TotalCharges"].isnull().sum()

Impute null values with mean.

In [None]:
data_X["TotalCharges"].fillna(data_X["TotalCharges"].mean(), inplace=True)

Categorical and numeric features:

In [None]:
cat = ["gender", "SeniorCitizen", "Partner", "Dependents", "PhoneService",
       "MultipleLines", "InternetService", "OnlineSecurity",
       "OnlineBackup", "DeviceProtection", "TechSupport",
       "StreamingTV", "StreamingMovies", "Contract",
       "PaperlessBilling", "PaymentMethod"]

num = ["tenure", "MonthlyCharges", "TotalCharges"]

Categorical features are converted to numeric. One of the categories is dropped to prevent correlation between features.

In [None]:
enc = OneHotEncoder(drop="first")
enc.fit(data_X[cat]);

cat2 = enc.get_feature_names(cat)
data_X_C = pd.DataFrame(enc.transform(data_X[cat]).toarray(),
                        columns = cat2)

Numeric and one-hot-encoded categorical features are combined into a single dataframe.

In [None]:
data_X = pd.concat([data_X_C, data_X[num]], axis=1)

Feature names are:

In [None]:
feature_names = data_X.columns
print(feature_names)

Unique values of target variable:

In [None]:
data_Y["Churn"].unique()

There are 2 output classes as expected. Churn is converted to binary.

In [None]:
lb = LabelBinarizer()

lb.fit(data_Y["Churn"]);
data_Y["Churn"] = lb.transform(data_Y["Churn"])

## Split Data

Dataset is split as training and test sets. We use stratify parameter of train_test_split function to get the same class distribution across train and test sets.

In [None]:
train_X, test_X, train_Y, test_Y = train_test_split(data_X, data_Y,
                                                    test_size=0.2,
                                                    shuffle = True,
                                                    stratify=data_Y,
                                                    random_state=0)

train_X.reset_index(drop=True, inplace=True);
test_X.reset_index(drop=True, inplace=True);
train_Y.reset_index(drop=True, inplace=True);
test_Y.reset_index(drop=True, inplace=True);

## Outlier Check with IQR

Numeric features are analyzed for outliers using **interquartile range (IQR)**.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(10,6))
for i, c in enumerate(train_X[num]):
    sea.boxplot(train_X[c], orient="v", color = "#6f7501",
                                width = 0.2, ax=axes[i])
    
fig.tight_layout(pad=3.0)

There are no outliers as can be seen from the box plots.

## Visualization

Categorical Features

In [None]:
fig = plt.figure(figsize=(10,66))
gs = gridspec.GridSpec(nrows=19, ncols=2, figure=fig)

for i, c in enumerate(train_X[cat2]):
    y, x = np.int(i/2), i%2 
    ax = fig.add_subplot(gs[y,x])    
    sea.distplot(train_X.loc[train_Y["Churn"]==0,c], kde = False,
                 color = "#004a4d", hist_kws = dict(alpha=0.7),
                 bins=10, label="Churn_No", ax=ax);
    sea.distplot(train_X.loc[train_Y["Churn"]==1,c], kde = False,
                 color = "#7d0101", hist_kws = dict(alpha=0.7),
                 bins=10, label="Churn_Yes", ax=ax);

ax.legend(loc="center left", bbox_to_anchor=(1.5,0.5),
          prop={"size":12});

Numerical features

In [None]:
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(nrows=2, ncols=2, figure=fig)

for i, c in enumerate(train_X[num]):
    y, x = np.int(i/2), i%2 
    ax = fig.add_subplot(gs[y,x])    
    sea.distplot(train_X.loc[train_Y["Churn"]==0,c], kde = True,
                 color = "#004a4d", hist_kws = dict(alpha=0.8),
                 bins=20, label="Churn_No", ax=ax);
    sea.distplot(train_X.loc[train_Y["Churn"]==1,c], kde = True,
                 color = "#7d0101", hist_kws = dict(alpha=0.5),
                 bins=20, label="Churn_Yes", ax=ax);

ax.legend(loc="center left", bbox_to_anchor=(1.5,0.5),
          prop={"size":12});

## Standardization

StandardScaler is only fit to training data to prevent data leakage.

In [None]:
scaler = StandardScaler()

# fit to train_X
scaler.fit(train_X)

# transform train_X
train_X = scaler.transform(train_X)
train_X = pd.DataFrame(train_X, columns = feature_names)

# transform test_X
test_X = scaler.transform(test_X)
test_X = pd.DataFrame(test_X, columns = feature_names)

## Correlation Analysis

In [None]:
corr_matrix = pd.concat([train_X, train_Y], axis=1).corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=np.bool))

plt.figure(figsize=(10,8))
sea.heatmap(corr_matrix,annot=False, fmt=".1f", vmin=-1,
            vmax=1, linewidth = 1,
            center=0, mask=mask,cmap="RdBu_r");

Some features are dropped due to high correlation. Then, dataframes are converted to numpy arrays.

In [None]:
drop = ["OnlineSecurity_No internet service",
        "OnlineBackup_No internet service",
        "DeviceProtection_No internet service",
        "TechSupport_No internet service",
        "StreamingTV_No internet service",
        "StreamingMovies_No internet service",
        "MultipleLines_No phone service"]

for d in drop:
    train_X.drop(d, axis=1, inplace=True)
    test_X.drop(d, axis=1, inplace=True)
    
np_train_X = train_X.values
np_train_Y = train_Y.values.ravel()
np_test_X = test_X.values
np_test_Y = test_Y.values.ravel()

## KNN with Brute NN Search

We will try KNN first with brute nearest neighbor search. We use grid search to find the optimal parameters and use stratified 5-fold for cross validation. Minkowski is used as distance metric and its value is searched (1 or 2). When p equals 1, Minkowski is Manhattan distance and when p equals 2, it is Euclidean distance. During grid search, model performance with each parameter combination is measured on cross validation folds. The parameters giving the highest performance is returned as best parameter set.

In [None]:
knn_cls = KNeighborsClassifier()
parameters = {
    "n_neighbors": range(30, 50, 2),
    "metric": ["minkowski"],
    "p": [1.0, 2.0],
    "algorithm": ["brute"]
}

skf_cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
gscv = GridSearchCV(
    estimator=knn_cls,
    param_grid=parameters,
    scoring="f1",
    n_jobs=-1,
    cv=skf_cv,
    verbose=False
)

gscv.fit(np_train_X, np_train_Y)
print("Best parameters {}".format(gscv.best_params_))

We train a new KNeighborsClassifier with the best parameters on np_train_X. Then we make predictions on np_test_X.

In [None]:
knn_cls = KNeighborsClassifier(**gscv.best_params_)
knn_cls.fit(np_train_X, np_train_Y)
y_pred = knn_cls.predict(np_test_X)
print(classification_report(np_test_Y, y_pred,
                            target_names=["Churn No", "Churn Yes"]))

## KNN with KDTree

In nearest neighbor search, data structures like kdtree can be incorporated instead of using KNN in its original form with brute search. kdtree learns which training sample is residing on which part of the feature space. On inference, it takes you to the close proximity of test sample and gives you the neighbors. kdtree allows you to search multidimensional space efficiently.

When creating kdtree, each node splits data using 1 dimension (1 feature). The split point is determined as the median of points along that dimension. The seperating hyperplane is orthogonal to dimension axis. The points on the left of hyperplane go to left child, the points on the right go to right child node. Choosing the number of points on each leaf, we slice the space into subspaces with the resolution we want. When a leaf is reached, we get a number of training points that we are interested in and KNeighborsClassifier switches to brute nearest neighbor search on this set.

Below, we do another grid search with KNeighborsClassifier to find optimal K, Minkowski p value when algorithm parameter is set to kd_tree. This time we have an extra hyperparameter, leaf_size denotes the number of points in each leaf.

In [None]:
knn_cls = KNeighborsClassifier()
parameters = {
    "n_neighbors": range(40, 60, 2),
    "leaf_size": [1, 2, 3],
    "metric": ["minkowski"],
    "p": [1.0, 2.0],
    "algorithm": ["kd_tree"]
}

skf_cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
gscv = GridSearchCV(
    estimator=knn_cls,
    param_grid=parameters,
    scoring="f1",
    n_jobs=-1,
    cv=skf_cv,
    verbose=False
)

gscv.fit(np_train_X, np_train_Y)
print("Best parameters {}".format(gscv.best_params_))

We train a new KNeighborsClassifier with the best parameters on np_train_X. Then we make predictions on np_test_X.

In [None]:
knn_cls = KNeighborsClassifier(**gscv.best_params_)
knn_cls.fit(np_train_X, np_train_Y)
y_pred = knn_cls.predict(np_test_X)
print(classification_report(np_test_Y, y_pred,
                            target_names=["Churn No", "Churn Yes"]))