# KNN

K-nearest neighbors (KNN) is an algorithm for classification tasks. TensorFlow is a deep learning library, and KNN is not a deep learning algorithm. For this reason, I recommend using Scikit-learn, which is a more appropriate library for this task. Below is an implementation of a KNN model using Scikit-learn:

In [97]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm import tqdm
import tensorflow as tf 
import json
import os
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, make_scorer, recall_score
from imblearn.over_sampling import SMOTE

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

import shap

In [98]:
# Load data
train = pd.read_csv(f"dataset_original.csv")
#train = pd.read_csv(f"dataset_160k.csv")

# shuffle the dataset
train = train.sample(frac=1).reset_index(drop=True)

In [99]:
# PreferedOrderCat
train.loc[train["PreferedOrderCat"] == "Laptop & Accessory", "PreferedOrderCat"] = "Laptop_Accessory"
train.loc[train["PreferedOrderCat"] == "Mobile Phone", "PreferedOrderCat"] = "Mobile"

#PreferredPaymentMode
train.loc[train["PreferredPaymentMode"] == "Debit Card", "PreferredPaymentMode"] = "DebitCard"
train.loc[train["PreferredPaymentMode"] == "Credit Card", "PreferredPaymentMode"] = "CreditCard"
train.loc[train["PreferredPaymentMode"] == "CC", "PreferredPaymentMode"] = "CreditCard"
train.loc[train["PreferredPaymentMode"] == "E wallet", "PreferredPaymentMode"] = "Ewallet"
train.loc[train["PreferredPaymentMode"] == "Cash on Delivery", "PreferredPaymentMode"] = "COD"

#PreferredLoginDevice
train.loc[train["PreferredLoginDevice"] == "Mobile Phone", "PreferredLoginDevice"] = "Mobile"
train.loc[train["PreferredLoginDevice"] == "Phone", "PreferredLoginDevice"] = "Mobile"

In [100]:
# Drop the 'CustomerID' column since it's not useful for prediction
X = train.drop('CustomerID', axis=1)

# Separate the target variable from the rest of the dataset
y = train['Churn']

X = X.drop(columns=['Churn'], axis=1)
#X = X.drop(columns=[], axis=1)

# Perform one-hot encoding on the categorical features
cat_cols = ['PreferredLoginDevice', 'PreferredPaymentMode', 'PreferedOrderCat','Gender','MaritalStatus']
X = pd.get_dummies(X, columns=cat_cols)

# Fill missing values with mean
#X = X.fillna(0)
#X.fillna(X.mode().iloc[0], inplace=True)
X.fillna(X.mean(), inplace=True)
# X.fillna(X.median(), inplace=True)
# X.fillna(method='ffill', inplace=True)
# X.fillna(method='bfill', inplace=True)
#X.interpolate(method='linear', inplace=True)

#y = y.fillna(0)
#y.fillna(y.mode().iloc[0], inplace=True)
y.fillna(y.mean(), inplace=True)
# y.fillna(y.median(), inplace=True)
# y.fillna(method='ffill', inplace=True)
# y.fillna(method='bfill', inplace=True)
#y.interpolate(method='linear', inplace=True)

In [101]:
num_cols = X.columns.tolist()
for col in cat_cols:
    if col in num_cols:
        num_cols.remove(col)

In [102]:
# Normalize the numerical features using min-max scaling
X[num_cols] = (X[num_cols] - X[num_cols].min()) / (X[num_cols].max() - X[num_cols].min())

# Another way to normalize:
#X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

In [103]:
# Convert the target variable to binary (0 or 1)
y = y.astype(int)

In [104]:
print(X.shape)
print(y.shape)

(5630, 30)
(5630,)


In [105]:
def apply_smote(X, y, random_state=None):
    """
    Applies SMOTE to the input features (X) and target variable (y) to balance the dataset.
    
    Parameters:
    X: numpy array or pandas DataFrame with the input features
    y: numpy array or pandas Series with the target variable
    random_state: int, default=None, controls the randomness of the SMOTE algorithm
    
    Returns:
    X_resampled: numpy array with the resampled input features
    y_resampled: numpy array with the resampled target variable
    """
    smote = SMOTE(random_state=random_state)
    X_resampled, y_resampled = smote.fit_resample(X, y)
    return X_resampled, y_resampled

In [106]:
#X, y = apply_smote(X, y, random_state=42)
print(X.shape)
print(y.shape)

(5630, 30)
(5630,)


In [107]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(4504, 30) (4504,) (1126, 30) (1126,)


In [108]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [109]:
# Choose the number of neighbors (k)
k = 1

In [110]:
# Train the KNN model
knn_classifier = KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(X_train, y_train)

In [111]:
# Make predictions
y_pred = knn_classifier.predict(X_test)

In [112]:
# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

[[930   5]
 [ 20 171]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       935
           1       0.97      0.90      0.93       191

    accuracy                           0.98      1126
   macro avg       0.98      0.94      0.96      1126
weighted avg       0.98      0.98      0.98      1126

Accuracy: 0.977797513321492


RESULTS ACHIEVED WITHOUT CLASS BALANCING:

[[908  24]
 [ 75 119]]
              precision    recall  f1-score   support

           0       0.92      0.97      0.95       932
           1       0.83      0.61      0.71       194

    accuracy                           0.91      1126
   macro avg       0.88      0.79      0.83      1126
weighted avg       0.91      0.91      0.91      1126

Accuracy: 0.9120781527531083


Let's analyze the performance metrics from above:

Recall (Sensitivity): Recall is the proportion of actual positive cases that were correctly identified by the model. The recall for class 1 (positive class) is 0.61, meaning that 61% of the positive cases were correctly identified. In some applications, this might be considered low, especially if false negatives have significant consequences (e.g., medical diagnosis, fraud detection). For class 0 (negative class), the recall is 0.97, which indicates that the model is better at identifying the negative class.

Precision: Precision represents the proportion of predicted positive cases that were actually positive. The precision for class 1 is 0.83, meaning that 83% of the instances predicted as positive were indeed positive. For class 0, the precision is 0.92, indicating that the model is more precise in predicting the negative class.

F1-score: The F1-score is the harmonic mean of precision and recall. It's useful when you want to balance both metrics. For class 1, the F1-score is 0.71, and for class 0, it's 0.95. The F1-score shows that the model has better performance in predicting the negative class.

Accuracy: The overall accuracy of the model is 0.912 (91.2%), which means that the model correctly classifies 91.2% of the cases. However, accuracy can be misleading if the dataset is imbalanced, so it's essential to look at other metrics like precision, recall, and F1-score.

In general, the model performs well in classifying the negative class (class 0) but has lower performance for the positive class (class 1). Depending on the specific problem and the importance of minimizing false negatives, you may want to optimize the model for recall to improve its performance in identifying the positive class.

### SCAN FOR BEST PARAMETERS:

In [53]:
# Define hyperparameters to search
parameters = {'n_neighbors': list(range(1, 31))}

# Create a recall scorer
recall_scorer = make_scorer(recall_score)

# Perform grid search with cross-validation to find the best hyperparameters
grid_search = GridSearchCV(knn_classifier, parameters, scoring=recall_scorer, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_parameters = grid_search.best_params_

# Train the KNN model with the best hyperparameters
knn_classifier_optimized = KNeighborsClassifier(**best_parameters)
knn_classifier_optimized.fit(X_train, y_train)

# Make predictions
y_pred = knn_classifier_optimized.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

[[917  15]
 [ 21 173]]
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       932
           1       0.92      0.89      0.91       194

    accuracy                           0.97      1126
   macro avg       0.95      0.94      0.94      1126
weighted avg       0.97      0.97      0.97      1126

Accuracy: 0.9680284191829485


By using k = 1 we optimize our KNN model, particularly in the detection of the Positives (Churn), jumping from recall 0.61 to 0.89.

In [54]:
best_parameters

{'n_neighbors': 1}