# Retail Banking Demo for K-Nearest Neighbors (KNN)

This notebook demonstrates the application of the K-Nearest Neighbors (KNN) algorithm using a synthetic dataset representing retail banking customer information. The dataset includes a variety of features such as customer age, number of years with the bank, credit score, account balances, loan statuses, and customer satisfaction levels.

## Data Generation

First, let's generate a synthetic dataset that mimics retail banking data. This dataset will be used to demonstrate the KNN algorithm.

In [None]:

import pandas as pd
import numpy as np

np.random.seed(0)  # For reproducibility

# Column descriptions
columns = [
    'Customer_Age', 'Years_with_Bank', 'Number_of_Products', 'Credit_Score',
    'Mortgage_Value', 'Savings_Account_Balance', 'Current_Account_Balance',
    'Personal_Loan', 'Home_Loan', 'Car_Loan', 'Credit_Card_Spending', 'Customer_Satisfaction'
]

# Generating random data for 1255 rows
data = {
    'Customer_Age': np.random.randint(18, 70, size=1255),
    'Years_with_Bank': np.random.randint(1, 30, size=1255),
    'Number_of_Products': np.random.randint(1, 10, size=1255),
    'Credit_Score': np.random.randint(300, 850, size=1255),
    'Mortgage_Value': np.random.randint(0, 500000, size=1255),
    'Savings_Account_Balance': np.random.randint(0, 100000, size=1255),
    'Current_Account_Balance': np.random.randint(0, 50000, size=1255),
    'Personal_Loan': np.random.randint(0, 2, size=1255),
    'Home_Loan': np.random.randint(0, 2, size=1255),
    'Car_Loan': np.random.randint(0, 2, size=1255),
    'Credit_Card_Spending': np.random.randint(0, 10000, size=1255),
    'Customer_Satisfaction': np.random.randint(1, 6, size=1255)
}

# Creating a DataFrame
df = pd.DataFrame(data, columns=columns)

df.head()


Unnamed: 0,Customer_Age,Years_with_Bank,Number_of_Products,Credit_Score,Mortgage_Value,Savings_Account_Balance,Current_Account_Balance,Personal_Loan,Home_Loan,Car_Loan,Credit_Card_Spending,Customer_Satisfaction
0,62,28,3,721,336375,43393,1790,1,0,0,8943,1
1,65,25,4,806,15940,26552,3847,1,0,1,9294,2
2,18,26,7,559,11762,835,41940,1,1,1,9917,3
3,21,27,5,724,334545,13137,5779,0,1,0,3039,4
4,21,25,8,783,64605,72961,19714,0,0,0,4338,4


## Data Preparation and Feature Scaling

Before applying the KNN algorithm, we split the dataset into training and test sets. Then, we scale the features since KNN is sensitive to the magnitude of data.

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Splitting the dataset into features and target variable
X = df.drop('Customer_Satisfaction', axis=1)
y = df['Customer_Satisfaction']

# Splitting into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Model Training

We train a KNN model on the scaled data. The `n_neighbors` parameter is set to 5, meaning the model looks at the 5 nearest neighbors to make a prediction.

In [None]:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)


Choosing k=5 for the K-Nearest Neighbors (KNN) algorithm is a common practice when no prior knowledge about the optimal number of neighbors exists. This choice is somewhat arbitrary but is based on several general considerations:

Default or Starting Point:

k=5 is often used as a starting point or a default value in many implementations of KNN, including examples and tutorials. It strikes a balance between too few neighbors (which can lead to a highly sensitive model that reacts too strongly to noise in the data) and too many neighbors (which might smooth over the data's features too much, leading to underfitting).

Avoiding Ties: Choosing an odd number for k in binary classification tasks helps avoid ties, where the algorithm might otherwise be equally split between two classes. For multi-class problems, like the one you're dealing with, the principle can still help minimize the chances of a tie, though it's less foolproof as the number of classes increases.

## Prediction and Evaluation

Let's make predictions on the test set and evaluate the model's performance using a confusion matrix and classification report.

In [None]:

from sklearn.metrics import classification_report, confusion_matrix

y_pred = knn.predict(X_test_scaled)

# Evaluation
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(conf_matrix)
print(class_report)


[[14  6 12  7  9]
 [11  6  7  7  7]
 [14  8 14 11 14]
 [13 11 13  6  9]
 [22  2 13  8  7]]
              precision    recall  f1-score   support

           1       0.19      0.29      0.23        48
           2       0.18      0.16      0.17        38
           3       0.24      0.23      0.23        61
           4       0.15      0.12      0.13        52
           5       0.15      0.13      0.14        52

    accuracy                           0.19       251
   macro avg       0.18      0.19      0.18       251
weighted avg       0.18      0.19      0.18       251



The information you provided consists of two parts: the confusion matrix and the classification report for a K-Nearest Neighbors (KNN) model prediction on a test dataset. Let's break down what each part tells us.

### Confusion Matrix
A confusion matrix is a table used to describe the performance of a classification model on a set of test data for which the true values are known. The matrix is:

```
[[14,  6, 12,  7,  9],
 [11,  6,  7,  7,  7],
 [14,  8, 14, 11, 14],
 [13, 11, 13,  6,  9],
 [22,  2, 13,  8,  7]]
```

Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class, or vice versa, depending on the library's implementation. For a 5-class classification (customer satisfaction levels from 1 to 5), the diagonal elements (14, 6, 14, 6, 7) show the number of correct predictions for each class, while the off-diagonal elements show the misclassifications.

For example:
- 14 instances of class 1 were correctly predicted, but 6 instances of class 1 were predicted as class 2, 12 as class 3, etc.
- 22 instances of class 5 were incorrectly predicted as class 1, indicating a significant misclassification.

### Classification Report
The classification report provides the following metrics for each class:

- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives. It tells us the accuracy of positive predictions.
- **Recall (Sensitivity)**: The ratio of correctly predicted positive observations to all observations in the actual class. It tells us the coverage of the actual positive sample.
- **F1-score**: The weighted average of Precision and Recall. It takes both false positives and false negatives into account. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

For this model:
- The precision values range from 0.15 to 0.24 across the classes, indicating low accuracy in positive predictions.
- Recall values also show that the coverage of actual positives is quite low, especially for classes 4 and 5.
- F1-scores are similarly low, indicating a poor balance between precision and recall across all classes.
- The overall accuracy of the model is 0.19 (19%), which is quite low, suggesting that the model struggles significantly to classify the instances correctly.

**Inference**: The model's performance is not very effective in classifying customer satisfaction levels accurately. The low precision, recall, and F1-scores across all classes suggest that the model has difficulty distinguishing between the different levels of customer satisfaction accurately. This could be due to a variety of factors, such as insufficient or non-informative features, class imbalance, or the need for more complex model tuning. Further investigation and model refinement are necessary to improve performance.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Assuming X and y are already defined
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Setting up the KNN classifier and grid search parameters
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1, 31)}

# Performing grid search with cross-validation
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# Best number of neighbors
best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_

# Training the KNN classifier with the best number of neighbors
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train_scaled, y_train)

# Evaluating the model
accuracy_train = knn_best.score(X_train_scaled, y_train)
accuracy_test = knn_best.score(X_test_scaled, y_test)

print(f"Optimal k: {best_k} with Cross-Validated Accuracy: {best_score}")
print(f"Training Set Accuracy: {accuracy_train}")
print(f"Test Set Accuracy: {accuracy_test}")


Optimal k: 1 with Cross-Validated Accuracy: 0.23203980099502486
Training Set Accuracy: 1.0
Test Set Accuracy: 0.21912350597609562


Grid search with cross-validation is a widely used method for hyperparameter tuning in machine learning. It systematically works through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. Here's a detailed breakdown of how it works:

### Step 1: Define the Parameter Grid
You start by defining a grid of parameters that you want to test. This grid is essentially a dictionary where the keys are the parameter names and the values are the ranges of values to test for each parameter. For instance, if you're using a K-Nearest Neighbors (KNN) classifier, your grid might include different values for \(n\_neighbors\) to determine the best number of neighbors.

### Step 2: Setup Cross-Validation
Cross-validation is a technique used to assess how the results of a statistical analysis will generalize to an independent data set. It's primarily used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. The most common cross-validation technique is \(k\)-fold cross-validation, where the data is split into \(k\) subsets (or folds). The model is trained on \(k-1\) of those folds, and the remaining fold is used for testing. This process is repeated \(k\) times (folds), with each of the \(k\) subsets used exactly once as the test set. The results from the \(k\) folds can then be averaged (or otherwise combined) to produce a single estimation.

### Step 3: Perform Grid Search
The grid search algorithm takes the parameter grid and the cross-validation setup as inputs and then systematically creates and evaluates models for each combination of parameters in your grid.

- For each parameter combination, grid search uses the cross-validation method to evaluate the model.
- It trains the model multiple times (once for each fold), each time using a different fold as the test set and the remaining data as the training set.
- The performance scores obtained from each fold are then averaged to get the overall performance metric for the parameter combination.

### Step 4: Select the Best Parameters
After evaluating all the parameter combinations, grid search selects the combination that resulted in the best performance metric (e.g., accuracy for classification problems). This combination of parameters is considered the best and is usually used to train a final model.

### Step 5: Train the Final Model
Finally, with the best parameters identified, a new model is trained on the entire training dataset. This final model is then ready to be evaluated on the test set or used for making predictions on new data.

### Advantages and Disadvantages
**Advantages**:
- **Systematic Exploration**: Grid search provides a systematic approach for exploring a wide range of parameters to find the best performing model.
- **Simplicity**: It's relatively simple to understand and implement.
- **Parallelization**: The process can be easily parallelized to speed up computation.

**Disadvantages**:
- **Computational Cost**: Testing all possible combinations can be computationally expensive, especially for large datasets and complex models.
- **Grid Resolution**: The performance of grid search can heavily depend on how finely the parameter space is defined. A coarse grid might miss the optimal parameters, while a very fine grid increases computational cost.

To mitigate some of these disadvantages, techniques like random search or Bayesian optimization can be used as alternatives or complements to grid search, offering a balance between exploration of the parameter space and computational efficiency.