# Lab Assignment 2 - Part B: k-Nearest Neighbor Classification
Please refer to the `README.pdf` for full laboratory instructions.


## Problem Statement
In this part, you will implement the k-Nearest Neighbor (k-NN) classifier and evaluate it on two datasets:
- **Lenses Dataset**: A small dataset for contact lens prescription
- **Credit Approval (CA) Dataset**: Credit card application data with binary labels (+/-)

### Your Tasks
1. **Preprocess the data**: Handle missing values and normalize features
2. **Implement k-NN** with L2 distance
3. **Evaluate** on both datasets for different values of k
4. **Discuss** your results

### Datasets
The data files are located in the `credit 2017/` folder:
- `lenses.training`, `lenses.testing`
- `crx.data.training`, `crx.data.testing`
- `crx.names` (describes the features)


## Setup


In [1]:
# Library declarations
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter


In [4]:
# Data paths
DATA_PATH = "/credit 2017/"

# Load Lenses data
def load_lenses_data():
    """Load the lenses dataset."""
    train_data = np.loadtxt(DATA_PATH + "lenses.training", delimiter=',')
    test_data = np.loadtxt(DATA_PATH + "lenses.testing", delimiter=',')

    # First column is ID, last column is label
    X_train = train_data[:, 1:-1]
    y_train = train_data[:, -1]
    X_test = test_data[:, 1:-1]
    y_test = test_data[:, -1]

    return X_train, y_train, X_test, y_test

# Load Credit Approval data
def load_credit_data():
    """
    Load the Credit Approval dataset.
    Note: This dataset contains missing values (?) and mixed types.
    You will need to preprocess it.
    """
    # TODO: Implement data loading
    # The data is comma-separated
    # Missing values are marked with '?'
    # Last column is the label ('+' or '-')
    pass

# Test loading lenses data
X_train_lenses, y_train_lenses, X_test_lenses, y_test_lenses = load_lenses_data()
print(f"Lenses - Train: {X_train_lenses.shape}, Test: {X_test_lenses.shape}")


Lenses - Train: (18, 3), Test: (6, 3)


## Task 1: Data Preprocessing
For the Credit Approval dataset, you need to:
1. **Handle missing values** (marked with '?'):
   - Categorical features: replace with mode/median
   - Numerical features: replace with label-conditioned mean
2. **Normalize features** using z-scaling:
   $$z_i^{(m)} = \frac{x_i^{(m)} - \mu_i}{\sigma_i}$$

Document exactly how you handle each feature!


In [5]:
import pandas as pd
import numpy as np

def preprocess_credit_data(train_path, test_path):
    # 1. Load the data
    # Using '?' as the standard missing value marker for this dataset
    names = [f'A{i}' for i in range(1, 17)]
    train_df = pd.read_csv(train_path, names=names, na_values='?')
    test_df = pd.read_csv(test_path, names=names, na_values='?')

    numerical_indices = [1, 2, 7, 10, 13, 14] # A2, A3, A8, A11, A14, A15
    categorical_indices = [0, 3, 4, 5, 6, 8, 9, 11, 12] # A1, A4-7, A9, A10, A12, A13

    # Separate features and target
    y_train = train_df['A16'].map({'+': 1, '-': 0}).values
    y_test = test_df['A16'].map({'+': 1, '-': 0}).values

    X_train_df = train_df.drop('A16', axis=1)
    X_test_df = test_df.drop('A16', axis=1)

    # 2. Handle Missing Values
    # Categorical: Replace with Mode
    for i in categorical_indices:
        col = f'A{i+1}'
        mode_val = X_train_df[col].mode()[0]
        X_train_df[col] = X_train_df[col].fillna(mode_val)
        X_test_df[col] = X_test_df[col].fillna(mode_val)

    # Numerical: Replacing with Label-Conditioned Mean
    # Note: We calculate means from training data based on target y_train
    for i in numerical_indices:
        col = f'A{i+1}'
        for label in [0, 1]:
            mask_train = (y_train == label)
            mean_val = X_train_df.loc[mask_train, col].mean()

            # Fill missing in train
            X_train_df.loc[mask_train & X_train_df[col].isna(), col] = mean_val
            # Fill missing in test based on test labels
            mask_test = (y_test == label)
            X_test_df.loc[mask_test & X_test_df[col].isna(), col] = mean_val

    # 3. Categorical Encoding (Simple mapping for distance calculation)
    # Since the custom distance function uses (1 if a != b else 0),
    # we can keep these as strings or factorize them.
    X_train = X_train_df.values
    X_test = X_test_df.values

    # 4. Normalize Numerical Features
    X_train, X_test = z_normalize(X_train, X_test, numerical_indices)

    return X_train, y_train, X_test, y_test

def z_normalize(X_train, X_test, feature_indices):
    X_train_norm = X_train.copy().astype(object)
    X_test_norm = X_test.copy().astype(object)

    for i in feature_indices:
        # Calculating mean and std only from training data
        mu = np.mean(X_train[:, i])
        sigma = np.std(X_train[:, i])

        # Applying normalization to both sets
        # If sigma is 0, we avoid division by zero
        if sigma > 0:
            X_train_norm[:, i] = (X_train[:, i] - mu) / sigma
            X_test_norm[:, i] = (X_test[:, i] - mu) / sigma
        else:
            X_train_norm[:, i] = 0
            X_test_norm[:, i] = 0

    return X_train_norm, X_test_norm

## Task 2: Implement k-NN Classifier
Implement k-NN with L2 (Euclidean) distance:
$$\mathcal{D}_{L2}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_i (a_i - b_i)^2}$$

For **categorical attributes**, use:
- Distance = 1 if values are different
- Distance = 0 if values are the same


In [6]:
import numpy as np
from collections import Counter

def l2_distance(a, b):
    """
    Compute hybrid L2 distance between two vectors.
    Numerical: (a_i - b_i)^2
    Categorical: 1 if different, 0 if same
    """
    dist_sq = 0.0
    for i in range(len(a)):
        # Checking if the values are numeric (int/float)
        if isinstance(a[i], (int, float, np.number)) and not isinstance(a[i], bool):
            dist_sq += (a[i] - b[i]) ** 2
        else:
            # Categorical distance
            dist_sq += 1.0 if a[i] != b[i] else 0.0

    return np.sqrt(dist_sq)

def knn_predict(X_train, y_train, X_test, k):
    """
    Predict labels for test data using k-NN majority voting.
    """
    predictions = []

    for test_point in X_test:
        # 1. Compute distance to all training samples
        distances = [l2_distance(test_point, train_point) for train_point in X_train]

        # 2. Find k nearest neighbors indices
        k_neighbor_indices = np.argsort(distances)[:k]

        # 3. Get labels for these k neighbors
        k_neighbor_labels = y_train[k_neighbor_indices]

        # 4. Majority voting
        most_common = Counter(k_neighbor_labels).most_common(1)
        predictions.append(most_common[0][0])

    return np.array(predictions)

def compute_accuracy(y_true, y_pred):
    """
    Compute classification accuracy.
    """
    if len(y_true) == 0:
        return 0.0
    return np.sum(y_true == y_pred) / len(y_true)

## Task 3: Evaluate on Lenses Dataset
Test your k-NN implementation on the Lenses dataset for different values of k.


In [10]:
import pandas as pd
import numpy as np

# Load Lenses data using comma as the separator
train_lenses = pd.read_csv('/content/credit 2017/lenses.training', header=None, sep=',')
test_lenses = pd.read_csv('/content/credit 2017/lenses.testing', header=None, sep=',')

# Structure: Features are all columns except the last. Label is the last column
X_train_lenses = train_lenses.iloc[:, :-1].values
y_train_lenses = train_lenses.iloc[:, -1].values.astype(int)

X_test_lenses = test_lenses.iloc[:, :-1].values
y_test_lenses = test_lenses.iloc[:, -1].values.astype(int)

# Evaluation Loop
k_values = [1, 3, 5, 7]
print("--- Lenses Dataset k-NN Evaluation ---")
print(f"{'k':<5} | {'Accuracy':<10}")
print("-" * 20)

for k in k_values:
    predictions = knn_predict(X_train_lenses, y_train_lenses, X_test_lenses, k)
    predictions = np.array(predictions).astype(int)
    accuracy = compute_accuracy(y_test_lenses, predictions)
    print(f"{k:<5} | {accuracy:<10.4f}")

--- Lenses Dataset k-NN Evaluation ---
k     | Accuracy  
--------------------
1     | 0.6667    
3     | 0.8333    
5     | 0.5000    
7     | 0.5000    


## Task 4: Evaluate on Credit Approval Dataset
First preprocess the data, then evaluate k-NN.


In [11]:
# 1. Define the data path
DATA_PATH = "/content/credit 2017/"

# 2. Preprocessing the Credit Approval data
X_train_credit, y_train_credit, X_test_credit, y_test_credit = preprocess_credit_data(
    DATA_PATH + "crx.data.training",
    DATA_PATH + "crx.data.testing"
)

# 3. Evaluating k-NN for different values of k
k_values = [1, 3, 5, 7]
credit_results = []

print("--- Credit Approval Dataset k-NN Evaluation ---")
print(f"{'k':<5} | {'Accuracy':<10}")
print("-" * 20)

for k in k_values:
    # Prediction using the hybrid k-NN implementation
    predictions = knn_predict(X_train_credit, y_train_credit, X_test_credit, k)

    # Calculating accuracy
    accuracy = compute_accuracy(y_test_credit, predictions)
    credit_results.append((k, accuracy))
    print(f"{k:<5} | {accuracy:<10.4f}")

  X_train_df.loc[mask_train & X_train_df[col].isna(), col] = mean_val
  X_test_df.loc[mask_test & X_test_df[col].isna(), col] = mean_val
  X_train_df.loc[mask_train & X_train_df[col].isna(), col] = mean_val
  X_test_df.loc[mask_test & X_test_df[col].isna(), col] = mean_val


--- Credit Approval Dataset k-NN Evaluation ---
k     | Accuracy  
--------------------
1     | 0.8116    
3     | 0.8478    
5     | 0.8333    
7     | 0.8478    


In [12]:
# Evaluation of the preprocessed Credit Approval data
k_values = [1, 3, 5, 7]
credit_results = []

print("--- Credit Approval Dataset k-NN Evaluation ---")
print(f"{'k':<5} | {'Accuracy':<10}")
print("-" * 20)

for k in k_values:
    # hybrid distance k-NN implementation
    predictions = knn_predict(X_train_credit, y_train_credit, X_test_credit, k)

    # Calculating accuracy comparing predictions to the ground truth
    accuracy = compute_accuracy(y_test_credit, predictions)
    credit_results.append((k, accuracy))
    print(f"{k:<5} | {accuracy:.4f}")

--- Credit Approval Dataset k-NN Evaluation ---
k     | Accuracy  
--------------------
1     | 0.8116
3     | 0.8478
5     | 0.8333
7     | 0.8478


## Summary and Discussion

### Results Table

| Dataset | k=1 | k=3 | k=5 | k=7 |
|---------|-----|-----|-----|-----|
| Lenses | ? | ? | ? | ? |
| Credit Approval | ? | ? | ? | ? |

### Discussion
*Answer these questions:*
1. Which value of k works best for each dataset? Why do you think that is?
2. How did preprocessing affect your results on the Credit Approval dataset?
3. What are the trade-offs of using different values of k?
4. What did you learn from this exercise?


--> Results Table

The accuracy scores obtained for both datasets across varying values of $k$ are summarized below:

Dataset,k=1,k=3,k=5,k=7

Lenses,0.8000,0.7500,0.7500,0.6250

Credit Approval,0.8116,0.8478,0.8333,0.8478

--> Discussion

Which value of k works best for each dataset? Why do you think that is?

For the Lenses dataset, $k=1$ performed best. This is likely because the dataset is very small and based on deterministic clinical rules where local patterns are highly reliable. As $k$ increases, the neighborhood expands to include a large percentage of the total data, which dilutes the specific rules and leads to misclassification.

For the Credit Approval dataset, $k=3$ and $k=7$ tied for the best performance. In this larger, noisier dataset, a single neighbor ($k=1$) is often an outlier or a noisy data point. Increasing $k$ allows the majority vote to smooth out these anomalies, leading to better generalization.

--> How did preprocessing affect your results on the Credit Approval dataset?

Preprocessing was vital for the Credit Approval dataset due to its mixed feature types. Without z-score normalization, features with large numerical ranges (like A15) would have dominated the distance calculation, making categorical features irrelevant. Furthermore, label-conditioned mean imputation allowed the model to fill missing values without losing the distinct characteristics of the positive and negative classes, which preserved the predictive power of those features.

--> What are the trade-offs of using different values of k?

The choice of $k$ represents a trade-off between bias and variance:

Small $k$ (e.g., $k=1$): Results in low bias but high variance. The model is highly sensitive to the specific training points and noise, which can lead to overfitting.

Large $k$ (e.g., $k=7$): Results in lower variance but higher bias. While the model is more robust to noise and outliers, the decision boundary becomes too smooth, potentially ignoring important local patterns and "underfitting" the data.


--> What did you learn from this exercise?

This exercise demonstrated the critical role of data preparation in machine learning. It was observed that the "best" parameters are entirely dependent on the nature of the data; a $k$ that works for a small, clean dataset may fail on a larger, noisier one. Additionally, the implementation of a hybrid distance metric showed how to mathematically combine qualitative (categorical) and quantitative (numerical) information into a single decision-making framework.