# Exercice 2: Classification system with KNN - To Loan or Not To Loan

## Imports

Import some useful libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from math import pow, sqrt

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [2]:
data = pd.read_csv("loandata.csv")

Display the head of the data.

In [3]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [4]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [5]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])

Split into `X` and `y`.

In [6]:
X = data.drop(columns="LoanStatus")
y = data.LoanStatus

Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [7]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns])

Convert `y` type to `int` 

In [8]:
y = y.astype(int)

Split dataset into train and test sets.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [10]:
class DummyClassifier():
    
    def __init__(self):
        """
        Initialize the class.
        """
    
    def fit(self, X, y):
        """
        Fit the dummy classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X_fit = X
        self.y_fit = y
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        return [np.random.choice(np.unique(self.y_fit)) for i in range(len(X))]

Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [11]:
def accuracy_score(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_pred) if len(y_true) == len(y_pred) else None

Compute the performance of the dummy classifier using the provided test set.

In [12]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
accuracy_score(y_test, dummy.predict(X_test))

0.53125

## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [13]:
class KNNClassifier():
    
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """
        self.n_neighbors = n_neighbors
    
    def fit(self, X, y):
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X_fit = np.asarray(X)
        self.y_fit = np.asarray(y)
    
    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        d = 0.0
        for i in range(len(a)-1):
            d += pow((a[i] - b[i]),2)
        return sqrt(d)
    
    def predict(self, X):
        labels_pred = []
        for x_to_pred in np.asarray(X):
            distances_tuple = []
            for i in range(len(self.X_fit)):
                distances_tuple.append((self._euclidian_distance(x_to_pred, self.X_fit[i]), self.y_fit[i]))
            distances_tuple.sort(key=lambda tup: tup[0])
            labels_kept = [item[1] for item in distances_tuple[:self.n_neighbors]]
            #print(str(labels_kept) + " -> " + str(max(set(labels_kept), key=labels_kept.count)))
            labels_pred.append(max(set(labels_kept), key=labels_kept.count))
        return labels_pred
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """

Compute the performance of the system as a function of $k = 1...7$.

In [14]:
for i in range(1,8):
    k = KNNClassifier(i)
    k.fit(X_train, y_train)
    print("k=" + str(i) + " -> accuracy = " + str(accuracy_score(y_test, k.predict(X_test))))

k=1 -> accuracy = 0.5625
k=2 -> accuracy = 0.53125
k=3 -> accuracy = 0.6458333333333334
k=4 -> accuracy = 0.6041666666666666
k=5 -> accuracy = 0.6666666666666666
k=6 -> accuracy = 0.6354166666666666
k=7 -> accuracy = 0.6458333333333334


Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

In [15]:
X_train_filtered = X_train[['TotalIncome', 'CreditHistory']]
X_test_filtered = X_test[['TotalIncome', 'CreditHistory']]
for i in range(1,8):
    k = KNNClassifier(i)
    k.fit(X_train_filtered, y_train)
    print("k=" + str(i) + " -> accuracy = " + str(accuracy_score(y_test, k.predict(X_test_filtered))))


k=1 -> accuracy = 0.5833333333333334
k=2 -> accuracy = 0.53125
k=3 -> accuracy = 0.6145833333333334
k=4 -> accuracy = 0.5208333333333334
k=5 -> accuracy = 0.625
k=6 -> accuracy = 0.59375
k=7 -> accuracy = 0.6354166666666666


Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

In [16]:
X_train_filtered = X_train[['TotalIncome', 'CreditHistory', 'Married']]
X_test_filtered = X_test[['TotalIncome', 'CreditHistory', 'Married']]
for i in range(1,8):
    k = KNNClassifier(i)
    k.fit(X_train_filtered, y_train)
    print("k=" + str(i) + " -> accuracy = " + str(accuracy_score(y_test, k.predict(X_test_filtered))))

k=1 -> accuracy = 0.7604166666666666
k=2 -> accuracy = 0.6875
k=3 -> accuracy = 0.78125
k=4 -> accuracy = 0.6979166666666666
k=5 -> accuracy = 0.8229166666666666
k=6 -> accuracy = 0.78125
k=7 -> accuracy = 0.8125


Re-run the KNN algorithm using all features.

In [17]:
for i in range(1,8):
    k = KNNClassifier(i)
    k.fit(X_train, y_train)
    print("k=" + str(i) + " -> accuracy = " + str(accuracy_score(y_test, k.predict(X_test))))

k=1 -> accuracy = 0.5625
k=2 -> accuracy = 0.53125
k=3 -> accuracy = 0.6458333333333334
k=4 -> accuracy = 0.6041666666666666
k=5 -> accuracy = 0.6666666666666666
k=6 -> accuracy = 0.6354166666666666
k=7 -> accuracy = 0.6458333333333334


In [18]:
for i in range(1,8):
    k = KNNClassifier(i)
    k.fit(X_train, y_train)
    print("k=" + str(i) + " -> accuracy = " + str(accuracy_score(y_test, k.predict(X_test))))

k=1 -> accuracy = 0.5625
k=2 -> accuracy = 0.53125
k=3 -> accuracy = 0.6458333333333334
k=4 -> accuracy = 0.6041666666666666
k=5 -> accuracy = 0.6666666666666666
k=6 -> accuracy = 0.6354166666666666
k=7 -> accuracy = 0.6458333333333334


c.
    
    b) When we try to find the best k, it looks like k=5 is the best because we have the best accuracy. It seems like choosing a lower k is not enough for learning and choosing a greater k make the model overfit a bit.

    e) The accuracy changes according to the number of features. As we can sse, a big number of features will not necessarily porduce an important accuracy.
    It's necessary to select the important features.  

    f) It takes the lower class's index. In this case, y = 0.