# Support Vector Machine

The purpose of this notebook is to predict whether an applicant is approved for a loan using a Support Vector Machine. The original dataset and definition of the problem corresponds to the following [Kaggle](https://www.kaggle.com/competitions/playground-series-s4e10/overview) competition, which we need to submit and participate as well.

In [19]:
import pandas as pd

pd.set_option('display.max_rows', 25)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 50)

import numpy as np

np.random.seed(42)

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
plt.rc('font', size=12)
plt.rc('figure', figsize=(12, 5))

import seaborn as sns

sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2, 'font.family': [u'times']})

In [20]:
train_set = pd.read_csv('./dataset/train.csv', index_col=0)
test_set = pd.read_csv('./dataset/test.csv', index_col=0)

# Data Exploration

The first step to any machine learning project is explore the data, so that we can know what we are working with. That process will help us in the future to decide how we can process and use this data.

In [21]:
print(train_set.shape[0])
print(test_set.shape[0])

58645
39098


In [22]:
train_set.head()

Unnamed: 0_level_0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,37,35000,RENT,0.0,EDUCATION,B,6000,11.49,0.17,N,14,0
1,22,56000,OWN,6.0,MEDICAL,C,4000,13.35,0.07,N,2,0
2,29,28800,OWN,8.0,PERSONAL,A,6000,8.9,0.21,N,10,0
3,30,70000,RENT,14.0,VENTURE,B,12000,11.11,0.17,N,5,0
4,22,60000,RENT,2.0,MEDICAL,A,6000,6.92,0.1,N,3,0


In [23]:
train_set.dtypes

person_age                      int64
person_income                   int64
person_home_ownership          object
person_emp_length             float64
loan_intent                    object
loan_grade                     object
loan_amnt                       int64
loan_int_rate                 float64
loan_percent_income           float64
cb_person_default_on_file      object
cb_person_cred_hist_length      int64
loan_status                     int64
dtype: object

In [24]:
train_set.select_dtypes(include=['object']).head()

Unnamed: 0_level_0,person_home_ownership,loan_intent,loan_grade,cb_person_default_on_file
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,RENT,EDUCATION,B,N
1,OWN,MEDICAL,C,N
2,OWN,PERSONAL,A,N
3,RENT,VENTURE,B,N
4,RENT,MEDICAL,A,N


The previous cells allow us to see some of the properties from the dataset, such as the relation between the test and train size, the amount of columns and the type of datatypes in the dataset. This dataset contains 4 columns that use categorical datatypes, we will probably need to One-hot encode them in the future.

In the following steps we will proceed to check the dataquality of the whole dataset. We will begin by looking for NaNs.

In [25]:
train_set.isna().sum()

person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
loan_status                   0
dtype: int64

As we can see, the training dataset does not contain any NaN. This is great news because we will be able to use all the data without the need to fill the gaps with expected values or dropping the huge voids.

Now we will check for data that we know for sure is not correct in the columns (age, income, employment length, loan amount)

In [26]:
# Ages need to be natural
print(f'Minimum age: {train_set['person_age'].min()}, Maximum age: {train_set['person_age'].max()}')

# Income will always be positive
print(f'Minimum income: {train_set['person_income'].min()} ')

# Employment length can not be longer than person age
print(f"Any employment length longer than age: {any(train_set['person_age'] <= train_set['person_emp_length'])}")

# Loan amount will always be positive
print(f'Minimum loan: {train_set['loan_amnt'].min()} ')


Minimum age: 20, Maximum age: 123
Minimum income: 4200 
Any employment length longer than age: True
Minimum loan: 500 


From the previous results we can see that some data from the dataset is incorrect such as a person being 123 years old and someone who worked for more years than they have been alive.

# Data Preprocessing
The next step after exploring the data is the preprocessing part, where we will clean, normalise, and remove all the irrelevant data.

### Data Cleaning
As we saw in the previous chapter, some data does not make any sense, we will begin by removing all this cases.

In [27]:
def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    # Filter for age between 20 and 100
    df = df[(df['person_age'] >= 20) & (df['person_age'] <= 100)]
    
    # Filter for positive income
    df = df[df['person_income'] > 0]
    
    # Filter where the difference between age and employment length is at least 16
    df = df[(df['person_age'] - df['person_emp_length']) >= 16]
    
    # Filter for positive loan amount
    df = df[df['loan_amnt'] > 0]
    
    return df

train_set = clean_data(train_set)

Now we will proceed to One-Hot encode all the categorical columns.

In [28]:
def OH_encode(df: pd.DataFrame, feature: str):
    dummies = pd.get_dummies(df[[feature]], dtype=int)
    df = df.drop(feature, axis=1)
    res = pd.concat([df, dummies], axis=1)
    return res

train_set_OH = train_set.copy()

for f in list(train_set.select_dtypes(include='object').columns):
    train_set_OH = OH_encode(train_set_OH, f)


In [29]:
train_set_OH

Unnamed: 0_level_0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,loan_status,person_home_ownership_MORTGAGE,person_home_ownership_OTHER,person_home_ownership_OWN,person_home_ownership_RENT,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE,loan_grade_A,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,cb_person_default_on_file_N,cb_person_default_on_file_Y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
0,37,35000,0.0,6000,11.49,0.17,14,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0
1,22,56000,6.0,4000,13.35,0.07,2,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
2,29,28800,8.0,6000,8.90,0.21,10,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0
3,30,70000,14.0,12000,11.11,0.17,5,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0
4,22,60000,2.0,6000,6.92,0.10,3,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58640,34,120000,5.0,25000,15.95,0.21,10,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1
58641,28,28800,0.0,10000,12.73,0.35,8,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
58642,23,44000,7.0,6800,16.00,0.15,2,1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0
58643,22,30000,2.0,5000,8.90,0.17,3,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0


In the next step we will normalise the data.

In [30]:
def normalize_df(df: pd.DataFrame):
    for col in df.columns:
        df[col] = df[col] / df[col].abs().max()
    
    return df

train_set_OH = normalize_df(train_set_OH)

In [31]:
train_set_OH

Unnamed: 0_level_0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,loan_status,person_home_ownership_MORTGAGE,person_home_ownership_OTHER,person_home_ownership_OWN,person_home_ownership_RENT,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE,loan_grade_A,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,cb_person_default_on_file_N,cb_person_default_on_file_Y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
0,0.440476,0.018421,0.000000,0.171429,0.494832,0.204819,0.466667,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.261905,0.029474,0.146341,0.114286,0.574935,0.084337,0.066667,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.345238,0.015158,0.195122,0.171429,0.383290,0.253012,0.333333,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.357143,0.036842,0.341463,0.342857,0.478467,0.204819,0.166667,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.261905,0.031579,0.048780,0.171429,0.298019,0.120482,0.100000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58640,0.404762,0.063158,0.121951,0.714286,0.686908,0.253012,0.333333,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
58641,0.333333,0.015158,0.000000,0.285714,0.548234,0.421687,0.266667,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
58642,0.273810,0.023158,0.170732,0.194286,0.689061,0.180723,0.066667,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
58643,0.261905,0.015789,0.048780,0.142857,0.383290,0.204819,0.100000,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Now that we have normalized the data we can check for the higher correlation between all the features.

In [37]:
# Get correlations and flatten them
corr_matrix = train_set_OH[train_set_OH.columns].corr()

n = corr_matrix.shape[0]
mask = np.tri(n, n, 0, dtype=bool)
corr_matrix[mask] = np.nan

corr_matrix = corr_matrix.stack().reset_index()
corr_matrix.columns = ['F1', 'F2', 'Corr']

top_5_corr = corr_matrix.nlargest(5, 'Corr')
bot_5_corr = corr_matrix.nsmallest(5, 'Corr')

print(top_5_corr)
print(bot_5_corr)

                F1                           F2      Corr
5       person_age   cb_person_cred_hist_length  0.876292
76       loan_amnt          loan_percent_income  0.646876
119  loan_int_rate  cb_person_default_on_file_Y  0.501420
114  loan_int_rate                 loan_grade_D  0.476913
335   loan_grade_C  cb_person_default_on_file_Y  0.475093
                                 F1                           F2      Corr
350     cb_person_default_on_file_N  cb_person_default_on_file_Y -1.000000
182  person_home_ownership_MORTGAGE   person_home_ownership_RENT -0.895226
111                   loan_int_rate                 loan_grade_A -0.822288
315                    loan_grade_A                 loan_grade_B -0.544994
118                   loan_int_rate  cb_person_default_on_file_N -0.501420


From the previous correlation results, we can see that there is a high correlation between a person's age and their credit history, which makes sense since the older you are the highest credit you have used. On the other side, leaving out the negative correlation on the result which is ovious, since its True or False, we can also see a huge negative correlation between the mortgage column and the rent column, as well as between the loan interest rate and the grade A loan.