# Vehicle Loan Prediction Machine Learning Model

# Chapter 5 - Linear Classifiers

### Load Data and Import Libraries

- Notice that we have included two new modules from sklearn

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
loan_df = pd.read_csv('../data/vehicle_loans_feat.csv', index_col='UNIQUEID')

## Lesson 1 - Train/Test Split

For the rest of this chapter, we will work through the steps of creating a simple linear classifier using Logistic Regression

First let's remind ourselves of the variables we are dealing with

In [3]:
#look at the columns
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 31 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   DISBURSED_AMOUNT                     233154 non-null  float64
 1   ASSET_COST                           233154 non-null  float64
 2   LTV                                  233154 non-null  float64
 3   MANUFACTURER_ID                      233154 non-null  int64  
 4   EMPLOYMENT_TYPE                      233154 non-null  object 
 5   STATE_ID                             233154 non-null  int64  
 6   AADHAR_FLAG                          233154 non-null  int64  
 7   PAN_FLAG                             233154 non-null  int64  
 8   VOTERID_FLAG                         233154 non-null  int64  
 9   DRIVING_FLAG                         233154 non-null  int64  
 10  PASSPORT_FLAG                        233154 non-null  int64  
 11  PERFORM_

It is important that our classifier recognises categorical variables where appropriate.

Lets use the dtypes property to look at the variable types of our categorical feilds.

In [4]:
#look at categorical data types
category_cols = ['DISBURSED_CAT', 'DISBURSAL_MONTH', 'PERFORM_CNS_SCORE_DESCRIPTION', 'STATE_ID', 'MANUFACTURER_ID', 'EMPLOYMENT_TYPE']
loan_df[category_cols].dtypes

DISBURSED_CAT                    object
DISBURSAL_MONTH                   int64
PERFORM_CNS_SCORE_DESCRIPTION    object
STATE_ID                          int64
MANUFACTURER_ID                   int64
EMPLOYMENT_TYPE                  object
dtype: object


- We do not want to treat MANUFACTURER_ID, STATE_ID and DISBURSAL_MONTH as integers
- We can encode our categorical columns with the category data type

In [5]:
#convert to categorical type
loan_df[category_cols] = loan_df[category_cols].astype('category')
loan_df[category_cols].dtypes

DISBURSED_CAT                    category
DISBURSAL_MONTH                  category
PERFORM_CNS_SCORE_DESCRIPTION    category
STATE_ID                         category
MANUFACTURER_ID                  category
EMPLOYMENT_TYPE                  category
dtype: object

### EXERCISE 

- To keep our first model simple, select 6 variables including 'LOAN_DEFAULT' and 'DISBURSED_CAT'
- Using these variables create a subset of loan_df and store it as a separate DataFrame loan_df_sml
- HINT: Think about the results of your exploratory analysis, which variables might be good predictors?

### SOLUTION

- I have selected the following 6 columns, 'STATE_ID', 'LTV', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE', 'DISBURSAL_MONTH', 'LOAN_DEFAULT'
- You could have selected any five which you are interested in, so long as one of them is 'LOAN_DEFAULT' and you have 'DISBURSED_CAT' which we will use later in this chapter

In [6]:
#type solution here
small_cols = ['STATE_ID', 'LTV', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE', 'DISBURSAL_MONTH', 'LOAN_DEFAULT']
loan_df_sml = loan_df[small_cols]

Nice! Let's have a quick look at our new dataframe

In [7]:
#check the dimensions
loan_df_sml.shape

(233154, 6)

We still have 233154 rows but now there are only 6 columns

In [8]:
#look at the columns
loan_df_sml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           233154 non-null  category
 1   LTV                233154 non-null  float64 
 2   DISBURSED_CAT      233154 non-null  category
 3   PERFORM_CNS_SCORE  233154 non-null  float64 
 4   DISBURSAL_MONTH    233154 non-null  category
 5   LOAN_DEFAULT       233154 non-null  int64   
dtypes: category(3), float64(2), int64(1)
memory usage: 7.8 MB


### Training/Test Split

- Before we fit (train) our basic linear model we need to split our data into training and test sets.
- Training Data: used to fit the model to our specific data
- Test Data: used to test the predictive power of the trained model  

We can use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from sklearn to create our training and test sets

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) has two required parameters 

- x: all of the rows and columns except the target variable 
- y: all of the rows but just the target variable column

### EXERCISE

- Create two variables x and y to match the required parameters for [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### SOLUTION

In [9]:
#type solution here
x = loan_df_sml.drop(['LOAN_DEFAULT'], axis = 1)
y = loan_df_sml['LOAN_DEFAULT']

We should investigate the dimensions of x and y to make sure the above solution is correct

In [12]:
#check the rows and columns
print(x.shape)
print(y.shape)

(233154, 5)
(233154,)


In [13]:
#x info
x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           233154 non-null  category
 1   LTV                233154 non-null  float64 
 2   DISBURSED_CAT      233154 non-null  category
 3   PERFORM_CNS_SCORE  233154 non-null  float64 
 4   DISBURSAL_MONTH    233154 non-null  category
dtypes: category(3), float64(2)
memory usage: 6.0 MB


In [15]:
#y info
y.dtype

dtype('int64')

Great! Looks like we have what need, now we can create our training/test data sets 

In addition to the required parameters of [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) we will also use 
- test_size: floating value between 0 and 1 indicating the size of the test set 
- random_state: integer value used for random seeding, allows for repeatability of the split

In [16]:
#train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 42)

Notice that [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) returns 4 output values 

- x_train: the training rows without the target variable 
- x_test: the test rows without the target variable 
- y_train: the training rows, target variable only 
- y_test: the test rows, target variable only 

Let's familiarize ourselves with this output

In [17]:
#check rows and columns
print("x_train has {0} rows and {1} columns".format(x_train.shape[0], x_train.shape[1]))
print("x_test has {0} rows and {1} columns".format(x_test.shape[0], x_test.shape[1]))
print("y_train has {0} rows".format(y_train.count()))
print("y_test has {0} rows".format(y_test.count()))

x_train has 186523 rows and 5 columns
x_test has 46631 rows and 5 columns
y_train has 186523 rows
y_test has 46631 rows


Looks like the number of rows and columns is what we would expect

In [18]:
#x train info
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 186523 entries, 633275 to 501520
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           186523 non-null  category
 1   LTV                186523 non-null  float64 
 2   DISBURSED_CAT      186523 non-null  category
 3   PERFORM_CNS_SCORE  186523 non-null  float64 
 4   DISBURSAL_MONTH    186523 non-null  category
dtypes: category(3), float64(2)
memory usage: 4.8 MB


In [19]:
#y train info
y_train.head()

UNIQUEID
633275    1
646002    0
591252    0
475736    0
639478    0
Name: LOAN_DEFAULT, dtype: int64

In [20]:
#x test info
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46631 entries, 617183 to 626383
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   STATE_ID           46631 non-null  category
 1   LTV                46631 non-null  float64 
 2   DISBURSED_CAT      46631 non-null  category
 3   PERFORM_CNS_SCORE  46631 non-null  float64 
 4   DISBURSAL_MONTH    46631 non-null  category
dtypes: category(3), float64(2)
memory usage: 1.2 MB


In [21]:
#y test info
y_test.head()

UNIQUEID
617183    1
515702    0
466872    0
632384    0
461426    0
Name: LOAN_DEFAULT, dtype: int64

Brilliant! All the train and test data has the correct columns 

Now let's use [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to check the distribution of the class variable

In [22]:
#check the training target variable
y_train.value_counts(normalize=True)

0    0.783099
1    0.216901
Name: LOAN_DEFAULT, dtype: float64

In [23]:
#check the test target variable
y_test.value_counts(normalize=True)

0    0.782248
1    0.217752
Name: LOAN_DEFAULT, dtype: float64

Great! Both the training and test set contain defaulted loans at 21.7%! 

We did not need to stratify due to the size of the dataset and the random nature of the sampling in [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

## Lesson 2 - Variable Encoding

Now its time build our first binary classifier!

First we create the model object using [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [24]:
#initialize the model
logistic_model = LogisticRegression()

Now we [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) the training data!

In [25]:
#fit the logistic model
logistic_model.fit(x_train, y_train)

ValueError: could not convert string to float: '60k - 75k'

### One Hot Encoding 

Ok, looks like we made a mistake!

The problem is that Logistic Regression, like most machine learning methods, does not know how to deal with string data

We can use [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to one hot encode our categorical variables 

- Remember in lesson 1 we converted our categorical variables to the 'category' data type 
- If we didn't do this, variables like STATE_ID which contained integer representations of categories would be missed by [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
- Then they would be incorrectly treated as continuous variables

Lets one hot encode our small dataframe and assign it to a new variable 'loan_data_dumm'


In [26]:
#one hot encode
loan_data_dumm = pd.get_dummies(loan_df_sml, prefix_sep = '_', drop_first = True)

We are passing three parameters to [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

- loan_df_sml: our small dataframe which we want to encode 
- prefix_sep: prefix separator for the dummy variables, new columns will be created like 'CNS_SCORE_CAT_Low'
- drop_first: drop the first dummy variable for each category 

HOLD ON! Why are we dropping the first dummy variable for each category?
- Think about it
- If we have 10 boolean variables indicating the presence of some category and there are no missing values 
- Then if the variable doesn't belong to one of 9 of the categories 
- It must belong to the 10th 
- So we can drop one of the dummy columns without losing any information
- This helps to simplify the model and reduce the impact of correlated variables 

Let's look at the results of [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [27]:
#check the columns
loan_data_dumm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 40 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   LTV                       233154 non-null  float64
 1   PERFORM_CNS_SCORE         233154 non-null  float64
 2   LOAN_DEFAULT              233154 non-null  int64  
 3   STATE_ID_2                233154 non-null  uint8  
 4   STATE_ID_3                233154 non-null  uint8  
 5   STATE_ID_4                233154 non-null  uint8  
 6   STATE_ID_5                233154 non-null  uint8  
 7   STATE_ID_6                233154 non-null  uint8  
 8   STATE_ID_7                233154 non-null  uint8  
 9   STATE_ID_8                233154 non-null  uint8  
 10  STATE_ID_9                233154 non-null  uint8  
 11  STATE_ID_10               233154 non-null  uint8  
 12  STATE_ID_11               233154 non-null  uint8  
 13  STATE_ID_12               233154 non-nu

Great! Looks like we have dummy columns for our categoricals

### EXERCISE 

- Take time to investigate the contents these new columns 
- Make sure you understand how [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) is transforming our dataset

### SOLUTION

In [None]:
#Extra space for exploration


## Lesson 3 - Train and Validate

### EXERCISE 

- Recreate our training and test set using loan_data_dumm
- Make sure the class distributions are correct

### SOLUTION 

In [28]:
#type solution here
x = loan_data_dumm.drop(['LOAN_DEFAULT'], axis = 1)
y = loan_data_dumm['LOAN_DEFAULT']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

print(y_train.value_counts(normalize = True))
print(y_test.value_counts(normalize = True))

0    0.782975
1    0.217025
Name: LOAN_DEFAULT, dtype: float64
0    0.782821
1    0.217179
Name: LOAN_DEFAULT, dtype: float64


Now let's try to [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) our model again

In [29]:
#intialize and train logistic regression
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

Ok! We are nearly there. We have successfully trained our model. But there is a warning we should take care of.

The above warning is telling us that the LogisticRegression did not find a solution to fit our data within the maximum number of iterations. 

The specifics of this error are out of the scope of this course but the likely explanations are that our data is not actually linearly separable or that our selected columns and pre-processing do not provide enough information to make distinct separations on the data. 

Something to keep in mind, but for now, we can try to resolve the warning by increasing the maximum allowed iterations.

The default value is 100, so let's try 200!

*The waring may or may not appear depending on your system, if you don't see any problems you can skip this step*

In [30]:
#fit model
logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(x_train, y_train)

LogisticRegression(max_iter=200)

Great! We have successfully trained our model

Now we need to generate some predictions for our test set

We pass our test features to the model to generate predictions, using [predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict)



In [31]:
#generate predictions
preds = logistic_model.predict(x_test)
preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

The output of predict is an array of 0s and 1s representing the loan default prediction

This is great but we need some measure of model performance

The [score](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) function generates predictions and compares the predicted class with the actual class. The output is a floating-point number between 0 and 1 telling us the percentage of loans we correctly classified!

In [32]:
#get accuracy
logistic_model.score(x_test, y_test)

0.7828212789683617

Wow! Looks like our model performed quite well, it predicted 78% of our test cases correctly.

Don't get too excited, accuracy can be a misleading measure of model performance! The next chapter will look at other measures of model performance