### Loan Prediction Data Wrangling


In this part of the loan prediction project, my goal is to prepare my data for future pre-processing and analysis.

In [10]:
# Import python packages
import pandas as pd

These standard packages should help me review and prepare the dataset for future processing and analysis. For this project, I found a dataset from kaggle and can be found here: https://www.kaggle.com/datasets/nikhil1e9/loan-default/data

The goal of the project is to predict whether a customer will default on their given a variety of features about the client.  By being able to identify high risk loans, financial institutions can reduce losses and improve profits. 

In [11]:
# Load the dataset
data = pd.read_csv('Loan_default.csv')
print(data.head())

       LoanID  Age  Income  LoanAmount  CreditScore  MonthsEmployed  \
0  I38PQUQS96   56   85994       50587          520              80   
1  HPSK72WA7R   69   50432      124440          458              15   
2  C1OZ6DPJ8Y   46   84208      129188          451              26   
3  V2KKSFM3UN   32   31713       44799          743               0   
4  EY08JDHTZP   60   20437        9139          633               8   

   NumCreditLines  InterestRate  LoanTerm  DTIRatio    Education  \
0               4         15.23        36      0.44   Bachelor's   
1               1          4.81        60      0.68     Master's   
2               3         21.17        24      0.31     Master's   
3               3          7.07        24      0.23  High School   
4               4          6.51        48      0.73   Bachelor's   

  EmploymentType MaritalStatus HasMortgage HasDependents LoanPurpose  \
0      Full-time      Divorced         Yes           Yes       Other   
1      Full-time    

The dataset was loaded successfully using pandas read_csv. Next, I will look more closely at the dataset to get a better understanding of the data. 

Since the target variable, default, is binary (0 - no default, 1 - default), this is a classification type problem.

In [12]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   LoanID          255347 non-null  object 
 1   Age             255347 non-null  int64  
 2   Income          255347 non-null  int64  
 3   LoanAmount      255347 non-null  int64  
 4   CreditScore     255347 non-null  int64  
 5   MonthsEmployed  255347 non-null  int64  
 6   NumCreditLines  255347 non-null  int64  
 7   InterestRate    255347 non-null  float64
 8   LoanTerm        255347 non-null  int64  
 9   DTIRatio        255347 non-null  float64
 10  Education       255347 non-null  object 
 11  EmploymentType  255347 non-null  object 
 12  MaritalStatus   255347 non-null  object 
 13  HasMortgage     255347 non-null  object 
 14  HasDependents   255347 non-null  object 
 15  LoanPurpose     255347 non-null  object 
 16  HasCoSigner     255347 non-null  object 
 17  Default   

There are 17 columns.  Of the 18 columns, 8 columns are objects (need to investigate these more), 8 columns are integers and 2 columns are floats. The dataset does not appear to have any null/NaN values but I will check more closely.

In [13]:
print(data.isnull().sum())

LoanID            0
Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
Default           0
dtype: int64


As suspected, there are no null values.  This is great!  The dataset seems to be in good initial shape.  

The columns 'LoanID', 'HasCoSigner', 'Education', 'EmploymentType', 'MaritalStatus', 'HasMortgage', 'HasDependents', 'LoanPurpose' are categorical and need to be converted into a numeric type. One way to do this is using the pd.get_dummies method in pandas. This will make the categorical columns usable for machine learning models.

In [14]:
data = pd.get_dummies(data, columns = ['HasCoSigner', 'Education', 'EmploymentType', 'MaritalStatus', 'HasMortgage', 'HasDependents', 'LoanPurpose'], drop_first=True)
print(data.columns)
print(data.info())

Index(['LoanID', 'Age', 'Income', 'LoanAmount', 'CreditScore',
       'MonthsEmployed', 'NumCreditLines', 'InterestRate', 'LoanTerm',
       'DTIRatio', 'Default', 'HasCoSigner_Yes', 'Education_High School',
       'Education_Master's', 'Education_PhD', 'EmploymentType_Part-time',
       'EmploymentType_Self-employed', 'EmploymentType_Unemployed',
       'MaritalStatus_Married', 'MaritalStatus_Single', 'HasMortgage_Yes',
       'HasDependents_Yes', 'LoanPurpose_Business', 'LoanPurpose_Education',
       'LoanPurpose_Home', 'LoanPurpose_Other'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 26 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   LoanID                        255347 non-null  object 
 1   Age                           255347 non-null  int64  
 2   Income                        255347 non-null  int64  
 3   LoanAmount 

In [15]:
print(data['Education_PhD'])

0         False
1         False
2         False
3         False
4         False
          ...  
255342    False
255343    False
255344    False
255345    False
255346    False
Name: Education_PhD, Length: 255347, dtype: bool


There are now 23 columns.  The only column that remains an object is 'LoanID' which we do not need and will drop.  The dummy columns are not numeric however.  Those columns are now bool type so I will need to convert the bool type to a numeric type, preferably of float type. To do this, I will use a lambda function to convert from bool to float.

In [16]:
bool_to_float = lambda x: float(x) if isinstance(x, bool) else x
columns_to_convert = ['Education_High School',
                      "Education_Master's", 'Education_PhD', 'EmploymentType_Part-time',
                      'EmploymentType_Self-employed', 'EmploymentType_Unemployed',
                      'MaritalStatus_Married', 'MaritalStatus_Single', 'HasMortgage_Yes',
                      'HasDependents_Yes', 'LoanPurpose_Business', 'LoanPurpose_Education',
                      'LoanPurpose_Home', 'LoanPurpose_Other', 'HasCoSigner_Yes']
data[columns_to_convert] = data[columns_to_convert].map(bool_to_float)
data=data.drop('LoanID', axis =1)
print(data.info)
print(data.dtypes)

<bound method DataFrame.info of         Age  Income  LoanAmount  CreditScore  MonthsEmployed  NumCreditLines  \
0        56   85994       50587          520              80               4   
1        69   50432      124440          458              15               1   
2        46   84208      129188          451              26               3   
3        32   31713       44799          743               0               3   
4        60   20437        9139          633               8               4   
...     ...     ...         ...          ...             ...             ...   
255342   19   37979      210682          541             109               4   
255343   32   51953      189899          511              14               2   
255344   56   84820      208294          597              70               3   
255345   42   85109       60575          809              40               1   
255346   62   22418       18481          636             113               2   

       

My data is ready to begin analysis.  My next step is to work through the exploratory data analysis steps for the project which will be completed in the next notebook.

In [17]:
print(data.shape)

(255347, 25)


In [18]:
transformed_data = data
transformed_data.to_csv('transformed_data.csv', index=False)