## Initial Modeling Attempt: Small Business Loans with Random Forrest

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

In [2]:
df = pd.read_csv("../Data/Processed/sba_cleaned.csv")

## Preparing the Data

To start, we load in the cleaned data from our initial data wrangling.

In [3]:
## Recall the feature names from our data.
df.columns

Index(['Unnamed: 0', 'ID', 'Name', 'City', 'State', 'Zip', 'Bank', 'BankState',
       'NAICS', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', 'NewExist',
       'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', 'LowDoc',
       'ChgOffDate', 'DisbursementDate', 'DisbursementGross', 'BalanceGross',
       'MIS_Status', 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv'],
      dtype='object')

In [4]:
##It appears that I've accidentally added an extra column called `Unnamed: 0`! Let's remove it.
df = df.drop(df.columns[0], axis = 1)

We choose the features to use in our model. Some features, such as `ChgOffDate` and clearly relate to the eventual fate of the loan, so are not appropriate to use in our model. Similarly, we drop `DisbursementDate`, `DisbursementGross`, `BalanceGross` and `ChgOffPrinGr`. It is not obvious whether CreateJob and RetainedJob refer to projections from the loan application, or later follow-up. We leave them in for now. To simplify the initial model, we drop `ApprovalDate` keep only `ApprovalFY`.

We also remove the `ID`, `City`, `Zip`, `Bank`, `BankState` and `FranchiseCode.` These categorical variables have a large number of values, which would create memory issues with one-hot encoding.

We subset on the remaining columns, and drop rows with missing values.

In [5]:
## Create list of features to use for our model
features = ['State', 'NAICS', 'ApprovalFY', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 'UrbanRural','LowDoc', 'MIS_Status', 'GrAppv', 'SBA_Appv']

In [6]:
## Subset on relevant columns
df_pred = df[features]

In [8]:
## Drop rows with missing values
df_pred = df_pred.dropna(axis = 0)

Since csv format does not keep track of data stypes, we must examine those and reset as needed.

In [10]:
## Examine the data types.
df_pred.dtypes

State           object
NAICS          float64
ApprovalFY      object
Term             int64
NoEmp            int64
NewExist       float64
CreateJob        int64
RetainedJob      int64
UrbanRural       int64
LowDoc          object
MIS_Status      object
GrAppv         float64
SBA_Appv       float64
dtype: object

In [18]:
## Columns to be converted to categories
cat_cols = ['State', 'NAICS', 'NewExist', 'UrbanRural', 'LowDoc', 'MIS_Status']
df_pred[cat_cols] = df_pred[cat_cols].apply(lambda x: x.astype('category'))

In [25]:
## Save approval year as an integer
df_pred['ApprovalFY'] = df_pred['ApprovalFY'].apply(lambda x: int(x[:4]))

In [26]:
df_pred.dtypes

State          category
NAICS          category
ApprovalFY        int64
Term              int64
NoEmp             int64
NewExist       category
CreateJob         int64
RetainedJob       int64
UrbanRural     category
LowDoc         category
MIS_Status     category
GrAppv          float64
SBA_Appv        float64
dtype: object

There are are a loarge number of NAICS codes, which may create memory issues with one-hot encoding. However, the first two digits of the NAICS code keeps track of the overeall type of industry. We simplify by replacing NAICS code with its first two digits.

In [34]:
## First two digits of each industry code
df_pred['Industry'] = df_pred['NAICS'].apply(lambda x: str(x)[:2])

In [37]:
df_pred['Industry'] = df_pred['Industry'].astype('category')

In [42]:
df_pred = df_pred.drop('NAICS', axis = 1)

We use one-hot encoding to encode categorical variables as numeric. 

In [43]:
## One-hot encode categorical features
df_features = pd.get_dummies(df_pred)

In [45]:
## Inspect our new list of features
df_features.columns

Index(['ApprovalFY', 'Term', 'NoEmp', 'CreateJob', 'RetainedJob', 'GrAppv',
       'SBA_Appv', 'State_AK', 'State_AL', 'State_AR', 'State_AZ', 'State_CA',
       'State_CO', 'State_CT', 'State_DC', 'State_DE', 'State_FL', 'State_GA',
       'State_HI', 'State_IA', 'State_ID', 'State_IL', 'State_IN', 'State_KS',
       'State_KY', 'State_LA', 'State_MA', 'State_MD', 'State_ME', 'State_MI',
       'State_MN', 'State_MO', 'State_MS', 'State_MT', 'State_NC', 'State_ND',
       'State_NE', 'State_NH', 'State_NJ', 'State_NM', 'State_NV', 'State_NY',
       'State_OH', 'State_OK', 'State_OR', 'State_PA', 'State_RI', 'State_SC',
       'State_SD', 'State_TN', 'State_TX', 'State_UT', 'State_VA', 'State_VT',
       'State_WA', 'State_WI', 'State_WV', 'State_WY', 'NewExist_1.0',
       'NewExist_2.0', 'UrbanRural_0', 'UrbanRural_1', 'UrbanRural_2',
       'LowDoc_N', 'LowDoc_Y', 'MIS_Status_CHGOFF', 'MIS_Status_P I F',
       'Industry_11', 'Industry_21', 'Industry_22', 'Industry_23',
       'Ind

In [51]:
## One-hot encoding creates two columns corresponding to MIS Status.
## Drop the MIS_Status_PIF column
df_features = df_features.drop(df_features.columns[-25], axis = 1)

In [56]:
## Array of predictors
X = df_features.drop('MIS_Status_CHGOFF', axis = 1).values

In [57]:
## Column of labels
y = df_features['MIS_Status_CHGOFF'].values

## Training the Model

We train a random forrest model, using 20% of our data as a training set. We scale the features using standard scalar.

Question: should the scalar be applied to columns representing categorical data?

In [62]:
## Split data into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [63]:
## Fit a Random Forrest model, making sure to scale the data first
RF_pipe = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(random_state = 0)
)

In [64]:
## Fit the model
RF_pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(random_state=0))])

In [67]:
## Make predictions
y_te_pred = RF_pipe.predict(X_test)

In [69]:
## Acurracy score on the test set is nearly 93%. Not bad for a first try!
accuracy_score(y_test, y_te_pred)

0.9295228289478454