**Step 1:** 
We are loading the CSV file to a pandas dataframe, so you can use any pandas functions to manipulate the data.

In [36]:
#loading data
import pandas as pd
data = pd.read_csv("SBA_loans_project_1.csv")

**Step 2**
Remove the index column as it is not required for training

In [37]:
#removing index
data.drop(columns="index",inplace=True)
data.head()


Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,7,1.0,6,7,1,1,0,N,743000.0,0.0,743000.0,743000.0,0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,2,1.0,0,0,1,0,0,N,280000.0,0.0,280000.0,210000.0,0
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,7,1.0,0,0,1,1,0,Y,144500.0,0.0,144500.0,122825.0,0
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,2,2.0,0,0,1,0,N,Y,52500.0,0.0,52500.0,42000.0,0


**Step 3**
Next, we will explore the data to see if it matches the data description and also we will find the missing / NA values and replace them with the appropriate values.

In [38]:
#show unique values in each column
for col in data.columns:
    print(col,":",data[col].unique())
    print("")

#find datatype of column
for col in data.columns:
    print(col,":",data[col].dtype)

City : ['GLEN BURNIE' 'WEST BEND' 'SAN DIEGO' ... 'Orange park' 'GREENHAVEN'
 'SCHAFFERSTOWN']

State : ['MD' 'WI' 'CA' 'MA' 'MO' 'OH' 'IL' 'GA' 'MI' 'NY' 'SC' 'FL' 'KS' 'ID'
 'AZ' 'NH' 'NM' 'KY' 'NJ' 'TX' 'PA' 'MN' 'OK' 'OR' 'WA' 'IN' 'UT' 'AL'
 'MS' 'CO' 'NC' 'CT' 'ME' 'HI' 'LA' 'IA' 'MT' 'RI' 'WV' 'NV' 'AR' 'VA'
 'TN' 'ND' 'VT' 'WY' 'AK' 'SD' 'DE' 'NE' 'DC' nan]

Zip : [21060 53095 92128 ... 32006 56038 14784]

Bank : ['BUSINESS FINANCE GROUP, INC.' 'JPMORGAN CHASE BANK NATL ASSOC'
 'UMPQUA BANK' ... 'WILSHIRE CREDIT CORP' 'NEVADA BANK & TRUST COMPANY'
 'FIRST COMMUN BK OF OZARKS']

BankState : ['VA' 'IL' 'OR' 'MA' 'OH' 'CA' 'SD' 'CT' 'RI' 'SC' 'WI' 'GA' 'MI' 'AZ'
 'DE' 'NY' 'NM' 'KY' 'NC' 'NJ' 'MN' 'WA' 'UT' 'IN' 'AL' 'MS' 'TX' 'DC'
 'CO' 'ID' 'PA' 'NH' 'MO' 'MD' 'HI' 'TN' 'IA' 'FL' 'LA' 'MT' nan 'KS' 'WV'
 'NV' 'OK' 'NE' 'ME' 'ND' 'WY' 'AK' 'VT' 'AR' 'PR' 'GU' 'VI' 'EN']

NAICS : [811111 722410      0 ... 927110 813211 336414]

NoEmp : [   7   20    2    5    3    1   10    4   18

According to the dataset desciption, there are a few columns that have vague or arbitrary values. 
- RevlineCr should only have values of Y or N
- LowDoc should only have values of Y or N
- NewExist should only have values of 1 or 2

In [39]:
for i in data['RevLineCr']:
    if i not in ['Y','N']:
        data['RevLineCr'].replace(i,'N',inplace=True)
print("RevLineCr",data['RevLineCr'].unique())

for i in data['LowDoc']:
    if i not in ['Y','N']:
        data['LowDoc'].replace(i,'N',inplace=True)
print("LowDoc",data['LowDoc'].unique())

for i in data['NewExist']:
    if i not in [1,2]:
        data['NewExist'].replace(i,None,inplace=True)
print("NewExist",data['NewExist'].unique())


RevLineCr ['N' 'Y']
LowDoc ['N' 'Y']


KeyboardInterrupt: 

In [None]:
#finding and removing null values
missing = data.isnull().sum()
na_col = data.isna().sum()
print(missing[missing>0])
print("---------------------")
print(na_col[na_col>0])

City           25
State          12
Bank         1405
BankState    1411
NewExist     1060
dtype: int64
---------------------
City           25
State          12
Bank         1405
BankState    1411
NewExist     1060
dtype: int64


As we can see, there are a lot of missing values in the dataset. We will replace the missing values with the appropriate values. For example, if the column is a numeric column, we will replace the missing values with the median of the column. If the column is a categorical column, we will replace the missing values with the mode of the column.

In [None]:
#replacing misisng values
cat_cols=['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc','NewExist']

for column in cat_cols:
    data[column]=data[column].fillna(data[column].mode()[0])

print(data.isnull().sum())

City                 0
State                0
Zip                  0
Bank                 0
BankState            0
NAICS                0
NoEmp                0
NewExist             0
CreateJob            0
RetainedJob          0
FranchiseCode        0
UrbanRural           0
RevLineCr            0
LowDoc               0
DisbursementGross    0
BalanceGross         0
GrAppv               0
SBA_Appv             0
MIS_Status           0
dtype: int64


In [None]:
data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,7,1.0,6,7,1,1,N,N,743000.0,0.0,743000.0,743000.0,0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,2,1.0,0,0,1,0,N,N,280000.0,0.0,280000.0,210000.0,0
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,7,1.0,0,0,1,1,N,Y,144500.0,0.0,144500.0,122825.0,0
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,2,2.0,0,0,1,0,N,Y,52500.0,0.0,52500.0,42000.0,0


**Step 4**
Now that we have replaced the missing values, we will split the dataset into training and testing datasets.Let's split the data into 80% training and 20% testing. We r using random state 42 for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

train,test = train_test_split(data,test_size=0.2,random_state=42)
train.shape, test.shape

((647397, 19), (161850, 19))

Training set now has 647397 samples and testing set has 161850 samples.

**Step 5**
Lets start with the feature encodings

**Target Encoding**
Next, lets perform target encoding on our categorical columns. We'll take MIS_Status as the target column. We will use the training dataset to perform target encoding and then use the same encoding on the testing dataset. New columns will be created for each of the categorical columns with the suffix "_trg".

In [None]:
#target encoder
import category_encoders as ce
categorical_columns = ['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc','NewExist', 'UrbanRural']

encoder = ce.TargetEncoder(cols=categorical_columns)
encoder.fit(train, train['MIS_Status'])

train_encoded = encoder.transform(train)
test_encoded = encoder.transform(test)

# Renaming the columns
train_encoded.rename(columns={col: col + "_trg" if col in categorical_columns else col for col in train_encoded.columns}, inplace=False)
test_encoded.rename(columns={col: col + "_trg" if col in categorical_columns else col for col in test_encoded.columns}, inplace=False)

train_encoded.head()


  elif pd.api.types.is_categorical_dtype(cols):
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  return pd.api.types.is_categorical_dtype(dtype)
  

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
362339,0.120141,0.198398,8873,0.270897,0.196262,453110,1,0.170224,0,1,1,0.244078,0.253225,0.186971,6450.0,0.0,4200.0,2100.0,0
744820,0.223837,0.212677,37403,0.548842,0.21823,422110,1,0.187814,1,1,1,0.244078,0.152733,0.186971,10000.0,0.0,10000.0,8500.0,0
763237,0.223837,0.212677,37416,0.150754,0.102112,0,1,0.187814,0,0,1,0.071191,0.152733,0.090954,42000.0,0.0,42000.0,37800.0,0
637364,0.239931,0.197623,14559,0.123545,0.168405,445110,220,0.170224,0,0,1,0.244078,0.152733,0.186971,637000.0,0.0,637000.0,477750.0,0
17777,0.142045,0.115634,99503,0.087592,0.087512,0,200,0.170224,0,0,79950,0.071191,0.152733,0.186971,500000.0,0.0,500000.0,500000.0,0


**Step 6**
Now that we have encoded the categorical variables, we will scale the numerical variables. We will use the StandardScaler from sklearn to scale the numerical variables. We will fit the scaler on the training data and then transform the training and testing data.

In [None]:
from sklearn.preprocessing import StandardScaler

numerical_columns = [ 'NoEmp', 'CreateJob', 'RetainedJob', 'GrAppv', 'SBA_Appv', 'DisbursementGross', 'BalanceGross']
scaler = StandardScaler()
train_encoded[numerical_columns] = scaler.fit_transform(train_encoded[numerical_columns])
test_encoded[numerical_columns] = scaler.transform(test_encoded[numerical_columns])

train_encoded.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
362339,0.120141,0.198398,8873,0.270897,0.196262,453110,-0.14003,0.170224,-0.035513,-0.041117,1,0.244078,0.253225,0.186971,-0.676128,-0.001977,-0.665714,-0.645651,0
744820,0.223837,0.212677,37403,0.548842,0.21823,422110,-0.14003,0.187814,-0.031245,-0.041117,1,0.244078,0.152733,0.186971,-0.663806,-0.001977,-0.64524,-0.617633,0
763237,0.223837,0.212677,37416,0.150754,0.102112,0,-0.14003,0.187814,-0.035513,-0.045383,1,0.071191,0.152733,0.090954,-0.552726,-0.001977,-0.53228,-0.489361,0
637364,0.239931,0.197623,14559,0.123545,0.168405,445110,2.809836,0.170224,-0.035513,-0.045383,1,0.244078,0.152733,0.186971,1.512665,-0.001977,1.568064,1.43669,0
17777,0.142045,0.115634,99503,0.087592,0.087512,0,2.540442,0.170224,-0.035513,-0.045383,79950,0.071191,0.152733,0.186971,1.037105,-0.001977,1.084456,1.534098,0


Logarithmic transformations are particularly useful in dealing with skewed data and can help in capturing non-linear patterns. We can apply log transformations to some of the continuous features to create new ones. The reason we are creating log values is that variables like loan amounts or balances can take on large values. Taking the logarithm reduces the scale of these values, making them easier to work with. 


In [None]:
import numpy as np

# Creating log-based features for the training dataset
train_encoded['Log_DisbursementGross'] = np.log1p(train_encoded['DisbursementGross'])
train_encoded['Log_NoEmp'] = np.log1p(train_encoded['NoEmp'])
train_encoded['Log_GrAppv'] = np.log1p(train_encoded['GrAppv'])
train_encoded['Log_SBA_Appv'] = np.log1p(train_encoded['SBA_Appv'])
train_encoded['Log_BalanceGross'] = np.log1p(train_encoded['BalanceGross'])

# Binning 
train_encoded['Disbursement_Bins'] = pd.cut(train_encoded['DisbursementGross'], 
                                           bins=[-np.inf, 50000, 150000, np.inf], 
                                           labels=['Low', 'Medium', 'High'])

# Loan Efficiency
train_encoded['Loan_Efficiency'] = train_encoded['DisbursementGross'] / (train_encoded['CreateJob'] + train_encoded['RetainedJob'] + 1)  # Adding 1 to avoid division by zero

# Guarantee Ratio
train_encoded['Guarantee_Ratio'] = train_encoded['SBA_Appv'] / train_encoded['GrAppv']

# Loan Guarantee Interaction
train_encoded['Loan_Guarantee_Interaction'] = train_encoded['SBA_Appv'] * train_encoded['GrAppv']

# Disbursement Squared
train_encoded['Disbursement_Squared'] = train_encoded['DisbursementGross'] ** 2

# Displaying the newly created features
train_encoded[['Log_DisbursementGross', 'Log_NoEmp', 'Log_GrAppv', 'Log_SBA_Appv','Disbursement_Bins', 'Loan_Efficiency', 'Guarantee_Ratio', 'Loan_Guarantee_Interaction', 'Disbursement_Squared']].head()


Unnamed: 0,Log_DisbursementGross,Log_NoEmp,Log_GrAppv,Log_SBA_Appv,Disbursement_Bins,Loan_Efficiency,Guarantee_Ratio,Loan_Guarantee_Interaction,Disbursement_Squared
362339,-1.127408,-0.150857,-1.095759,-1.037473,Low,-0.73224,0.969862,0.429819,0.45715
744820,-1.090066,-0.150857,-1.036314,-0.961373,Low,-0.715586,0.957214,0.398521,0.440638
763237,-0.804583,-0.150857,-0.759886,-0.672092,Low,-0.601374,0.919366,0.260477,0.305506
637364,0.921344,1.337586,0.943152,0.890641,Low,1.645804,0.916219,2.252823,2.288156
17777,0.71153,1.264252,0.734508,0.929838,Low,1.128386,1.414625,1.663662,1.075586


For log transformations, we have added a small constant like 1 to handle zero values, as the logarithm of zero is undefined. 
Apart from log features, we have created other features like:

Disbursement_Bins: Convert DisbursementGross into categorical bins (e.g., Low, Medium, High).

Loan_Efficiency: DisbursementGross divided by the sum of jobs created and retained. This captures how much loan amount is disbursed per job impact.  1 has been added to the denominator to avoid division by zero.

Guarantee_Ratio: Ratio of the amount guaranteed by the SBA to the gross loan amount approved by the bank.

Loan_Guarantee_Interaction: Multiplication of SBA_Appv and GrAppv.

Disbursement_Squared: Square of DisbursementGross to capture non-linear effects.

**Step 7**
Lets Move on the model tuning part.

In [None]:
train_encoded.columns

Index(['City', 'State', 'Zip', 'Bank', 'BankState', 'NAICS', 'NoEmp',
       'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural',
       'RevLineCr', 'LowDoc', 'DisbursementGross', 'BalanceGross', 'GrAppv',
       'SBA_Appv', 'MIS_Status', 'Log_DisbursementGross', 'Log_NoEmp',
       'Log_GrAppv', 'Log_SBA_Appv', 'Log_BalanceGross', 'Disbursement_Bins',
       'Loan_Efficiency', 'Guarantee_Ratio', 'Loan_Guarantee_Interaction',
       'Disbursement_Squared'],
      dtype='object')

In [None]:
train_encoded.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,...,Log_DisbursementGross,Log_NoEmp,Log_GrAppv,Log_SBA_Appv,Log_BalanceGross,Disbursement_Bins,Loan_Efficiency,Guarantee_Ratio,Loan_Guarantee_Interaction,Disbursement_Squared
362339,0.120141,0.198398,8873,0.270897,0.196262,453110,-0.14003,0.170224,-0.035513,-0.041117,...,-1.127408,-0.150857,-1.095759,-1.037473,-0.001979,Low,-0.73224,0.969862,0.429819,0.45715
744820,0.223837,0.212677,37403,0.548842,0.21823,422110,-0.14003,0.187814,-0.031245,-0.041117,...,-1.090066,-0.150857,-1.036314,-0.961373,-0.001979,Low,-0.715586,0.957214,0.398521,0.440638
763237,0.223837,0.212677,37416,0.150754,0.102112,0,-0.14003,0.187814,-0.035513,-0.045383,...,-0.804583,-0.150857,-0.759886,-0.672092,-0.001979,Low,-0.601374,0.919366,0.260477,0.305506
637364,0.239931,0.197623,14559,0.123545,0.168405,445110,2.809836,0.170224,-0.035513,-0.045383,...,0.921344,1.337586,0.943152,0.890641,-0.001979,Low,1.645804,0.916219,2.252823,2.288156
17777,0.142045,0.115634,99503,0.087592,0.087512,0,2.540442,0.170224,-0.035513,-0.045383,...,0.71153,1.264252,0.734508,0.929838,-0.001979,Low,1.128386,1.414625,1.663662,1.075586


Here we will determine the best hyperparameters to train the model.

Lets first start with a Logistic Regression model with 100 iterations. We will use the GridSearchCV function to find the best hyperparameters. 

For inverse regulariztion strength, lets start with common values like 0.01, 0.1, 1, 10, 100

And for penalty, lets start with l1 l2 and elasticnet, which is combination of l1 and l2.

Considering the computing resources at hand, lets keep max iterations as 100 and 200.

To control the mix between L1 and L2 while using elasticnet, we will use the l1_ratio parameter. We will start with 0.2, 0.5 and 0.8.

In [None]:
#logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

target = "MIS_Status"
nontrain_cols=['City', 'State', 'Zip','Bank', 'BankState', 'LowDoc','RevLineCr','MIS_Status','Disbursement_Bins']

x = train_encoded.drop(columns=[target])
y = train_encoded[target]

lr = LogisticRegression(random_state=42,max_iter=100,n_jobs=-1, verbose=1)

traincol = [col for col in x.columns if col not in nontrain_cols]

classifier = lr.fit(x[traincol],y)

#GridSearchCV

param_grid = {
    'C': 0.01,
    'penalty': 'l1',
    'solver': 'liblinear'  
    'max_iter': 100,
    'class_weight': None,
    'l1_ratio': 0.2 
}
grid = GridSearchCV(lr, param_grid, cv=5, verbose=1, n_jobs=-1)
grid.fit(x[traincol], y)
print("Best parameters found: ", grid.best_params_)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Fitting 5 folds for each of 360 candidates, totalling 1800 fits


300 fits failed out of a total of 1800.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
300 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\sklearn\linear_model\_logistic.py", line 1169, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\sklearn\linear

[LibLinear]Best parameters found:  {'C': 0.01, 'class_weight': None, 'l1_ratio': 0.2, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best cross-validation score: 0.82


**Best Parameters**
[LibLinear]Best parameters found:  {'C': 0.01, 'class_weight': None, 'l1_ratio': 0.2, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best cross-validation score: 0.82

In [None]:
#dump best params
import pickle
pickle.dump(grid.best_params_,open('best_params.pkl','wb'))


**Step 8**
Now we have the best configuration for the model, so we can move forward with the training part of the project.


In [295]:
#threshold calculation
import numpy as np
from sklearn.metrics import f1_score

# def calculate_optimal_threshold(classifier, X, y):
#     """
#     Calculate the optimal threshold for a classifier based on Youden's J statistic.
#     """
    
#     # Predict probabilities
#     y_prob = classifier.predict_proba(X)[:,1]

#     # Calculate ROC curve
#     fpr, tpr, thresholds = roc_curve(y, y_prob)

#     # Calculate Youden's J statistic
#     J = tpr - fpr

#     # Determine the optimal threshold
#     optimal_idx = np.argmax(J)
#     optimal_threshold = thresholds[optimal_idx]

#     return optimal_threshold

def calculate_optimal_threshold(classifier, X, y):
    y_prob = classifier.predict_proba(X)[:, 1]
    
    thresholds = np.linspace(0, 1, 100)
    f1_scores = []
    
    # Compute the F1 score
    for threshold in thresholds:
        y_pred = (y_prob > threshold).astype(int)
        score = f1_score(y, y_pred, average='macro')
        f1_scores.append(score)
    
    # Getting threshold
    optimal_threshold = thresholds[np.argmax(f1_scores)]
    
    return optimal_threshold


In [296]:
import pandas as pd
import numpy as np
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import pickle
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


def train_model(data):
    """
    Train sample model and save artifacts
    """
    target = "MIS_Status"
    nontrain_cols=['City', 'State', 'Zip','Bank', 'BankState', 'LowDoc','RevLineCr','MIS_Status','Disbursement_Bins']
    # Removing the index column
    if "index" in data.columns:
        data.drop(columns="index", inplace=True)
    
    y = data[target] if target in data.columns else None
    x = data.drop(columns=[target]) if target in data.columns else data.copy()
        
    # Replacing the missing values
    for i in data['RevLineCr']:
        if i not in ['Y','N']:
            data['RevLineCr'].replace(i,'N',inplace=True)

    for i in data['LowDoc']:
        if i not in ['Y','N']:
            data['LowDoc'].replace(i,'N',inplace=True)

    for i in data['NewExist']:
        if i not in [1,2]:
            data['NewExist'].replace(i,None,inplace=True)

    cat_cols=['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc','NewExist', 'UrbanRural']
    for column in cat_cols:
        data[column]=data[column].fillna(data[column].mode()[0])
    
    
    # Target encoding the categorical columns
    categorical_columns = ['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc','NewExist', 'UrbanRural']
    encoder = ce.TargetEncoder(cols=categorical_columns)
    encoder.fit(data[categorical_columns], data['MIS_Status'])
    train_encoded = encoder.transform(data[categorical_columns])
    train_encoded = train_encoded.add_suffix('_trg')
    #train_encoded = pd.concat([train_encoded, data], axis=1)
    train_encoded = pd.concat([train_encoded, data], axis=1)
    for column in categorical_columns:
        train_encoded[column + "_trg"].fillna(train_encoded[column + "_trg"].mean(), inplace=True)
    
    # Renaming the columns
    #train_encoded.rename(columns={col: col + "_trg" if col in categorical_columns else col for col in train_encoded.columns}, inplace=False)
    print(train_encoded.columns)
    #Feature Engineering
    train_encoded['Log_DisbursementGross'] = np.log1p(train_encoded['DisbursementGross'])
    train_encoded['Log_NoEmp'] = np.log1p(train_encoded['NoEmp'])
    train_encoded['Log_GrAppv'] = np.log1p(train_encoded['GrAppv'])
    train_encoded['Log_SBA_Appv'] = np.log1p(train_encoded['SBA_Appv'])
    train_encoded['Log_BalanceGross'] = np.log1p(train_encoded['BalanceGross'])
    train_encoded['Loan_Efficiency'] = train_encoded['DisbursementGross'] / (train_encoded['CreateJob'] + train_encoded['RetainedJob'] + 1)  
    train_encoded['Guarantee_Ratio'] = train_encoded['SBA_Appv'] / train_encoded['GrAppv']
    train_encoded['Loan_Guarantee_Interaction'] = train_encoded['SBA_Appv'] * train_encoded['GrAppv']
    train_encoded['Disbursement_Squared'] = train_encoded['DisbursementGross'] ** 2
    train_encoded['Disbursement_Bins'] = pd.cut(train_encoded['DisbursementGross'], 
                                           bins=[-np.inf, 50000, 150000, np.inf], 
                                           labels=['Low', 'Medium', 'High'])
     # Scaling the numerical columns
    numerical_columns = [ 'NoEmp', 'CreateJob', 'RetainedJob', 'GrAppv', 'SBA_Appv', 'DisbursementGross', 'BalanceGross',
                        'Log_DisbursementGross', 'Log_NoEmp', 'Log_GrAppv', 'Log_SBA_Appv','Loan_Efficiency', 'Guarantee_Ratio', 
                        'Loan_Guarantee_Interaction', 'Disbursement_Squared']
    
    scaler = StandardScaler()
    #fit and transform separately
    scaler.fit(train_encoded[numerical_columns])
    train_encoded[numerical_columns] = scaler.transform(train_encoded[numerical_columns])

    clf = LogisticRegression(random_state=42,max_iter=100,n_jobs=-1, verbose=1)
    traincol = [col for col in train_encoded.columns if col not in nontrain_cols]
    classifier = clf.fit(train_encoded[traincol],y)

    # GridSearchCV
    param_grid = {
        'C': [0.01],
        'penalty': ['l1'],
        'solver': ['liblinear'],  
        'max_iter': [100],
        'class_weight': [None],
        'l1_ratio': [0.2] 
    }
    grid = GridSearchCV(clf, param_grid, cv=5, verbose=1, n_jobs=-1, scoring='f1')
    grid.fit(train_encoded[traincol], y)
    print("Best parameters found: ", grid.best_params_)
    best_clf = grid.best_estimator_

    # Saving the artifacts
    artifacts_dict = {
        "model": best_clf,
        "target_encoder": encoder,
        "te_columns": categorical_columns,
        "columns_to_train":traincol,
        "numerical_columns":numerical_columns,
        "cat_cols":cat_cols,
        "scaler":scaler
    }

    #calculating threshold
    if y is not None:
        optimal_threshold = calculate_optimal_threshold(classifier, train_encoded[traincol], y)
        print(f"Optimal Threshold: {optimal_threshold}")
        # Saving the threshold in artifacts
        artifacts_dict["threshold"] = optimal_threshold

    
    artifacts_dict_file = open("./Artifacts/artifacts_dict_file.pkl", "wb")
    pickle.dump(obj=artifacts_dict, file=artifacts_dict_file)
    
    artifacts_dict_file.close()    
    return classifier


In [297]:
from sklearn.model_selection import train_test_split
df = pd.read_csv("SBA_loans_project_1.csv")

target = "MIS_Status"
y = df[target]
x = df.drop(columns=[target])

# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)
df_train = X_train.copy()
df_train[target] = y_train
train_model(df_train)

Index(['City_trg', 'State_trg', 'Bank_trg', 'BankState_trg', 'RevLineCr_trg',
       'LowDoc_trg', 'NewExist_trg', 'UrbanRural_trg', 'City', 'State', 'Zip',
       'Bank', 'BankState', 'NAICS', 'NoEmp', 'NewExist', 'CreateJob',
       'RetainedJob', 'FranchiseCode', 'UrbanRural', 'RevLineCr', 'LowDoc',
       'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv',
       'MIS_Status'],
      dtype='object')


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


Fitting 5 folds for each of 1 candidates, totalling 5 fits




[LibLinear]Best parameters found:  {'C': 0.01, 'class_weight': None, 'l1_ratio': 0.2, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Optimal Threshold: 0.05050505050505051


In [211]:
artifacts_dict_file = open("./Artifacts/artifacts_dict_file.pkl", "rb")
artifacts_dict = pickle.load(file=artifacts_dict_file)
artifacts_dict_file.close()

print([l for l in artifacts_dict["target_encoder"].mapping])

print([l for l in artifacts_dict["columns_to_train"]])

['City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc', 'NewExist', 'UrbanRural']
['NewExist', 'UrbanRural', 'NAICS', 'NoEmp', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv', 'Log_DisbursementGross', 'Log_NoEmp', 'Log_GrAppv', 'Log_SBA_Appv', 'Log_BalanceGross', 'Loan_Efficiency', 'Guarantee_Ratio', 'Loan_Guarantee_Interaction', 'Disbursement_Squared']


In [291]:
def scoring(data):
    """
    Function to score input dataset.
    
    Input: dataset in Pandas DataFrame format
    Output: Python list of labels in the same order as input records
    
    Flow:
        - Load artifacts
        - Transform dataset
        - Score dataset
        - Return labels
    
    """
    if "index" in data.columns:
        data.drop(columns="index", inplace=True)
    #Load Artifacts
    artifacts_dict_file = open("./Artifacts/artifacts_dict_file.pkl", "rb")
    artifacts_dict = pickle.load(file=artifacts_dict_file)
    artifacts_dict_file.close()

    clf = artifacts_dict["model"]
    te = artifacts_dict["target_encoder"]
    te_columns = artifacts_dict["te_columns"]
    columns_to_score = artifacts_dict["columns_to_train"]
    threshold = artifacts_dict["threshold"]
    cat_cols = artifacts_dict["cat_cols"]
    numerical_columns = artifacts_dict["numerical_columns"]
    scaler = artifacts_dict["scaler"]
    
    # Replacing the missing values
    for i in data['RevLineCr']:
        if i not in ['Y','N']:
            data['RevLineCr'].replace(i,'N',inplace=True)

    for i in data['LowDoc']:
        if i not in ['Y','N']:
            data['LowDoc'].replace(i,'N',inplace=True)

    for i in data['NewExist']:
        if i not in [1,2]:
            data['NewExist'].replace(i,None,inplace=True)

    for column in cat_cols:
        data[column]=data[column].fillna(data[column].mode()[0])
    
    #Adding Features
    data['Log_DisbursementGross'] = np.log1p(data['DisbursementGross'])
    data['Log_NoEmp'] = np.log1p(data['NoEmp'])
    data['Log_GrAppv'] = np.log1p(data['GrAppv'])
    data['Log_SBA_Appv'] = np.log1p(data['SBA_Appv'])
    data['Log_BalanceGross'] = np.log1p(data['BalanceGross'])
    data['Loan_Efficiency'] = data['DisbursementGross'] / (data['CreateJob'] + data['RetainedJob'] + 1)
    data['Guarantee_Ratio'] = data['SBA_Appv'] / data['GrAppv']
    data['Loan_Guarantee_Interaction'] = data['SBA_Appv'] * data['GrAppv']
    data['Disbursement_Squared'] = data['DisbursementGross'] ** 2
    data['Disbursement_Bins'] = pd.cut(data['DisbursementGross'], 
                                           bins=[-np.inf, 50000, 150000, np.inf], 
                                           labels=['Low', 'Medium', 'High'])
     
    # Scaling the numerical columns
    data[numerical_columns] = scaler.transform(data[numerical_columns])                             
    
    # Target encoding the categorical columns
    data_encoded = te.transform(data[te_columns])
    data_encoded = data_encoded.add_suffix('_trg')
    data_encoded = pd.concat([data_encoded, data], axis=1)
    
    # Renaming the columns
    
    for column in te_columns:
        data_encoded[column + "_trg"].fillna(data_encoded[column + "_trg"].mean(), inplace=True)
    
    # Predicting the probabilities
    y_prob = clf.predict_proba(data_encoded[columns_to_score])
    y_pred = (y_prob[:,0] < threshold).astype(int)
    d = {
        "index": data.index,
        "label": y_pred,
        "probability_0": y_prob[:,0],
        "probability_1": y_prob[:,1]
    }
    #print(y_prob)
    return pd.DataFrame(d)


In [298]:
print(scoring(X_test))


         index  label  probability_0  probability_1
0            0      0       0.921069       0.078931
1            1      0       0.948424       0.051576
2            2      0       0.952875       0.047125
3            3      0       0.848874       0.151126
4            4      0       0.786245       0.213755
...        ...    ...            ...            ...
161845  161845      0       0.962089       0.037911
161846  161846      0       0.850704       0.149296
161847  161847      0       0.964865       0.035135
161848  161848      0       0.830517       0.169483
161849  161849      0       0.786244       0.213756

[161850 rows x 4 columns]


In [293]:
val_data = pd.read_csv("SBA_loans_project_1_holdout_students_valid.csv")
print(scoring(val_data))

       index  label  probability_0  probability_1
0          0      0       0.473412       0.526588
1          1      0       0.956414       0.043586
2          2      0       0.632336       0.367664
3          3      0       0.984358       0.015642
4          4      0       0.936927       0.063073
...      ...    ...            ...            ...
89912  89912      0       0.903601       0.096399
89913  89913      0       0.774825       0.225175
89914  89914      0       0.890063       0.109937
89915  89915      0       0.960143       0.039857
89916  89916      0       0.944447       0.055553

[89917 rows x 4 columns]


**H2O**

In [6]:
import h2o
h2o.init(max_mem_size='12G')
h2o.remove_all()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
  Starting server from C:\Users\varun\anaconda3\envs\aml\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\varun\AppData\Local\Temp\tmpiz90910x
  JVM stdout: C:\Users\varun\AppData\Local\Temp\tmpiz90910x\h2o_varun_started_from_python.out
  JVM stderr: C:\Users\varun\AppData\Local\Temp\tmpiz90910x\h2o_varun_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.1
H2O_cluster_version_age:,18 days
H2O_cluster_name:,H2O_from_python_varun_izb6bk
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,10.66 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [None]:
import pandas as pd
import numpy as np
import os

In [7]:
df_h2o = h2o.import_file("SBA_loans_project_1.csv")
df_h2o.head()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


index,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,7,1,6,7,1,1,0,N,743000,0,743000,743000,0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,20,1,0,0,1,0,N,N,137000,0,137000,109737,0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,2,1,0,0,1,0,0,N,280000,0,280000,210000,0
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,7,1,0,0,1,1,0,Y,144500,0,144500,122825,0
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,2,2,0,0,1,0,N,Y,52500,0,52500,42000,0
5,NEWTOWN,OH,45244,HAMILTON CNTY DEVEL COMPANY IN,OH,234110,5,1,2,0,1,0,N,N,52000,0,52000,52000,0
6,MISSION VIEJO,CA,92691,BANK OF AMERICA CALIFORNIA N.A,CA,445310,3,1,0,0,1,0,Y,N,50000,0,50000,25000,0
7,OSWEGO,IL,60543,JPMORGAN CHASE BANK NATL ASSOC,IL,812990,1,1,2,1,0,1,Y,N,38619,0,25000,12500,1
8,DECATUR,GA,30033,WELLS FARGO BANK NATL ASSOC,SD,561421,10,1,1,11,1,1,Y,N,32714,0,20000,10000,1
9,ROLLING HILLS,CA,90274,BANC OF CALIFORNIA NATL ASSOC,CA,541512,4,1,4,4,1,1,Y,N,90055,0,50000,25000,0


In [8]:
df_h2o.describe()

Unnamed: 0,index,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
type,int,enum,enum,int,enum,enum,int,int,int,int,int,int,int,enum,enum,int,int,int,int,int
mins,0.0,,,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,200.0,100.0,0.0
mean,404623.0,,,53800.9370044004,,,398573.7836099488,11.414083709917984,1.2802764488289113,8.415865613341783,10.773365857395833,2751.93917555456,0.7577476345293829,,,201194.3958816036,3.1859951288049255,192717.89713153083,149528.1544176252,0.17528517251222434
maxs,809246.0,,,99999.0,,,928120.0,9999.0,2.0,8800.0,9500.0,99999.0,2.0,,,11446325.0,996262.0,5472000.0,5472000.0,1.0
sigma,233609.63098297126,,,31186.36710873629,,,263354.97981331375,74.52942885131719,0.45169187889462253,236.28834837455483,236.61205316253333,12758.411810115618,0.6463471493041519,,,287848.92292755377,1516.284729590831,283166.5956093533,228332.17708348396,0.38021107222877165
zeros,1,,,262,,,181845,5937,932,566148,396287,187961,290804,,,169,809235,0,0,667398
missing,0,25,12,0,1405,1411,0,0,128,0,0,0,0,4094,3662,0,0,0,0,0
0,0.0,GLEN BURNIE,MD,21060.0,"BUSINESS FINANCE GROUP, INC.",VA,811111.0,7.0,1.0,6.0,7.0,1.0,1.0,0,N,743000.0,0.0,743000.0,743000.0,0.0
1,1.0,WEST BEND,WI,53095.0,JPMORGAN CHASE BANK NATL ASSOC,IL,722410.0,20.0,1.0,0.0,0.0,1.0,0.0,N,N,137000.0,0.0,137000.0,109737.0,0.0
2,2.0,SAN DIEGO,CA,92128.0,UMPQUA BANK,OR,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0,N,280000.0,0.0,280000.0,210000.0,0.0


In [9]:
df_h2o = df_h2o.drop("index", axis=1)
df_h2o.describe()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
type,enum,enum,int,enum,enum,int,int,int,int,int,int,int,enum,enum,int,int,int,int,int
mins,,,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,200.0,100.0,0.0
mean,,,53800.9370044004,,,398573.7836099488,11.414083709917984,1.2802764488289113,8.415865613341783,10.773365857395833,2751.93917555456,0.7577476345293829,,,201194.3958816036,3.1859951288049255,192717.89713153083,149528.1544176252,0.17528517251222434
maxs,,,99999.0,,,928120.0,9999.0,2.0,8800.0,9500.0,99999.0,2.0,,,11446325.0,996262.0,5472000.0,5472000.0,1.0
sigma,,,31186.36710873629,,,263354.97981331375,74.52942885131719,0.45169187889462253,236.28834837455483,236.61205316253333,12758.411810115618,0.6463471493041519,,,287848.92292755377,1516.284729590831,283166.5956093533,228332.17708348396,0.38021107222877165
zeros,,,262,,,181845,5937,932,566148,396287,187961,290804,,,169,809235,0,0,667398
missing,25,12,0,1405,1411,0,0,128,0,0,0,0,4094,3662,0,0,0,0,0
0,GLEN BURNIE,MD,21060.0,"BUSINESS FINANCE GROUP, INC.",VA,811111.0,7.0,1.0,6.0,7.0,1.0,1.0,0,N,743000.0,0.0,743000.0,743000.0,0.0
1,WEST BEND,WI,53095.0,JPMORGAN CHASE BANK NATL ASSOC,IL,722410.0,20.0,1.0,0.0,0.0,1.0,0.0,N,N,137000.0,0.0,137000.0,109737.0,0.0
2,SAN DIEGO,CA,92128.0,UMPQUA BANK,OR,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0,N,280000.0,0.0,280000.0,210000.0,0.0


In [15]:
missing_col = []

for col in df_h2o.columns:
    if df_h2o[col].isna().sum() > 0:
        missing_col.append(col)

print(missing_col)


['City', 'State', 'Bank', 'BankState', 'NewExist', 'RevLineCr', 'LowDoc']


Exception ignored in: <function ExprNode.__del__ at 0x0000026FA69439A0>
Traceback (most recent call last):
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\h2o\expr.py", line 204, in __del__
    ExprNode.rapids("(rm {})".format(self._cache._id))
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\h2o\expr.py", line 258, in rapids
    return h2o.api("POST /99/Rapids", data={"ast": expr, "session_id": h2o.connection().session_id})
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\h2o\h2o.py", line 122, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\h2o\backend\connection.py", line 494, in request
    resp = requests.request(method=method, url=url, data=rd, json=json, params=params,
  File "c:\Users\varun\anaconda3\envs\aml\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "

KeyError: 'RevLineCr'