## Step 1: Preliminaries

#### Import Packages

In [1]:

import numpy as np
import pandas as pd
import xgboost as xgb
from pandas.api.types import CategoricalDtype
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek


#### Data Exploration

I performed data exploration and read the data description document, the following were discovered;
* The Dataset is inbalanced.
* There are columns containing arbitrary values.
* Some columns pose a data leakage problem.
* The target features column has some missing values
* Missing values are very minimal (<2%).
* There are curropt values, typos, and wrong dtypes.

These issues will now be addressed.

**Confirm Dataset is imbalanced**

Firstly I need to check if the dataset in imbalanced.

In [2]:
# read data into dataframe
df = pd.read_csv("SBAnational.csv", low_memory = False)
# Check the value counts of the target variable
df.MIS_Status.value_counts()

P I F     739609
CHGOFF    157558
Name: MIS_Status, dtype: int64

The ration of  `P I F` (Paid if full) to `CHGOFF` (Charged off) loans is ~ 1 : 5, there is clear inbalance.

## Step 2: Data Preprocessing and Transformation

**Load Data**

The following function will load the data. This process will include reading the data, drop predetermined columns, clean and transform the data, handle missing values, and factorize the data.

In [3]:
def load_data():
    # Read data
    df = pd.read_csv("SBAnational.csv", low_memory = False)
    # PREPROCESSING
    # Function to drop missing values and arbitrary + data leakage columns
    df = drop_cols_and_missing_vals(df)
    # Function to clean and transform data
    df = clean_and_transform(df)
    # Function to factorize categorical features
    # df = factorizer(df)
    
    return df

##### **Handling Missing Values and Arbitrary + Data Leakage Columns**

**Arbitrary and Data Leakage Columns:**

* The `LoanNr_ChkDgt` column holds the loan identification number while the `Name` column is the name of the borrower. These columns are arbitrary and hence not helpful.
* `ChgOffDate` is the date a loan was charged off. Clearly only charged off loans have a value in this column. `ChgOffPrinGr` is the amount that was charged off. Again, only charged of loans have a value here. `BalanceGross` is the outstanding balance on a loan. All charged off loans have a $0 value. These columns will cause data leakage problems and thus need to be dropped.


In [4]:
arbitrary_cols = ['LoanNr_ChkDgt', 'Name']
data_leakage_cols = ['ChgOffDate', 'ChgOffPrinGr', 'BalanceGross']

**Missing Values:**

Calculate Missing Values

In [5]:
df.drop(columns = arbitrary_cols + data_leakage_cols, inplace=True)
missing_vals = df.isnull().sum()
isnull_values = round((missing_vals[missing_vals > 0].sum() / len(df)) * 100, 2)
dataset_size = len(df)
missing_values_percentage = pd.DataFrame(data = {'Count': [dataset_size, missing_vals[missing_vals > 0].sum(),\
                                         isnull_values]}, index = ['Size of Dataset', 'Missing Values', 'Missing Values %'])
missing_values_percentage

Unnamed: 0,Count
Size of Dataset,899164.0
Missing Values,14780.0
Missing Values %,1.64


The missing values make up less than 2% of the dataset. Considering the size of the dataset (~ 900k),this seems negligible. The missing values will be dropped.

Function to drop missing values and arbitrary + data leakage columns.

In [6]:
def drop_cols_and_missing_vals(df):
    # Drop arbitrary and data leakage columns
    df.drop(columns=arbitrary_cols + data_leakage_cols, inplace = True)
    # Drop missing values from Target variable column
    df.dropna(subset = 'MIS_Status', inplace=True, axis=0)
    # Drop missing values
    df.dropna(inplace = True)
    
    return df

**Clean and Transform Data**

From reading the data description document and performing some data exploration, it was found that substantial data cleaning needs to be performed.

In [7]:


# These columns contain norminal data presented as numeric data, mapping is required to reflect their norminal categorical nature
# Also map the target variable from categorical to binary values
map_cols = {'NAICS': {11: 'Agriculture', 21: 'Mining', 22: 'Utilities', 23: 'Construction', 31: 'Manufacturing', 32: 'Manufacturing',\
              33: 'Manufacturing', 42: 'Wholesale', 44: 'Retail', 45: 'Retail', 48: 'Transportation', 49: 'Transportation', 51: 'Information', \
              52: 'Finance', 53: 'RealEstate', 54: 'Professional', 55: 'Management', 56: 'AdminWasteRem', 61: 'Education', 62: 'HealthCare', \
              71: 'Arts', 72: 'Accommodation', 81: 'OtherServices', 92: 'PublicAdmin'},\
           'NewExist': {0: 'Undefined', 1: 'Existing', 2: 'New'}, 'UrbanRural': {0: 'Undefined', 1: 'Urban', 2: 'Rural'}, \
           'MIS_Status': {'P I F': 0, 'CHGOFF': 1},}
yesno_cols = ['RevLineCr', 'LowDoc']
# These columns hold continous numerical data but have $ and comma characters that need to be removed
currency_cols = ['DisbursementGross', 'GrAppv', 'SBA_Appv']



def clean_and_transform(df):
    # Some values in ApprovalFY column have typos that need to be corrected
    df.ApprovalFY = df.ApprovalFY.replace({'1976A': 1976})
    # Map all other categorical columns the above map_cols, yesno_cols and currency_cols
    for column in df.columns:
        if column == 'FranchiseCode':
            df[column] = df[column].apply(lambda x: 0 if x == 0 or x == 1 else 1)
        elif column == 'NAICS':
            df[column] = df[column].apply(lambda x: 'Other' if x == 0 else map_cols[column][int(str(x)[:2])])
        elif column in yesno_cols:
            df[column] = df[column].apply(lambda x: x if x == 'Y' or x == 'N' else 'Unknown')
        elif column in currency_cols:
            df[column] = df[column].map(lambda x: int(x[1:-4].replace(',','')))
        elif column in map_cols.keys():
            df[column] = df[column].map(map_cols[column])
        else:
            continue

        
    return df


**Factorize**

This function will factorize all categorical features in the Data

In [8]:
def factorizer(df):
    for colname in df.select_dtypes('object'):
        df[colname], _ = df[colname].factorize()
    return df

## Step 3: Feature Utility Scores

Mutual Information (MI) will be used to measure the relationship between the features and the target variable.
This will help in determing which features will be helpful and which can be dropped.

The function defined below will generate the MI scores for the features to determine their utility.

In [9]:

def make_mi_scores(df):
    df = factorizer(df)
    X = df.copy()
    y = X.pop('MIS_Status')
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    # Get MI scores
    mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
    # Convert to a Series object
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


In [10]:
df = load_data()
mi_scores = make_mi_scores(df)
print(mi_scores)

Term                 0.261964
DisbursementGross    0.104011
DisbursementDate     0.087544
ApprovalDate         0.073108
ApprovalFY           0.066067
SBA_Appv             0.060602
Bank                 0.058030
Zip                  0.039582
GrAppv               0.035734
City                 0.034859
RetainedJob          0.026046
UrbanRural           0.024535
BankState            0.019807
NAICS                0.014742
NoEmp                0.008602
CreateJob            0.007551
State                0.006458
RevLineCr            0.006196
LowDoc               0.003942
NewExist             0.000242
FranchiseCode        0.000123
Name: MI Scores, dtype: float64


All features have MI scores greater than zero, although some scores are quite minimal. Since there are no zero scores, all features will be kept.

**Load Data**

The final part of this step is to load the Data.

In [11]:
df = load_data()

## Step 4: Feature Engineering

A few new features were created with the hope that they would improve the informativeness of the dataset.
* A possibly useful feature is the ratio of payment term in months `Term` to the loam amount `GrAppv`.
* Another interesting feature could be the ratio of the SBA insured loan portion `SBA_Appv` to the loan amount `GrAppv`.
* A third possibly useful feature could be the ratio of the payment term in months `Term` to the SBA insured loan portion `SBA_Appv`.

The function defined below will create these new features and add them to the dataset.

In [12]:
def add_new_features(df):
    X_new = pd.DataFrame()
    X_new['feature_1'] = df.Term / df.GrAppv
    X_new['feature_2'] = df.SBA_Appv / df.GrAppv
    X_new['feature_3'] = df.Term / df.SBA_Appv
    df = df.join(X_new)
    
    return df

And now to check if the added features provide any added information by virtue of their MI scores.

In [13]:
df = add_new_features(df)
mi_scores = make_mi_scores(df)
print(mi_scores)

Term                 0.261964
feature_1            0.145049
feature_3            0.142019
DisbursementGross    0.104011
DisbursementDate     0.087544
ApprovalDate         0.073108
ApprovalFY           0.066067
SBA_Appv             0.060602
Bank                 0.058030
feature_2            0.049165
Zip                  0.039582
GrAppv               0.035734
City                 0.034859
RetainedJob          0.026046
UrbanRural           0.024535
BankState            0.019807
NAICS                0.014742
NoEmp                0.008602
CreateJob            0.007551
State                0.006458
RevLineCr            0.006196
LowDoc               0.003942
NewExist             0.000242
FranchiseCode        0.000123
Name: MI Scores, dtype: float64


Their MI scores imply that they are informative, hence they wiil be kept.

### Final Features

 A function to load the final features, which will include the new features from the feature engineering step.

In [14]:
def final_features():
    df = load_data()
    df = add_new_features(df)
    df = factorizer(df)
    X = df.copy()
    y = X.pop('MIS_Status')
    
    return X, y

## Step 5: Train and Evaluate Model

In [15]:
# Load final features
X, y = final_features()

In [16]:
# Split dataset into train and test sets with train set = 80% and test set = 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model - XGBoost**

XGBoost is known for its accuracy and efficiency. It is a good choice for dealing with imbalanced datasets 
because it can be fine-tuned to focus on the minority class.

**Evaluation Metric - Confusion Matrix and Classification Report**

* Confusion Matrix was chosing because it makes it easy to see wrong classifications numerically. In this case with regards to this dataset, more focus should be on the True Positives and False Negatives. This is because the aim is to identify loanees who will default (True Positives). As such, minimizing the number of defaulters that are wrongfully classified as no risk (False Negatives) is also of importance.
* Classification Report was chosen because focuses on precision and recall. The precision (percentage of correct positive predictions relative to total positive predictions) should always be evaluated. However, regarding the aim of this task, the recall (percentage of correct positive predictions relative to total actual positives) is important.

Using these metrics can help train the model to produce better results regarding default prediction.

#### Hyperparameter tuning

To get the best results, hyperparameter tuning was done using GridSearchCV. The best parameters found can be seen in the params dictionary below. Of particular interest was the `scale_pos_weight` parameter. This parameter assigns weights to the classes and can be used to apply more weight to the minority class, emphasizing its importance. In this case, the model performed best with this parameter set to its default(1), implying that the imbalance in the dataset cannot be addressed using this parameter.

**Parameters**

In [17]:
# Parameter grid for XGBoost
params = {
        'learning_rate': 0.25,
        'n_estimators': 100,
        'objective':'binary:logistic',
        'colsample_bytree': 0.9,
        'max_depth': 11,
        'scale_pos_weight': 1
        }


The model was trained and evaluated using the parameter grid above.

In [18]:
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[142820   3161]
 [  4150  27120]]
              precision    recall  f1-score   support

           0       0.97      0.98      0.98    145981
           1       0.90      0.87      0.88     31270

    accuracy                           0.96    177251
   macro avg       0.93      0.92      0.93    177251
weighted avg       0.96      0.96      0.96    177251



The precision and recall for the Negative class is very high, this is to be expected as this is the majority class with regards to the inbalance in the dataset.
The precision and recall for the positive class are not as high as those for the Negative class, understandable as this is the minority class.

The aim now is to try and mitigate the effects of the inbalance and consequently improve the recall of the true positives class without sacrificing too much precision.

## Step 6: Improve model by mitigating dataset imbalance

### Resampling methods

#### Under-Sampling, Over-Sampling & SMOTE

In order to try improve the model performance by accounting for the inbalance in the data, the following  Resampling techniques were performed;
* **Under-Sampling**: This reduces the majority class to be more at par with the minority class.
* **Over-Sampling**: This copies the minority class and uses these copies to increase the class size to be more at par with the majority class.
* **SMOTE**: This technique oversamples the minority class by creating synthetic data points between existing minority class points.

The resampling rate I used here is 50% to resample the respective class by halve the size of the other class.
The model was then trained and evaluated on each of the resampled datasets and the results were compared. See code block below (the code block is in markdown because it requires significant time to execute).

```
sample = 0.5
imbl_dict = {
    'Under-Sampling': NearMiss(sampling_strategy=sample),
    'Over-Sampling': RandomOverSampler(sampling_strategy=sample),
    'SMOTE': SMOTETomek(sampling_strategy=sample)
}

for key in imbl_dict.keys():
    sampler = imbl_dict[key]
    X_train_s, y_train_s = sampler.fit_resample(X_train, y_train)
    model = xgb.XGBClassifier()
    model.fit(X_train_s, y_train_s)
    y_pred = model.predict(X_test)
    print(key + ' Result')
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
```

Following are the results of the above code block.

```
Under-Sampling Result
[[120635  25346]
 [  2450  28820]]
              precision    recall  f1-score   support

           0       0.98      0.83      0.90    145981
           1       0.53      0.92      0.67     31270

    accuracy                           0.84    177251
   macro avg       0.76      0.87      0.79    177251
weighted avg       0.90      0.84      0.86    177251

Over-Sampling Result
[[140188   5793]
 [  2845  28425]]
              precision    recall  f1-score   support

           0       0.98      0.97      0.97    145981
           1       0.86      0.90      0.88     31270

    accuracy                           0.95    177251
   macro avg       0.91      0.93      0.92    177251
weighted avg       0.95      0.95      0.95    177251

SMOTE Result
[[141793   4188]
 [  4515  26755]]
              precision    recall  f1-score   support

           0       0.97      0.97      0.97    145981
           1       0.86      0.86      0.86     31270

    accuracy                           0.95    177251
   macro avg       0.92      0.91      0.92    177251
weighted avg       0.95      0.95      0.95    177251
```

In [19]:

resampler = RandomOverSampler(sampling_strategy = 0.5)
X_train, y_train = resampler.fit_resample(X_train, y_train)

model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[141421   4560]
 [  3117  28153]]
              precision    recall  f1-score   support

           0       0.98      0.97      0.97    145981
           1       0.86      0.90      0.88     31270

    accuracy                           0.96    177251
   macro avg       0.92      0.93      0.93    177251
weighted avg       0.96      0.96      0.96    177251



Based on the above results, the method to go with is Over-Sampling. This is because it is the only one that significantly increases recall without compromising precision excessively.

The final model will be trained on Over-Sampled data.

#### Final Model

In [20]:
X, y = final_features()

resampler = RandomOverSampler(sampling_strategy = 0.5)
X_train, y_train = resampler.fit_resample(X, y)

model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.9, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.25, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=11, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)

#### Save Model

In [21]:
# Save the model
model.save_model("model.json")


Thank you.