# Lending Club Predict Interest Rates


## Preprocessing and Training Data Development

Preprocessing plays a crucial role in preparing data for fitting a regression model. It involves transforming and 
manipulating the data to ensure it is in a suitable format and meets the requirements of the regression algorithm. 
We will take the following preprocessing steps for our regression modeling:
    
1. Feature scaling: Scale numerical features to ensure they are on a similar scale.
    Scaling can prevent certain features from dominating the regression model due to their larger magnitude.

2. Encoding categorical variables: we will convert categorical variables into a numerical representation that can be 
    understood by the regression model. We will us one-hot encoding technique to create binary dummy variables.
    label encoding (assigning numerical labels), or using entity embeddings for high-cardinality categorical variables.

3. Feature selection: we will identify the most relevant features that have a strong relationship with the target variable. 
  Feature selection techniques, such as correlation analysis, stepwise regression, or regularization methods 
    like Lasso or Ridge regression, can help in selecting the most informative features and removing irrelevant 
    or redundant ones.
4. Splitting the data: lastly, we will divide the dataset into training and test sets. The training set is used to 
    train the regression model, while the test set is used to evaluate its performance on unseen data. 
    This step helps assess the model's generalization ability.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer


import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import cross_val_score, validation_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor


from sklearn import preprocessing
import scipy.stats as stats




In [2]:
#load the data
loan_df = pd.read_csv('../lc_loanDf.csv')

In [3]:
loan_df.shape

(817103, 53)

In [4]:
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 817103 entries, 0 to 817102
Data columns (total 53 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           817103 non-null  int64  
 1   loan_amnt                    817103 non-null  float64
 2   funded_amnt                  817103 non-null  float64
 3   funded_amnt_inv              817103 non-null  float64
 4   term                         817103 non-null  object 
 5   int_rate                     817103 non-null  float64
 6   installment                  817103 non-null  float64
 7   grade                        817103 non-null  object 
 8   sub_grade                    817103 non-null  object 
 9   emp_title                    817103 non-null  object 
 10  emp_length                   817103 non-null  int64  
 11  home_ownership               817103 non-null  object 
 12  annual_inc                   817103 non-null  float64
 13 

To determine which columns are relevant for predicting interest rate, we used domain knowledge. 
Here is the general analysis of the columns and  their potential relevance for predicting interest rates:

    'int_rate': This column represents the target variable, the interest rate. It is the variable you want to predict, and therefore, it is directly relevant to the prediction task.

    'loan_amnt': The loan amount can potentially have an impact on the interest rate. Higher loan amounts might be associated with higher interest rates due to increased risk. Therefore, 'loan_amnt' can be considered relevant for predicting interest rates.

    'annual_inc': Borrower's annual income is an important factor that lenders consider when determining interest rates. Higher incomes may indicate a borrower's ability to handle debt, and lower incomes may result in higher interest rates. Therefore, 'annual_inc' can be relevant for predicting interest rates.

    'mths_since_last_delinq': This column represents the number of months since the borrower's last delinquency. Delinquencies may affect creditworthiness, which can impact interest rates. Therefore, 'mths_since_last_delinq' can potentially be relevant in predicting interest rates.

    'revol_bal' and 'revol_util': These columns are related to the borrower's revolving credit balance and utilization. Higher revolving balances and higher utilization may indicate a higher risk for lenders, potentially leading to higher interest rates. Thus, 'revol_bal' and 'revol_util' can be relevant features for predicting interest rates.

    'tot_coll_amt' and 'tot_cur_bal': These columns represent the total collection amount and the total current balance, respectively. These variables provide information about the borrower's credit history and their current financial situation. Both variables can be relevant for predicting interest rates as they reflect creditworthiness and financial stability.

    'total_rev_hi_lim': This column represents the total revolving credit limit. It provides information about the borrower's available credit and potential debt capacity. Higher credit limits might suggest a better credit profile, which can influence interest rates. Thus, 'total_rev_hi_lim' can be relevant in predicting interest rates.

    'term_in_month': This column indicates the loan term in months. Loan terms can affect interest rates, as longer terms may carry higher risks and potentially higher interest rates. Therefore, 'term_in_month' can be relevant for predicting interest rates.
    'empl length':The length of employment can be an indicator of job stability. Lenders may consider borrowers with longer employment histories as more financially secure and reliable, potentially resulting in lower interest rates. Conversely, borrowers with shorter employment histories may be perceived as higher risk, leading to higher interest rates.



In [5]:
relevant_cols=['int_rate','loan_amnt','annual_inc','mths_since_last_delinq','revol_bal', 'revol_util',
               'tot_coll_amt','tot_cur_bal','total_rev_hi_lim','term_in_month', 'emp_length']

In [6]:
loan_Df=loan_df[relevant_cols]

In [7]:
loan_Df.shape

(817103, 11)

In [8]:
loan_Df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 817103 entries, 0 to 817102
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   int_rate                817103 non-null  float64
 1   loan_amnt               817103 non-null  float64
 2   annual_inc              817103 non-null  float64
 3   mths_since_last_delinq  817103 non-null  int64  
 4   revol_bal               817103 non-null  float64
 5   revol_util              817103 non-null  float64
 6   tot_coll_amt            817103 non-null  float64
 7   tot_cur_bal             817103 non-null  float64
 8   total_rev_hi_lim        817103 non-null  float64
 9   term_in_month           817103 non-null  int64  
 10  emp_length              817103 non-null  int64  
dtypes: float64(8), int64(3)
memory usage: 68.6 MB


# Split data into training and testing subsets

In [9]:
from sklearn.model_selection import train_test_split

# Define the features and target variable
X = loan_Df.drop('int_rate', axis=1) # all columns except the target
y = loan_Df['int_rate'] # target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Modeling

In [10]:
#create three separate regression models: ridge_model, rf_model, and xgbr_model

ridge_model=Ridge(alpha=1.0, random_state=1)

rf_model=RandomForestRegressor()

xgbr_model=XGBRegressor()

In [11]:
#fit the three regression models

ridge_model.fit(X_train, y_train)

rf_model.fit(X_train, y_train)

xgbr_model.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)

In [None]:
#print('Mean absolut error:', mean_absolute_error(y_train, rf_model.predict(X_train)))

In [12]:
from tabulate import tabulate
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


# Predict on the test set
ridge_pred = ridge_model.predict(X_test)
rf_pred = rf_model.predict(X_test)
xgbr_pred = xgbr_model.predict(X_test)

# Calculate metrics for each model
ridge_mse = mean_squared_error(y_test, ridge_pred)
ridge_mae = mean_absolute_error(y_test, ridge_pred)
ridge_r2 = r2_score(y_test, ridge_pred)

rf_mse = mean_squared_error(y_test, rf_pred)
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_r2 = r2_score(y_test, rf_pred)

xgbr_mse = mean_squared_error(y_test, xgbr_pred)
xgbr_mae = mean_absolute_error(y_test, xgbr_pred)
xgbr_r2 = r2_score(y_test, xgbr_pred)

# Create a table with the model metrics
table = [
    ["Model", "MSE", "MAE", "R^2"],
    ["Ridge", ridge_mse, ridge_mae, ridge_r2],
    ["Random Forest", rf_mse, rf_mae, rf_r2],
    ["XGBoost", xgbr_mse, xgbr_mae, xgbr_r2]
]

# Print the table
print(tabulate(table, headers="firstrow", tablefmt="fancy_grid"))


╒═══════════════╤═════════╤═════════╤══════════╕
│ Model         │     MSE │     MAE │      R^2 │
╞═══════════════╪═════════╪═════════╪══════════╡
│ Ridge         │ 14.2689 │ 2.99902 │ 0.265462 │
├───────────────┼─────────┼─────────┼──────────┤
│ Random Forest │ 12.2698 │ 2.77889 │ 0.368373 │
├───────────────┼─────────┼─────────┼──────────┤
│ XGBoost       │ 11.3792 │ 2.66945 │ 0.414218 │
╘═══════════════╧═════════╧═════════╧══════════╛


Based on the result above the XGBoost performs better with MSE value of 11.3 and MAE of 2.6. It's R value shows a bettr fit
in comparison to the other models.
There is still lots of room for improvement.

In [13]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor

# Create the pipeline
clf = make_pipeline(
    #OrdinalEncoder(),
    #SimpleImputer(),
    RandomForestRegressor(random_state=42, n_jobs=1, max_samples=0.6)
)


In [14]:
parm_Grid = {
    #'simpleimputer__strategy': ['median', 'mean'],
    'randomforestregressor__max_depth': range(5, 60, 5),
    'randomforestregressor__n_estimators': range(25, 200, 25)
}


In [15]:
from sklearn.model_selection import RandomizedSearchCV

models_rfrs = RandomizedSearchCV(
    clf,
    param_distributions=parm_Grid,
    n_iter=10,
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=2
)


In [None]:
models_rfrs.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


# Cross validation for random Forest