<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-and-Load-Data" data-toc-modified-id="Import-and-Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import and Load Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Trying-Out-Models" data-toc-modified-id="Trying-Out-Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Trying Out Models</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Support-Vector-Machine" data-toc-modified-id="Support-Vector-Machine-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Support Vector Machine</a></span></li><li><span><a href="#Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)" data-toc-modified-id="Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Decision Trees (Random Forest, Gradient Boosting, XGBoost)</a></span></li><li><span><a href="#Other-Models-(e.g.-Bagging-Classifier)" data-toc-modified-id="Other-Models-(e.g.-Bagging-Classifier)-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Other Models (e.g. Bagging Classifier)</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model Evaluation</a></span></li></ul></div>

## Import and Load Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [2]:
df = pd.read_csv("loans.csv")

  df = pd.read_csv("loans.csv")


## Preprocessing

 - Handle missing values
 - Encode categorical variables, scale data (if you wish), feature selection, etc.
 - Split the dataset into features (X) and target variable (y)
 - Split into training and testing sets

In [3]:
threshold = len(df) * 0.10 # 90% threshold

In [4]:
df_cleaned = df.dropna(axis = 1, thresh = threshold) # dropped columns that have more than 80% missing values

In [5]:
# df_cleaned = df_cleaned.drop(axis = 1, columns =['id', 'member_id','emp_title', 'url', 'Unnamed: 0', 'title', 'zip_code','addr_state', 'policy_code','desc', 'next_pymnt_d',
#                                                  'issue_d', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
#                                                  'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int','total_rec_late_fee','recoveries', 'collection_recovery_fee',
#                                                  'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med','sub_grade'\
#                                                  ,'mths_since_last_delinq', 'mths_since_last_major_derog','mths_since_last_record', 'earliest_cr_line'\
#                                                  ,'grade', 'initial_list_status', 'term',\
#                                                 , 'inq_last_6mths', 'open_acc', 'pub_rec',\
#                                                 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'revol_util'])

In [6]:
#df_cleaned['term'].replace({' 36 months': 36, ' 60 months': 60}, inplace=True) #remove string value from column to ensure int value in column

In [7]:
#df_cleaned.term.unique()

In [8]:
#df_cleaned['earliest_cr_line_year'] = pd.to_datetime(df_cleaned['earliest_cr_line']).dt.year   #convert to year
#df_cleaned['earliest_cr_line_month'] = pd.to_datetime(df_cleaned['earliest_cr_line']).dt.month  #convert to month
#df_cleaned.drop(columns = 'earliest_cr_line', inplace = True)

In [9]:
other_nominal_columns = [ 'pymnt_plan']
other_continuous_columns = ['emp_length', ]

In [10]:
df_cleaned.earliest_cr_line

0         Jan-1985
1         Apr-1999
2         Nov-2001
3         Feb-1996
4         Jan-1996
            ...   
887374    Sep-2004
887375    Mar-1974
887376    Sep-2003
887377    Oct-2003
887378    Dec-2001
Name: earliest_cr_line, Length: 887379, dtype: object

In [11]:
df_cleaned['earliest_cr_line'] = pd.to_datetime(df_cleaned['earliest_cr_line'])
df_cleaned['issue_d'] = pd.to_datetime(df_cleaned['issue_d'])

  df_cleaned['earliest_cr_line'] = pd.to_datetime(df_cleaned['earliest_cr_line'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['earliest_cr_line'] = pd.to_datetime(df_cleaned['earliest_cr_line'])
  df_cleaned['issue_d'] = pd.to_datetime(df_cleaned['issue_d'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['issue_d'] = pd.to_datetime(df_cleaned['issue_d'])


In [12]:
continuous_columns = ['loan_amnt', 'installment', 'funded_amnt','funded_amnt_inv', 'annual_inc', 'int_rate', 'dti', 'delinq_2yrs',\
                     'revol_bal', 'total_acc', 'acc_now_delinq', 'revol_util', 'open_acc', 'inq_last_6mths', 'pub_rec']
nominal_columns = ['home_ownership', 'purpose', 'verification_status', 'application_type', 'initial_list_status', 'pymnt_plan']
ordinal_columns = ['sub_grade']
time_columns = ['earliest_cr_line']

In [13]:
total_cols = continuous_columns + nominal_columns + ordinal_columns

In [14]:
total_cols

['loan_amnt',
 'installment',
 'funded_amnt',
 'funded_amnt_inv',
 'annual_inc',
 'int_rate',
 'dti',
 'delinq_2yrs',
 'revol_bal',
 'total_acc',
 'acc_now_delinq',
 'revol_util',
 'open_acc',
 'inq_last_6mths',
 'pub_rec',
 'home_ownership',
 'purpose',
 'verification_status',
 'application_type',
 'initial_list_status',
 'pymnt_plan',
 'sub_grade']

In [27]:
len(total_cols)

22

In [15]:
from sklearn.impute import SimpleImputer
preprocessor = ColumnTransformer(
    transformers=[
        ('nominal', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder())
        ]), nominal_columns),
        ('continuous', Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]), continuous_columns),
        ('ordinal', Pipeline(steps=[
            ('encoder', OrdinalEncoder())
        ]), ordinal_columns),
    ],
    remainder='drop'
)

In [16]:
# from sklearn.impute import SimpleImputer
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('nominal', Pipeline([
#             ('imputer', SimpleImputer(strategy='most_frequent')),
#             ('encoder', OneHotEncoder())
#         ]), nominal_columns),
#         ('continuous', Pipeline([
#             ('imputer', SimpleImputer(strategy='mean')),
#             ('scaler', StandardScaler())
#         ]), continuous_columns),
#         ('ordinal', Pipeline(steps=[
#             ('encoder', OrdinalEncoder())
#         ]), ordinal_columns),
#         ('time', Pipeline(steps=[
#             ('imputer', SimpleImputer(strategy='median'))
#         ]), time_columns),
#     ],
#     remainder='drop'
# )

In [17]:
#df.loan_status.unique()

In [18]:
#df[~df['loan_status'].isin(['Current', 'In Grace Period', 'Issued'])].loan_status.unique()

In [19]:
#df_cleaned = df_cleaned[~df_cleaned['loan_status'].isin(['Current', 'In Grace Period', 'Issued'])] #drop rows containing data not needed for model

In [20]:
df_cleaned['binary_loan_status'] = df_cleaned['loan_status'].apply(lambda x: 1 if x in ['Fully Paid'] else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['binary_loan_status'] = df_cleaned['loan_status'].apply(lambda x: 1 if x in ['Fully Paid'] else 0)


In [21]:
df_cleaned.drop(columns = 'loan_status', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned.drop(columns = 'loan_status', inplace = True)


In [22]:
X = df_cleaned[total_cols]

In [23]:
y = df_cleaned.binary_loan_status

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, shuffle = True, test_size=0.3)

In [25]:
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import make_scorer
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    #('feature_selection', SelectFromModel(DecisionTreeClassifier(random_state=42))),
    ('classifier', LogisticRegression(max_iter = 1000))
])

pipeline.fit(X_train, y_train)

In [26]:
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

AUC Score:  0.7705692975631879


In [24]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(DecisionTreeClassifier(random_state=42))),
    ('classifier', LogisticRegression(max_iter = 1000))
])

pipeline.fit(X_train, y_train)

In [25]:
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

AUC Score:  0.7323003108135538


In [102]:
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__penalty': ['l1', 'l2']
}

auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=auc_scorer, verbose=1, n_jobs=-1)

grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)

y_pred_proba = grid_search.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/beautse/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/beautse/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/beautse/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/beautse/opt/anaconda3/lib/python3.9/si

Best parameters found:  {'classifier__C': 10, 'classifier__penalty': 'l2'}
AUC Score:  0.6989960097588532


## Trying Out Models

Here, you want to try each type of machine learning model and perform the train-test-loop: identify the best hyperparameters for the model to perform well in training and validation. GridSearchCV is likely relevant.

### Logistic Regression

In [24]:
# Print all column names from the DataFrame
print("DataFrame columns:", df_cleaned.columns.tolist())

# Print specified column names for each transformer
print("Continous columns:", continuous_columns)
print("Nominal columns:", nominal_columns)

# Check for any specified columns that are not in the DataFrame
all_specified_columns = set(continuous_columns +  nominal_columns)
missing_columns = [col for col in all_specified_columns if col not in df_cleaned.columns]
if missing_columns:
    print("Missing columns in DataFrame:", missing_columns)
else:
    print("All specified columns are present in the DataFrame.")


DataFrame columns: ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'installment', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'pymnt_plan', 'purpose', 'application_type', 'binary_loan_status']
Continous columns: ['loan_amnt', 'int_rate', 'installment', 'funded_amnt', 'funded_amnt_inv', 'annual_inc']
Nominal columns: ['purpose', 'verification_status', 'emp_length', 'home_ownership', 'application_type', 'pymnt_plan']
All specified columns are present in the DataFrame.


In [25]:
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import make_scorer

# Define the logistic regression pipeline with feature selection
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(DecisionTreeClassifier(random_state=42))),
    ('classifier', LogisticRegression(solver='liblinear'))
])

# Parameters of the logistic regression and feature selection to be tuned through cross-validation
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2']
}

# Custom scorer for optimizing the hyperparameters based on AUC
auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=auc_scorer, verbose=1, n_jobs=-1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]

# Compute AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters found:  {'classifier__C': 0.1, 'classifier__penalty': 'l1'}
AUC Score:  0.5800350011247017


### Support Vector Machine

In [None]:
# Import necessary libraries
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

# Define the pipeline
# The preprocessor has already been defined in your provided code
svm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(probability=True, random_state=42))
])

# Parameter grid for GridSearchCV
param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__gamma': ['scale', 'auto'],
    'classifier__kernel': ['rbf', 'linear']
}

# Setup the GridSearchCV
grid_search = GridSearchCV(svm_pipeline, param_grid, cv=5, scoring='roc_auc', verbose=2, n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities on the test set
y_prob = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"The AUC score for the optimized SVM model is: {auc_score:.4f}")


### Decision Trees (Random Forest, Gradient Boosting, XGBoost)

In [None]:
from sklearn.metrics import roc_curve, auc

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Setup the pipeline for preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', dt_classifier)])

# Parameters to search for the Decision Tree Classifier
param_grid = {
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 5],
    'classifier__criterion': ['gini', 'entropy']
}

# Setup GridSearchCV to find the best parameters using cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities for the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)


In [None]:
from sklearn.metrics import roc_auc_score

# Define the pipeline steps
pipeline_steps = [
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
]

# Create the pipeline
pipeline = Pipeline(steps=pipeline_steps)

# Define the parameter grid for GridSearchCV
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 4, 5]
}

# Initialize GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1, n_jobs=-1)

# Fit the GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC score found: ", grid_search.best_score_)

# Predict on the test set
y_pred_prob = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC score on the test set
test_auc_score = roc_auc_score(y_test, y_pred_prob)
print("AUC score on the test set: ", test_auc_score)


In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

# Define a pipeline that includes the preprocessing steps and the classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])

# Define the hyperparameter space for the XGBoost model
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [3, 6, 9],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__subsample': [0.8, 0.9, 1],
    'classifier__colsample_bytree': [0.8, 0.9, 1]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC found: ", grid_search.best_score_)

# Evaluate the model on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score on Test Set: ", auc_score)


### Other Models (e.g. Bagging Classifier)

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score

# Define the base estimator
base_estimator = DecisionTreeClassifier(random_state=42)

# Initialize the BaggingClassifier with the Decision Tree as the base estimator
bagging_clf = BaggingClassifier(base_estimator=base_estimator, random_state=42)

# Create a pipeline with preprocessing and the classifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', bagging_clf)])

# Define the parameter grid to search over
param_grid = {
    'classifier__n_estimators': [10, 50, 100],  # Example: trying 10, 50, and 100 trees in the ensemble
    # Add other parameters here if you wish to tune them
}

# Set up the GridSearchCV to find the best parameters for both the model and preprocessing
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameter set found
print("Best parameters found: ", grid_search.best_params_)

# Predict on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score: {auc_score:.4f}")


## Model Evaluation

Compare the best models' performance on the test data. Which one does the best? Which one the worst? Why do you think this is the case?