<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-and-Load-Data" data-toc-modified-id="Import-and-Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import and Load Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Trying-Out-Models" data-toc-modified-id="Trying-Out-Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Trying Out Models</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Support-Vector-Machine" data-toc-modified-id="Support-Vector-Machine-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Support Vector Machine</a></span></li><li><span><a href="#Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)" data-toc-modified-id="Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Decision Trees (Random Forest, Gradient Boosting, XGBoost)</a></span></li><li><span><a href="#Other-Models-(e.g.-Bagging-Classifier)" data-toc-modified-id="Other-Models-(e.g.-Bagging-Classifier)-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Other Models (e.g. Bagging Classifier)</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model Evaluation</a></span></li></ul></div>

## Import and Load Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [2]:
#import pandas as pd
#from google.colab import drive
#import zipfile

# Mount Google Drive
#drive.mount('/content/drive')

# Extract the ZIP file containing the CSV
#zip_ref = zipfile.ZipFile("/content/drive/MyDrive/OMIS116 (1)/loans.csv.zip", 'r')
#zip_ref.extractall("/content/dataset")
#zip_ref.close()

# Assuming the CSV file is named loans.csv and is directly inside the zip without any folder structure
#csv_file_path = "/content/dataset/loans.csv"

# Load the CSV file into a pandas DataFrame
#df = pd.read_csv(csv_file_path)
df = pd.read_csv('loans.csv')

  df = pd.read_csv('loans.csv')


## Preprocessing

 - Handle missing values
 - Encode categorical variables, scale data (if you wish), feature selection, etc.
 - Split the dataset into features (X) and target variable (y)
 - Split into training and testing sets

In [3]:
df.drop(['id','member_id','emp_title','title','zip_code','url'],axis=1,inplace=True)

In [4]:
df['emp_length'] = df['emp_length'].str.extract('(\d+)').astype(float)

In [6]:
df_cleaned = df.copy()

In [7]:
threshold = len(df) * 0.10 # 90% threshold

In [8]:
df_cleaned = df.dropna(axis = 1, thresh = threshold) # dropped columns that have more than 80% missing values

In [10]:
continuous_columns = ['loan_amnt', 'installment', 'funded_amnt','funded_amnt_inv', 'annual_inc', 'dti', \
                      'revol_bal', 'revol_util', 'total_rev_hi_lim','total_acc',\
                      'int_rate', 'pub_rec', 'delinq_2yrs','inq_last_6mths','open_acc','acc_now_delinq', 'emp_length'
                     ]
nominal_columns = ['home_ownership', 'pymnt_plan', 'term', 'application_type', 'initial_list_status', 'purpose',  'verification_status',\
                    'sub_grade', 'addr_state']
ordinal_columns = []
time_columns = ['earliest_cr_line']

In [12]:
total_cols = nominal_columns + continuous_columns

In [13]:
from sklearn.impute import SimpleImputer
preprocessor = ColumnTransformer(
    transformers=[
        ('nominal', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown = 'ignore'))
        ]), nominal_columns),
        ('continuous', Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]), continuous_columns),
    ],
    remainder='drop'
)

In [15]:
desired_statuses = ['Fully Paid', 'Default', 'Charged Off']

df_cleaned = df[df['loan_status'].isin(desired_statuses)]

In [17]:
df_cleaned['binary_loan_status'] = df_cleaned['loan_status'].apply(lambda x: 1 if x in ['Fully Paid'] else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['binary_loan_status'] = df_cleaned['loan_status'].apply(lambda x: 1 if x in ['Fully Paid'] else 0)


In [18]:
df_cleaned.drop(columns = 'loan_status', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned.drop(columns = 'loan_status', inplace = True)


In [19]:
X = df_cleaned[total_cols]

In [20]:
y = df_cleaned.binary_loan_status

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, shuffle = True, test_size=0.3)

## Models

### Logistic Regression

In [26]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

AUC Score:  0.7047060488150438


In [None]:
#advanced logistic regression code

from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression(solver='liblinear'))])

# Parameters of the logistic regression to be tuned through cross-validation
param_grid = {
    'classifier__C': [0.1, 1, 10, 100,1000, 10000],
    'classifier__penalty': ['l1', 'l2']
}

# Custom scorer for optimizing the hyperparameters based on AUC
auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=auc_scorer, verbose=1)

grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]

# Compute AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

### Support Vector Machine

### Decision Trees (Random Forest, Gradient Boosting, XGBoost)

In [None]:
#Decision tree simnple
from sklearn.metrics import roc_curve, auc

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42, max_depth = 5)

# Setup the pipeline for preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', dt_classifier)])

pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

In [None]:
#Decision tree advanced

from sklearn.metrics import roc_curve, auc

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Setup the pipeline for preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', dt_classifier)])

# Parameters to search for the Decision Tree Classifier
param_grid = {
    'classifier__max_depth': [10, 20, 30],
    'classifier__min_samples_split': [2, 10, 50],
    'classifier__min_samples_leaf': [1, 5, 10]
}


# Setup GridSearchCV to find the best parameters using cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities for the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)


In [None]:
#random forest classifier sample
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Setup the pipeline for preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', rf_classifier)])

pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

In [None]:
#random forest classifier model

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Setup the pipeline for preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', rf_classifier)])

# Parameters to search for the Random Forest Classifier
param_grid = {
    'classifier__n_estimators': [100, 200, 300],  # Number of trees in the forest
    'classifier__max_depth': [10, 20, 30],  # Maximum depth of the tree
    'classifier__min_samples_split': [2, 10, 50],  # Minimum number of samples required to split an internal node
    'classifier__min_samples_leaf': [1, 5, 10]  # Minimum number of samples required to be at a leaf node
}

# Setup GridSearchCV to find the best parameters using cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1)

# Assuming X_train, y_train, X_test, and y_test are already defined
# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities for the test set
y_pred_proba = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)


In [None]:
#gradient boosting simple
from sklearn.metrics import roc_auc_score

# Define the pipeline steps
pipeline_steps = [
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42, n_iter_no_change=10))]
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

In [None]:
#gradient boosting
from sklearn.metrics import roc_auc_score

# Define the pipeline steps
pipeline_steps = [
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42, n_iter_no_change=10))
]

# Create the pipeline
pipeline = Pipeline(steps=pipeline_steps)

# Define the parameter grid for GridSearchCV
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 4, 5]
}

# Initialize GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1, n_jobs=-1)

# Fit the GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC score found: ", grid_search.best_score_)

# Predict on the test set
y_pred_prob = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC score on the test set
test_auc_score = roc_auc_score(y_test, y_pred_prob)
print("AUC score on the test set: ", test_auc_score)


In [None]:
#XGBoost simple
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

# Define a pipeline that includes the preprocessing steps and the classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

In [None]:
#XGBoost
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

# Define a pipeline that includes the preprocessing steps and the classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])

# Define the hyperparameter space for the XGBoost model
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [3, 6],
    'classifier__learning_rate': [0.05, 0.1]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC found: ", grid_search.best_score_)

# Evaluate the model on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score on Test Set: ", auc_score)

### Other Models (e.g. Bagging Classifier)

In [None]:
#bagging simple
dtree = DecisionTreeClassifier()

bag_clf = BaggingClassifier(dtree,
                           n_estimators = 500,
                           max_samples = 100,
                           n_jobs = -1, random_state = 42)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', bag_clf)])
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:,1]

auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score

# Define the base estimator
base_estimator = DecisionTreeClassifier(random_state=42, max_depth = 5)

# Initialize the BaggingClassifier with the Decision Tree as the base estimator
bagging_clf = BaggingClassifier(base_estimator=base_estimator, random_state=42)

# Create a pipeline with preprocessing and the classifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', bagging_clf)])

# Define the parameter grid to search over
param_grid = {
    'classifier__n_estimators': [10, 50, 100],
    'max_samples':[0.5,0.75,1.0],
    'max_features':[0.5,0.75,1.0],
    'bootstrap':[True, False]
    'bootstrap_features':[False, True]
    # Example: trying 10, 50, and 100 trees in the ensemble
    # Add other parameters here if you wish to tune them
}

# Set up the GridSearchCV to find the best parameters for both the model and preprocessing
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameter set found
print("Best parameters found: ", grid_search.best_params_)

# Predict on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score: {auc_score:.4f}")


## Model Evaluation

Compare the best models' performance on the test data. Which one does the best? Which one the worst? Why do you think this is the case?