<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-and-Load-Data" data-toc-modified-id="Import-and-Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import and Load Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Trying-Out-Models" data-toc-modified-id="Trying-Out-Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Trying Out Models</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Support-Vector-Machine" data-toc-modified-id="Support-Vector-Machine-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Support Vector Machine</a></span></li><li><span><a href="#Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)" data-toc-modified-id="Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Decision Trees (Random Forest, Gradient Boosting, XGBoost)</a></span></li><li><span><a href="#Other-Models-(e.g.-Bagging-Classifier)" data-toc-modified-id="Other-Models-(e.g.-Bagging-Classifier)-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Other Models (e.g. Bagging Classifier)</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model Evaluation</a></span></li></ul></div>

## Import and Load Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [None]:
from google.colab import drive
drive.mount('/content/drive/Shareddrives/OMIS 116 Team Drive/Homework/loans.csv')

ValueError: Mountpoint must not contain a space.

In [None]:
df = pd.read_csv("loans.csv")

  df = pd.read_csv("loans.csv")


KeyboardInterrupt: 

## Preprocessing

 - Handle missing values
 - Encode categorical variables, scale data (if you wish), feature selection, etc.
 - Split the dataset into features (X) and target variable (y)
 - Split into training and testing sets

In [None]:
threshold = len(df) * 0.10 # 90% threshold

In [None]:
df_cleaned = df.dropna(axis = 1, thresh = threshold) # dropped columns that have more than 80% missing values

In [None]:
df_cleaned = df_cleaned.drop(axis = 1, columns =['id', 'member_id','emp_title', 'url', 'Unnamed: 0', 'title', 'zip_code','addr_state', 'policy_code','desc', 'next_pymnt_d',
                                                 'issue_d', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
                                                 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int','total_rec_late_fee','recoveries', 'collection_recovery_fee',
                                                 'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d', 'collections_12_mths_ex_med'])

In [None]:
df_cleaned['term'].replace({' 36 months': 36, ' 60 months': 60}, inplace=True) #remove string value from column to ensure int value in column

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['term'].replace({' 36 months': 36, ' 60 months': 60}, inplace=True) #remove string value from column to ensure int value in column
  df_cleaned['term'].replace({' 36 months': 36, ' 60 months': 60}, inplace=True) #remove string value from column to ensure int value in column


In [None]:
df_cleaned['earliest_cr_line_year'] = pd.to_datetime(df_cleaned['earliest_cr_line']).dt.year   #convert to year
df_cleaned['earliest_cr_line_month'] = pd.to_datetime(df_cleaned['earliest_cr_line']).dt.month  #convert to month
df_cleaned.drop(columns = 'earliest_cr_line', inplace = True)

  df_cleaned['earliest_cr_line_year'] = pd.to_datetime(df_cleaned['earliest_cr_line']).dt.year   #convert to year
  df_cleaned['earliest_cr_line_month'] = pd.to_datetime(df_cleaned['earliest_cr_line']).dt.month  #convert to month


In [None]:
continous_columns = [
    'loan_amnt', 'int_rate', 'installment', 'term','funded_amnt','funded_amnt_inv',
    'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec',
    'revol_bal', 'total_acc', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'revol_util',
    'mths_since_last_delinq', 'mths_since_last_major_derog','mths_since_last_record']
nominal_columns = ['purpose', 'verification_status','emp_length','home_ownership', 'application_type', 'initial_list_status', 'pymnt_plan']
time_columns = ['earliest_cr_line_year' , 'earliest_cr_line_month']
ordinal_columns = ['grade', 'sub_grade']

In [None]:
from sklearn.impute import SimpleImputer

# Adjusted imputation strategies
# For continuous columns
continuous_imputer = SimpleImputer(strategy='median')

# For nominal columns (categorical data without a specific order)
nominal_imputer = SimpleImputer(strategy='most_frequent')

# For ordinal columns (categorical data with a specific order)
ordinal_imputer = SimpleImputer(strategy='most_frequent')
time_imputer = SimpleImputer(strategy='median')

# Updated preprocessor to include imputation for all feature types
preprocessor = ColumnTransformer(
    transformers=[
        ('nominal', Pipeline(steps=[
            ('imputer', nominal_imputer),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), nominal_columns),

        ('continuous', Pipeline(steps=[
            ('imputer', continuous_imputer),
            ('scaler', StandardScaler())
        ]), continous_columns),

        ('ordinal', Pipeline(steps=[
            ('imputer', ordinal_imputer),
            ('encoder', OrdinalEncoder())
        ]), ordinal_columns),
        ('time', Pipeline(steps=[
            ('imputer', time_imputer)
        ]), time_columns),
    ],
    remainder='passthrough'  # Include any other column that doesn't fit into the above categories without transformation
)


In [None]:
df_cleaned = df_cleaned[~df_cleaned['loan_status'].isin(['Current', 'In Grace Period', 'Issued'])] #drop rows containing data not needed for model

In [None]:
df_cleaned['binary_loan_status'] = df_cleaned['loan_status'].apply(lambda x: 1 if x in ['Fully Paid', 'Does not meet the credit policy. Status:Fully Paid','Does not meet the credit policy. Status:Charged Off'] else 0)
df_cleaned.drop(columns = 'loan_status', inplace = True)

In [None]:
df_cleaned.loan_status

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,...,initial_list_status,mths_since_last_major_derog,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,earliest_cr_line_year,earliest_cr_line_month,binary_loan_status
0,5000,5000,4975.0,36,10.65,162.87,B,B2,10+ years,RENT,...,f,,INDIVIDUAL,0.0,,,,1985.0,1.0,1
1,2500,2500,2500.0,60,15.27,59.83,C,C4,< 1 year,RENT,...,f,,INDIVIDUAL,0.0,,,,1999.0,4.0,0
2,2400,2400,2400.0,36,15.96,84.33,C,C5,10+ years,RENT,...,f,,INDIVIDUAL,0.0,,,,2001.0,11.0,1
3,10000,10000,10000.0,36,13.49,339.31,C,C1,10+ years,RENT,...,f,,INDIVIDUAL,0.0,,,,1996.0,2.0,1
5,5000,5000,5000.0,36,7.90,156.46,A,A4,3 years,RENT,...,f,,INDIVIDUAL,0.0,,,,2004.0,11.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887351,4200,4200,4200.0,36,15.99,147.64,D,D2,10+ years,MORTGAGE,...,f,38.0,INDIVIDUAL,0.0,0.0,207975.0,20400.0,1990.0,8.0,0
887364,10775,10775,10775.0,36,6.03,327.95,A,A1,< 1 year,RENT,...,w,28.0,INDIVIDUAL,0.0,0.0,24696.0,41700.0,1975.0,11.0,1
887366,6225,6225,6225.0,36,16.49,220.37,D,D3,2 years,RENT,...,f,,INDIVIDUAL,0.0,0.0,8357.0,1800.0,2011.0,2.0,1
887369,4000,4000,4000.0,36,8.67,126.59,B,B1,10+ years,MORTGAGE,...,f,,INDIVIDUAL,0.0,0.0,18979.0,30100.0,2002.0,9.0,1


In [None]:
X= df_cleaned.drop(columns = 'binary_loan_status')

In [None]:
y = df_cleaned.binary_loan_status

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, shuffle = True, test_size=0.3)

## Trying Out Models

Here, you want to try each type of machine learning model and perform the train-test-loop: identify the best hyperparameters for the model to perform well in training and validation. GridSearchCV is likely relevant.

### Logistic Regression

In [None]:
# Print all column names from the DataFrame
print("DataFrame columns:", df_cleaned.columns.tolist())

# Print specified column names for each transformer
print("Continous columns:", continous_columns)
print("Nominal columns:", nominal_columns)

# Check for any specified columns that are not in the DataFrame
all_specified_columns = set(continous_columns +  nominal_columns + time_columns)
missing_columns = [col for col in all_specified_columns if col not in df_cleaned.columns]
if missing_columns:
    print("Missing columns in DataFrame:", missing_columns)
else:
    print("All specified columns are present in the DataFrame.")


In [None]:
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.model_selection import GridSearchCV

# Define the logistic regression pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression(solver='liblinear'))])

# Parameters of the logistic regression to be tuned through cross-validation
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2']
}

# Custom scorer for optimizing the hyperparameters based on AUC
auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=auc_scorer, verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]

# Compute AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)

### Support Vector Machine

In [None]:
# Import necessary libraries
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

# Define the pipeline
# The preprocessor has already been defined in your provided code
svm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(probability=True, random_state=42))
])

# Parameter grid for GridSearchCV
param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__gamma': ['scale', 'auto'],
    'classifier__kernel': ['rbf', 'linear']
}

# Setup the GridSearchCV
grid_search = GridSearchCV(svm_pipeline, param_grid, cv=5, scoring='roc_auc', verbose=2, n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities on the test set
y_prob = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"The AUC score for the optimized SVM model is: {auc_score:.4f}")


### Decision Trees (Random Forest, Gradient Boosting, XGBoost)

In [None]:
from sklearn.metrics import roc_curve, auc

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Setup the pipeline for preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', dt_classifier)])

# Parameters to search for the Decision Tree Classifier
param_grid = {
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 5],
    'classifier__criterion': ['gini', 'entropy']
}

# Setup GridSearchCV to find the best parameters using cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# Predict probabilities for the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score: ", auc_score)


In [None]:
from sklearn.metrics import roc_auc_score

# Define the pipeline steps
pipeline_steps = [
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
]

# Create the pipeline
pipeline = Pipeline(steps=pipeline_steps)

# Define the parameter grid for GridSearchCV
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 4, 5]
}

# Initialize GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=1, n_jobs=-1)

# Fit the GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC score found: ", grid_search.best_score_)

# Predict on the test set
y_pred_prob = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC score on the test set
test_auc_score = roc_auc_score(y_test, y_pred_prob)
print("AUC score on the test set: ", test_auc_score)


In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

# Define a pipeline that includes the preprocessing steps and the classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
])

# Define the hyperparameter space for the XGBoost model
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [3, 6, 9],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__subsample': [0.8, 0.9, 1],
    'classifier__colsample_bytree': [0.8, 0.9, 1]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC found: ", grid_search.best_score_)

# Evaluate the model on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:,1]
auc_score = roc_auc_score(y_test, y_pred_proba)
print("AUC Score on Test Set: ", auc_score)


### Other Models (e.g. Bagging Classifier)

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score

# Define the base estimator
base_estimator = DecisionTreeClassifier(random_state=42)

# Initialize the BaggingClassifier with the Decision Tree as the base estimator
bagging_clf = BaggingClassifier(base_estimator=base_estimator, random_state=42)

# Create a pipeline with preprocessing and the classifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', bagging_clf)])

# Define the parameter grid to search over
param_grid = {
    'classifier__n_estimators': [10, 50, 100],  # Example: trying 10, 50, and 100 trees in the ensemble
    # Add other parameters here if you wish to tune them
}

# Set up the GridSearchCV to find the best parameters for both the model and preprocessing
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameter set found
print("Best parameters found: ", grid_search.best_params_)

# Predict on the test set
y_pred_proba = grid_search.predict_proba(X_test)[:, 1]

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score: {auc_score:.4f}")


## Model Evaluation

Compare the best models' performance on the test data. Which one does the best? Which one the worst? Why do you think this is the case?