<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-and-Load-Data" data-toc-modified-id="Import-and-Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import and Load Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Trying-Out-Models" data-toc-modified-id="Trying-Out-Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Trying Out Models</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Support-Vector-Machine" data-toc-modified-id="Support-Vector-Machine-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Support Vector Machine</a></span></li><li><span><a href="#Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)" data-toc-modified-id="Decision-Trees-(Random-Forest,-Gradient-Boosting,-XGBoost)-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Decision Trees (Random Forest, Gradient Boosting, XGBoost)</a></span></li><li><span><a href="#Other-Models-(e.g.-Bagging-Classifier)" data-toc-modified-id="Other-Models-(e.g.-Bagging-Classifier)-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Other Models (e.g. Bagging Classifier)</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model Evaluation</a></span></li></ul></div>

## Import and Load Data

In [43]:
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)

In [44]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
#from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [None]:
#load data
df = pd.read_csv("loans.csv",index_col=0 )
df.head()
df.shape

In [None]:
# Exmaine all columns
df.columns

In [None]:
df = df[df.loan_status != 'Current'] # Drops loans with loan status 'Current'
df.loan_status.value_counts()
df.info()

In [None]:
#Set 'Fully Paid' and 'Does not meet the credit policy. Status:Fully Paid' to 1 and the rest to 0
df['loan_status'] = df['loan_status'].replace({'Fully Paid': 1,'Charged Off': 0, 'Late (31-120 days)': 0,'Late (16-30 days)':0,'Default':0,'Does not meet the credit policy. Status:Charged Off':0,'Does not meet the credit policy. Status:Fully Paid':1})

# Drop 'Issued' and ' In Grace Period' due to lack of information about whether loans are 'good' or 'bad'
df = df[~df['loan_status'].isin(['Issued', 'In Grace Period'])]
df.loan_status.value_counts()

In [None]:
# Reformat 'term' column to int
df['term'] = df['term'].str.extract('(\d+)').astype(int)
# Examine term variable post refromatting
df.term.value_counts()

In [None]:
# Convert issue_d to datetime for ML
df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y')

# Separate into year and month features
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

df.drop(columns=['issue_d'], inplace=True)

In [None]:
df.issue_month.head()

# Cyclical encoding instead of one hot encoding month, to inform the model that January (1) may be close to December (12)
df['issue_month_sin'] = np.sin(2 * np.pi * df['issue_month'] / 12)
df['issue_month_cos'] = np.cos(2 * np.pi * df['issue_month'] / 12)

df.drop(columns=['issue_month'], inplace=True)


In [None]:
# Examine emp_title and emp_length variables
df.emp_title.dropna().value_counts()
df.emp_length.dropna().value_counts()
df.emp_title.isna().sum()
df.emp_length.isna().sum()

In [None]:
df.application_type.value_counts()
# We can safely drop this
df = df.drop('application_type',axis=1)

In [None]:
# Examining more features
df.title.value_counts()
df.earliest_cr_line.value_counts()
df.last_pymnt_d.value_counts()
df.last_credit_pull_d.value_counts()

In [None]:
df.head()

In [None]:
# Dropping more columns that likely will not be strong predictors
df = df.drop(['emp_title','zip_code','title','emp_length','url','id','member_id'],axis=1)

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

df1 = df.copy()

# Drop all columns with any missing values
df1 = df1.dropna(axis=1, how='any')

# Check remaining shape
print("df1 shape after dropping all columns with missing values:", df1.shape)


df2 = df.copy()

# Drop columns with more than 50% missing values
threshold = 0.5
df2 = df2.dropna(axis=1, thresh=int((1-threshold) * len(df2)))

# Separate numerical and categorical columns
numerical_columns = df2.select_dtypes(include=['number']).columns
categorical_columns = df2.select_dtypes(include=['object']).columns

# Impute missing values for numerical columns with mean
num_imputer = SimpleImputer(strategy='mean')
df2[numerical_columns] = num_imputer.fit_transform(df2[numerical_columns]).copy()

# Impute missing values for categorical columns with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df2[categorical_columns] = cat_imputer.fit_transform(df2[categorical_columns]).copy()

# Check final shape
print("df2 shape after partial drop + imputation:", df2.shape)


In [None]:
df2['earliest_cr_line'] = pd.to_datetime(df2['earliest_cr_line'], format='%b-%Y')

df2['earliest_cr_line_year'] = df2['earliest_cr_line'].dt.year
df2['earliest_cr_line_month'] = df2['earliest_cr_line'].dt.month

df2.drop(columns=['earliest_cr_line'], inplace=True)

In [None]:
def parse_dates(date):
    try:
        return pd.to_datetime(date, format='%b-%Y')  # Try parsing as 'Oct-2015'
    except:
        try:
            return pd.to_datetime(date, format='%Y')  # Try parsing as '2015'
        except:
            return pd.NaT  # Assign NaT if parsing fails

df2['last_pymnt_d'] = df2['last_pymnt_d'].astype(str).apply(parse_dates)



In [None]:
df2['last_pymnt_d_year'] = df2['last_pymnt_d'].dt.year
df2['last_pymnt_d_month'] = df2['last_pymnt_d'].dt.month

df2.drop(columns=['last_pymnt_d'], inplace=True)

In [None]:
df2['last_credit_pull_d'] = pd.to_datetime(df2['last_credit_pull_d'], format='%b-%Y')

df2['last_credit_pull_d_year'] = df2['last_credit_pull_d'].dt.year
df2['last_credit_pull_d_month'] = df2['last_credit_pull_d'].dt.month

df2.drop(columns=['last_credit_pull_d'], inplace=True)

In [None]:
df2['earliest_cr_line_month_sin'] = np.sin(2 * np.pi * df2['earliest_cr_line_month'] / 12)
df2['earliest_cr_line_month_cos'] = np.cos(2 * np.pi * df2['earliest_cr_line_month'] / 12)

df2.drop(columns=['earliest_cr_line_month'], inplace=True)

df2['last_pymnt_d_sin'] = np.sin(2 * np.pi * df2['last_pymnt_d_month'] / 12)
df2['last_pymnt_d_cos'] = np.cos(2 * np.pi * df2['last_pymnt_d_month'] / 12)

df2.drop(columns=['last_pymnt_d_month'], inplace=True)

df2['last_credit_pull_d_sin'] = np.sin(2 * np.pi * df2['last_credit_pull_d_month'] / 12)
df2['last_credit_pull_d_cos'] = np.cos(2 * np.pi * df2['last_credit_pull_d_month'] / 12)

df2.drop(columns=['last_credit_pull_d_month'], inplace=True)

In [None]:
from sklearn.preprocessing import OneHotEncoder

df1_categorical = ['term','grade','sub_grade','home_ownership', 'verification_status','pymnt_plan','purpose', 'addr_state','initial_list_status']

df2_categorical = df1_categorical 

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

encoded_cats_df2 = encoder.fit_transform(df2[df2_categorical])

# Convert to DataFrame
encoded_df2 = pd.DataFrame(encoded_cats_df2, columns=encoder.get_feature_names_out(df2_categorical))

# Drop original categorical columns and concatenate encoded ones
df2 = df2.drop(columns=df2_categorical).reset_index(drop=True)
df2 = pd.concat([df2, encoded_df2], axis=1)

print("df2 shape after one-hot encoding:", df2.shape)

encoded_cats_df1 = encoder.fit_transform(df1[df1_categorical])

# Convert to DataFrame
encoded_df1 = pd.DataFrame(encoded_cats_df1, columns=encoder.get_feature_names_out(df1_categorical))

# Drop original categorical columns and concatenate encoded ones
df1 = df1.drop(columns=df1_categorical).reset_index(drop=True)
df1 = pd.concat([df1, encoded_df1], axis=1)
print("df1 shape after one-hot encoding:", df1.shape)

In [None]:
numerical_columns = ['loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'annual_inc',
 'dti',
 'delinq_2yrs',
 'inq_last_6mths',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_amnt',
 'collections_12_mths_ex_med',
 'mths_since_last_major_derog',
 'policy_code',
 'annual_inc_joint',
 'dti_joint',
 'acc_now_delinq',
 'tot_coll_amt',
 'tot_cur_bal',
 'open_acc_6m',
 'open_il_6m',
 'open_il_12m',
 'open_il_24m',
 'mths_since_rcnt_il',
 'total_bal_il',
 'il_util',
 'open_rv_12m',
 'open_rv_24m',
 'max_bal_bc',
 'all_util',
 'total_rev_hi_lim',
 'inq_fi',
 'total_cu_tl',
 'inq_last_12m', 'issue_year', 'last_credit_pull_d_year', 'last_pymnt_d_year','earliest_cr_line_year']


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
common_numerical = df1.columns.intersection(numerical_columns)
common_numerical2 = df2.columns.intersection(numerical_columns)
df1[common_numerical] = scaler.fit_transform(df1[common_numerical])
df2[common_numerical2] = scaler.fit_transform(df2[common_numerical2])


In [None]:
df1.loan_status.value_counts()
df2.loan_status.value_counts()

From the value_counts() function, we gather that the datasets are imbalanced. To resolve this issue, we can use SMOTE (Synthetic Minority Oversampling Technique)

In [None]:
df1['loan_status'] = df1['loan_status'].astype(int)
df2['loan_status'] = df2['loan_status'].astype(int)

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Define target variable
target = 'loan_status'


X1 = df1.drop(columns=[target])
y1 = df1[target]

X2 = df2.drop(columns=[target])
y2 = df2[target]

# Apply SMOTE separately to df1 and df2
smote = SMOTE(sampling_strategy='auto', random_state=42)

X1_resampled, y1_resampled = smote.fit_resample(X1, y1)
X2_resampled, y2_resampled = smote.fit_resample(X2, y2)

# Convert back to DataFrame
df1_balanced = pd.DataFrame(X1_resampled, columns=X1.columns)
df1_balanced[target] = y1_resampled

df2_balanced = pd.DataFrame(X2_resampled, columns=X2.columns)
df2_balanced[target] = y2_resampled

In [None]:
df1_balanced.loan_status.value_counts()
df2_balanced.loan_status.value_counts()

In [None]:
X1 = df1_balanced.drop(columns=[target])
y1 = df1_balanced[target]

X2 = df2_balanced.drop(columns=[target])
y2 = df2_balanced[target]

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X1,y1,test_size=0.3,random_state=42,stratify=y1)
X_train2,X_test2,y_train2,y_test2 = train_test_split(X2,y2,test_size=0.3,random_state=42,stratify=y2)

In [None]:
df2_sample = df2_balanced.sample(n=50000, random_state=42)

X = df2_sample.drop(columns=['loan_status'])
y = df2_sample['loan_status']

corr_matrix = pd.DataFrame(X).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
X = pd.DataFrame(X).drop(columns=to_drop)

In [None]:
lasso = LogisticRegression(penalty='l1', solver='saga', max_iter=500)
param_dist = {'C': np.logspace(-4, 4, 10)}  # Log-distributed values

In [None]:
from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(lasso, param_distributions=param_dist,
                                   n_iter=5, scoring='roc_auc', cv=3, n_jobs=-1, random_state=42)
random_search.fit(X, y)

best_lasso = random_search.best_estimator_
lasso_coefficients = best_lasso.coef_.flatten()
selected_features = X.columns[lasso_coefficients != 0]

print(f"Selected {len(selected_features)} features using Lasso:")
print(selected_features)
print(f"Best Regularization Parameter (C): {random_search.best_params_['C']}")

In [None]:
feature_importance = pd.DataFrame({'Feature': X.columns, 'Coefficient': lasso_coefficients})

# Select only non-zero coefficients
feature_importance = feature_importance[feature_importance['Coefficient'] != 0]

# Sort features by absolute coefficient value in descending order
feature_importance = feature_importance.reindex(feature_importance['Coefficient'].abs().sort_values(ascending=False).index)


feature_importance.head(20)


## Preprocessing

 - Handle missing values
 - Encode categorical variables, scale data (if you wish), feature selection, etc.
 - Split the dataset into features (X) and target variable (y)
 - Split into training and testing sets

# feature selection
1. Loan Amount (loan_amnt)
2. Annual Income (annual_inc)
3. Debt-to-Income Ratio (dti)
4. Loan Term (term)
6. Employment Length (emp_length)
7. Grade and Subgrade (grade, sub_grade) credit grading FICO score
9. Interest Rate (int_rate)


5. Purpose of Loan (purpose) ?C
5. Home Ownership Status (home_ownership)? C


normalizer - shubhaan

feature selection - lasso, decisiontrees/randomforests - generate feature importance plot

Creating X and Y and train_test_split and LASSO - shubhaan

## Trying Out Models

Here, you want to try each type of machine learning model and perform the train-test-loop: identify the best hyperparameters for the model to perform well in training and validation. GridSearchCV is likely relevant.

### Logistic Regression

### Decision Trees (Random Forest, Gradient Boosting, XGBoost)

### Other Models (e.g. Bagging Classifier)

## Model Evaluation

Compare the best models' performance on the test data. Which one does the best? Which one the worst? Why do you think this is the case?