# Grade: /100 pts

# Assignment 05: Model Selection & Cross Validation

### Follow These Instructions

Once you are finished, ensure to complete the following steps.

1.  Restart your kernel by clicking 'Kernel' > 'Restart & Run All'.

2.  Fix any errors which result from this.

3.  Repeat steps 1. and 2. until your notebook runs without errors.

4.  Submit your completed notebook to OWL by the deadline.



#### In this assignment, we will work on the bank loan data. The task is to build a model given the information of clients to predict whether the clients default or not. The data file is `loan_Data.csv`.  The target variable is `loanDefault`, which can be Fully Paid or Charged Off. The data includes some information about the payment behavior and customer characteristics such as job and purpose acquiring the current loan. You could view the description of the variables in `loan_param.xlsx`.



---

### Global Toolbox

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split,StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, roc_auc_score


pd.set_option('display.max_columns', 500)
%matplotlib inline
plt.style.use('ggplot')


_____

## Question 1: /10 pts

#### 1.1 Load the data `loan_Data.csv` and display the last 10 rows. How many features and how many obersvations are there in the dataset?

In [None]:
# Load the data and display

# 2 pt
data = pd.read_csv('loan_Data.csv',on_bad_lines='skip')

df = data['loanAmnt;annualInc;application_type;int_rate;revol_bal;revol_util;dti;emp_length;grade;homeOwnership;installment;job;loanDefault;mortAcc;pub_rec_bankruptcies;purpose;term;Year'].str.split(';', expand=True)
df.columns = ['loanAmnt', 'annualInc', 'application_type', 'int_rate', 'revol_bal', 'revol_util', 'dti', 'emp_length', 'grade', 'homeOwnership', 'installment', 'job', 'loanDefault', 'mortAcc', 'pub_rec_bankruptcies', 'purpose', 'term', 'Year']

df['annualInc'] = np.log(df['annualInc'].astype(float))
df['loanAmnt'] = df['loanAmnt'].astype(float)
df['int_rate'] = df['int_rate'].astype(float)
df['revol_bal'] = df['revol_bal'].astype(float)
df['revol_util'] = df['revol_util'].astype(float)
df['dti'] = df['dti'].astype(float)
df['emp_length'] = df['emp_length'].astype(int)
df['installment'] = df['installment'].astype(float)
df['mortAcc'] = df['mortAcc'].astype(int)
df['term'] = df['term'].astype(int)

print(df)


**Written Answer** [2 pt]: 
There are 982 observations and 18 features

#### 1.2 Create a bar graph to visualize the count of `Charged Off` and `Fully Paid`. Calculate the percentage of `Charged off`, which is the percentage of default.

In [None]:
# Plot

df['loanDefault'].value_counts().plot(kind='bar', color=['green', 'red'])
plt.title('Loan Status Counts')
plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.show()


# 2pts

In [None]:
# Calculate the percentage
FullyPaid = df['loanDefault'].value_counts()['Fully Paid']
ChargedOff = df['loanDefault'].value_counts()['Charged Off']
Total = FullyPaid+ChargedOff
percentage = ChargedOff/Total



**Written Answer** [2 pt]: 81.87%

#### 1.3 Change the values of the column `loanDefault` to 1 if the loan is `Charged Off` and 0 if it is `Fully Paid`. 

In [None]:
# Change the values
df['loanDefault'] = np.where(df['loanDefault'] == 'Charged Off', 1, 0)

# 2 pts

_____________

## Question 2: /16 pts 
Here we are interested in if the distribution of income is different between clients who defaulted and those who did not default on their loans.

#### 2.1 First create the histogram of the annual income `annualInc` for all the clients. Do not forget to label the axes.

In [None]:
# Plot the distribution
plt.hist(df['annualInc'], bins=20)
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Number of Clients')
plt.show()
# 2 pts

What do you notice about the distribution of the annual income variable? What transformation would you suggest for it?

**Witten Answer**[2 pts]: 
The variation is big. So a log transformation is suggested.

#### 2.2  Apply the transformation (*i.e.*, based on your answer to the previous question) to annual income and plot the histogram of the transformed version. Update (*i.e.*, overwrite) the original entry values of `annualInc` with the transformed ones.

In [None]:
# Apply transformation and plot distribution
plt.hist(np.log(df['annualInc'].astype(float)), bins=20)
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Number of Clients')
plt.show()
# 4 pts

In [None]:
# Overwrite
df['annualInc'] = np.log(df['annualInc'].astype(float))

# 2 pts

#### 2.3 Plot the histograms of annual income for clients who defaulted and clients who did not default. Compare to see if there is any noticeable difference. Comment qualitatively.

In [None]:
# Plot two distributions overlaid (use the alpha argument to create contrast in overlay plots)
# loan Default:
plt.hist(df[df['loanDefault'] == 1]['annualInc'], bins = 20)
plt.title('Distribution of income')
plt.xlabel('loanDefault')
plt.ylabel('Number of loanDefault Clients')
plt.show()
# loan not default:
plt.hist(df[df['loanDefault'] == 0]['annualInc'], bins = 20)
plt.title('Distribution of income')
plt.xlabel('loanDefault')
plt.ylabel('Number of loanDefault Clients')
plt.show()

# 4 pts

**Written Answer** [2pts]:  load default has a higher average 

___________

## Question 3: /14 pts

Let's build a model and use the annual income to predict the default outcome.

#### 3.1 Create a model pipline to include preprocessing step using `StandardScaler` and a basic logistic regression model (with default penalization and use `solver='lbfgs'`, `max_iter=10000` and `random_state=0`)

In [None]:
# Create a model pipline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(solver='lbfgs', max_iter=10000, random_state=0))])
# 2 pt

#### 3.2 Use a 80/20 train-test split of the data and remember to set `random_state=0`. Fit the model and then evaluate this model plotting the ROC curve and reporting the AUC value. 

In [None]:
# Get the X and y
X = df[['annualInc']]
y = df['loanDefault']

# Split the train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


# Fit the model
pipe.fit(X_train, y_train)

# 4 pts

In [None]:
# Use predict_proba to get the probability of default
prob = pipe.predict_proba(X_test)

# 2 pts

# Plot the ROC curve and report AUC
fpr, tpr, thresholds = roc_curve(y_test, prob.T[1])
auc_value = roc_auc_score(y_test, prob.T[1])
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='blue', label='ROC (AUC = {:.2f})'.format(auc_value))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
print(f"Area Under the Curve (AUC): {auc_value:.2f}")

# 4 pts

Is income itself alone good enough to predict the default outcome?


**Written Answer** [2 pt] No

____________

## Question 4: /34

Here, let's use cross-validation to find how each numeric feature performs to predict the default status.


#### 4.1: Let's write our own function instead of using `cross_val_score` to get the cross-validation AUC score. First, create a function `AUC_calculation` with inputs `(model, X, y, index_train, index_test)`  which calculates the AUC of the model trained on `index_train` and tested on `index_test`. Here we assume that X and y are pandas dataframe.

In [None]:
def AUC_calculation(model, X, y, index_train, index_test):
    # Define Xtrain, ytrain, Xtest, ytest 
    Xtrain, ytrain = X.iloc[index_train], y.iloc[index_train]
    Xtest, ytest = X.iloc[index_test], y.iloc[index_test]

    # Fit the model
    model.fit(Xtrain, ytrain)
    proba_positive_class = model.predict_proba(Xtest)[:, 1]
    
    # Calculate the auc score
    score_auc = roc_auc_score(ytest, proba_positive_class)
    return score_auc
# 6 pts

#### 4.2: Create a function named `AUC_cross_validation` which has as input (model, X, y, n_fold) and does a `StratifiedKFold` cross validation with n_fold and its output should be a list with the AUC for each fold. This function will call the above function `AUC_calculation`.

In [None]:
def AUC_cross_validation(model, X, y, n_fold):
    # Create the stratified folds
    skf = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=42)

    # Loop over folds and compute the AUC score for each fold
    
    list_auc = []
    for train_index, test_index in skf.split(X, y):
        auc = AUC_calculation(model, X, y, train_index, test_index)
        list_auc.append(auc)

    return list_auc
# 6 pts 

#### 4.3 Now we are ready to estimate and compare through cross validation the performance of all the *simple models* that only use one numeric predictor as input. Here we will apply logarithm transformation for the `loanAmnt` and replace it. We will also exclude `Year` and `installment`.

In [None]:
# Extract only the numeric features
feau_num = ['int32', 'int64', 'float64']
data_num = pd.DataFrame(df.select_dtypes(include=feau_num))


# Log transform 
data_num['loanAmnt'] = np.log(data_num['loanAmnt']) 


# Drop
data_num = data_num.drop(columns=['Year', 'installment']) 



data_num.head()
# 2pts

#### 4.4: Use the function `AUC_cross_validation` and the model from Q3.1 to compute cross-validation estimates of the AUC for each single numeric feature model, and use a pandas dataframe (named `AUC_models`) to store the AUC value for each fold and each of the models (use `n_fold=10`).

The column names of `AUC_models` have to be in the form `Simple-[numeric predictor variable]`, *e.g.*, `simple-int_rate`.

In [None]:
# Construct AUC_models dataframe
AUC_models = pd.DataFrame({})

# Run cross-validation for each feature
for feature in data_num.columns:
    auc_scores = AUC_cross_validation(pipe, data_num[[feature]], y, n_fold=10)  # Using each feature for prediction
    AUC_models[f'simple-{feature}'] = auc_scores
print(AUC_models)
# 8 pts

In [None]:
# Print AUC_models dataframe. The shape should be 10 x number of features
print(AUC_models)


#### 4.5: Let's use a `sns.boxplot`` (without presenting outliers) to show the distribution of the AUC scores for each feature.

In [None]:
# Hint: use data=pd.melt(AUC_models) in boxplot

# Melt the dataframe for plotting
melted_AUC_models = AUC_models.melt(var_name='Features', value_name='AUC Score')

# Create the boxplot
plt.figure(figsize=(15, 8))
sns.boxplot(x='Features', y='AUC Score', data=melted_AUC_models, showfliers=False)  # showfliers=False to hide outliers
plt.xticks(rotation=45)  # Rotate feature names for better visualization
plt.title('AUC scores for each feature')
plt.tight_layout()
plt.show()

#5 pts


What is the feature that yields the best performance?

**Written Answer** [1 pt]: int rate

#### 4.6: Now let's use a model including all the numeric features for training. Again use 10-fold cross-validation to determine if this new model has better performance. Add the results to the previous AUC_models dataframe and visualize again using boxplots.

In [None]:
# Get the X and y
X = data_num
y = data['int_rate']

# Calculate the auc scores using cross validation
auc_scores_all_numeric = AUC_cross_validation(pipe, X, y, n_fold=10)

# 3. Include the auc scores in the AUC_models DataFrame in the column 'All_numeric'
AUC_models['All_numeric'] = auc_scores_all_numeric
# Print the new data frame
print(AUC_models)


In [None]:
# Plot 
plt.figure(figsize=(15, 8))
sns.boxplot(x='Features', y='AUC Score', data=melted_AUC_models, showfliers=False)  # showfliers=False to hide outliers
plt.xticks(rotation=45)  # Rotate feature names for better visualization
plt.title('Distribution of AUC scores for each feature including All Numeric Features')
plt.tight_layout()
plt.show()
# 2 pts

_____________

### Question 5: /10 pts

#### 5.1 Let's also include the categorical variable `grade` to the model (in addition to all the all the numeric features). And, again, add the results to the `AUC_models` dataframe.

In [None]:
# Convert category into numerical values


# Add this feature to all the numeric variables


# Calculate the auc scores using cross validation


# Include the auc scores in the AUC_models DataFrame in the column 'All_numeric_&_Grade'
AUC_models['All_numeric_&_Grade'] = 

# 4 pts

#### 5.2 Print the AUC mean and AUC standard deviation for each of the models. Which model would you choose and why?

In [None]:
mean_scores = AUC_models.mean()
std_scores = AUC_models.std()
summary = pd.DataFrame({
    'AUC Mean': mean_scores,
    'AUC Std Dev': std_scores
})
print(summary)

# 6 pts

**Written Answer** [2 pts]: AUC. Since AUC has a 

______________

### Question 6: /14pts
Train and test the model you selected using a 80/20 train-test split of the data.
- Use boostrap technique without refitting the model to obtain a confidence interval for the test AUC measure
- Plot the distribution of the boostrap AUC scores


In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#Fit the model
pipe.fit(X_train, y_train)
#Calculate the predictions on the Test data
y_prob = pipe.predict_proba(X_test)[:, 1]


# Create the AUC
auc = roc_auc_score(y_test, y_prob)
print(f'Original AUC: {auc}')
# 2 pt

In [None]:
# Boostrap
bootstrap_aucs = []
n_bootstrap = 1000

for _ in range(n_bootstrap):
    # Sample with replacement from y_test and y_prob
    indices = np.random.choice(len(y_test), len(y_test), replace=True)
    y_test_sampled = y_test.iloc[indices].values
    y_prob_sampled = y_prob[indices]
    bootstrap_aucs.append(roc_auc_score(y_test_sampled, y_prob_sampled))

# 4 pts

In [None]:
# Plot
plt.figure(figsize=(10, 6))
sns.histplot(bootstrap_aucs, kde=True, bins=30)
plt.title('Bootstrap Distribution of AUC Scores')
plt.xlabel('AUC')
plt.ylabel('Frequency')
plt.show()
# 2 pts

In [None]:
# Find the confidence interval
alpha = 0.05
ci_min = np.percentile(bootstrap_aucs, 100 * alpha / 2.)
ci_max = np.percentile(bootstrap_aucs, 100 * (1 - alpha / 2.))

print(f'The CI for the AUC of the model is: {(ci_min, ci_max)}')

# 6 pts