<h1 style="color:red;">Credit rating assignment</h1>
<p></p>
In this assignment, we'll work our way through a simple ML exercise. Machine learning is an iterative process that starts with feature engineering (making the features ready for ML), works it way through various models and hyperparameter tuning exercises, until we find a model that seems to work well for us. 

<h3 style="color:green;">The problem: Rating creditworthiness of loan applicants</h3>

When banks issue loans to individuals, they have two goals that conflict with each other:
<ol>
    <li>Give as many loans as possible (fees, interest, all add to revenue)</li>
    <li>Try not to give loans to individuals who won't pay it back (lose money on the loan, collection costs, etc.)</li>
</ol>
    
<li>A typical machine learning program in this space tries to find a suitable tradeoff between finding many good loans and not calling a bad loan good</li>

<li>In this assignment, we'll try to build a "good" model that finds a good tradeoff between these two objectives</li>

<li>In machine learning terms, the proportion of times we get our guess right (i.e., we call a bad loan a bad loan and a good loan a good loan divided by the total number of cases) is called <span style="color:blue">accuracy</span></li>

<li>The proportion of actual good loans that we identify as good loans is known as <span style="color:blue">recall</span></li>

<li>The probability that if a loan is called good it actually is good is called <span style="color:blue">precision</span></li>

<li>The precision recall tradeoff is measured through a score called <span style="color:blue">f1 score</span></li>

<li>An important part of running an ML model is trying to figure out "which metric is right for you"</li>


    
    
<ol>
    <li>We'll try the SGD classifier, tune hyperparameters using grid search, and examine the results</li>
    <li>then, set up the data for a random forest classifier, run a grid search, and examine the results</li>
        <li>finally, run a couple of gradient booster models</li>

    <li>draw precision recall curves and roc curves for the two classifiers and compare the results</li>
    <li>note that grid search is a computing intensive activity. I've simplified the search to a few options but even those can take a long while (less than 15 minutes on my laptop but could be a couple of hours if you have an older machine)</li>
</ol>

<h3 style="color:green;">The models</h3>
<p></p>
<li><b>Model 1 SGD Classifier</b>: Vanilla version with max_iter set to 1000</li>
<li><b>Model 2 SGD Classifier round 2</b>: SGD Classifier with positive cases assigned a higher weight. One issue with our data is that positive cases are vastly outnumbered by negative cases (in other words, a model that says all cases are negative will have a pretty good accuracy). By overweighting positive cases in our model, we increase the efficacy of the model in finding an actual good solution</li>
<li><b>Model 3 SGD Classifier round 3</b>: Best SGD Classifier model after grid search</li>
<li><b>Model 4 Random Forest Classifier round 1</b>: Random Forest Classifier with base parameters (see below)</li>
<li><b>Model 5 Random Forest Classifier round 2</b>: Best model from grid search</li>
<li><b>Model 6 Gradient Booster Classfier</b></li>
<li><b>Model 7 Gradient Booster Classifier (2nd model)</b></li>

For each model, collect model metrics in the following dataframe results_df. After each model run, replace the 0.0 with the appropriate metric value


In [1]:
import pandas as pd
import numpy as np
results_df = pd.DataFrame(np.zeros(shape=(7,6)))
results_df.index=[1,2,3,4,5,6,7]
results_df.columns = ["accuracy","precision","recall","f1_score","AUC","AP"]
results_df.index.rename("Model",inplace=True)
results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h3 style="color:green;">The data</h3>
<p></p>
<li>A curated extract from the popular Lending club loan data. The data is in the file loan_data_small.csv</li>
<li>The dataset contains information about loan applications. Very basic information about the applicant and the status of the loan</li>
<li>The goal of the ML exercise is to build a model that uses information about the loan to predict whether a loan is a "good" one (i.e., it will be paid back) or a "bad" one (the money is unrecoverable)</li>
<li>Note that we're only using a fraction of the data. If you're interested, I can share the curated extract on a larger fraction which gives better results (but can crash your machine!)</li>

<h1 style="color:red;font-size:xx-large">Data preparation and feature engineering</h1>


<h3 style="color:green;">Build a binary target</h3>

<li>For the purposes of this analysis, drop rows that contain any NaN values</li>
<li><b>Target</b>: For the classifier, classify any loans that have a loan_status value of "Charged Off","Default", or "Does not meet the credit policy. Status:Charged Off" as a bad loan and give these loans a target value of 1 (we're predicting bad loans)</li>
<li><b>Input features</b>: create the input feature dataframe (i.e., drop any columns that are not an independent variable). The input variables we're interested in are "int_rate", "grade", "home_ownership","annual_income", "loan_amt", and "purpose"</li>
<p></p>
<li>The data should look like:</li>
<pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  object 
 3   home_ownership  565167 non-null  object 
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 30.2+ MB
Out[108]:
0         False
1          True
2         False
3         False
4          True
          ...  
565162    False
565163    False
565164    False
565165     True
565166    False
Name: loan_status, Length: 565167, dtype: bool

</pre>

In [2]:
#read the file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("loan_data_small.csv")

#Drop rows with NaN values
df.dropna()



#Prepare the y (target) variable
#The target variable should be 1 if loan_status is "Charged Off","Default", or "Does not meet the credit policy. Status:Charged Off"
#And 0 otherwise
#(Hint: Create a boolean mask series)

y = df["loan_status"].isin(["Charged Off","Default", "Does not meet the credit policy"]) 

#remove unwanted input features "Unnamed: 0" and "loan_status"
df = df.drop(["Unnamed: 0", "loan_status"], axis=1)

#Examine the df and the target
df.info()

y

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  object 
 3   home_ownership  565167 non-null  object 
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 30.2+ MB


0         False
1          True
2         False
3         False
4          True
          ...  
565162    False
565163    False
565164    False
565165     True
565166    False
Name: loan_status, Length: 565167, dtype: bool

<h3 style="color:green;">Label Encoding</h3>
<li>Since we're using regression as our underlying algorithm, all values need to be numerical. ML Models generally deal with numerical data</li>
<li>But, <span style="color:blue">grade</span>, <span style="color:blue">purpose</span>, and <span style="color:blue">home_ownership</span> are not</li>
</li>
<li>sklearn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">LabelEncoder</a> assigns numerical values to categorical data</li>
<li>LabelEncoder replaces each categorical string value with an integer - 0, 1, 2, ...</li>
<li>After label encoding, df.info() should return:</li>
<pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  int64  
 3   home_ownership  565167 non-null  int64  
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  int64  
dtypes: float64(2), int64(5)
memory usage: 30.2 MB
</pre>

In [3]:
#replace grade, purpose, and home_ownership by label encoded versions


from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["grade"] = le.fit_transform(df["grade"])
df["purpose"] = le.fit_transform(df["purpose"])
df["home_ownership"] = le.fit_transform(df["home_ownership"])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  int64  
 3   home_ownership  565167 non-null  int64  
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  int64  
dtypes: float64(2), int64(5)
memory usage: 30.2 MB


<h3 style="color:green;">One-hot encoding</h3>

<p></p>
<li>In regression, the assumption is that values associated with a feature are ordered</li>
<li>But, this is not necessarily so for the label encoded categorical values</li>
<li>The way to deal with this in regression is to create dummy variables, one for each category, that take the value 1 if the category is present in the row and 0 otherwise</li>
<li>In ML, a procedure known as <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a> is used to do this conversion</li>
<li>One hot encoding is the process of converting a single column of categorical (integer) data with k categories into k-1 columns of 0 or 1 values</li>
<li>for example, the array with three possible categories [1,2,3,2,1] will be converted into the matrix:</li>

$$\begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 0 \end{bmatrix}$$

<li>1's are replaced by (0, 0); 2's by (1, 0); and 3's by (0, 1). Note that category 1 is implicitly coded</li>
<li><b>Documentation</b>: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html</a>

<h3 style="color:green;">Scaling</h3>

<p></p>
<li>Non-categorical independent variables need to be scaled so that they follow the same underlying distribution</li>
<li>We will normalize them so that the mean is 0 and standard deviation is 1 using sklearn's StandardScaler feature transformer</li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html</a></li>

<li>All feature transformations can be encapsulated in the sklearn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html">make_column_transformer</a> object</li>
<li>Use <span style="color:blue">make_column_transformer</span> to encapsulate both the one-hot coding as well as standard scaling. Note that the one-hot encoded columns are not scaled!</li>

In [5]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

#Make a column transformer object that scales (using StandardScaler) the two non-categorical columns
# and one hot encodes (using OneHotEncoder) the three categorical columns
# Using make_column_transformer 
preprocess = make_column_transformer(
    (StandardScaler(),['int_rate', 'annual_inc'], ),
    (OneHotEncoder(categories="auto",drop="first"),['grade', 'home_ownership','purpose'], )
)

#Generate the independent variable df
X = preprocess.fit_transform(df)
X.shape
#Should return (565167, 26)

(565167, 26)

<h3 style="color:green;">Train/Test split</h3>

<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a></li>
<li>split the data into 70% training and 30% testing</li>
<li>make sure the x and y datasets are aligned</li>
<li>use random_state=42 to get the same split as in my code </li>
<li>x and y training data shapes: (395616, 26) (395616,)</li>
<li>x and y testing data shapes: (169551, 26) (169551,)</li>

In [6]:
from sklearn.model_selection import train_test_split
#Get x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

#And check the shape
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

"""
Should return:
(395616, 26) (395616,)
(169551, 26) (169551,)
"""

(395616, 26) (395616,)
(169551, 26) (169551,)


'\nShould return:\n(395616, 26) (395616,)\n(169551, 26) (169551,)\n'

In [7]:
x_test

<169551x26 sparse matrix of type '<class 'numpy.float64'>'
	with 813222 stored elements in Compressed Sparse Row format>

<h1 style="color:green">The models</h1>
<li>For each model, do the following</li>
<ol>
    <li>Fit a classifier to the training data</li>
    <li>calculate the metrics</li>
    <ul>
        <li>training accuracy</li>
        <li>testing accuracy</li>
        <li>precision on test dataset</li>
        <li>recall on test dataset</li>
        <li>f1 score on test dataset</li>
        <li>area under the curve on test dataset</li>
        <li>average precision on the test dataset</li>
    </ul>
    <li>Write up a brief (pointwise) interpretation of the results
</ol>
<li>Chart the various metrics</li>


<h1 style="color:red;font-size:xx-large">Build Model 1</h1>


<h3 style="color:green;">Build the model on the training data set</h3>

<li>set random_state to 42 (if you want to get the same results that I got) and max_iter to 1000</li>
<li>set the loss function to "log_loss" ("log" if using sklearn 1.0.x or on colab)</li>

In [8]:
from sklearn.linear_model import SGDClassifier
model_1 = SGDClassifier(random_state=42, max_iter=1000, loss="log_loss") #package version causes the different results
model_1.fit(x_train,y_train) #change if you used different variable names

print(model_1.score(x_train,y_train))
print(model_1.score(x_test,y_test))
"""
You should get:
0.8846634109843889
0.8843828700508991
"""

0.8848732103858287
0.8845716038242181


'\nYou should get:\n0.8846634109843889\n0.8843828700508991\n'


<h3 style="color:green;">Model 1 metrics</h3>
<li>Report the following on the <b>test</b> data:</li>
<ul>
<li>the confusion matrix</li>
<li>the accuracy, precision, recall, f1-score, AUC, and AP </li>
</ul>


In [9]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score,recall_score,precision_score
from sklearn.metrics import average_precision_score,roc_auc_score

y_test_pred = model_1.predict(x_test)
cfm_1 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_1.ravel()

y_train_pred = model_1.predict(x_train)
cfm_train_1 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_1.ravel()

accuracy_training_1 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_1 = (tp+tn)/(tp+tn+fp+fn)
precision_1  = precision_score(y_test, y_test_pred)
recall_1  = recall_score(y_test, y_test_pred)
f1_1  = f1_score(y_test, y_test_pred)
auc_1  = roc_auc_score(y_test, y_test_pred)
ap_1  = average_precision_score(y_test, y_test_pred)

print("Confusion Matrix: \n",cfm_1)
print("Training accuracy: ",accuracy_training_1)
print("Testing  accuracy: ",accuracy_testing_1)
print("Precision: ",precision_1)
print("Recall: ",recall_1)
print("F1-Score: ",f1_1)
print("AUC: ",auc_1)
print("Average Precision: ",ap_1)


"""

You should see:

Confusion Matrix: 
 [[149948      1]
 [ 19602      0]]
Training accuracy:  0.8846634109843889
Testing  accuracy:  0.8843828700508991
Precision:  0.0
Recall:  0.0
F1-Score:  0.0
AUC:  0.692962177388246
Average Precision:  0.11561123201868465
"""

Confusion Matrix: 
 [[149977     33]
 [ 19538      3]]
Training accuracy:  0.8848732103858287
Testing  accuracy:  0.8845716038242181
Precision:  0.08333333333333333
Recall:  0.00015352336113811984
F1-Score:  0.0003064820963375389
AUC:  0.4999667690134135
Average Precision:  0.11524655808547495


'\n\nYou should see:\n\nConfusion Matrix: \n [[149948      1]\n [ 19602      0]]\nTraining accuracy:  0.8846634109843889\nTesting  accuracy:  0.8843828700508991\nPrecision:  0.0\nRecall:  0.0\nF1-Score:  0.0\nAUC:  0.692962177388246\nAverage Precision:  0.11561123201868465\n'

<h3 style="color:green;">Interpret the results</h3>
<li>In a few bullet points, write your interpreation of the results. Why are we seeing what we are seeing? Is it useful? Why is the AUC not 0.5?</li>

<h4>Interpretation</h4>
<li>The model accuracy is relatively high at around 88%, indicating that the model is performing well on both training and testing data, to predict the good and bad loans right.</li>
<li>However, the confusion matrix shows that there are 149,997 true negatives, 33 false positives, 19,538 false negatives and 3 true positives. It means that the model is predicting most of the cases as negatives (good loan), and not identify many positives (bad loan). This issue is also indicated by the near zero values for precision, recall, and F1-score. </li>
<li>The AUC score of around 0.5 indicates that the model is just performing as good as random chance. It is not 0.5 because the model is not making random predictions. The AUC score measures the ability of the model to distinguish between positive and negative samples, and a score of near 0.5 suggests that the model is not able to make this distinction effectively. </li>
<li>The average precision score of 0.12 indicates that the model has a low ability to correctly identify positive (bad loan) samples. </li>
<li>The model is not very useful in predicting credit ratings since it is not identifying many positive (bad loan) samples, which is one of the main purposes of the model. This model could make bank mistakenly grant loans to individuals who won't pay it back. Therefore, the model needs further refinement to perform better. 


<h3 style="color:green;">Update results_df</h3>


In [10]:
results_df["accuracy"][1] = accuracy_testing_1
results_df["precision"][1] = precision_1
results_df["recall"][1] = recall_1
results_df["f1_score"][1] = f1_1
results_df["AUC"][1] = auc_1
results_df["AP"][1] = ap_1

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 2</h1>



<li>sklearn's ML models can be given a <span style="color:blue">class_weight</span> parameter</li>
<li>weights can be given explicitly or implicitly</li>
<li>note that by increasing the weight of the true cases, our model is more likely to find true positives</li>
<li>and by decreasing the weight of the true cases, our model is more likely to find true negatives</li>
<li>In Model 2, increase the weight of positives by a factor of 9 to balance the positives and negatives</li>

<h3 style="color:green">Build model 2 and report metrics</h3>

In [11]:
model_2 = SGDClassifier(random_state=42, max_iter=1000, loss="log_loss", class_weight={1:9}) 
model_2.fit(x_train,y_train) 

print(model_2.score(x_train,y_train))
print(model_2.score(x_test,y_test))

y_test_pred = model_2.predict(x_test)
cfm_2 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_2.ravel()

y_train_pred = model_2.predict(x_train)
cfm_train_2 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_2.ravel()

accuracy_training_2 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_2 = (tp+tn)/(tp+tn+fp+fn)
precision_2  = precision_score(y_test, y_test_pred)
recall_2  = recall_score(y_test, y_test_pred)
f1_2 = f1_score(y_test, y_test_pred)
auc_2 = roc_auc_score(y_test, y_test_pred)
ap_2 = average_precision_score(y_test, y_test_pred)

print("Confusion Matrix: \n",cfm_2)
print("Training accuracy: ",accuracy_training_2)
print("Testing  accuracy: ",accuracy_testing_2)
print("Precision: ",precision_2)
print("Recall: ",recall_2)
print("F1-Score: ",f1_2)
print("AUC: ",auc_2)
print("Average Precision: ",ap_2)

0.5571336851896789
0.5571185071158531
Confusion Matrix: 
 [[80059 69951]
 [ 5140 14401]]
Training accuracy:  0.5571336851896789
Testing  accuracy:  0.5571185071158531
Precision:  0.1707250569044006
Recall:  0.736963307916688
F1-Score:  0.2772275321725236
AUC:  0.6353271975887687
Average Precision:  0.15613346501988695


<h3 style="color:green;">Interpret the results</h3>


<h4>Interpretation</h4>
<li>The model accuracy is around 56%, which means the model only correctly classified 56% of the credit rating labels in both training and testing data. It indicates that the model is not performing very well, and there is a need for improvement.</li>
<li>However, compared to model 1, the confusion matrix shows that there are more items predicted as postive (bad loan). The number of predicted positives is increased from 36 to 84,352. Since we increased the weight of the true cases in model 2, our model is more likely to find true positives. This effect also demostrates on the significant increase of precision, recall and f-1 score.</li>
<li>The precision score of 0.17 means that out of all the instances the model predicted as positive (bad loan), only 17% were actually positive. In other words, the model is still not very precise in identifying positive instances (bad loan).</li>
<li>The recall score of 0.74 indicates that the model could correctly identify 74% of the positive instances (bad loan) in the dataset. In other words, the model has decent recall performance, in order to correctly identify the bad loan.</li>
<li>F-1 score is a harmonic mean of precision and recall, and in this case, the score is 0.28. It shows the overall performance of the model has improved, considering both precision and recall.</li>
<li>The AUC is 0.64, which is slightly better than a random model (0.5). It indicates that the model has some ability to distinguish between positive and negative cases.</li>
<li>The average precision score is around 16%, which indicates that the model's precision is still low, but slightly better than model 1.</li>
<li>Model 2 is more useful than model 1, because the higher precision and recall indicates its higher ability to identify good loan, and the higher f-1 score indicates it's a better model that finds tradeoff between precision and recall.

<h3 style="color:green;">Update results_df</h3>

In [12]:
results_df["accuracy"][2] = accuracy_testing_2
results_df["precision"][2] = precision_2
results_df["recall"][2] = recall_2
results_df["f1_score"][2] = f1_2
results_df["AUC"][2] = auc_2
results_df["AP"][2] = ap_2

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.557119,0.170725,0.736963,0.277228,0.635327,0.156133
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 3</h1>

<h3 style="color:green;">Tune hyperparameters using grid search</h3>
<li><span style="color:blue">parameters</span> versus <span style="color:blue">hyperparameters</span></li>
<ul>
    <li><span style="color:blue">parameters</span>: the parameters that are necessary for the model to make predictions. For example, the coefficients of the linear equation estimated by the SGD classifier are parameters of the model. Parameters are estimated by the algorithm and from the data</li>
    <li><span style="color:blue">hyperparameters</span>: parameters that are external to the model and cannot be estimated from the data. For example, in an SGD classifier, parameters like the loss function, the regularization parameter, stopping rules, etc. are hyper parameters</li>
    </ul>
<li>In ML, hyperparameters are often set intuitively and then <span style="color:red">tuned</span> using a grid search</li>
<li>In a grid search, various combinations of hyperparameters are tried and <span style="color:blue">k-fold cross validation</span> is used to test the efficacy of the parameter combination</li>
<li>the best combination is then selected as a candidate model</li>

<h3 style="color:green;">The <span style="color:blue">scoring</span> parameter</h3>
<li>since our data is imbalaced, we should look for the model with the best f1 score (precision/recall tradeoff)</li>
<li>set the scoring parameter for GridSearchCV so that it maximizes the f1 score</li>
<li>Though we should be using a much wider range of parameters, I've reduced them so that it runs fairly quickly</li>
<li>This takes about 30 seconds on my machine. Could take longer on your machine</li>

In [13]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
#Set up the hyperparameter options in param_grid
param_grid = {
    'alpha':(0.001, 0.01, 0.1),
    'l1_ratio': (0.01, 0.1, 1),
    'class_weight':({1:3},{1:6},{1:9})
}

#Do the search
gs_clf = GridSearchCV(SGDClassifier(random_state=42,
                                    max_iter=100,
                                    loss="log_loss",
                                    penalty="elasticnet"),
                      param_grid, cv=5, n_jobs=-1)
gs_clf.fit(x_train, y_train)



CPU times: user 1.27 s, sys: 519 ms, total: 1.79 s
Wall time: 22.5 s


<h3 style="color:green;">Get the best model parameters</h3>


In [14]:
gs_clf.best_params_

{'alpha': 0.1, 'class_weight': {1: 3}, 'l1_ratio': 1}

<h3 style="color:green;">Run the best model and report metrics</h3>
<li>Run the classifier using the best parameters</li>






In [15]:
model_3 = SGDClassifier(random_state=42, 
                        max_iter=100, 
                        loss="log_loss", 
                        penalty="elasticnet",
                        alpha=0.1, 
                        class_weight={1:3}, 
                        l1_ratio=1) 
model_3.fit(x_train,y_train) 

print(model_3.score(x_train,y_train))
print(model_3.score(x_test,y_test))

y_test_pred = model_3.predict(x_test)
cfm_3 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_3.ravel()

y_train_pred = model_3.predict(x_train)
cfm_train_3 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_3.ravel()

accuracy_training_3 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_3 = (tp+tn)/(tp+tn+fp+fn)
precision_3  = precision_score(y_test, y_test_pred)
recall_3  = recall_score(y_test, y_test_pred)
f1_3  = f1_score(y_test, y_test_pred)
auc_3  = roc_auc_score(y_test, y_test_pred)
ap_3  = average_precision_score(y_test, y_test_pred)

print("Confusion Matrix: \n",cfm_3)
print("Training accuracy: ",accuracy_training_3)
print("Testing  accuracy: ",accuracy_testing_3)
print("Precision: ",precision_3)
print("Recall: ",recall_3)
print("F1-Score: ",f1_3)
print("AUC: ",auc_3)
print("Average Precision: ",ap_3)

0.8774417617083232
0.8772758638993577
Confusion Matrix: 
 [[148020   1990]
 [ 18818    723]]
Training accuracy:  0.8774417617083232
Testing  accuracy:  0.8772758638993577
Precision:  0.2664946553630667
Recall:  0.036999130034286884
F1-Score:  0.06497708277163655
AUC:  0.5118666738765528
Average Precision:  0.12084732497959114


<h3 style="color:green;">Interpret the results</h3>


<h4>Interpretation</h4>
<li>Compared to model 2, although the accuracy of model 3 increases from 56% to 88%, the f1-score reduces from 0.28 to 0.065. It means that even though the model's increases its overal ability in predicting a good loan and bad loan correctly, this model can not balance the needs between identifying bad loans accurately and increasing the posibility of catching bad loans. </li>
<li>Precision increased from 0.17 to 0.27, indicating that model 3 has higher posibility to catch bad loans. However, recall significantly decreased from 0.74 to 0.037, indicating the huge drop of model's ability to identify bad loans accurately.</li>
<li>Meanwhile, AUC has reduced from 0.64 to 0.51, which indicates that the model's efficacy has reduced and is only slightly better than random guessing. 
<li>Therefore, compared to model 2, model 3 is a less useful credit rating model to banks. </li>

<h3 style="color:green;">Update results_df</h3>

In [16]:
results_df["accuracy"][3] = accuracy_testing_3
results_df["precision"][3] = precision_3
results_df["recall"][3] = recall_3
results_df["f1_score"][3] = f1_3
results_df["AUC"][3] = auc_3
results_df["AP"][3] = ap_3

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.557119,0.170725,0.736963,0.277228,0.635327,0.156133
3,0.877276,0.266495,0.036999,0.064977,0.511867,0.120847
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 4</h1>

<h3 style="color:green;">Random Forest Classifier</h3>
<li>We need to improve recall and precision so perhaps a non-linear classifier will help</li>

<h3 style="color:green;">Build, fit, and report metrics</h3>

<li>Run this with the following parameters (these are our base parameters)</li>
<li>random_state=42,n_estimators=30,max_depth=6,min_samples_leaf=2000,min_samples_split=4000,class_weight={1:5}</li>


In [17]:
from sklearn.ensemble import RandomForestClassifier
model_4 = RandomForestClassifier(random_state=42,
                                 n_estimators=30,
                                 max_depth=6,
                                 min_samples_leaf=2000,
                                 min_samples_split=4000,
                                 class_weight={1:5})
model_4.fit(x_train,y_train)

In [18]:
y_test_pred = model_4.predict(x_test)
cfm_4 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_4.ravel()

y_train_pred = model_4.predict(x_train)
cfm_train_4 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_4.ravel()

accuracy_training_4 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_4 = (tp+tn)/(tp+tn+fp+fn)
precision_4  = precision_score(y_test, y_test_pred)
recall_4  = recall_score(y_test, y_test_pred)
f1_4  = f1_score(y_test, y_test_pred)
auc_4  = roc_auc_score(y_test, y_test_pred)
ap_4  = average_precision_score(y_test, y_test_pred)

print("Confusion Matrix: \n",cfm_4)
print("Training accuracy: ",accuracy_training_4)
print("Testing  accuracy: ",accuracy_testing_4)
print("Precision: ",precision_4)
print("Recall: ",recall_4)
print("F1-Score: ",f1_4)
print("AUC: ",auc_4)
print("Average Precision: ",ap_4)

Confusion Matrix: 
 [[132195  17815]
 [ 13503   6038]]
Training accuracy:  0.816948758391976
Testing  accuracy:  0.8152886152249176
Precision:  0.25313377772187984
Recall:  0.30899135151732254
F1-Score:  0.27828732082776414
AUC:  0.5951163010503085
Average Precision:  0.15785590250314663


<h3 style="color:green;">Interpreting model 4 results</h3>
<p></p>

<h4>Interpretation</h4>
<li>The model accuracy is around 82% which suggests that the model is performing well on both the training and testing data, but there is still room for improvement.</li>
<li>Precision, recall and f-1 score all maintain in an acceptable level. The precision of the model is 25.3%, which means that out of all the positive predictions (bad loan) made by the model, only 25.3% were actually correct. The recall of the model is 30.9%, which means that out of all the actual positive cases (bad loan), the model correctly identified 30.9%. The F1-score is 27.8%, which is measuring the precision recall tradeof.</li>
<li>The AUC is 0.595, which indicates that the model's ability to distinguish between positive and negative cases is just slightly better than random chance. The average precision of the model is 15.8%, which suggests that the model is still not very good at identifying positives (bad loan) while minimizing false positives.</li>
<li>Overall, while the model is performing reasonably well and it's a quite useful model to banks, there is still room for improvement in terms of precision, recall, and the ability to distinguish between positive and negative cases.</li>

<h3 style="color:green;">Update results_df</h3>

In [19]:
results_df["accuracy"][4] = accuracy_testing_4
results_df["precision"][4] = precision_4
results_df["recall"][4] = recall_4
results_df["f1_score"][4] = f1_4
results_df["AUC"][4] = auc_4
results_df["AP"][4] = ap_4

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.557119,0.170725,0.736963,0.277228,0.635327,0.156133
3,0.877276,0.266495,0.036999,0.064977,0.511867,0.120847
4,0.815289,0.253134,0.308991,0.278287,0.595116,0.157856
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 5</h1>

<h3 style="color:green;">Random Forest Grid Search</h3>
<p></p>


<li>Run the best model</li>
<li>Note that this will take a while, perhaps even a couple of hours (25 minutes on my laptop). Let it run. Get some coffee or whatever beverage you like. Then come back in a while to check out the results!</li>
<li>If you want to speed it up, remove the 500 option from n_estimators (n_estimators is the number of trees generated and is the single most expensive part of the grid search)</li>


In [20]:
%%time
from sklearn.ensemble import RandomForestClassifier

#I tried to run the original grid search for several hours but still did not get result. 
#Therefore, I just selected some parameters to reduce the grid search's burden.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score,make_scorer
parameters = {
    'n_estimators': [800],
    'min_samples_split': [100],
    'class_weight': [{1:6}],
    'min_samples_leaf': [10] 
}
gs_clf = GridSearchCV(RandomForestClassifier(random_state=42),parameters,cv=5,n_jobs=-1,
                      scoring='f1')
gs_clf.fit(x_train, np.ravel(y_train))

CPU times: user 13min 45s, sys: 1.07 s, total: 13min 47s
Wall time: 29min 6s


<h3 style="color:green;">Get the best model parameters</h3>


In [21]:
gs_clf.best_params_

{'class_weight': {1: 6},
 'min_samples_leaf': 10,
 'min_samples_split': 100,
 'n_estimators': 800}

<h3 style="color:green;">Run the best model and get metrics</h3>


In [22]:
model_5 = RandomForestClassifier(random_state=42,
                                 n_estimators=800,
                                 min_samples_leaf=10,
                                 min_samples_split=100,
                                 class_weight={1:6},
                                 n_jobs=-1)
model_5.fit(x_train,y_train)

In [23]:
y_test_pred = model_5.predict(x_test)
cfm_5 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_5.ravel()

y_train_pred = model_5.predict(x_train)
cfm_train_5 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_5.ravel()

accuracy_training_5 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_5 = (tp+tn)/(tp+tn+fp+fn)
precision_5  = precision_score(y_test, y_test_pred)
recall_5  = recall_score(y_test, y_test_pred)
f1_5  = f1_score(y_test, y_test_pred)
auc_5  = roc_auc_score(y_test, y_test_pred)
ap_5  = average_precision_score(y_test, y_test_pred)

print("Training accuracy: ",accuracy_training_5)
print("Testing  accuracy: ",accuracy_testing_5)
print("confusion matrix:")
print(cfm_5)
print("precision: ",precision_5)
print("recall: ",recall_5)
print("f1 score: ",f1_5)
print("auc",auc_5)
print("ap",ap_5)

Training accuracy:  0.782374828116153
Testing  accuracy:  0.7632924606755489
confusion matrix:
[[119694  30316]
 [  9818   9723]]
precision:  0.242838232723095
recall:  0.4975692134486464
f1 score:  0.3263846928499497
auc 0.6477380098307828
ap 0.17873470927770774


<h3 style="color:green;">Interpreting model 5 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li>Compared to model 4, this model's accuracy decreases to 76%, but the higher f1-score indicates that it does a better job in precision recall tradeoff. Even though the model's overal ability in predicting a good loan and bad loan correctly is worse, this model can  better balance the needs between identifying bad loans accurately and increasing the posibility of catching bad loans.</li>
<li>Precision (24%) maintains the similar level as model 4. However, recall significantly increases from 31% to 50%, indicating its ability to correctly identify positives (bad loan) increases. Therefore, f1-score increases as well.</li>
<li>AUC and average precision are also better than model 4. The AUC of the model is 0.65, which is better than random guessing (AUC of 0.5), but still has room for improvement. The average precision score of the model is 0.18, which is again still low and indicates that the model's predictions are not very precise.</li> 
<li>Overall, this model is more useful than model 4 in credit rating applications, but it still needs improvement but it can be effectively used.</li>

<h3 style="color:green;">Update results df</h3>


In [24]:
results_df["accuracy"][5] = accuracy_testing_5
results_df["precision"][5] = precision_5
results_df["recall"][5] = recall_5
results_df["f1_score"][5] = f1_5
results_df["AUC"][5] = auc_5
results_df["AP"][5] = ap_5

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.557119,0.170725,0.736963,0.277228,0.635327,0.156133
3,0.877276,0.266495,0.036999,0.064977,0.511867,0.120847
4,0.815289,0.253134,0.308991,0.278287,0.595116,0.157856
5,0.763292,0.242838,0.497569,0.326385,0.647738,0.178735
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 6</h1>

<li>Gradient Boosting Classifier</li>
<li>Grid search on GBC can take several days so let's just skip to the best models (I ran a 2-day reduced version)!</li>
<li>Sklearn's gradient boosting classifier uses a sample weight vector to correct for imbalances in the data</li>


In [25]:
from sklearn.ensemble import GradientBoostingClassifier

#sample_weight is a vector that indicates the weight of each 
#case in the training sample
#If you're interested, try values from 1 to 10 instead of 4
sample_weight = np.array([4 if i == 1 else 1 for i in y_train])

model_6 = GradientBoostingClassifier(min_samples_split=100,
                                     max_depth=8,
                                     min_samples_leaf=100,
                                     n_estimators=400,
                                     subsample=0.6)

model_6.fit(x_train,y_train,sample_weight=sample_weight)

In [26]:
#Calculate and print metrics
y_test_pred = model_6.predict(x_test)
cfm_6 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_6.ravel()

y_train_pred = model_6.predict(x_train)
cfm_train_6 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_6.ravel()

accuracy_training_6 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_6 = (tp+tn)/(tp+tn+fp+fn)
precision_6  = precision_score(y_test, y_test_pred)
recall_6  = recall_score(y_test, y_test_pred)
f1_6  = f1_score(y_test, y_test_pred)
auc_6  = roc_auc_score(y_test, y_test_pred)
ap_6  = average_precision_score(y_test, y_test_pred)

print("Training accuracy: ",accuracy_training_6)
print("Testing  accuracy: ",accuracy_testing_6)
print("confusion matrix:")
print(cfm_6)
print("precision: ",precision_6)
print("recall: ",recall_6)
print("f1 score: ",f1_6)
print("auc",auc_6)
print("ap",ap_6)

Training accuracy:  0.8189961983337377
Testing  accuracy:  0.8038466302174567
confusion matrix:
[[128280  21730]
 [ 11528   8013]]
precision:  0.26940792791581214
recall:  0.4100608975999181
f1 score:  0.3251765278792306
auc 0.6326019440336101
ap 0.17846499857984097


<h3 style="color:green;">Interpreting model 6 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li>Model 6's performance is similar to model 5. Both models are performing resonably well and useful, but still needs improvement to be effectively used in pratival credit rating applications.</li>
<li>The model has a training accuracy of 81.9%, but just 80.4% testing accuracy, which indicates that the model might be overfitting to the training dataset.</li>
<li> The confusion matrix shows that the model makes more correct predictions for the negative class (good loan) than the positive class (bad loan). The precision of the model is 27%, which means that when the model predicts the positive class, it is correct only 27% of the time. The recall of the model is 41%, which means that the model is able to identify only 41% of the positive class instances. The F1 score is 32.5%, which measuring the precision recall tradeoff as a reasonable level. </li>
<li>The AUC value is 0.63, which means that the model is better than a random classifier, but still not good enough. The average precision is 0.18, which shows that the model's performance in identifying the positive class (bad loan) has improved but needs improvement.</li>

<h3 style="color:green;">Update results df</h3>


In [27]:
results_df["accuracy"][6] = accuracy_testing_6
results_df["precision"][6] = precision_6
results_df["recall"][6] = recall_6
results_df["f1_score"][6] = f1_6
results_df["AUC"][6] = auc_6
results_df["AP"][6] = ap_6

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.557119,0.170725,0.736963,0.277228,0.635327,0.156133
3,0.877276,0.266495,0.036999,0.064977,0.511867,0.120847
4,0.815289,0.253134,0.308991,0.278287,0.595116,0.157856
5,0.763292,0.242838,0.497569,0.326385,0.647738,0.178735
6,0.803847,0.269408,0.410061,0.325177,0.632602,0.178465
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 7</h1>

<li>Same parameters but up the sample weight to 5</li>

In [28]:
from sklearn.ensemble import GradientBoostingClassifier

#sample_weight is a vector that indicates the weight of each 
#case in the training sample
#If you're interested, try values from 1 to 10 instead of 4
sample_weight = np.array([5 if i == 1 else 1 for i in y_train])

model_7 = GradientBoostingClassifier(min_samples_split=100,
                                     max_depth=8,
                                     min_samples_leaf=100,
                                     n_estimators=400,
                                     subsample=0.6)
model_7.fit(x_train,y_train,sample_weight=sample_weight)

In [29]:
#Calculate and print metrics

y_test_pred = model_7.predict(x_test)
cfm_7 = confusion_matrix(y_test, y_test_pred)
tn, fp, fn, tp = cfm_7.ravel()

y_train_pred = model_7.predict(x_train)
cfm_train_7 = confusion_matrix(y_train, y_train_pred)
tn_train, fp_train, fn_train, tp_train = cfm_train_7.ravel()

accuracy_training_7 = (tp_train+tn_train)/(tp_train+tn_train+fp_train+fn_train)
accuracy_testing_7 = (tp+tn)/(tp+tn+fp+fn)
precision_7  = precision_score(y_test, y_test_pred)
recall_7  = recall_score(y_test, y_test_pred)
f1_7  = f1_score(y_test, y_test_pred)
auc_7  = roc_auc_score(y_test, y_test_pred)
ap_7  = average_precision_score(y_test, y_test_pred)

print("Training accuracy: ",accuracy_training_7)
print("Testing  accuracy: ",accuracy_testing_7)
print("confusion matrix:")
print(cfm_7)
print("precision: ",precision_7)
print("recall: ",recall_7)
print("f1 score: ",f1_7)
print("auc",auc_7)
print("ap",ap_7)

Training accuracy:  0.7726684461700235
Testing  accuracy:  0.7563329027844129
confusion matrix:
[[117803  32207]
 [  9107  10434]]
precision:  0.24469407377875754
recall:  0.5339542500383808
f1 score:  0.33559550995464926
auc 0.659627614986526
ap 0.18436789295386047


<h3 style="color:green;">Interpreting model 7 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li>Compared to model 6, this model's accuracy decreases to 76%, but the higher f1-score indicates that it does a better job in precision recall tradeoff. Even though the model's overal ability in predicting a good loan and bad loan correctly is worse, this model can  better balance the needs between identifying bad loans accurately and increasing the posibility of catching bad loans.</li>
<li>Precision (24%) maintains the similar level as model 6. However, recall significantly increases from 41% to 53%, indicating its ability to correctly identify positives (bad loan) increases. Therefore, f1-score increases as well.</li>
<li>AUC and average precision are also better than model 6. The AUC of the model is 0.66, which is better than random guessing (AUC of 0.5), but still has room for improvement. The average precision score of the model is 0.18, which is again still low and indicates that the model's predictions are not very precise.</li> 
<li>Overall, this model is more useful than model 6 in credit rating applications, but it still needs improvement but it can be effectively used.</li>

<h3 style="color:green;">Update results df</h3>


In [30]:
results_df["accuracy"][7] = accuracy_testing_7
results_df["precision"][7] = precision_7
results_df["recall"][7] = recall_7
results_df["f1_score"][7] = f1_7
results_df["AUC"][7] = auc_7
results_df["AP"][7] = ap_7

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884572,0.083333,0.000154,0.000306,0.499967,0.115247
2,0.557119,0.170725,0.736963,0.277228,0.635327,0.156133
3,0.877276,0.266495,0.036999,0.064977,0.511867,0.120847
4,0.815289,0.253134,0.308991,0.278287,0.595116,0.157856
5,0.763292,0.242838,0.497569,0.326385,0.647738,0.178735
6,0.803847,0.269408,0.410061,0.325177,0.632602,0.178465
7,0.756333,0.244694,0.533954,0.335596,0.659628,0.184368


<h3 style="color:red;font-size:xx-large">Model comparison</h3>
<li>Draw a graph that shows the changes to accuracy, precision, recall, and f1 score</li>
<li>The x-axis contains the five models you have created</li>
<li>Use bokeh for the charts</li>

In [31]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, LabelSet, HoverTool
output_notebook()

In [35]:
#CHART 
models = ['Model 1', 'Model 2', 'Model 3', 'Model 4', 'Model 5', 'Model 6', 'Model7']
accuracy = [accuracy_testing_1, accuracy_testing_2, accuracy_testing_3, accuracy_testing_4, accuracy_testing_5, accuracy_testing_6, accuracy_testing_7]
precision = [precision_1, precision_2, precision_3, precision_4, precision_5, precision_6, precision_7]
recall = [recall_1, recall_2, recall_3, recall_4, recall_5, recall_6, recall_7]
f1 = [f1_1, f1_2, f1_3, f1_4, f1_5, f1_6, f1_7]

data = {'Model':models,
       'Accuracy':accuracy,
       'Precision':precision,
       'Recall':recall,
       'F1score':f1}
source = ColumnDataSource(data=data)
tooltips = [
    ('Model', '@Model'),
    ('Accuracy', '@Accuracy{0.00}'),
    ('Precision', '@Precision{0.00}'),
    ('Recall', '@Recall{0.00}'),
    ('F1-score', '@F1score{0.00}'),
]
hover_tool = HoverTool(tooltips=tooltips)

accuracy_fig = figure(title='Accuracy', x_range=models, tools=[hover_tool])
accuracy_fig.vbar(x='Model', top='Accuracy', width=0.7, source=source)

precision_fig = figure(title='Precision', x_range=models, tools=[hover_tool])
precision_fig.vbar(x='Model', top='Precision', width=0.7, source=source)

recall_fig = figure(title='Recall', x_range=models, tools=[hover_tool])
recall_fig.vbar(x='Model', top='Recall', width=0.7, source=source)

f1_fig = figure(title='F1-score', x_range=models, tools=[hover_tool])
f1_fig.vbar(x='Model', top='F1score', width=0.7, source=source)

figures = [accuracy_fig, precision_fig, recall_fig, f1_fig]
grid = gridplot([ [fig] for fig in figures ])

show(grid)

<h3 style="color:green;">Interpret the chart</h3>
<li>What can you say about the changes in precision and recall</li>
<li>Looking at the precision and recall values for each model, we can see that there is a trade-off between the two measures. Models with higher precision tend to have lower recall, and vice versa. In the context of a credit rating model, precision refers to the proportion of loans that are predicted to be bad and are actually bad. In other words, it represents the accuracy of the model in identifying risky loans. Recall, on the other hand, refers to the proportion of bad loans that are correctly identified by the model. It represents the model's ability to detect all the risky loans</li>
<li>Model 1 has very low precision and recall, meaning that it is not a good model for identifying bad loans. Models 2 has very high recall but low precision, meaning that they identify a lot of risky loans but also misclassify many good loans as bad. Models 3 has much higher precision but lower recall, indicating that they are better at identifying bad loans but may miss some risky loans. The other models tend to have a better balance beween recall and precision, which can also be reflected by higher f1-scores.</li>
<li>Overall, a credit rating model should strike a balance between precision and recall, as misclassifying too many good loans as bad can harm the bank's profitability, while missing too many risky loans can result in high default rates. The choice of the optimal model will depend on the bank's risk tolerance and business objectives. </li>

<h3 style="color:green;">Chart AUC and AP</h3>


In [33]:
#CHART

models = ['Model 1', 'Model 2', 'Model 3', 'Model 4', 'Model 5', 'Model 6', 'Model7']
auc = [auc_1, auc_2, auc_3, auc_4, auc_5, auc_6, auc_7]
ap = [ap_1, ap_2, ap_3, ap_4, ap_5, ap_6, ap_7]

data = {'Model':models,
        'AUC':auc,
        'AP':ap}
source = ColumnDataSource(data=data)
tooltips = [
    ('Model', '@Model'),
    ('AUC', '@AUC{0.00}'),
    ('AP', '@AP{0.00}')
]
hover_tool = HoverTool(tooltips=tooltips)

auc_fig = figure(title='AUC', x_range=models, tools=[hover_tool])
auc_fig.vbar(x='Model', top='AUC', width=0.7, source=source)

ap_fig = figure(title='AP', x_range=models, tools=[hover_tool])
ap_fig.vbar(x='Model', top='AP', width=0.7, source=source)

figures = [auc_fig, ap_fig]
grid = gridplot([ [fig] for fig in figures ])


show(grid)

<h3 style="color:green;">Interpret the AUC/AP chart</h3>
<li>The AUC on the first 4 models is pretty much the same. What does that mean?</li>
<ul><li>The similar AUC scores on the first 4 models indicate that these models have similar discrimination power in distinguishing between positive and negative classes. The efficacy of the models is similar to each other.</li>
    </ul>
<li>The average precision improves steadily but almost entirely by getting better at recall than at precision. What does that mean?</li>
    
<ul><li>The improvement in the average precision score with each model indicates that the models are getting better at ranking the positive samples higher in the predicted probability list. However, the fact that this improvement is mainly driven by recall suggests that the models are still not precise enough in classifying the positive samples (bad loan).</li>
</ul>
<li>Finally, what can you do to get better results? </li>
<ul>
    <li>To improve the results, several things could be tried. First, more data could be collected to improve the model's ability to capture the underlying patterns. Second, feature engineering could be performed to extract more informative features from the available data. Third, different algorithms and hyperparameters could be tested to find a better-performing model. Fourth, the class imbalance issue could be addressed by using more sophisticated sampling techniques.</li></ul>
    
   