# Decision Trees
Let's try to predict loan application approvals using a decision tree on the loans_data_encoded.csv data we previously used to encode the dataset.

In [1]:
# Initial imports
import pandas as pd
from path import Path
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

## Loading and Preprocessing Loans Encoded Data

In [2]:
# Loading data
file_path = Path("../Resources/loans_data_encoded.csv")
df_loans = pd.read_csv(file_path)
df_loans.head()

Unnamed: 0,amount,term,age,bad,month_num,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male
0,1000,30,45,0,6,0,1,0,0,0,1
1,1000,30,50,0,7,1,0,0,0,1,0
2,1000,30,33,0,8,1,0,0,0,1,0
3,1000,15,27,0,9,0,0,0,1,0,1
4,1000,30,28,0,10,0,0,0,1,1,0


Our goal is to predict if a loan application is worthy of approval based on information we have in our df_loans DataFrame. To do this, we'll have to split our dataset into features (or inputs) and target (or outputs). The features set, X, will be a copy of the df_loansDataFrame without the badcolumn. These features are all the variables that help determine whether a loan application should be denied.

## REWIND
Recall that X is the input data and y is the output data.

In [3]:
# Define features set
X = df_loans.copy()
X = X.drop("bad", axis=1)
X.head()

Unnamed: 0,amount,term,age,month_num,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male
0,1000,30,45,6,0,1,0,0,0,1
1,1000,30,50,7,1,0,0,0,1,0
2,1000,30,33,8,1,0,0,0,1,0
3,1000,15,27,9,0,0,0,1,0,1
4,1000,30,28,10,0,0,0,1,1,0


The target set is the bad column, indicating whether or not a loan application is good (0) or bad (1). Run the following code to generate the target set data.

In [4]:
# Define target vector
y = df_loans["bad"].values.reshape(-1, 1)
y[:5]

array([[0],
       [0],
       [0],
       [0],
       [0]])

A preview of the target set indicates five good (loan worthy) applications.

# Split the Data Into Training and Testing Sets
To train and validate our model, we'll need to split the features and target sets into training and testing sets. This will help determine the relationships between each feature in the features training set and the target training set, which we'll use to determine the validity of our model using the features and target testing sets.

In Jupyter Notebook, add the following code that will split our data into training and testing sets.

In [5]:
# Splitting into Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

When the train_test_split() function is executed, our data is split into a specific proportion of the original data sets. By default, our training and testing data sets are 75% and 25%, respectively, of the original data. Using the following code, we can see the data's 75-25 split.

In [6]:
# Determine the shapeor our training and testing sets.
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(375, 10)
(125, 10)
(375, 1)
(125, 1)


The output from running the code above shows that the X_train and y_train is 75% of 500 and that the X_test and y_test are 25%.

We can manually specify the desired split with the train_size parameter.

In [7]:
# Splitting into Train and Tests sets inot an 80/20 split.
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, random_state=78, train_size=0.80)

In [8]:
# Determine the shape of our training and testing sets.
print(X_train2.shape)
print(X_test2.shape)
print(y_train2.shape)
print(y_test2.shape)

(400, 10)
(100, 10)
(400, 1)
(100, 1)


## NOTE
Consult the sklearn documentation for additional information about the train_test_split()  function and the parameters it takes.

# Scale the Training and Testing Data
Now that we have split our data into training and testing sets, we can scale the data using Scikit-learn's StandardScaler.

The standard scaler standardizes the data. Which means that each feature will be rescaled so that its mean is 0 and its standard deviation is 1.

## NOTE
Typically, models that compute distances between data points, such as SVM, require scaled data. Although decision trees don't require scaling the data, it can be helpful when comparing the performances of different models.

To scale our data, we'll use the StandardScaler as before and fit the instance, scaler, with the training data and then scale the features with the transform() method:

In [9]:
# Creating StandardScaler instance (each feature will be rescaled so that its mean is 0 and its standard deviation is 1)
scaler = StandardScaler()

In [10]:
# Fitting Standard Scaler with the training data.
X_scaler = scaler.fit(X_train)

In [11]:
# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


## Fitting the Decision Tree Model
After scaling the features data, the decision tree model can be created and trained. First, we create the decision tree classifier instance and then we train or fit the "model" with the scaled training data.

Add and run the following code block to create the decision tree instance and fit the model:

In [12]:
# Creating the decision tree classifier instance
model = tree.DecisionTreeClassifier()

In [13]:
# Fitting the model
model = model.fit(X_train_scaled, y_train)

## Making Predictions Using the Tree Model
After fitting the model, we can run the following code to make predictions using the scaled testing data:

In [None]:
# Making predictions using the testing data
predictions = model.predict(X_test_scaled)

The output from this code will be an array of 125 predictions with either a 1 for a bad loan application or a 0 for a good, or approved, loan application.

![image.png](attachment:image.png)

## Evaluate the Model
Finally, we'll determine how well our model classifies loan applications. First, we need to use a confusion matrix.

The following code block creates the confusion_matrix using the y_test and the predictions that we just calculated and adds the confusion_matrix array to a DataFrame:

In [15]:
# Calculating the confusion matrix
cm = confusion_matrix(y_test, predictions)

# Create a DataFrame from the confusion matrix
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

cm_df

![image.png](attachment:image.png)

The results show that:

Out of 84 good loan applications (Actual 0), 50 were predicted to be good (Predicted 0), which we call true positives.

Out of 84 good loan applications (Actual 0), 34 were predicted to be bad (Predicted 1), which are considered false negatives.

Out of 41 bad loan applications (Actual 1), 22 were predicted to be good (Predicted 0) and are considered false positives.

Out of 41 bad loan applications (Actual 1), 19 were predicted to be bad (Predicted 1) and are considered true negatives.

We can add these terms to the confusion matrix and add the row and column totals to get the following table:

![image.png](attachment:image.png)

Next, we can determine the accuracy, or how often the classifier is correct with the model, by running the following code:

In [None]:
# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions)

The accuracy of our model is 0.552, which can also be calculated as follows:

![image.png](attachment:image.png)

Lastly, we can print out the above results along with the classification report, which will give us the precision, recall, F1 score, and support for the two classes.

In [16]:
# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))


Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,54,30
Actual 1,23,18


Accuracy Score : 0.576
Classification Report
              precision    recall  f1-score   support

           0       0.70      0.64      0.67        84
           1       0.38      0.44      0.40        41

    accuracy                           0.58       125
   macro avg       0.54      0.54      0.54       125
weighted avg       0.59      0.58      0.58       125



Let's go over the results in the classification report:

#### Precision:
Precision is the measure of how reliable a positive classification is. From our results, the precision for the good loan applications can be determined by the ratio TP/(TP + FP), which is 50/(50 + 22) = 0.69. The precision for the bad loan applications can be determined as follows: 19/(19 + 34) = 0.358. A low precision is indicative of a large number of false positives—of the 53 loan applications we predicted to be bad applications, 34 were actually good loan applications.

#### Recall:
Recall is the ability of the classifier to find all the positive samples. It can be determined by the ratio: TP/(TP + FN), or 50/(50 + 34) = 0.595 for the good loans and 19/(19 + 22) = 0.463 for the bad loans. A low recall is indicative of a large number of false negatives.

#### F1 score: 
F1 score is a weighted average of the true positive rate (recall) and precision, where the best score is 1.0 and the worst is 0.0.

#### Support:
Support is the number of actual occurrences of the class in the specified dataset. For our results, there are 84 actual occurrences for the good loans and 41 actual occurrences for bad loans.

In summary, this model may not be the best one for preventing fraudulent loan applications because the model's accuracy, 0.552, is low, and the precision and recall are not good enough to state that the model will be good at classifying fraudulent loan applications. Modeling is an iterative process: you may need more data, more cleaning, another model parameter, or a different model. It's also important to have a goal that's been agreed upon, so that you know when the model is good enough.

## NOTE
Consult the sklearn.metrics.precision_recall_fscore_support documentation  for additional information about the classification scores.