# Divide and Conquer - Classification Using Decision Trees and Rules

This code will identify risky bank loans using decision trees

**collecting data**

* import data 'credit.csv' using pandas. **source**: [GitHub Machine-Learning-with-R-datasets](https://github.com/suziyousif/Machine_Learning_DecisionTree/blob/master/credit.csv).
* show the data types for each column.


In [4]:
import pandas as pd
file = pd.read_csv("credit.csv")
file.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,...,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,...,real estate,67,none,own,2,1,1,yes,yes,skilled employee
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,...,real estate,22,none,own,1,2,1,none,yes,skilled employee
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,...,real estate,49,none,own,1,1,2,none,yes,unskilled resident
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,...,building society savings,45,none,for free,1,1,2,none,yes,skilled employee
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,...,unknown/none,53,none,for free,2,2,2,none,yes,skilled employee


**Preparing the data**

* Encode labels with value between 0 and n_classes-1 for columns that contains literal data.

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in ['checking_balance', 'savings_balance', 
            'employment_length', 'credit_history', 
            'purpose', 'personal_status', 'other_debtors', 
            'property', 'installment_plan', 'housing', 
            'telephone', 'foreign_worker', 'job']:
    file[col] = le.fit_transform(file[col])

In [6]:
file.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,...,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,1,6,0,7,1169,4,3,4,3,2,...,2,67,1,1,2,1,1,1,1,1
1,0,48,4,7,5951,2,1,2,1,2,...,2,22,1,1,1,2,1,0,1,1
2,3,12,0,4,2096,2,2,2,3,2,...,2,49,1,1,1,1,2,0,1,3
3,1,42,4,5,7882,2,2,2,3,1,...,0,45,1,0,1,1,2,0,1,1
4,1,24,1,1,4870,2,1,3,3,2,...,3,53,1,0,2,2,2,0,1,1


* Split the data into train and test data sets (test [10%] and train [90%])

In [7]:
from sklearn.model_selection import train_test_split
credit_train, credit_test = train_test_split(file, test_size = 0.1, 
                                             random_state = 123)

* we should remove the label ['default'] from both data sets

In [8]:
train_labels = credit_train.pop('default')
test_labels = credit_test.pop('default')

**Training a model on the data**

* Fit DecisionTreeClassifier model with a value of max_depth = 4 to this data and predict the outcome of test data.

In [46]:
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier(max_depth = 4)
DTC = DTC.fit(credit_train, train_labels)
y_predict = DTC.predict(credit_test)

print('number of nodes:',DTC.tree_.node_count,'\nmaximum actual depth:', DTC.tree_.max_depth)

number of nodes: 31 
maximum actual depth: 4


**Evaluating model performance**

* make a function to measure error

In [14]:
from sklearn.metrics import accuracy_score, precision_score                         
from sklearn.metrics import recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)}, name=label)
measure_error(test_labels, y_predict, 'credit')

accuracy     0.740000
f1           0.814286
precision    0.770270
recall       0.863636
Name: credit, dtype: float64

* Compute confusion matrix to evaluate the accuracy of a classification

In [17]:
from sklearn.metrics import confusion_matrix
confusion_matrix_ = pd.DataFrame (confusion_matrix(test_labels, y_predict), 
                                  columns =['Predicted Positive', 'Predicted Negative'], 
                                  index =['Actual Positive', 'Actual Negative'])
confusion_matrix_

Unnamed: 0,Predicted Positive,Predicted Negative
Actual Positive,57,9
Actual Negative,17,17


* export a decision tree in DOT format.
* to visualize the decision tree you can enter your graphviz data into the Text Area in the site [Graphviz](http://www.webgraphviz.com/).

In [50]:
from sklearn.tree import export_graphviz
with open("CREDIT_TREE.txt", "w") as f:
    f = export_graphviz(DTC, out_file=f)
    