# Decision Tree Lab

### Part 1: Load  data

Import "bank-data.csv"

In [1]:
import pandas as pd
bankData = pd.read_csv('bank-data.csv', sep = ';')
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### Part 2: Preprocess data

Preprocess the dataset as you have done before

In [2]:
bankData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB


In [3]:
bankData.shape

(4521, 17)

#### 2.1 Binary encoding

Use LabelEncoder to encode the following columns:
- y
- default
- housing
- loan

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#example
bankData['y'] = le.fit_transform(bankData['y'])
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,0
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,0
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,0
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,0
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,0


In [5]:
#Encode the remaining columns

bankData['housing'] = le.fit_transform(bankData['housing'])
bankData['default'] = le.fit_transform(bankData['default'])
bankData['loan'] = le.fit_transform(bankData['loan'])

bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,0,1787,0,0,cellular,19,oct,79,1,-1,0,unknown,0
1,33,services,married,secondary,0,4789,1,1,cellular,11,may,220,1,339,4,failure,0
2,35,management,single,tertiary,0,1350,1,0,cellular,16,apr,185,1,330,1,failure,0
3,30,management,married,tertiary,0,1476,1,1,unknown,3,jun,199,4,-1,0,unknown,0
4,59,blue-collar,married,secondary,0,0,1,0,unknown,5,may,226,1,-1,0,unknown,0


#### 2.2 Convert categorical variables into dummy columns

(1) Use pd.get_dummies to convert the following categorical variales into dummy columns
- job
- maritial
- education
- contact
- month
- poutcome

(2) Drop columns that have been converted

In [6]:
#example
bankData = pd.concat([bankData,pd.get_dummies(bankData['job'],prefix='job')],axis=1)
bankData = bankData.drop(columns=['job'])
bankData.head()

Unnamed: 0,age,marital,education,default,balance,housing,loan,contact,day,month,...,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown
0,30,married,primary,0,1787,0,0,cellular,19,oct,...,0,0,0,0,0,0,0,0,1,0
1,33,married,secondary,0,4789,1,1,cellular,11,may,...,0,0,0,0,0,1,0,0,0,0
2,35,single,tertiary,0,1350,1,0,cellular,16,apr,...,0,0,1,0,0,0,0,0,0,0
3,30,married,tertiary,0,1476,1,1,unknown,3,jun,...,0,0,1,0,0,0,0,0,0,0
4,59,married,secondary,0,0,1,0,unknown,5,may,...,0,0,0,0,0,0,0,0,0,0


In [7]:
#convert the remaining categorical variables
bankData = pd.concat([bankData,pd.get_dummies(bankData['marital'],prefix='marital')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['education'],prefix='education')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['contact'],prefix='contact')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['month'],prefix='month')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['poutcome'],prefix='poutcome')],axis=1)

bankData = bankData.drop(columns=['marital', 'education', 'contact', 'month', 'poutcome'])

#### 2.3 Train/Test separation

Perform hold-out method
- 60% training set
- 40% testing set

In [8]:
bankData_train = bankData.sample(frac = 0.6)
bankData_test = bankData.drop(bankData_train.index)
print(pd.crosstab(bankData_train['y'],columns = 'count'))
print(pd.crosstab(bankData_test['y'],columns = 'count'))

col_0  count
y           
0       2402
1        311
col_0  count
y           
0       1598
1        210


##### X/y separation

In [9]:
bankData_train_y = bankData_train['y']
bankData_train_X = bankData_train.copy()
del bankData_train_X['y']

bankData_test_y = bankData_test['y']
bankData_test_X = bankData_test.copy()
del bankData_test_X['y']

### Part 3: Train a decision tree model

In [10]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_leaf=30, max_depth=5)
clf = clf.fit(bankData_train_X, bankData_train_y)
print(clf)

DecisionTreeClassifier(max_depth=5, min_samples_leaf=30)


##### Tree Visualization

You MUST first install 'graphviz' in order to run the following code.

In [17]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                              feature_names=bankData_train_X.columns,
                              class_names=['0','1'],
                              filled=True, rounded=True,
                              special_characters=True, rotate=True)
graph = graphviz.Source(dot_data)
graph.render('dtree_render')

'dtree_render.pdf'

##### Variable importance

In [12]:
tree_feature = pd.DataFrame({'feature':bankData_train_X.columns,
                             'Score':clf.feature_importances_})

tree_feature.sort_values(by = 'Score', ascending=False).head()

Unnamed: 0,feature,Score
6,duration,0.540705
46,poutcome_success,0.267881
42,month_oct,0.098448
0,age,0.042396
2,balance,0.02746


##### Prediction

In [13]:
clf.predict(bankData_test_X)

array([0, 0, 0, ..., 0, 0, 0])

In [14]:
clf.predict_proba(bankData_test_X)

array([[0.98492118, 0.01507882],
       [0.98492118, 0.01507882],
       [0.98492118, 0.01507882],
       ...,
       [0.92134831, 0.07865169],
       [0.88653367, 0.11346633],
       [0.98492118, 0.01507882]])

### Part 4: Model Evaluation

Evaluation metrics
- confusion metrix
- accuracy
- precision, recall, f1-score

In [15]:
#confusion metrix
res = clf.predict(bankData_test_X)
pd.crosstab(bankData_test_y, res)

col_0,0,1
y,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1546,52
1,147,63


In [16]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print("Accuracy:\t %.3f" %accuracy_score(bankData_test_y, res))
print(classification_report(bankData_test_y, res))

Accuracy:	 0.890
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      1598
           1       0.55      0.30      0.39       210

    accuracy                           0.89      1808
   macro avg       0.73      0.63      0.66      1808
weighted avg       0.87      0.89      0.88      1808



### Part 5: Model tuning

#### Note:

After building the decision tree classifier, try answering the following questions.

1. What is the Accuracy Score?
2. If you change your preprosessing method, can you improve the model?
3. If you change your parameters setting, can you improve the model?

##### Pruning Parameters
- max_leaf_nodes
    - Reduce the number of leaf nodes
- min_samples_leaf
    - Restrict the size of sample leaf
    - Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
- max_depth
    - Reduce the depth of the tree to build a generalized tree
    - Set the depth of the tree to 3, 5, 10 depending after verification on test data