# Decision Tree Lab

### Part 1: Load  data

Import "bank-data.csv"

In [None]:
import pandas as pd
bankData = pd.read_csv('bank-data.csv', sep = ';')
bankData.head()

### Part 2: Preprocess data

Preprocess the dataset as you have done before

In [None]:
bankData.info()

In [None]:
bankData.shape

#### 2.1 Binary encoding

Use LabelEncoder to encode the following columns:
- y
- default
- housing
- loan

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#example
bankData['y'] = le.fit_transform(bankData['y'])
bankData.head()

In [None]:
#Encode the remaining columns

bankData['housing'] = le.fit_transform(bankData['housing'])
bankData['default'] = le.fit_transform(bankData['default'])
bankData['loan'] = le.fit_transform(bankData['loan'])

bankData.head()

#### 2.2 Convert categorical variables into dummy columns

(1) Use pd.get_dummies to convert the following categorical variales into dummy columns
- job
- maritial
- education
- contact
- month
- poutcome

(2) Drop columns that have been converted

In [None]:
#example
bankData = pd.concat([bankData,pd.get_dummies(bankData['job'],prefix='job')],axis=1)
bankData = bankData.drop(columns=['job'])
bankData.head()

In [None]:
#convert the remaining categorical variables
bankData = pd.concat([bankData,pd.get_dummies(bankData['marital'],prefix='marital')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['education'],prefix='education')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['contact'],prefix='contact')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['month'],prefix='month')],axis=1)
bankData = pd.concat([bankData,pd.get_dummies(bankData['poutcome'],prefix='poutcome')],axis=1)

bankData = bankData.drop(columns=['marital', 'education', 'contact', 'month', 'poutcome'])

#### 2.3 Train/Test separation

Perform hold-out method
- 60% training set
- 40% testing set

In [None]:
bankData_train = bankData.sample(frac = 0.6)
bankData_test = bankData.drop(bankData_train.index)
print(pd.crosstab(bankData_train['y'],columns = 'count'))
print(pd.crosstab(bankData_test['y'],columns = 'count'))

##### X/y separation

In [None]:
bankData_train_y = bankData_train['y']
bankData_train_X = bankData_train.copy()
del bankData_train_X['y']

bankData_test_y = bankData_test['y']
bankData_test_X = bankData_test.copy()
del bankData_test_X['y']

### Part 3: Train a decision tree model

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_leaf=30, max_depth=5)
clf = clf.fit(bankData_train_X, bankData_train_y)
print(clf)

##### Tree Visualization

You MUST first install 'graphviz' in order to run the following code.

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                              feature_names=bankData_train_X.columns,
                              class_names=['0','1'],
                              filled=True, rounded=True,
                              special_characters=True, rotate=True)
graph = graphviz.Source(dot_data)
graph.render('dtree_render')

##### Variable importance

In [None]:
tree_feature = pd.DataFrame({'feature':bankData_train_X.columns,
                             'Score':clf.feature_importances_})

tree_feature.sort_values(by = 'Score', ascending=False).head()

##### Prediction

In [None]:
clf.predict(bankData_test_X)

In [None]:
clf.predict_proba(bankData_test_X)

### Part 4: Model Evaluation

Evaluation metrics
- confusion metrix
- accuracy
- precision, recall, f1-score

In [None]:
#confusion metrix
res = clf.predict(bankData_test_X)
pd.crosstab(bankData_test_y, res)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print("Accuracy:\t %.3f" %accuracy_score(bankData_test_y, res))
print(classification_report(bankData_test_y, res))

### Part 5: Model tuning

#### Note:

After building the decision tree classifier, try answering the following questions.

1. What is the Accuracy Score?
2. If you change your preprosessing method, can you improve the model?
3. If you change your parameters setting, can you improve the model?

##### Pruning Parameters
- max_leaf_nodes
    - Reduce the number of leaf nodes
- min_samples_leaf
    - Restrict the size of sample leaf
    - Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
- max_depth
    - Reduce the depth of the tree to build a generalized tree
    - Set the depth of the tree to 3, 5, 10 depending after verification on test data