# Chapter 4: Decision trees

## Decision tree model building
We have this survey data with us. We want to build a decision tree model that will predict whether a customer will be satisfied or dissatisfied. We want to predict it as soon as he makes a call before he takes a survey. Based on the customer attributes, if we predict that the customer is going to be dissatisfied, then we will route the call to top agents with a high score. If a customer has a high chance of being satisfied, then we can route him to agents with low scores or inexperienced agents. This strategy will help us in resource planning and increasing the resolution rate. 

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### **Data importing and basic data exploration** 

In [None]:
#survey_data = pd.read_csv('/content/drive/My Drive/DataSets/Chapter-4/datasets/Call_center_survey.csv')
survey_data = pd.read_csv('https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter4_Decison_Trees/Datasets/Call_center_survey.csv')

In [None]:
print(survey_data.shape)
print(survey_data.columns)

In [None]:
pd.set_option('display.max_columns', None)

When we use pd.head()
We see some middle colums hidden by default.

Max_colums let us manually configure how many colums to be displayed with df.head or df.tails.

So the above code displayes maximum columns specified by you. 

In [None]:
survey_data.head()

In [None]:
summary=survey_data.describe()
round(summary,2)

The above code gives us the summary of each column.

From the above output, few observations are listed below:

* Age  variable has an average value of 44.
* Account balance has an average of 41,177. The minimum is 4,904, and the maximum is 109,776. 
* Personal loan indicator, Home Loan indicator and Prime
Customer Indicator are categorical variables. They take only two values, 0 and 1. A better measure to summaries these variables will be a frequency table using the function value_counts() 
* Overall_Satisfaction is not shown in the above output. It also takes two values “Satisfied” and “Dis-Satisfied.” 
We will look at the frequency counts table for these indicator and categorical variables.


In [None]:
print(survey_data['Overall_Satisfaction'].value_counts())
print(survey_data["Personal_loan_ind"].value_counts())
print(survey_data["Home_loan_ind"].value_counts())
print(survey_data["Prime_Customer_ind"].value_counts())

from the above output we san clearly note below things:

* Overall, 6,707 customers are dissatisfied and rest are satisfied. More customers are dissatisfied than satisfied customers
* Almost 50% of the customers have personal loans 
* Almost 50% of the customers have an existing home loan
* Nearly 58% of customers are prime category customers.

Now we will build the model by using  'Age',  'Account_balance', 'Personal_loan_ind',     'Home_loan_ind', 'Prime_Customer_ind’ as predictor variables and considering ‘Overall_Satisfaction’ as the target variable. 

### Model building
Before building a model we need to convert the data in specific format i.e we need to convert the non-numeric variables to numeric variables for building the decision tree model. If the non-numeric columns are populated with only two values, then we can easily map them to 0 and 1. 

For example, a variable like gender can be easily converted to numeric by mapping Male and Female to 0 and 1.

If a categorical variable has several values populated in it, then we need to convert it into multiple dummy variables.

For example, a variable like Region takes four values East, West, North and South. We can not map these values to 1,2,3 and 4. We need to create four new columns. All four columns will be binary. East_ind, West_ind, North_ind and South_ind.  


In this example all the predictor variables are numeric but the target variable is categorical which we need to convert to numeric. Below code is mapping non-numeric values to numeric values.

In [None]:
survey_data['Overall_Satisfaction'] = survey_data['Overall_Satisfaction'].map( {'Dis Satisfied': 0, 'Satisfied': 1} ).astype(int)

Now we will check the value_counts since we converted the non-numeric values to numeric values.

In [None]:
survey_data['Overall_Satisfaction'].value_counts()

We will store the predictor variables list in a list called features.  

In [None]:
features=list(survey_data.columns[1:6])
print(features)

We can prepare the final features and target matrix using this below code. 


In [None]:
X=survey_data[features]
y = survey_data['Overall_Satisfaction']

We are going to use these to matrices in building the model. Below is the code for configuring the model. 

In [None]:
from sklearn import tree
DT_Model = tree.DecisionTreeClassifier(max_depth=2)
DT_Model.fit(X,y)

We will try to understand the above code.


DT_model – This is the model name. It can be any name

DecisionTreeClassifier() - The function to build the decision trees. This function will execute the decision tree algorithm. 

max_depth – This is a pruning parameter. This parameter is important.

**"DT_model.fit(X,y)"**

The above step is the model configuration. In this step, we supply the actual data of X and y. Once we call the model.fit() function, the algorithm will start the information gain calculation and other steps of building the decision tree model. 


This output which we got from above code is not the model output that we are expecting. It is just the function and all the parameters. We need to draw the tree to understand the model stored in DT_Model. 

### Drawing the Decision tree
All the measures are calculated and stored in DT_Model. We will access all the values and draw the decision tree. Below is the code for drawing the decision tree.We need two packages to draw this decision tree, “Graphviz” and “pydotplus.” We need to supply the model name. This code will extract all the values from the model and returns the tree image with all the details.

#### Using GraphViz package

In [None]:
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydotplus
dot_data = StringIO()
tree.export_graphviz(DT_Model, 
                     out_file = dot_data,
                     filled=True, 
                     rounded=True,
                     impurity=False,
                     feature_names = features)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

#### Using plot_tree function

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(15,7))
plot_tree(DT_Model, filled=True, 
                     rounded=True,
                     impurity=False,
                     feature_names = features)
print(export_text(DT_Model, feature_names = features))

### Tree validation and accuracy
After building the decision tree model, we will get the decision tree rules. Before going ahead with the predictions, we need to take a note of the accuracy of the model. The actual values of the target variable are 0’s and 1’s. We can get the predicted values and create a confusion matrix to derive accuracy. 

In [None]:
predict1 = DT_Model.predict(X)
print(predict1)

In [None]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y, predict1)
print(cm)

In [None]:
total = sum(sum(cm))
accuracy = (cm[0,0]+cm[1,1])/total
print(accuracy)

sum(cm) gives us column wise sum(sum(cm)) is the actual total of the confusion matrix.

From the output, we can see that the accuracy of our decision tree model is 92.6%


## problem of overfitting
* If a model works really well on the training data and fails on the test data, then we call that model as an overfitted model. 
* **Train Data**: The data set that is used for building the model is known as train data. The model tries to learn the patterns in this train data. This dataset will be completely exposed to the model. Sometimes model might memorize this dataset instead of learning the generic patterns from it.  A fully grown decision tree will return the same data points as rules, that is an example of memorizing the training data. 
* **Test Data**: Test data is sampled from the same population but it has been kept aside while building the model. We know the actual target values in the test data. We will use this test data for validating the model. A model has very high accuracy may not always ensure that it will have high accuracy on test data. We will build the model on train data and apply it to test data. Get the accuracy of test data, as well. The model is considered to be good if it shows high accuracy on train data and almost matching accuracy on test data. 


In [None]:
import pandas as pd

In [None]:
#train = pd.read_csv("/content/drive/My Drive/DataSets/Chapter-4/datasets/Buyers Profiles/Train_data.csv")
#test = pd.read_csv("/content/drive/My Drive/DataSets/Chapter-4/datasets/Buyers Profiles/Test_data.csv")

train = pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter4_Decison_Trees/Datasets/Buyers%20Profiles/Train_data.csv")
test = pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter4_Decison_Trees/Datasets/Buyers%20Profiles/Test_data.csv")

In [None]:
print(train.shape)
print(test.shape)

In [None]:
train['Gender'] = train['Gender'].map( {'Male': 1, 'Female': 0} ).astype(int)
train['Bought'] = train['Bought'].map({'Yes':1, 'No':0}).astype(int)

In [None]:
test['Gender'] = test['Gender'].map( {'Male': 1, 'Female': 0} ).astype(int)
test['Bought'] = test['Bought'].map({'Yes':1, 'No':0}).astype(int)

**What is Overfitting?**
* A model has high accuracy on train data and significantly low accuracy on test data. 
* A model that is learning specific patterns related to training data, instead of learning the generic patterns, the model is memorizing the training data. 
* For small changes in the training data, the model and its parameters change a lot. For example, if the decision tree is overfitted, then small changes in the training data will cause a huge change in the final rules. Since these overfitted models have a huge variance in their parameters, they are also known as models with a lot of variance. 
* An overcomplicated model with too many parameters. A model that needs to be simplified. If it is a decision tree, then a really large tree with too many rules, these types of trees need to be pruned. 
* Overfitting is a generic concept. It can happen to any model, regression model or logistic regression model. Any model that shows high accuracy on train data and low accuracy on test data is called an overfitted model. 


In [None]:
from sklearn import tree

In [None]:
features = list(train.columns[:2])
X_train = train[features]
y_train = train['Bought']
X_test = test[features]
y_test = test['Bought']

In [None]:
clf = tree.DecisionTreeClassifier()
clf.fit(X_train,y_train)

In [None]:
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydotplus
dot_data = StringIO()
tree.export_graphviz(clf,
                     out_file = dot_data,
                     feature_names = features,
                     filled=True, rounded=True,
                     impurity=False)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

***How to detect the overfitting?***

Find the accuracy of train data and test data. If a model has significantly lower accuracy on test data, then the model is overfitted. Any difference of more than 5% is a significant difference. If we have 90% accuracy on train data. Then we expect the model accuracy on the test data to be more than 85%. 

In [None]:
predict1 = clf.predict(X_train)
print(predict1)

In [None]:
predict2 = clf.predict(X_test)
print(predict2)

In [None]:
from sklearn.metrics import confusion_matrix ###for using confusion matrix###
cm1 = confusion_matrix(y_train,predict1)
cm1

In [None]:
total1 = sum(sum(cm1))
accuracy1 = (cm1[0,0]+cm1[1,1])/total1
accuracy1

In [None]:
cm2 = confusion_matrix(y_test,predict2)
cm2

In [None]:
total2 = sum(sum(cm2))
accuracy2 = (cm2[0,0]+cm2[1,1])/total2
accuracy2

From the above outputs we can clearly say that the above model is overfitted since the accuracy on training dataset is 100% and on test dataset is 16.66%

### Choosing optimal value of Pruning parameter
While building a model, we have to make sure that it should be neither overfitted nor under fitted. 

First of all, you can build and finalize the decision tree model in one attempt. You need to build several models and choose the optimal one. No one can guess what the optimal depth of a decision tree for a given data is. We have to discover it.  
* First start by building a really large tree.  Depending on the training data, try to get the maximum possible depth. This model will be overfitted
* In second attempt, build a very small tree  Very small tree with just max depth=1. This will be most probably under fitted. You can look at the training accuracy and confirm it. Model_1 is overfittedand Model_2 is underfitted but we got the boundries. Now we can search the optimal value between these boundries.
* Now build a model by taking the value of parameter between these boundries. If the model is overfitted then reduce the value of parameter.



In [None]:
dtree = tree.DecisionTreeClassifier(max_leaf_nodes = 10, 
                                    min_samples_leaf = 5, 
                                    max_depth= 5)
dtree.fit(X_train,y_train)

In [None]:
predict3 = dtree.predict(X_train)
predict4 = dtree.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix ###for using confusion matrix###
cm1 = confusion_matrix(y_train,predict3)
cm1

In [None]:
total1 = sum(sum(cm1))
accuracy1 = (cm1[0,0]+cm1[1,1])/total1
accuracy1

In [None]:
cm2 = confusion_matrix(y_test,predict4)
cm2

In [None]:
total2 = sum(sum(cm2))
accuracy2 = (cm2[0,0]+cm2[1,1])/total2
accuracy2

Now this model is not overfitted, So we got the optimal value of pruning parameter.