Decision Trees in Python with Scikit-Learn
-------------------------------------------------------------

Introduction
-----------------
A decision tree is one of most frequently and widely **used supervised machine learning algorithms that can perform both regression and classification tasks**. Hence called <font color='green'><b>CART</b> - <u>C</u>lassification <u>A</u>nd <u>R</u>egression <u>T</u>rees.</font>

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.



>**Advantages of Decision Trees**
------------------------------

There are several advantages of using decision trees for predictive analysis:

1> Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks.

2> They require relatively less effort for training the algorithm.

3> They can be used to classify non-linearly separable data.

4> They're very fast and efficient compared to KNN and other classification algorithms.


# 1. Decision Tree for Classification
---------------------------------------------------------
<b><font color='green'> We will be using DecisionTreeClassifier from sklearn.tree.</b> It is fast, simple and takes care of all the Math part.
<font color='red'>
Here, we will predict whether a <b>bank note is authentic or fake</b> depending upon the four different attributes of the image of the note. The <u>attributes</u> are Variance of wavelet transformed image, curtosis of the image, entropy, and skewness of the image.</font>

**Note :** In the dataset the **class** variable can be **0 or 1**. **0 indicates authentic BankNote and 1 indicates fake BankNote.**

In [1]:
# doing the minimum necessary imports
# more modules would be imported as and when needed

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# reading data from CSV file. 
# reading bank currency note data into pandas dataframe.
bankdata = pd.read_csv("bill_authentication.csv")  

# Exploratory Data Analysis
print(bankdata.shape)  
print("------------")

#bankdata.head()

# shuffling the 100% of the data
print(bankdata.sample(random_state=100, frac=1).head(10)) 

## shuffle the original dataframe
# bankdata = bankdata.sample(random_state=100, frac=1)

(1372, 5)
------------
      Variance  Skewness  Curtosis  Entropy  Class
1058  -1.56210   -2.2121   4.25910  0.27972      1
714    2.55590    3.3605   2.03210  0.26809      0
1061  -2.31470    3.6668  -0.69690 -1.24740      1
399    2.96950    5.6222   0.27561 -1.15560      0
382    0.86202    2.6963   4.29080  0.54739      0
376    3.23030    7.8384  -3.53480 -1.21510      0
987   -0.55648    3.2136  -3.30850 -2.79650      1
416    4.34830   11.1079  -4.08570 -4.25390      0
945   -1.76970    3.4329  -1.21440 -2.37890      1
595    3.18360    7.2321  -1.07130 -2.59090      0


In [2]:
bankdata.info()  # this helps in finding any missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Variance  1372 non-null   float64
 1   Skewness  1372 non-null   float64
 2   Curtosis  1372 non-null   float64
 3   Entropy   1372 non-null   float64
 4   Class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


<b> Analysis : </b> Their is no missing data. This data is clean.

In [3]:
# Data Preprocessing
# Data preprocessing involves 
# (1) Dividing the data into attributes and labels and 
# (2) dividing the data into training and testing sets.

# To divide the data into attributes and labels, do :
X = bankdata.drop('Class', axis=1)  
y = bankdata['Class']  

# the final preprocessing step is to divide data into training and test sets
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)
# default test_size parameter value is 0.25

# Training the Algorithm. Here we would use DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier  
classifier = DecisionTreeClassifier()  
classifier.fit(X_train, y_train)

# make predictions on the test data
y_pred = classifier.predict(X_test)

# Evaluating the Algorithm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))



[[197   1]
 [  3 142]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       198
           1       0.99      0.98      0.99       145

    accuracy                           0.99       343
   macro avg       0.99      0.99      0.99       343
weighted avg       0.99      0.99      0.99       343

0.9883381924198251


<b><font color='green'>Analysis</font></b> : From the confusion matrix, you can see that out of 343 test instances, our algorithm misclassified only 4. This is approx 99% accuracy. 

# 2. Decision Tree for Regression
------------------------------------------------------
<b><font color='green'>( We will be using DecisionTreeRegressor from sklearn.tree.</b> It is fast, simple and takes care of all the Math part.  )</font><br><br>
<font color='red'>
We will use petrol_consumption.csv dataset and <b>try to predict gas consumptions</b> (in millions of gallons) in 48 US states <u>based upon</u> gas tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with a drivers license. </font>

**Note :** In the dataset **Petrol_Consumption** is the target variable. 

In [14]:
# Importing Libraries
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# Importing the Dataset
dataset = pd.read_csv('petrol_consumption.csv')

dataset.head()  

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [19]:
# To see statistical details of the dataset, execute the following command:

#dataset.describe()
dataset['Petrol_Consumption'].mean()*0.1 # avg of the target var. 

57.67708333333334

In [17]:
# Preparing the Data
# divide the data into attributes and labels
X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

# dividing data into training and testing set
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)  

# Training and Making Predictions
# Note : we will using DecisionTreeRegressor class, not DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
Regressor = DecisionTreeRegressor()  
Regressor.fit(X_train, y_train)

# make predictions on the test data
y_pred = Regressor.predict(X_test)





**Note** : 

that in your case the records compared may be different, depending upon the training and testing split. Since the train_test_split method randomly splits the data we likely won't have the same training and test sets. For train_test_split with random_state=0 , you would get the same results.


In [18]:
# Evaluating the Algorithm
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

Mean Absolute Error: 51.7
Mean Squared Error: 4686.9
Root Mean Squared Error: 68.46093776745977


The root mean squared error for our algorithm is 68.46, which is more than *10 percent of the mean* of all the values in the '**Petrol_Consumption**' column ( i.e **57.6** ). This means that our algorithm did not do a fine prediction job. 
Their could many reasons for a Regression Algo to not perform that well, some reasons are : 





