Decision Trees in Python with Scikit-Learn
-------------------------------------------------------------

Introduction
-----------------
A decision tree is one of most frequently and widely used supervised machine learning algorithms that can perform both regression and classification tasks. 

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.

Consider a scenario where a person asks you to lend them your car for a day, and you have to make a decision whether or not to lend them the car. There are several factors that help determine your decision, some of which have been listed below:

![decison_tree_image](datasets_n_images/datasets_n_images/images/decison_tree_image.png 'decison_tree_image')

Advantages of Decision Trees
------------------------------

There are several advantages of using decision treess for predictive analysis:

>1. Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks.

>2. They require relatively less effort for training the algorithm.

>3. They can be used to classify non-linearly separable data.

>4. They're very fast and efficient compared to KNN and other classification algorithms.

# 1. Decision Tree for Classification
---------------------------------------------------------

Here, we will predict whether a bank note is authentic or fake depending upon the four different attributes of the image of the note. The attributes are Variance of wavelet transformed image, kurtosis of the image, entropy, and skewness of the image.

In [1]:
# doing the minimum necessary imports
# more modules would be imported as and when needed

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline
#try %matplotlib notebook

# reading data from CSV file. 
# reading bank currency note data into pandas dataframe.
bankdata = pd.read_csv("./datasets_n_images/datasets_n_images/datasets_module_4/bill_authentication.csv")  

# Exploratory Data Analysis
# your code goes here1


#class=0 Not fake
#Class=1 Fake

In [None]:
# Data Preprocessing
# Data preprocessing involves 
# (1) Dividing the data into attributes and labels and 
# (2) dividing the data into training and testing sets.

# To divide the data into attributes and labels, do :
# your code goes here2



# the final preprocessing step is to divide data into training and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)


# Training the Algorithm. Here we would use DecisionTreeClassifier
# your code goes here


# make predictions on the test data
# your code goes here



# Evaluating the Algorithm
# your code goes here



Conclusion : 
Try 1:
From the confusion matrix, you can see that out of 275 test instances, our algorithm misclassified only 2. This is 98.5% accuracy. This will change everytime depending upon weighted avg precision value.
Try 2:
From the confusion matrix, you can see that out of 275 test instances, our algorithm misclassified only 2. This is 99.27% accuracy.

In [None]:
#Loop the above process
from sklearn.model_selection import train_test_split
i=1
x=0.05
test_sz=.85
for i in range(1,10):
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.85-x)


 # Training the Algorithm. Here we would use DecisionTreeClassifier
 from sklearn.tree import DecisionTreeClassifier  
 classifier = DecisionTreeClassifier()  
 classifier.fit(X_train, y_train)

 # make predictions on the test data
 y_pred=classifier.predict(X_test)



 # Evaluating the Algorithm
 from sklearn.metrics import classification_report, confusion_matrix
 print("Iteration",i,":\n")
 print("Confusion Matrix:\n",confusion_matrix(y_test, y_pred),"\n")  
 print("\nclassification_report:\n",classification_report(y_test, y_pred))
 x=test_sz-0.05

# Remember : for evaluating classification-based ML algo use  
# confusion_matrix, classification_report and accuracy_score.
# And for evaluating regression-based ML Algo use Mean Squared Error(MSE), ...

# 2. Decision Tree for Regression
------------------------------------------------------

We will petrol_consumption.csv dataset to try and predict gas consumptions (in millions of gallons) in 48 US states based upon gas tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with a drivers license.

In [2]:
# Importing Libraries
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# Importing the Dataset
dataset = pd.read_csv('./datasets_n_images/datasets_n_images/datasets_module_4/petrol_consumption.csv')

# your code goes here4
dataset.sample(5)

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
22,9.0,4897,2449,0.511,464
5,10.0,5342,1333,0.571,457
33,7.5,3357,4121,0.547,628
20,7.0,4593,7834,0.663,649
2,9.0,3865,1586,0.58,561


In [3]:
# To see statistical details of the dataset, execute the following command:
dataset.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [5]:
# Preparing the Data
# divide the data into attributes and labels

# your code goes here6
X = dataset.drop('Petrol_Consumption',axis=1)
y= dataset['Petrol_Consumption']

# dividing data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,random_state=0)


# Training and Making Predictions
# Note : we will using DecisionTreeRegressor class , not DecisionTreeClassifier
# your code goes here
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train,y_train)


# To make predictions on the test set, 
# your code goes here
y_pred = regressor.predict(X_test)

# Now let's compare some of our predicted values with the actual values 
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df  

Unnamed: 0,Actual,Predicted
29,534,487.0
4,410,524.0
26,577,574.0
30,571,554.0
32,577,574.0


**Remember : 

that in your case the records compared may be different, depending upon the training and testing split. Since the train_test_split method randomly splits the data we likely won't have the same training and test sets.

In [8]:
# Evaluating the Algorithm
# your code goes here7
from sklearn import metrics
print("\n",metrics.mean_absolute_error(y_test,y_pred))
print("\n",metrics.mean_squared_error(y_test,y_pred))
print("\n",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))


 36.8

 3102.4

 55.69919209467943


The mean absolute error for our algorithm is 56.09, which is less than 10% of 576.77 i.e. 57.677 of all the values in the 'Petrol_Consumption' column. This means that our algorithm did a fine prediction job. All though getting a value <10% would have been better.

In [1]:
# necessary imports
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# loading the dataset
dataset = pd.read_csv('./datasets_n_images/datasets_n_images/datasets_module_4/petrol_consumption.csv')

dataset.head() 

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [2]:
# your code goes here9
dataset.sample(5)

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
12,7.0,4817,6930,0.574,525
17,7.0,3718,4725,0.54,714
14,7.0,4332,8159,0.608,566
32,8.0,3063,6524,0.578,577
42,7.0,4300,3635,0.603,632


In [3]:
# your code goes here1.0
dataset.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [4]:
X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

In [8]:
# your code goes here1.2
X = dataset.drop('Petrol_Consumption',axis=1)
y= dataset['Petrol_Consumption']
# Training the Algorithm
# your code goes here
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15,random_state=0)

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [9]:
y_pred = regressor.predict(X_test)  

# Now let's compare some of our predicted values with the actual values 
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df  

Unnamed: 0,Actual,Predicted
29,534,541.0
4,410,632.0
26,577,574.0
30,571,554.0
32,577,631.0
37,704,640.0
34,487,648.0
40,587,649.0


In [10]:
# Evaluating the Algorithm
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

Mean Absolute Error: 73.75
Mean Squared Error: 10801.0
Root Mean Squared Error: 103.92785959500947
