In [None]:
pwd

## About Data and Objective

#### Use Decision Trees to build a regressor & Grid Search to find the optimal value for the hyperparameters for the given dataset, and evaluate your model on the appropriate metrics try and predict gas consumption (in millions of gallons) in 48 US states based upon:
1. Gas Tax (in cents)
2. Per Capita Income (dollars)
3. Paved Highways (in miles) &
4. The proportion of the population with a driver license.

#### Importing Packages

In [None]:
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
from pandas import DataFrame
import pylab as pl
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
Fuel_cons=pd.read_csv("../input/petrol_consumption.csv") #Importing Data

In [None]:
Fuel_cons.head()

#### Structure and Datatypes of Dataset along with Summary Statistics

In [None]:
print(Fuel_cons.shape)
Fuel_cons.info()

In [None]:
pd.options.display.float_format = '{:.4f}'.format
data_summary=Fuel_cons.describe()
data_summary.T

###### Checking for Outliers

In [None]:
for k, v in Fuel_cons.items():
    q1 = v.quantile(0.25)
    q3 = v.quantile(0.75)
    irq = q3 - q1
    v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
    perc = np.shape(v_col)[0] * 100.0 / np.shape(Fuel_cons)[0]
    print("Column %s outliers = %.2f%%" % (k, perc))

In [None]:
plt.figure(figsize=(12,5))
Fuel_cons.boxplot(patch_artist=True,vert=False)

###### Correlation Chart

In [None]:
my_corr=Fuel_cons.corr()
my_corr

In [None]:
plt.figure(figsize=(12,5))
sns.heatmap(my_corr,linewidth=0.5)
plt.show()

##### Understanding Data: Exploratory Data Analysis 

Calculating Correlation, P-value and Regression plot

To understand the spread of datapoints this regression plot has been plotted along with pearson coefficients.

In [None]:
pearson_coef, p_value = stats.pearsonr(Fuel_cons['Petrol_tax'], Fuel_cons['Petrol_Consumption'])
print("The Pearson Correlation Coefficient of Petrol_tax is", pearson_coef, " with a P-value of P =", p_value)  
sns.regplot(x="Petrol_tax", y="Petrol_Consumption", data=Fuel_cons)
plt.ylim(0,)

In [None]:
pearson_coef, p_value = stats.pearsonr(Fuel_cons['Average_income'], Fuel_cons['Petrol_Consumption'])
print("The Pearson Correlation Coefficient of Petrol_tax is", pearson_coef, " with a P-value of P =", p_value)  
sns.regplot(x="Average_income", y="Petrol_Consumption", data=Fuel_cons)
plt.ylim(0,)

In [None]:
pearson_coef, p_value = stats.pearsonr(Fuel_cons['Paved_Highways'], Fuel_cons['Petrol_Consumption'])
print("The Pearson Correlation Coefficient of Petrol_tax is", pearson_coef, " with a P-value of P =", p_value)  
sns.regplot(x="Paved_Highways", y="Petrol_Consumption", data=Fuel_cons)
plt.ylim(0,)

In [None]:
pearson_coef, p_value = stats.pearsonr(Fuel_cons['Population_Driver_licence(%)'], Fuel_cons['Petrol_Consumption'])
print("The Pearson Correlation Coefficient of Petrol_tax is", pearson_coef, " with a P-value of P =", p_value)  
sns.regplot(x="Population_Driver_licence(%)", y="Petrol_Consumption", data=Fuel_cons)
plt.ylim(0,)

##### Knowing about the distribution of Predictors and target of dataset. 

In [None]:
plt.figure(figsize=(12,5))
sns.distplot(Fuel_cons['Petrol_tax'])

In [None]:
plt.figure(figsize=(12,5))
sns.distplot(Fuel_cons['Paved_Highways'])

In [None]:
plt.figure(figsize=(12,5))
sns.distplot(Fuel_cons['Average_income'])

In [None]:
plt.figure(figsize=(12,5))
sns.distplot(Fuel_cons['Petrol_Consumption'])

In [None]:
plt.figure(figsize=(12,5))
sns.distplot(Fuel_cons['Population_Driver_licence(%)'])

#### Viz 3: The below plot has been plotted in order to show the Petrol Consumption as per tax.

We can see from the below ploted graph that the bandwith of petrol Consumption with lower petrol tax is smooth shown by (Brown Curve) it has high bandwidth as it has shallow kernel and has high density. When we look at the area under Blue Curve we can see its density and amplitude is low which shows petrol tax with High petrol consumption people are less.

In [None]:
a = sns.FacetGrid(Fuel_cons, hue = 'Petrol_Consumption', aspect=4 )
a.map(sns.kdeplot, 'Petrol_tax', shade= True )
a.set(xlim=(0 ,Fuel_cons['Petrol_tax'].max()))
a.add_legend()

#### Viz 4:  The factor plot shows that people mostly taxes are less when there are paved Highways whereas there is high taxation in petrol fuel where there are less Paved Highways. This shows despite of high petrol taxes the condition of highways are not improved which says the improper management and lack of work done by authority.

In [None]:
axes = sns.factorplot('Petrol_tax','Paved_Highways',data=Fuel_cons, aspect = 2.5, )

##### Dividing Data into Predictors and Target Variables

In [None]:
predictor_var= Fuel_cons[['Petrol_tax','Average_income','Paved_Highways','Population_Driver_licence(%)']] #all columns except the last one
target_var= Fuel_cons['Petrol_Consumption'] #only the last column

In [None]:
predictor_var.shape

In [None]:
target_var.shape

### Plotting Decision Tree without using any external tool for Optimization such as Grid Search CV

##### Importing Test Train Split from sklearn package

In [None]:
from sklearn.model_selection import train_test_split

**Now train_test_split will return 4 different parameters. We will name them:\
X_trainset, X_testset, y_trainset, y_testset**

**The train_test_split will need the parameters:\
X, y, test_size=0.3, and random_state=123.**

**The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(predictor_var,target_var, test_size=0.30, random_state=123)

### Objective 1: Applied Decision Tree algorithm for regression.

Import Decision Tree Regressor and fit the model to the training data.



In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree = DecisionTreeRegressor(max_depth=4,max_features=4)

 ##### Inside of the regressor, specify criterion="mse" so we can see the mse of each node.

In [None]:
tree.fit(X_train, Y_train)

##### Make predictions and evaluate output.



In [None]:
predictions = tree.predict(X_test)

In [None]:
df=pd.DataFrame({'Actual':Y_test, 'Predicted':predictions})
df.head(5)

We see that the predictions are not accurate. Let's evaluate the prediction accuracy.

##### Evaluating the Prediction Accuracy


In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test,predictions))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test,predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test,predictions)))
print('r2_score:', metrics.r2_score(Y_test,predictions))

##### Looking after Feature Importances

In [None]:
tree.feature_importances_
pd.Series(tree.feature_importances_,index=predictor_var.columns).sort_values(ascending=False)

Importing Graphviz from Sklearn library to plot the decission tree

In [None]:
from sklearn.tree import export_graphviz

In [None]:
dot_data = export_graphviz(tree, filled=True, rounded=True, feature_names=predictor_var.columns, out_file=None)

In [None]:
import graphviz

In [None]:
graphviz.Source(dot_data)

The trees follow a top-down greedy approach known as recursive binary splitting. We call it as ‘top-down’ because it begins from the top of tree when all the observations are available in a single region and successively splits the predictor space into two new branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for best variable available) about only the current split, and not about future splits which will lead to a better tree.

Stopping Criteria: The most common stopping procedure is to use a minimum count on the number of training instances assigned to each leaf node. If the count is less than some minimum then the split is not accepted and the node is taken as a final leaf node.

For this dataset Petrol_consumption is considered as the Target variable and rest are Predictors:\

**Q.  Why does Tree considered Petrol_tax as Root of Tree?**\
**Ans**The reason could be the feature_importances as from above codes we can see Petrol_tax has feature importance value compared to others. So, the Petrol_tax is selected as root of tree and other features as decision nodes.

**The Split :** Purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

##### Hyperparameters Considered:   max_depth=4,  max_features=4
1. The tree uses Petrol Tax features with threshold value of 7.25 to initially divide the samples.
2. We can see out of 33 Samples the root (Petrol Tax) split into two other decision nodes:\
   i.  Where Petrol_tax <= 7.25 then it uses another feature (Paved Highways) with total number of sample of 15 as decision node.\
   ii.Where Petrol_tax> 7.25 then it uses feature Average Income as another decision node with Sample count of 18.
3. The tree considered Mean Squared Error(MSE) as the determining and decision making criteria. 
4. In the above plotted decision tree every node is a conditon how to split values. The least the MSE the better the result. 
5. We can see from decision nodes there are several boxes known as leaf nodes where MSE=0 and thus tree stops branching from that nodes.



###  Using  Gridsearch & Cross Validation appropriately.

#### Now I will use grid search cv to find the optimal value of hyper_parameters to plot the Decision Tree

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = [{"max_depth":[3,4,5, None], "max_features":[3,4,5,6,7]}]

In [None]:
gs = GridSearchCV(estimator=DecisionTreeRegressor(random_state=123),param_grid = param_grid,cv=10)

In [None]:
gs.fit(X_train, Y_train)

In [None]:
gs.cv_results_['params']

In [None]:
gs.cv_results_['rank_test_score']

In [None]:
gs.best_estimator_

In [None]:
tree = DecisionTreeRegressor(max_depth=3,max_features=4)

In [None]:
tree.fit(X_train, Y_train)

In [None]:
predictions = tree.predict(X_test)

In [None]:
df=pd.DataFrame({'Actual':Y_test, 'Predicted':predictions})
df.head(5) #Check the top 5 predictions and actual values.

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test,predictions))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test,predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test,predictions)))
print('r2_score:', metrics.r2_score(Y_test,predictions))

We use graphviz to plot the dot data as a decision tree.

In [None]:
from sklearn.tree import export_graphviz

In [None]:
dot_data = export_graphviz(tree, filled=True, rounded=True, feature_names=predictor_var.columns, out_file=None)

In [None]:
import graphviz

In [None]:
graphviz.Source(dot_data)

The above tree is the optimised result of the Base tree which we have used earlier in this assignment. To Optimisation I have used Grid Search to prune the tree and find the best estimator as shown below:

##### Hyperparameters Considered:   max_depth=3,  max_features=4
**The reason for defining\
i. Maximum depth of tree(vertical depth) to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.\
ii. max_features these are selected randomly but higher value selection results in Overfitting.**

We have used the above mentioned parameters to improve decision tree accuracy and reduce MSE. by fitting in our model this can also be called a one way of prune the tree.

Observation
1. The tree uses Petrol Tax features with threshold value of 7.25 to initially divide the samples.
2. We can see out of 33 Samples the root (Petrol Tax) split into two other decision nodes:\
   i.  Where Petrol_tax <= 7.25 then it uses another feature (Paved Highways) with total number of sample of 15 as decision node.\
   ii.Where Petrol_tax> 7.25 then it uses feature Average Income as another decision node with Sample count of 18.
3. The tree considered Mean Squared Error(MSE) as the determining and decision making criteria. 
4. In the above plotted decision tree every node is a conditon how to split values. The least the MSE the better the result. 
5. We can see from decision nodes there are several boxes known as leaf nodes where MSE=0 and thus tree stops branching from that nodes.



#### The Comparision Of Decision Tree without Applying GridSearch and One by applying with GridSearch

In [None]:
DT_Regressor=[['Max_Depth',4,3],['Max_Feature',4,4],['Mean Abs. Error',106.73,96.6],['Mean Square Error',18466.34,15143.73],['Root Mean Square',135.89,123.05],['r2_Score',0.20,0.344]]
Result_Summary2= pd.DataFrame(DT_Regressor, columns = ['Parameters','Without Grid Search','With Grid Search'])
Result_Summary2

**From the above comparision Dataframe we can see how Using GridSearch has affected the end result of Decision Tree Model. The Accuracy scores has also increased after substituting the optimum hyperparameters value. The MSE has reduced.**

References:

1. https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html
2. https://webfocusinfocenter.informationbuilders.com/wfappent/TLs/TL_rstat/source/DecisionTree47.html
3. https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/