# 1. Decision Trees:

1. Decision Trees (DTs) are a _non-parametric supervised_ learning method
2. Used for _classification_ and _regression_. 
3. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

 Data in the form of :    $(X,Y)=(x_{1},x_{2},x_{3},...,x_{k},Y)$ 

Where Input Samples are given in the shape of an array of shape (n_samples, n_features)

And Output values are given as an array of shape (n_samples,)

The goal is to build an Estimator :  $\phi_{L} : X \rightarrow  y$ minimizing :

Err ($\phi_{L}) = E_{X,Y} \{L (Y, \phi_{L}.predict(X))\}$


<img src="files/images/1.png">

## Another Tree : 

<img src="files/images/2.png">


## 1.1 Using Scikit Learn for DTs (Growing  a Tree)

A tree is grown iteratively from top to down by choosing a variable at each step that _best splits_ the set of items. 

### Gini Impurity :
 $I_{G}(f)=\sum _{i=1}^{J}f_{i}(1-f_{i})=\sum _{i=1}^{J}(f_{i}-{f_{i}}^{2})=\sum _{i=1}^{J}f_{i}-\sum _{i=1}^{J}{f_{i}}^{2}=1-\sum _{i=1}^{J}{f_{i}}^{2}=\sum _{i\neq k}f_{i}f_{k}$

### Information Gain (IG)
$H(T)=I_{E}(p_{1},p_{2},...,p_{n})=-\sum _{i=1}^{J}p_{i}\log _{2}^{}p_{i}$

$IG(T,a) = H(T) - H(T|a)$



### Classification Models: 

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit

class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, class_weight=None, presort=False)


<img src="files/images/3.png">

# 2. Titanic Example: 

Given a certain set of parameters about a passenger, can we predict whether he/she will survive or not?

## 2.1 Downloading Packages:

First, we import some pacakges that we are going to use:

In [7]:
#####################################################################
# Importing the Relevant Packages                                   #
#####################################################################

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer, LabelBinarizer, StandardScaler, LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
#import xgboost as xgb
#import keras as krs
#from keras.models import Sequential
#from keras.layers import Dense
#from keras.wrappers.scikit_learn import KerasClassifier
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#import graphviz as gv
from sklearn import tree
from sklearn.tree import export_graphviz
#import pandoc
sns.set_style("darkgrid")
sns.set_context("paper")
sns.set(font_scale=1.25)


## 2.2 Importing, Pruning and visualizing the Data: 

Now we will import the data set from input files

In [9]:

#####################################################################
# Reading, Cleaning & Transforming Training Data                    #
#####################################################################

training_data = pd.read_csv("data/raw/train.csv")
print(training_data)


     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

Removing some columns : 

In [3]:
training_data.drop(["Name", "Ticket"], axis=1, inplace=True)


In [4]:

#####################################################################
#Replacing Nan with Mean and Create Indicator Columns for "Cabin" and "Embarked"   ##
#####################################################################

tmp = np.round(np.mean(training_data["Age"]))
training_data_replaced_nan_with_mean = training_data
for i in range(training_data_replaced_nan_with_mean.shape[0]):
    if np.isnan(training_data_replaced_nan_with_mean["Age"][i]) == True:
        training_data_replaced_nan_with_mean.set_value(i, "Age", tmp)

In [5]:
training_data_replaced_nan_with_mean

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.2500,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.9250,,S
3,4,1,1,female,35.0,1,0,53.1000,C123,S
4,5,0,3,male,35.0,0,0,8.0500,,S
5,6,0,3,male,30.0,0,0,8.4583,,Q
6,7,0,1,male,54.0,0,0,51.8625,E46,S
7,8,0,3,male,2.0,3,1,21.0750,,S
8,9,1,3,female,27.0,0,2,11.1333,,S
9,10,1,2,female,14.0,1,0,30.0708,,C


In [10]:

#####################################################################
# Data Visualization                                                #
#####################################################################

plt.figure(figsize=(10, 20))
ax1 = plt.subplot2grid((4,2), (0, 0))
ax1 = sns.distplot(training_data["Survived"], kde=False)
plt.title("Histogram")
ax2 = plt.subplot2grid((4,2), (0, 1))
ax2 = sns.boxplot(x=training_data["Survived"], y=training_data["Pclass"])
plt.title("Variation of Survived with Pclass")
ax3 = plt.subplot2grid((4,2), (1, 0))
ax3 = sns.boxplot(x=training_data["Survived"], y=training_data["Sex"])
plt.title("Variation of Survived with Sex")
ax4 = plt.subplot2grid((4,2), (1, 1))
ax4 = sns.boxplot(x=training_data["Survived"], y=training_data["Age"])
plt.title("Variation of Survived with Age")
ax5 = plt.subplot2grid((4,2), (2, 0))
ax5 = sns.boxplot(x=training_data["Survived"], y=training_data["SibSp"])
plt.title("Variation of Survived with SibSp")
ax6 = plt.subplot2grid((4,2), (2, 1))
ax6 = sns.boxplot(x=training_data["Survived"], y=training_data["Parch"])
plt.title("Variation of Survived with Parch")
ax7 = plt.subplot2grid((4,2), (3, 0), colspan=2)
ax7 = sns.boxplot(x=training_data["Survived"], y=training_data["Fare"])
plt.title("Variation of Survived with Fare")
plt.tight_layout()

## 2.3 Building the Tree Model: 

Now we will read the training and testing data, separate the dependent variable (y) column and use Scikit learn to build a Decision Tree model:

In [7]:
#####################################################################
# reading the Training Dataset and the Testing DataSet          #
#####################################################################
train = pd.read_csv("data/raw/cleaned_decomposed_training_data_replaced_nan_with_mean.csv")
test = pd.read_csv("data/raw/cleaned_decomposed_testing_data_replaced_nan_with_mean.csv")

all_columns = train.columns
tabu_list = ["PassengerId", "Survived", "Pclass", "Sex", "Cabin", "Embarked"]
x_columns = []
for i in range(len(all_columns)):
    if all_columns[i] in tabu_list:
        continue
    else:
        x_columns.append(i)
x_columns = np.sort(x_columns)

train_X = train[x_columns].values
train_y = train["Survived"].values

test_X = test[x_columns].values
test_y = test["Survived"].values

In [13]:
#####################################################################
### Decision Tree Classification                                  ###
#####################################################################

dt_model = DecisionTreeClassifier(criterion = 'gini', splitter='best', max_depth=None, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=10)
dt_model.fit(train_X, train_y)
dt_model_predictions = dt_model.predict(test_X)
print(dt_model_predictions)
dt_confusion_matrix = confusion_matrix(test_y, dt_model_predictions)
print(dt_confusion_matrix)
#println("Accuracy of Decision Tree Classifier = $(sum(dt_model_predictions .== y_values_replaced_nan_with_mean_validation_set) / length(y_values_replaced_nan_with_mean_validation_set))")


NameError: name 'train_X' is not defined

In [9]:
dt_apply = dt_model.apply(test_X)  # Returns the index of the leaf that each sample is predicted as.
print(dt_apply)
dt_dp = dt_model.decision_path(test_X) #Return the decision path in the tree
#dt_logProb = dt_model.predict_log_proba(test_X) #Predict class log-probabilities of the input samples X.
dt_prob = dt_model.predict_proba(test_X) #Predict class probabilities of the input samples X.
print(dt_prob)
dt_score = dt_model.score(test_X, test_y) #Returns the mean accuracy on the given test data and labels.
print(dt_score)

[13 13 12 12 12 12 15 12 15 13  8 17 15 12 12  7 12 12 12 12 17 12 15 12 10
 12 13 12 12 12 12 15 12 12 12 12 13 12 13 12 15 12 12 12 12 12 12 12 12 10
 15 12 15 12 17 15 17 17 13 13 12 13 15 12 12 13 12 17 12 13 12 12 13 13 12
 12 12 15 12 10 12 12 12 15 12 13 13 12 17 12 12 12 13 17 12 12 17 12 13 12
 12 12 13 12 13 12 13 12 13 13 10 13 12 12 12 12 15 12 12 13 12 12 12 12 12
 12 17 12 12 18 12 17 15 12 17 12 12 12 12 15 17 13 17 12 12 17 12 12 17 12
 17 12 12 12 13 12 17 12 12 13 12 13 12 17 12  8 12 15 12 12 13 13 13 15 12
 10 15 12]
[[ 0.03703704  0.96296296]
 [ 0.03703704  0.96296296]
 [ 0.86725664  0.13274336]
 [ 0.86725664  0.13274336]
 [ 0.86725664  0.13274336]
 [ 0.86725664  0.13274336]
 [ 0.39583333  0.60416667]
 [ 0.86725664  0.13274336]
 [ 0.39583333  0.60416667]
 [ 0.03703704  0.96296296]
 [ 0.92857143  0.07142857]
 [ 0.5625      0.4375    ]
 [ 0.39583333  0.60416667]
 [ 0.86725664  0.13274336]
 [ 0.86725664  0.13274336]
 [ 0.05        0.95      ]
 [ 0.86725664  0.13274336

In [10]:
import networkx as nx
#import pygraphviz as pgv


In [11]:
import pydotplus 
from IPython.display import Image  

#export_graphviz(dt_model, out_file = "tree.dot", feature_names = )
dot_data = tree.export_graphviz(dt_model, out_file=None,  filled=True, rounded=True, special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data) 
#Image(graph.create_png())  


## 2.4  Advantages of Decision Trees:
Some advantages of decision trees are:

1. Non Parametric
1. Easily interpretable. Trees can be visualised.
2. Require little data preparation.
3. Fast to train, fast to predict. Complexity --> $\phi (pN\text{log}^2 N)$
4. Able to handle both numerical and categorical data. 
5. Able to handle multi-output problems.
6. Use a white box model.


## 2.5  Disadvantages of Decision Trees:

1. Overfitting. Limit the minimum number of samples required at a leaf node or set the maximum depth of the tree.
2. High Variance. Sensitive to small variations in data. This problem is mitigated by ensemble models.
3. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
4. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

# 3. Ensemble Models

bagging, boosting etc! Models that consist of multiple learners.

## 3.1 Random Forests: 

1. In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. 
2. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. 
3. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but
4. Due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

The scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.


<img src="files/images/4.png">

## Why More Trees?
## Why Randomization?

<img src="files/images/5.png">


## 3.2 Using Scikit learn for Ensemble Models:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

In [14]:
#####################################################################
### Random Forest Classification                                  ###
#####################################################################

rf_model = RandomForestClassifier(n_estimators=50)
rf_model.fit(train_X, train_y)
rf_model_predictions = rf_model.predict(test_X)
print(rf_model_predictions)
rf_confusion_matrix = confusion_matrix(test_y, rf_model_predictions)
print(dt_confusion_matrix)
#println("Accuracy of Decision Tree Classifier = $(sum(dt_model_predictions .== y_values_replaced_nan_with_mean_validation_set) / length(y_values_replaced_nan_with_mean_validation_set))")


NameError: name 'train_X' is not defined

In [13]:
rf_apply = rf_model.apply(test_X)  # Returns the index of the leaf that each sample is predicted as.
print(rf_apply)
rf_dp = rf_model.decision_path(test_X) #Return the decision path in the tree
#dt_logProb = dt_model.predict_log_proba(test_X) #Predict class log-probabilities of the input samples X.
rf_prob = rf_model.predict_proba(test_X) #Predict class probabilities of the input samples X.
print(rf_prob)
rf_score = rf_model.score(test_X, test_y) #Returns the mean accuracy on the given test data and labels.
print(rf_score)

[[310 343 246 ..., 117 368 100]
 [211 243  27 ..., 249 187  76]
 [ 80  82 161 ..., 212 232 177]
 ..., 
 [301 184 129 ...,  23  99 334]
 [287 212 126 ..., 270 308 275]
 [ 33 165 219 ...,  80  45 143]]
[[ 0.          1.        ]
 [ 0.08        0.92      ]
 [ 0.87433333  0.12566667]
 [ 0.96        0.04      ]
 [ 0.54        0.46      ]
 [ 0.34        0.66      ]
 [ 0.64        0.36      ]
 [ 1.          0.        ]
 [ 0.46        0.54      ]
 [ 0.14        0.86      ]
 [ 1.          0.        ]
 [ 0.66        0.34      ]
 [ 0.74        0.26      ]
 [ 0.76666667  0.23333333]
 [ 0.90666667  0.09333333]
 [ 0.44        0.56      ]
 [ 0.98        0.02      ]
 [ 0.82        0.18      ]
 [ 0.95        0.05      ]
 [ 0.98        0.02      ]
 [ 0.62        0.38      ]
 [ 0.96866667  0.03133333]
 [ 0.1         0.9       ]
 [ 0.88        0.12      ]
 [ 0.94        0.06      ]
 [ 0.9         0.1       ]
 [ 0.          1.        ]
 [ 0.995       0.005     ]
 [ 0.78        0.22      ]
 [ 0.68        0.

## 3.3 Advantags of Random Forest :  

1. Off the shelf, no tuning required
2. Control of Bias and Variance through the extent of randomization
3. Moderately fast
4. Good for Parralelization