# Decision Trees : Algorithm and Implementation

## [It's a Nonparametric Algorithm](https://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/)

### Parametric Algorithm
*A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.*

— Artificial Intelligence: A Modern Approach, page 737



#### 2  Steps of Parametric Algorithm
1. Select a form for the function.
2. Learn the coefficients for the function from the training data.
An easy to understand functional form for the mapping function is a line, as is used in linear regression:

b0 + b1*x1 + b2*x2 = 0

#### Benefits of Parametric Machine Learning Algorithms:

Simpler: These methods are easier to understand and interpret results.
Speed: Parametric models are very fast to learn from data.
Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.
Limitations of Parametric Machine Learning Algorithms:

Constrained: By choosing a functional form these methods are highly constrained to the specified form.
Limited Complexity: The methods are more suited to simpler problems.
Poor Fit: In practice the methods are unlikely to match the underlying mapping function.

### Nonparametric Algorithm
Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.

*Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features.*

— Artificial Intelligence: A Modern Approach, page 757




In [1]:
import pandas as pd
import requests

contra=pd.read_csv('contraceptive_data.data', header=None)

contra.columns = ['wifeage', 'wifeedu','husedu','numchild','religion','wifework','husjob','living','media','method']
contra.head()
#df

Unnamed: 0,wifeage,wifeedu,husedu,numchild,religion,wifework,husjob,living,media,method
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1


From this information we can talk about our goal: to predict Method (or, type of contraceptive) given the features wifeage,  education, number of children, and etc. We can use pandas to show the three contraceptive types:

In [2]:
print("* contra types:", contra["method"].unique(), sep="\n")

* contra types:
[1 2 3]


Let's group the feature columns together

In [3]:
features = list(contra.columns[:9])
print("* features:", features, sep="\n")  

* features:
['wifeage', 'wifeedu', 'husedu', 'numchild', 'religion', 'wifework', 'husjob', 'living', 'media']


### Let's fit the decision tree now
source for this example
http://stackabuse.com/decision-trees-in-python-with-scikit-learn/

In [4]:
from __future__ import print_function

import os
import subprocess
import numpy as np
import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz

We pull the X and y data from the pandas dataframe using simple indexing.
The decision tree, imported at the start of the post, is initialized with two parameters: min_samples_split=20 requires 20 samples in a node for it to be split (this will make more sense when we see the result) and random_state=99 to seed the random number generator.

In [5]:
y = contra["method"]
X = contra[features]
dt = DecisionTreeClassifier(min_samples_split=10, random_state=10)
dt.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=10,
            min_weight_fraction_leaf=0.0, presort=False, random_state=10,
            splitter='best')

In [12]:
def visualize_tree(tree, feature_names):
    with open("dt.dot", 'w') as f:
        export_graphviz(tree, out_file=f,
                        feature_names=feature_names)

    command = [r"C:\Users\SUE Kwon\Anaconda3\Library\bin\graphviz\dot", "-Tpng", "dt.dot", "-o", "dt.png"]
    try:
        subprocess.check_call(command)
    except:
        exit("Could not run dot, ie graphviz, to "
             "produce visualization")

In [13]:
visualize_tree(dt, features)

## Second Attempt http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html

In [8]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("ContrA") 

CalledProcessError: Command '['dot.bat', '-Tpdf', '-O', 'ContrA']' returned non-zero exit status 1.

## The error I get with the second attempt
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
~\Anaconda3\lib\site-packages\graphviz\backend.py in render(engine, format, filepath, quiet)
    123         try:
--> 124             subprocess.check_call(args, startupinfo=STARTUPINFO, stderr=stderr)
    125         except OSError as e:

~\Anaconda3\lib\subprocess.py in check_call(*popenargs, **kwargs)
    285     """
--> 286     retcode = call(*popenargs, **kwargs)
    287     if retcode:

~\Anaconda3\lib\subprocess.py in call(timeout, *popenargs, **kwargs)
    266     """
--> 267     with Popen(*popenargs, **kwargs) as p:
    268         try:

~\Anaconda3\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
    708                                 errread, errwrite,
--> 709                                 restore_signals, start_new_session)
    710         except:

~\Anaconda3\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
    996                                          os.fspath(cwd) if cwd is not None else None,
--> 997                                          startupinfo)
    998             finally:

FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

ExecutableNotFound                        Traceback (most recent call last)
<ipython-input-34-c9e82e904cc1> in <module>()
      4 dot_data = tree.export_graphviz(clf, out_file=None)
      5 graph = graphviz.Source(dot_data)
----> 6 graph.render("ContrA")

~\Anaconda3\lib\site-packages\graphviz\files.py in render(self, filename, directory, view, cleanup)
    174         filepath = self.save(filename, directory)
    175 
--> 176         rendered = backend.render(self._engine, self._format, filepath)
    177 
    178         if cleanup:

~\Anaconda3\lib\site-packages\graphviz\backend.py in render(engine, format, filepath, quiet)
    125         except OSError as e:
    126             if e.errno == errno.ENOENT:
--> 127                 raise ExecutableNotFound(args)
    128             else:  # pragma: no cover
    129                 raise

ExecutableNotFound: failed to execute ['dot', '-Tpdf', '-O', 'ContrA'], make sure the Graphviz executables are on your systems' PATH
