 ## Decision Tree Learning
 Decision Tree is a supervised learning algorithm classifies dataset into different categories by spliting them based on a feature that increases the information gain or reduces the purity.

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv('movies.csv')
df

Unnamed: 0,gender,education,age,status,movie
0,male,bachelors,25,single,action
1,female,masters,30,married,drama
2,male,highschool,18,single,comedy
3,female,highschool,15,single,animation
4,male,highschool,25,married,comedy
5,male,masters,35,married,drama
6,female,bachelors,22,single,comedy


## Impurity
The measure of homogineity of the label within the subset of data. It measures the chance of being incorrect if we randomly assign a label to a dataset.
* Gini impurity 
    * $I_{G}(p)=1-\sum _{i=1}^{J}{p_{i}}^{2}$
* Entropy
    * $H(T)=-\sum _{i=1}^{J}p_{i}\log _{2}^{}p_{i}$

## Information Gain

The measure of change of entropy from prior state to current state with some information or context provided

$\overbrace {IG(T,a)} ^{\text{Information Gain}}=\overbrace {H(T)} ^{\text{Entropy(parent)}}-\overbrace {H(T|a)} ^{\text{Weighted Sum of Entropy(Children)}}$

### Decision Tree building algorithm
The trick to know what question to ask and when to ask.
1. Start with the root node with all of the dataset
2. At each node, ask Yes/No questions from the feature set, then divide the dataset.
    1. Pick the question that maximizes the information gain
    2. For dataset that is impure, go back to step ii

In [3]:
df

Unnamed: 0,gender,education,age,status,movie
0,male,bachelors,25,single,action
1,female,masters,30,married,drama
2,male,highschool,18,single,comedy
3,female,highschool,15,single,animation
4,male,highschool,25,married,comedy
5,male,masters,35,married,drama
6,female,bachelors,22,single,comedy


In [4]:
import math
p_action = 1/7.0
p_drama = 2/7.0
p_anime = 1/7.0
p_comedy = 3/7.0
p=[p_action,p_drama,p_anime,p_comedy]
h = [pi*math.log(pi,2) for pi in p]
H= -1*sum(h)
H

1.8423709931771086

### Suppose question we ask is what is the gender?

In [5]:
df_male = df[df.gender=='male']
df_male

Unnamed: 0,gender,education,age,status,movie
0,male,bachelors,25,single,action
2,male,highschool,18,single,comedy
4,male,highschool,25,married,comedy
5,male,masters,35,married,drama


In [6]:
p_action = 1/4.0
p_comedy = 2/4.0
p_drama = 1/4.0
p=[p_action,p_comedy,p_drama]
h = [pi*math.log(pi,2) for pi in p]
H_male = -1*sum(h)
H_male

1.5

In [7]:
df_female = df[df.gender=='female']
df_female

Unnamed: 0,gender,education,age,status,movie
1,female,masters,30,married,drama
3,female,highschool,15,single,animation
6,female,bachelors,22,single,comedy


In [8]:
p_comedy = 1/3.0
p_anime = 1/3.0
p_drama = 1/3.0
p=[p_comedy,p_anime,p_drama]
h = [pi*math.log(pi,2) for pi in p]
H_female = -1*sum(h)
H_female

1.584962500721156

In [9]:
IG_age = H - ((4/7.0)*H_male + (3/7.0)*H_female)
IG_age

0.30595849286804166

### Suppose instead question we ask is what is the education?


In [10]:
df_highschool = df[df.education=='highschool']
df_highschool

Unnamed: 0,gender,education,age,status,movie
2,male,highschool,18,single,comedy
3,female,highschool,15,single,animation
4,male,highschool,25,married,comedy


In [11]:
p_comedy = 2/3.0
p_anime = 1/3.0
p=[p_comedy,p_anime]
h = [pi*math.log(pi,2) for pi in p]
H_highschool = -1 * sum(h)
H_highschool

0.9182958340544896

In [12]:
df_bachelors = df[df.education=='bachelors']
df_bachelors

Unnamed: 0,gender,education,age,status,movie
0,male,bachelors,25,single,action
6,female,bachelors,22,single,comedy


In [13]:
p_comedy = 1/2.0
p_action = 1/2.0
p=[p_comedy,p_action]
h = [pi*math.log(pi,2) for pi in p]
H_bachelors = -1*sum(h)
H_bachelors

1.0

In [14]:
df_masters = df[df.education=='masters']
df_masters

Unnamed: 0,gender,education,age,status,movie
1,female,masters,30,married,drama
5,male,masters,35,married,drama


In [15]:
p_comedy = 1
p =[p_comedy]
h = [pi*math.log(pi,2) for pi in p]
H_masters = -1*sum(h)
H_masters

-0.0

In [16]:
IG = H - ((3/7.0)*H_highschool + (2/7.0)*H_bachelors + (2/7.0)*H_masters)
IG

1.1631013500108989

### Clearly, "what is the education?" is better question to ask

In [11]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for col in ['gender','education','status']:
    enc.fit(df[col])
    df[col+'_enc'] = enc.transform(df[col])
df

Unnamed: 0,gender,education,age,status,movie,gender_enc,education_enc,status_enc
0,male,bachelors,25,single,action,1,0,1
1,female,masters,30,married,drama,0,2,0
2,male,highschool,18,single,comedy,1,1,1
3,female,highschool,15,single,animation,0,1,1
4,male,highschool,25,married,comedy,1,1,0
5,male,masters,35,married,drama,1,2,0
6,female,bachelors,22,single,comedy,0,0,1


In [12]:
feature_cols = ['gender_enc','education_enc','status_enc','age']
cls_names =['action','drama','comedy','animation']
X = df.loc[:, feature_cols]
Y = df.movie

In [13]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

In [7]:
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("movie") 

ExecutableNotFound: failed to execute ['dot', '-Tpdf', '-O', 'movie'], make sure the Graphviz executables are on your systems' PATH

In [10]:
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=feature_cols,  
                         class_names=cls_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0x11215f550>

In [15]:
from sklearn.externals.six import StringIO
import pydotplus
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data, 
                         feature_names=feature_cols,  
                         class_names=cls_names,  
                         filled=True, rounded=True,  
                         impurity=False,special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("movies.pdf")

InvocationException: GraphViz's executables not found

### Random Forest
 Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).