# 31 Decision Tree Classifier on Guitar Models

### .

### Overview
<span>
    <table>
        <tr><td>What is a Decision Tree?<td><tr>
        <tr><td>Building the Decision Tree Classifier<td><tr>
    <table>
<span>

### Setup

In [1]:
%matplotlib inline
import sys
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import feature_extraction
from sklearn import tree
from six import StringIO 
from IPython.core.display import Image
import pydot

from __future__ import print_function

# turn on data table rendering
pd.set_option('display.notebook_repr_html', True)
sns.set_palette(['#00A99D', '#F5CA0C', '#B6129F', '#76620C', '#095C57'])
sys.version

'3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]'

## What is a Decision Tree?
A decision tree is a structure of questions and answers used to separate data points into classes. We can use supervised machine learning to build such a structure from existing data. Decision trees can be used for classification and regression. In this example we focus on classification.

### Classification of Guitar Models
In the example below we try to predict the class of a guitar, based on its features, using a decision tree. There are only two classes of guitar models in this case, 10 Stratocasters (st) and 16 Les Paul (lp) models. Our feature set contains body material, fretboard, number of frets and kind of pickup elements. Note: this is probably *highly inaccurate* toy-data only created to illustrate a point. You can [download the data set](https://raw.githubusercontent.com/remondo/NoteBooks-Unsupervised-Learning/master/data/guitar-model.csv) from my GitHub repo.

In [2]:
# Load the guitar model data set
df = pd.read_csv('data\guitar-model.csv')
df

Unnamed: 0,model,material,fretboard,frets,elements
0,st,alder,maple,21,humbuckers
1,st,alder,maple,21,humbuckers
2,st,lime,maple,22,single coil
3,st,lime,maple,22,single coil
4,st,alder,maple,24,single coil
5,st,alder,maple,24,single coil
6,st,alder,rosewood,24,single coil
7,st,alder,rosewood,24,single coil
8,st,maple,rosewood,24,single coil
9,st,maple,rosewood,24,single coil


##Feature Extraction
We are confronted with a lot of categorical data, so we need to do some feature extraction first. We use [binary one-hot encoding](http://unsupervised-learning.com/binary-one-hot-encoding-for-machine-learning-in-python/) for this.

In [3]:
# Do some feature extracting for
cat_columns = ['material', 'fretboard', 'frets', 'elements']
cat_dict = df[cat_columns].to_dict(orient='records')

vec = feature_extraction.DictVectorizer()
cat_vector = vec.fit_transform(cat_dict).toarray()

df_vector = pd.DataFrame(cat_vector)
vector_columns = vec.get_feature_names()
df_vector.columns = vector_columns
df_vector.index = df.index

df = df.drop(cat_columns, axis=1)
df = df.join(df_vector)
df.head()



Unnamed: 0,model,elements=humbuckers,elements=single coil,fretboard=ebony,fretboard=maple,fretboard=rosewood,frets,material=alder,material=lime,material=mahogany,material=maple
0,st,1.0,0.0,0.0,1.0,0.0,21.0,1.0,0.0,0.0,0.0
1,st,1.0,0.0,0.0,1.0,0.0,21.0,1.0,0.0,0.0,0.0
2,st,0.0,1.0,0.0,1.0,0.0,22.0,0.0,1.0,0.0,0.0
3,st,0.0,1.0,0.0,1.0,0.0,22.0,0.0,1.0,0.0,0.0
4,st,0.0,1.0,0.0,1.0,0.0,24.0,1.0,0.0,0.0,0.0


In [4]:
# Assign an ID to the models
df.loc[df.model == 'st','model'] = 0
df.loc[df.model == 'lp','model'] = 1
df.model.value_counts()

1    16
0    10
Name: model, dtype: int64

## Building the Decision Tree Classifier
We use Scikit Learn's [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) to construct a decision tree. To choose which feature gives the largest information gain at any given point in the tree, we use the entropy criterion. Entropy is a proportional measure of how pure a set of labels is, where 0.0 is perfectly pure and 1.0 is the largets possible mix of labels.

In [6]:
### Setup# Split the data set in features and labels
features = df.drop(['model'], axis=1)
labels = df.model

test_features = features[-1:]
test_label = labels[-1:]

# Train the decision tree based on the entropy criterion
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(features[:-1], labels[:-1])
clf

ValueError: Unknown label type: 'unknown'

In [7]:
# Visualize the decision tree
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data, feature_names=features.columns) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

AttributeError: 'DecisionTreeClassifier' object has no attribute 'tree_'

In [8]:
# Make a prediction with test data
pred = clf.predict(test_features)
print((features[-1:].T))
print(('Predicted class:', pred))
print(('Accurate prediction?', pred[0] == test_label.values[0]))

AttributeError: 'DecisionTreeClassifier' object has no attribute 'tree_'

It seems that the 'material' of the guitar body does not play any roll in deciding which label belongs to a given feature set.

### Done!

#### Next: _Shannon's Entropy and Information Gain for Decision Trees_