# Tree-structure classifier model implementation
## Case workbook
<br><br>
### Source: 
[F. Provost, T. Fawcett, "Data Science for Business"](https://data-science-for-biz.com/)
<br><br>
### Dataset source: 
[Mushroom Data Set](https://archive.ics.uci.edu/ml/datasets/Mushroom)
<br><br>
### Problem outline: 
implement a tree-structure classifier to predict a target variable (edible, poisonous). This is a supervised classification problem. [In the previous workbook](https://github.com/nefiu/data_science_for_business_implementations/blob/main/3_intro_to_predictive_modeling/attribute-selection-with-information-gain-3-1-dsfb.ipynb), we calculated [Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) for each feature which will serve here as a feature selection criterion. I will use features with the top 2 IG values in the model. IG table attached below.  
Additional problem to solve here is caused by the fact, that decision trees in sklearn do not handle nominal data. That is why I need to dummy encode it into numerical values using OneHotEncoder. [This lecture](https://www.youtube.com/watch?v=irHhDMbw3xo) helped me a lot to understand how it works.
<br><br>
Problem type: classification

Dataset values: categorical

Target variable: edible (e), poisonous (p)

Splitting criterion: [Informastion gain]
(https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)
<br><br>
### TODO, TOANSWER
- How to visualize dummy encoded decision tree?
- Is cross val score enough for pipeline validation (what are the pitfalls)?
- What is the impact of handle_unknown on model cross-validation score?
<br><br>
<img src="./static/ig_table.png"></img>

## Import section

In [1]:
## Data analytics ##
####################
import pandas as pd

## Machine learning ##
######################

## Get pipeline constructor
from sklearn.pipeline import make_pipeline

## Get preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

## Get model
from sklearn.tree import DecisionTreeClassifier

## Get cross validation tool
from sklearn.model_selection import cross_val_score

## Read data and explore

In [7]:
mushroom_set = pd.read_csv('./data/mushrooms.csv')
mushroom_set.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [8]:
mushroom_set.shape

(8124, 23)

In [9]:
## We are lucky there are no NaNs. Otherwise, I would drop NaN instances before
## feeding the model with data.
mushroom_set.isna().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

## Building feature vector

In [10]:
## I choose odor and spore-print-color as they are in top 2 IG (0.90, 0.48)
features_to_use = 'odor spore-print-color'.split()

In [47]:
## Create a feature and class frames
X = mushroom_set[features_to_use]
y = mushroom_set['class']

## I will use cross_val_score I do not need to split data explicitly onto train and test
## frames. Although, I use train_test_split function to extract out of sample data
## from in sample data to use it later for a prediction test. 
X_train, X_out, y_train, y_out = train_test_split(X, 
                                                  y, 
                                                  test_size = 0.001, 
                                                  random_state = 42)

## Preprocessing

In [55]:
## Instantiate One Hot Encoder
ohe = OneHotEncoder(handle_unknown='error') ## TODO: check impact of handle_unkcnown on
## model cross-validation score

In [56]:
## Set column transformer
column_trans = make_column_transformer(
    (ohe,
    features_to_use),
    remainder='passthrough')

## Instantiate a model

In [57]:
## Instantiate a decision tree with entropy as splitting criterion
classifier_en = DecisionTreeClassifier(criterion='entropy', 
                                       max_depth=4, 
                                       random_state=42)

## Build a pipeline

In [68]:
pipe = make_pipeline(column_trans, classifier_en)

In [69]:
## Evaluate
cross_val_score(pipe, X_train, y_train, cv=6, scoring='accuracy').mean()

0.9940851180297708

## Train & predict

In [60]:
## Train
pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['odor',
                                                   'spore-print-color'])])),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=4,
                                        random_state=42))])

In [61]:
## Use out of sample data to predict target variable
prediction = pipe.predict(X_out)
y_out == prediction

1971    True
6654    True
5606    True
3332    True
6988    True
5761    True
5798    True
3064    True
1811    True
Name: class, dtype: bool

## Visualize