# Tree-structure classifier model implementation
## Case workbook
<br><br>
### Source: 
[F. Provost, T. Fawcett, "Data Science for Business"](https://data-science-for-biz.com/)
<br><br>
### Dataset source: 
[Mushroom Data Set](https://archive.ics.uci.edu/ml/datasets/Mushroom)
<br><br>
### Problem outline: 
implement a tree-structure classifier to predict a target variable (edible, poisonous). This is a supervised classification problem. [In the previous workbook](https://www.kaggle.com/rafpast/attribute-selection-with-information-gain-3-1-dsfb), we calculated [Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) for each feature which will serve here as a feature selection criterion. I will use features with the top 2 IG values in the model. IG table attached below.  
Additional problem to solve here is caused by the fact, that decision trees in sklearn do not handle nominal data. That is why I need to dummy encode it into numerical values using OneHotEncoder. [This lecture](https://www.youtube.com/watch?v=irHhDMbw3xo) helped me a lot to understand how it works.
<br><br>
Problem type: classification

Dataset values: categorical

Target variable: edible (e), poisonous (p)

Splitting criterion: [Informastion gain]
(https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)
<br><br>
### TODO, TOANSWER
- How to visualize dummy encoded decision tree?
- Is cross val score enough for pipeline validation (what are the pitfalls)?
- What is the impact of handle_unknown on model cross-validation score?
<br><br>
<img src="https://raw.githubusercontent.com/nefiu/data_science_for_business_implementations/1be90fcf25d74235a1636cac93abaf97e112077e/3_intro_to_predictive_modeling/static/ig_table.png"></img>

## Import section

In [None]:
## Data analytics ##
####################
import pandas as pd

## Machine learning ##
######################

## Get pipeline constructor
from sklearn.pipeline import make_pipeline

## Get preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

## Get model
from sklearn.tree import DecisionTreeClassifier

## Get cross validation tool
from sklearn.model_selection import cross_val_score

## Read data and explore

In [None]:
mushroom_set = pd.read_csv('../input/mushroom-classification/mushrooms.csv')
mushroom_set.head()

In [None]:
mushroom_set.shape

In [None]:
## We are lucky there are no NaNs. Otherwise, I would drop NaN instances before
## feeding the model with data.
mushroom_set.isna().sum()

## Building feature vector

In [None]:
## I choose odor and spore-print-color as they are in top 2 IG (0.90, 0.48)
features_to_use = 'odor spore-print-color'.split()

In [None]:
## Create a feature and class frames
X = mushroom_set[features_to_use]
y = mushroom_set['class']

## I will use cross_val_score I do not need to split data explicitly onto train and test
## frames. Although, I use train_test_split function to extract out of sample data
## from in sample data to use it later for a prediction test. 
X_train, X_out, y_train, y_out = train_test_split(X, 
                                                  y, 
                                                  test_size = 0.001, 
                                                  random_state = 42)

## Preprocessing

In [None]:
## Instantiate One Hot Encoder
ohe = OneHotEncoder(handle_unknown='error')

In [None]:
## Set column transformer
column_trans = make_column_transformer(
    (ohe,
    features_to_use),
    remainder='passthrough')

## Instantiate a model

In [None]:
## Instantiate a decision tree with entropy as splitting criterion
classifier_en = DecisionTreeClassifier(criterion='entropy', 
                                       max_depth=4, 
                                       random_state=42)

## Build a pipeline

In [None]:
pipe = make_pipeline(column_trans, classifier_en)

In [None]:
## Evaluate
cross_val_score(pipe, X_train, y_train, cv=6, scoring='accuracy').mean()

## Train & predict

In [None]:
## Train
pipe.fit(X_train, y_train)

In [None]:
## Check the order of classes in the pipeline
pipe.classes_

In [None]:
## Use out of sample data to predict target variable. I use .predict_proba() to see estimated 
## probability of predictions. The outpout is np array that I pass to DataFrame for more clear 
## data representation. With given features and the model accuracy I get 100% probability for
## each prediction.
##
## It's worth to experiment with different features to see how probabilities change.
prediction = pipe.predict_proba(X_out)
pd.DataFrame(prediction, columns=pipe.classes_)

## Visualize
work in progress