# Decision Tree Model

## Table of Contents

- [Data Takeover](#Data-Takeover)
    - [Train/Test Split](#Train/Test-Split)
- [DecisionTree Classifier](#DecisionTree-Classifier)
    - [Performance Measurement](#Performance-Measurement)
- [Results Handover](#Results-Handover)

## Data Takeover

Read in DataFrame from chapter [Feature Matrix Generation](./3_FeatureMatrixGeneration.ipynb) as input for processing in this chapter.

In [1]:
import os
import pandas as pd

path_goldstandard = './daten_goldstandard'

# Restore results so far
df_labelled_feature_matrix = pd.read_pickle(os.path.join(path_goldstandard,
                                                         'labelled_feature_matrix.pkl'),
                                 compression=None)

df_labelled_feature_matrix.head()

Unnamed: 0,duplicates,century_delta,corporate_110_delta,corporate_710_delta,edition_delta,format_prefix_delta,format_postfix_delta,person_245c_delta,ttlfull_245_delta,ttlfull_246_delta,volumes_delta
0,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1,1.0,1.0,1.0,1.0,1.0,1.0,0.818905,0.363636,1.0,1.0
2,1,1.0,1.0,1.0,1.0,1.0,1.0,0.69774,1.0,1.0,1.0
3,1,1.0,1.0,1.0,1.0,1.0,1.0,0.818905,0.363636,1.0,1.0
4,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [2]:
print('Part of duplicates (1) on uniques (2) in units of [%]')
print(df_labelled_feature_matrix.duplicates.value_counts(normalize=True)*100)

Part of duplicates (1) on uniques (2) in units of [%]
0    99.435054
1     0.564946
Name: duplicates, dtype: float64


### Train/Test Split

The train/test split will be implemented here as a general function to be called in the models chapters.

In [3]:
X = df_labelled_feature_matrix.drop(columns=['duplicates']).values
y = df_labelled_feature_matrix.duplicates.values

In [4]:
from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y, test_size=0.2, random_state=0)

## DecisionTree Classifier

In [5]:
X_tr[:5], y_tr[:5]

(array([[0.5       , 1.        , 1.        , 1.        , 0.        ,
         0.42857143, 0.50165426, 0.44444444, 1.        , 0.07142857],
        [0.        , 1.        , 0.        , 1.        , 0.        ,
         0.42857143, 0.54435379, 0.109375  , 1.        , 0.        ],
        [0.        , 1.        , 0.        , 0.        , 1.        ,
         1.        , 0.6020276 , 0.38028169, 1.        , 0.22222222],
        [0.        , 1.        , 1.        , 1.        , 1.        ,
         1.        , 0.        , 0.17910448, 1.        , 0.375     ],
        [0.75      , 1.        , 1.        , 0.        , 1.        ,
         1.        , 0.51341896, 0.05479452, 1.        , 0.2       ]]),
 array([0, 0, 0, 0, 0]))

In [6]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=0)
dt.fit(X_tr, y_tr)
y_pred = dt.predict(X_te)

### Performance Measurement

In [7]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_te, y_pred)

array([[51833,    19],
       [   14,   281]])

In [8]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

print('Score {:.1f}%'.format(100*dt.score(X_te, y_te)))
print('Area under the curve {:.1f}% - accuracy {:.1f}% - precision {:.1f}% - recall {:.1f}%'.format(
    100*roc_auc_score(y_te, y_pred),
                100*accuracy_score(y_te, y_pred),
                100*precision_score(y_te, y_pred),
                100*recall_score(y_te, y_pred)
               ))

Score 99.9%
Area under the curve 97.6% - accuracy 99.9% - precision 93.7% - recall 95.3%


## Results Handover

Entry point for saving results.

In [9]:
# Add result of this section
df_result = pd.DataFrame.from_dict({
    'model': ['DecisionTree Classifier'],
    'test_score' : [dt.score(X_te, y_te)],
    'auc' : [100*roc_auc_score(y_te, y_pred)],
    'accuracy' : [100*accuracy_score(y_te, y_pred)],
    'precision' : [100*precision_score(y_te, y_pred)],
    'recall' : [100*recall_score(y_te, y_pred)]
})

# Save full DataFrame into pickle file
df_result.to_pickle(os.path.join(path_goldstandard,
                                 'results.pkl'), compression=None)