## Training des Klassifizierungsmodels und Einsicht in das Model

Vorher wurde Model und Datensatz klassifiziert, hier werden subsets der Features untersucht wie gut diese eine Vorhersage machen können

- Model ist ein GradientBoost mit 400/800 estimators
- Data Set sind absolute Click-Zahlen (page impression, visits, visits inculding home pi, usw.), Metrik ist Gauss-basiert

HIER 
- werden verschiedene Untergruppen im Datensatz angeguckt (lange/kurze Verträge, mit/ohne Schläfer usw.)
- werden Regeln extrahiert, die die Modelle klassifizeren

In [1]:
import pandas as pd
import pickle
from datetime import datetime
%pylab inline

dat_all, dat_long, dat_short = pickle.load(open('dat_to_train.pkl','rb'))
dat_all_wBehav, dat_long_wBehav, dat_short_wBehav = pickle.load(open('dat_to_train_wlastBehav.pkl','rb'))

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


Vertragslänge ist herovrragender Prediktor für Kündigungswahrscheinlichkeit!

Lange Kunden, die kündigen sind schwer von langen Kunden, die nicht kündigen zu unterscheiden. Auch das Training-Datenset performt nicht optimal.

Nutzerverhalten vorm Kündigen spielt eine untergeordnete Rolle beim Klassifizieren

### Training

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from datetime import datetime

# prepare data
keys = ['dat_all', 'dat_long', 'dat_short', 'dat_all_wBehav', 'dat_long_wBehav', 'dat_short_wBehav']
dats = [dat_all, dat_long, dat_short, dat_all_wBehav, dat_long_wBehav, dat_short_wBehav]

dat_train = {}
dat_test = {}
label_train = {}
label_test = {}
for idx, key in enumerate(keys):
    label = dats[idx]['dat_abs_gauss_final']['Kuendigungsstatus']
    dat_cp = dats[idx]['dat_abs_gauss_final'].drop(columns = ['Kuendigungsstatus'])
    
    dat_train_, dat_test_, label_train_, label_test_ = train_test_split(dat_cp, label, test_size=0.2, random_state=42)
    dat_train[key] = dat_train_
    dat_test[key] = dat_test_
    label_train[key] = label_train_
    label_test[key] = label_test_

In [3]:
# model training
classifiers = {}
for key in keys:
    
    print(datetime.now().time())
    print(key)

    classifiers[key] = GradientBoostingClassifier( n_estimators = 800 )
    classifiers[key].fit(dat_train[key], label_train[key])

18:29:29.615357
dat_all
18:29:39.955434
dat_long
18:29:47.789998
dat_short
18:29:50.276349
dat_all_wBehav
18:30:12.373738
dat_long_wBehav
18:30:29.094101
dat_short_wBehav


Das Model Training mit den gesetzten Parametern dauert nur Sekunden.

### Make Prediction

In [4]:
from sklearn.metrics import classification_report

pred_train={}
pred_test ={}

for key in keys:
    pred_train[key] = classifiers[key].predict(dat_train[key])
    pred_test[key] = classifiers[key].predict(dat_test[key])

In [45]:
print('#### Training with n_estimators = 800 ###')

for key in keys:
    print('#########################')
    print(key)
    print('#########################')
    print('TRAINING SET')
    print('')
    print(pd.crosstab(label_train[key], pred_train[key], rownames=['True'], colnames=['Predicted'], margins=True))
    print()
    print(classification_report(label_train[key], pred_train[key]))
    print('#########################')
    print('TEST SET')
    print('')
    print(pd.crosstab(label_test[key], pred_test[key], rownames=['True'], colnames=['Predicted'], margins=True))
    print()
    print(classification_report(label_test[key], pred_test[key]))

#### Training with n_estimators = 800 ###
#########################
dat_all
#########################
TRAINING SET

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt          1662            48  1710
ungekuendigt         713          4401  5114
All                 2375          4449  6824

              precision    recall  f1-score   support

  gekuendigt       0.70      0.97      0.81      1710
ungekuendigt       0.99      0.86      0.92      5114

 avg / total       0.92      0.89      0.89      6824

#########################
TEST SET

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt           359            76   435
ungekuendigt         227          1044  1271
All                  586          1120  1706

              precision    recall  f1-score   support

  gekuendigt       0.61      0.83      0.70       435
ungekuendigt       0.93      0.82      0.87      1271

 avg / total 

In [42]:
print('#### Training with n_estimators = 400 ###')

for key in keys:
    print('#########################')
    print(key)
    print('#########################')
    print('TRAINING SET')
    print('')
    print(pd.crosstab(label_train[key], pred_train[key], rownames=['True'], colnames=['Predicted'], margins=True))
    print()
    print(classification_report(label_train[key], pred_train[key]))
    print('#########################')
    print('TEST SET')
    print('')
    print(pd.crosstab(label_test[key], pred_test[key], rownames=['True'], colnames=['Predicted'], margins=True))
    print()
    print(classification_report(label_test[key], pred_test[key]))

#### Training with n_estimators = 400 ###
#########################
dat_all
#########################
TRAINING SET

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt          1577           133  1710
ungekuendigt         754          4360  5114
All                 2331          4493  6824

              precision    recall  f1-score   support

  gekuendigt       0.68      0.92      0.78      1710
ungekuendigt       0.97      0.85      0.91      5114

 avg / total       0.90      0.87      0.88      6824

#########################
TEST SET

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt           362            73   435
ungekuendigt         219          1052  1271
All                  581          1125  1706

              precision    recall  f1-score   support

  gekuendigt       0.62      0.83      0.71       435
ungekuendigt       0.94      0.83      0.88      1271

 avg / total 

##### With Vertragslänge

In [22]:
print(pd.crosstab(label_train_vl, pred_vl, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print(classification_report(label_train_vl, pred_vl))

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt          1693            17  1710
ungekuendigt           7          5107  5114
All                 1700          5124  6824

              precision    recall  f1-score   support

  gekuendigt       1.00      0.99      0.99      1710
ungekuendigt       1.00      1.00      1.00      5114

 avg / total       1.00      1.00      1.00      6824



##### Ohne Vertragslaenge

In [27]:
print(pd.crosstab(label_train, pred, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print(classification_report(label_train, pred))

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt          1560           150  1710
ungekuendigt         751          4363  5114
All                 2311          4513  6824

              precision    recall  f1-score   support

  gekuendigt       0.68      0.91      0.78      1710
ungekuendigt       0.97      0.85      0.91      5114

 avg / total       0.89      0.87      0.87      6824



##### Kurze Verträge

In [24]:
print(pd.crosstab(label_short_train, pred_short, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print(classification_report(label_short_train, pred_short))

Predicted     gekuendigt  ungekuendigt   All
True                                        
gekuendigt          1050             9  1059
ungekuendigt         100           728   828
All                 1150           737  1887

              precision    recall  f1-score   support

  gekuendigt       0.91      0.99      0.95      1059
ungekuendigt       0.99      0.88      0.93       828

 avg / total       0.95      0.94      0.94      1887



##### Lange Verträge

In [26]:
print(pd.crosstab(label_long_test, pred_long, rownames=['True'], colnames=['Predicted'], margins=True))
print()
print(classification_report(label_long_test, pred_long))

ValueError: could not broadcast input array from shape (4936) into shape (1235)

### Feature Importance

In [9]:
features = pd.DataFrame({'feature_names': dat_cp.columns.values,
            'feature_importances': clf.feature_importances_})

features_vl = pd.DataFrame({'feature_names': dat_cp_vl.columns.values,
            'feature_importances': clf_vl.feature_importances_})

features_short = pd.DataFrame({'feature_names': data_short.columns.values,
            'feature_importances': clf_short.feature_importances_})

features_long = pd.DataFrame({'feature_names': data_long.columns.values,
            'feature_importances': clf_long.feature_importances_})

In [10]:
features_vl.sort_values(by='feature_importances', ascending=False).head(20)

Unnamed: 0,feature_names,feature_importances
51,Vertragslaenge,0.496273
5,pi_median,0.034164
11,piArticle_median,0.03387
10,piArticle_diff_max_min_normed,0.032661
43,vArticle_activ_wks,0.032144
20,vArticle_diff_max_min_normed,0.028822
4,pi_diff_max_min_normed,0.026566
13,piPlusComment_median,0.022946
0,v_diff_max_min_normed,0.0201
6,piComment_diff_max_min_normed,0.019687


In [11]:
features.sort_values(by='feature_importances', ascending=False).head(20)

Unnamed: 0,feature_names,feature_importances
43,vArticle_activ_wks,0.133559
5,pi_median,0.069136
20,vArticle_diff_max_min_normed,0.065503
4,pi_diff_max_min_normed,0.047191
34,v_activ_wks,0.046827
0,v_diff_max_min_normed,0.040547
47,vPlusArticle_activ_wks,0.037606
13,piPlusComment_median,0.032667
7,piComment_median,0.032137
12,piPlusComment_diff_max_min_normed,0.0311


In [12]:
features_short.sort_values(by='feature_importances',ascending=False).head(20)

Unnamed: 0,feature_names,feature_importances
43,vArticle_activ_wks,0.087183
5,pi_median,0.086677
12,piPlusComment_diff_max_min_normed,0.062513
4,pi_diff_max_min_normed,0.052538
13,piPlusComment_median,0.051434
7,piComment_median,0.042925
34,v_activ_wks,0.037691
20,vArticle_diff_max_min_normed,0.036323
28,vPlusArticle_diff_max_min_normed,0.03601
1,v_median,0.034987


In [13]:
features_long.sort_values(by='feature_importances',ascending=False).head(20)

Unnamed: 0,feature_names,feature_importances
43,vArticle_activ_wks,0.130287
47,vPlusArticle_activ_wks,0.055161
5,pi_median,0.053278
20,vArticle_diff_max_min_normed,0.052971
11,piArticle_median,0.052926
4,pi_diff_max_min_normed,0.040209
21,vArticle_median,0.03741
0,v_diff_max_min_normed,0.036407
28,vPlusArticle_diff_max_min_normed,0.03603
24,vDesktop_diff_max_min_normed,0.035204


In [51]:
for idx, key in enumerate(keys):
    dat = dats[idx]['dat_abs_gauss_final'].drop(columns = ['Kuendigungsstatus','Vertragslaenge'])
    features = pd.DataFrame({'feature_names': dat.columns.values,
            'feature_importances': classifiers[key].feature_importances_})
    print(key)
    print(features.sort_values(by='feature_importances',ascending=False).head(20))
    print('')
    

dat_all
              feature_names  feature_importances
43       vArticle_activ_wks             0.088751
4                   pi_mean             0.063634
21       vArticle_sd_normed             0.060064
1               v_sd_normed             0.055456
5              pi_sd_normed             0.048761
13  piPlusComment_sd_normed             0.048214
12       piPlusComment_mean             0.037706
24            vDesktop_mean             0.036552
10           piArticle_mean             0.035139
20            vArticle_mean             0.031748
47   vPlusArticle_activ_wks             0.031242
7       piComment_sd_normed             0.029271
29   vPlusArticle_sd_normed             0.028483
11      piArticle_sd_normed             0.027641
6            piComment_mean             0.027331
0                    v_mean             0.024553
3           vPlus_sd_normed             0.023576
25       vDesktop_sd_normed             0.022890
34              v_activ_wks             0.021571
28        vP

### FIND DECISION RULES: Make several trees in order to extract boundaries

In [12]:
# function to plot n decision trees for subsets of data 
def plot_trees(dat, label, word, n=10):
    for i in range(1,(n+1)):
        # split test, train data
        dat_train, dat_test, label_train, label_test = train_test_split(dat, label, 
                                                                        test_size=0.2, random_state=3*i)


        clf = tree.DecisionTreeClassifier(max_depth=4)
        clf = clf.fit(dat_train, label_train)
        
        features = dat_train.columns.values
        dot_data = tree.export_graphviz(clf, out_file=None, 
                                        feature_names=features,
                                        class_names = True,
                                        filled=True, rounded=True)

        graph = graphviz.Source(dot_data)
        # file name
        fname = 'decision_trees/tree_'+ word +str(i).zfill(2)
        graph.render(fname)

In [22]:
from sklearn import tree
import graphviz 
from sklearn.model_selection import train_test_split

# prepare data
data_long, data_short = dat_long_wBehav['dat_abs_median_final'], dat_short_wBehav['dat_abs_median_final']
label_long, label_short = data_long['Kuendigungsstatus'], data_short['Kuendigungsstatus']
data_long, data_short = data_long.drop(columns = ['Kuendigungsstatus','Vertragslaenge']),\
                        data_short.drop(columns = ['Kuendigungsstatus','Vertragslaenge'])



# plot generation of trees
# first generation: all variables
plot_trees(data_short,label_short, 'shortVL_01',n=5)
plot_trees(data_long,label_long, 'longVL_01',n=5)

# second generation: 'vArticle_activ_weeks' is not included
data_short_02 = data_short.drop(columns=['vArticle_activ_wks','v_activ_wks'])
data_long_02 = data_long.drop(columns=['vArticle_activ_wks','v_activ_wks'])

plot_trees(data_short_02,label_short, 'shortVL_02',n=5)
plot_trees(data_long_02,label_long, 'longVL_02',n=5)

# third generation
data_short_03 = data_short_02.drop(columns=['piPlusComment_activ_wks'])
data_long_03 = data_long_02.drop(columns=['vPlusArticle_activ_wks'])

plot_trees(data_short_03,label_short, 'shortVL_03',n=5)
plot_trees(data_long_03,label_long, 'longVL_03',n=5)

# fourth generation
data_short_04 = data_short_03.drop(columns=['vPlusArticle_activ_wks','vPlus_activ_wks'])
data_long_04 = data_long_03.drop(columns=['piPlusComment_activ_wks'])

plot_trees(data_short_04,label_short, 'shortVL_04',n=5)
plot_trees(data_long_04,label_long, 'longVL_04',n=5)

# fifth generation
data_short_05 = data_short_04.drop(columns=['piComment_activ_wks'])
plot_trees(data_short_05,label_short, 'shortVL_05',n=5)
