We are going to improve the result from the decision tree proposed in the following kaggle kernel: 

https://www.kaggle.com/ricardorios/decision-trees-don-t-overfit

In this kernel the decision tree was built with default parameters and it turned out to be an overfitting model, one important aspect is the way how the decision tree was built which uses [mathematical optimization](https://en.wikipedia.org/wiki/Mathematical_optimization). If you have time you can review the following web page:

[How decision trees work internally](https://medium.com/cracking-the-data-science-interview/decision-trees-how-to-optimize-my-decision-making-process-e1f327999c7a)


In order to improve the model, we are going to consider the parameter max_depth which controls the complexity of the model, we start with max_depth = 1 (a simpler model) and increase this parameter until find a good model. 

In [None]:
# Loading the packages
import numpy as np
import pandas as pd 
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
#from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt


In [None]:
# Loading the training dataset
df_train = pd.read_csv("../input/train.csv")

In [None]:
y = df_train["target"]
# We exclude the target and id columns from the training dataset
df_train.pop("target");
df_train.pop("id")
X = df_train 
del df_train
X = X.values # Converting pandas dataframe to numpy array 
y = y.values # Converting pandas series to numpy array 


In order to perform our analysis, we take the following facts into account. 

"Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance".[1]

https://machinelearningmastery.com/k-fold-cross-validation/

We are going to use stratified cross validation the reason for that is "Stratification is a technique where we rearrange the data in a way that each fold has a good representation of the whole dataset. It forces each fold to have at least m instances of each class. This approach ensures that one class of data is not overrepresented especially when the target variable is unbalanced". [2]

https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e


In [None]:
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X, y)

We are going to use the following values for the max_depth argument: 1, 2, and 3.

In [None]:
def fit_decision_tree(max_depth=1, nbins=5):
    train_auc = []
    test_auc = []
    
    for train_index, test_index in skf.split(X, y):
        model = DecisionTreeClassifier(max_depth=max_depth)
        model.fit(X[train_index], y[train_index])
        y_train = y[train_index]
        y_test = y[test_index]
    
        y_train_predict = model.predict_proba(X[train_index])
        y_train_predict = y_train_predict[:,1]
        y_test_predict = model.predict_proba(X[test_index], )
        y_test_predict = y_test_predict[:,1]        
        train_auc.append(roc_auc_score(y_train, y_train_predict))
        test_auc.append(roc_auc_score(y_test, y_test_predict))
        
    n_bins = 5

    fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, tight_layout=True);

    ax1.hist(train_auc, bins=n_bins);
    ax1.set_title("Histogram of AUC training")
    ax2.hist(test_auc, bins=n_bins);
    ax2.set_title("Histogram of AUC validation")        
    

In [None]:
fit_decision_tree(1, 5)

In [None]:
fit_decision_tree(2, 5)

In [None]:
fit_decision_tree(3, 5)

From the plots above, it seems that with a value of max_depth equals to 3 or 2 it turns out overfitting models. On the other hand with a value of max_depth  equals to 1 we obtain a similar variation in the distribution of AUC values in the training and validation dataset that is why we are going to choose this model. Next, we are going to fit this model with the whole training dataset.

In [None]:
model = DecisionTreeClassifier(max_depth=1, class_weight='balanced')
model.fit(X, y)

In [None]:
df_test = pd.read_csv("../input/test.csv")
df_test.pop("id");
X = df_test 
del df_test
y_pred = model.predict_proba(X)
y_pred = y_pred[:,1]

In [None]:
# submit prediction
smpsb_df = pd.read_csv("../input/sample_submission.csv")
smpsb_df["target"] = y_pred
smpsb_df.to_csv("decision_tree_improved.csv", index=None)


## References: 

[1] https://machinelearningmastery.com/k-fold-cross-validation/

[2] https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e