<a href="https://colab.research.google.com/github/xtbtds/ml-zoomcamp/blob/main/lesson6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6.3 Decision trees

**Decision trees** make predictions based on the bunch of if/else statements by splitting a node into two or more sub-nodes.

The decision tree is also prone to overfitting. One of the reason why this algorithm often overfits because of its **depth**. It tends to **memorize** all the patterns in the train data but struggle to performs well on the unseen data (validation or test set). To overcome with it, we reduce depth size.

```
dt = DecisionTreeClassifier(max_depth=3)
```
To print the tree:
```
from sklearn.tree import export_text
print(export_text(dt, feature_names=dv.get_feature_names()))
```

# 6.4 Decision Tree Learning Algorithm

***Find the best split algorithm:***
```
for F in features:
    find all thresholds for F
    for T in thresholds:
        split dataset using "F > T"
        compute impurity of this split  (misclassification)
select the condition with the lowest impurity
```
***Stopping criteria:***
- group is already pure
- tree reached depth limit
- group too small to split

***Decision tree learning algorithm:***
- find the best split
- stop if max depth reached
- if left is sufficiently large and not pure:  
     -> repeat for left
- if right is sufficiently large and not pure:  
     -> repeat for right

# 6.5 Decision Trees Parameter Tuning
Two features, **max_depth** and **min_samples_leaf** have a greater importance than other parameters.   
```
scores = []
for m in [4, 5, 6]:
    print('depth: %s' % m)

    for s in [1, 5, 10, 15, 20, 50, 100, 200]:
        dt = DecisionTreeClassifier(max_depth=m, min_samples_leaf=s)
        dt.fit(X_train, y_train)
        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        scores.append((d, s, auc))
```
А dataframe is created with all possible combinations of *max_depth*, *min_sample_leaf* and the *auc score* corresponding to them.
```
df_scores = pd.DataFrame(scores, columns = ["max_depth", "...", "..."])
df_scores_pivot = df_scores.pivot(index='min_samples_leaf', columns=['max_depth'], values=['auc'])
``` 
These results are visualized using a **heatmap** by pivoting the dataframe to easily determine the best possible max_depth and min_samples_leaf combination. 
```
sns.heatmap(df_scores_pivot, annot=True)
```
Finally, the DT is retrained using the identified parameter combination. DT so trained is viewed as a tree diagram.

# 6.6 Ensemble learning and random forest
**Random Forest** is an example of **ensemble** learning where each model is a decision tree and their predictions are aggregated to identify the most popular result. Random forest only select a random subset of features from the original data to make predictions. In random forest the decision trees are trained independently.

Tuning the *max_depth* parameter:
```
all_aucs = {}

for depth in [5, 10, 20]:
    print('depth: %s' % depth)
    aucs = []

    for i in range(10, 201, 10):
        rf = RandomForestClassifier(n_estimators=i, max_depth=depth,random_state=1)
        rf.fit(X_train, y_train)
        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        print('%s -> %.3f' % (i, auc))
        aucs.append(auc)
    
    all_aucs[depth] = aucs
```
Tuning the *min_samples_leaf* parameter:
```
all_aucs = {}

for m in [3, 5, 10]:
    print('min_samples_leaf: %s' % m)
    aucs = []

    for i in range(10, 201, 20):
        rf = RandomForestClassifier(n_estimators=i, max_depth=10, min_samples_leaf=m, random_state=1)
        rf.fit(X_train, y_train)
        y_pred = rf.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        print('%s -> %.3f' % (i, auc))
        aucs.append(auc)
    
    all_aucs[m] = aucs
    print()
```

# 6.7 Gradient boosting and XGBoost
Unlike Random Forest where each decision tree trains *independently*, in the **Gradient Boosting Trees**, the models are combined *sequentially* where each model takes the prediction errors made my the previous model and then tries to improve the prediction. This process continues to `n` number of iterations and in the end all the predictions get combined to make *final prediction*.  

**XGBoost** is one of the libraries which implements the gradient boosting technique.
```
!pip install xgboost
```
- To train and evaluate the model, we need to wrap our train and validation data into a special data structure from XGBoost which is called **DMatrix**. This data structure is optimized to train xgboost models faster.  
```
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=dv.feature_names_)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=dv.feature_names_)
```
- **xgb_params**: key-value pairs of hyperparameters to train xgboost model.
```
xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'nthread': 8,
    'seed': 1
}
model = xgb.train(xgb_params, dtrain, num_boost_round=10)
```
- **watchlist**: list to store training and validation accuracy to evaluate the performance of the model after each training iteration. The list takes tuple of train and validation set from DMatrix wrapper:
```
watchlist = [(dtrain, 'train'), (dval, 'val')]
```
- **`%%capture output`**: IPython magic command which captures the standard output and standard error of a cell.
- the last step is to **parse** xgb output and make a plot



# 6.8 XGBoost parameter tuning
XGBoost has various tunable parameters but the three most important ones are:

- **eta** (default=0.3)
 - It is also called learning_rate and is used to prevent overfitting by regularizing the weights of new features in each boosting step. range: [0, 1]

- **max_depth** (default=6)
  - Maximum depth of a tree. Increasing this value will make the model mroe complex and more likely to overfit. range: [0, inf]

- **min_child_weight** (default=1)
  - Minimum number of samples in leaf node. range: [0, inf] 


Sequence:

1) find the best value for eta  
2) find the best value for max_depth  
3) find the best value for min_child_weight  


___________

Other useful parameter are:

- subsample (default=1)
  - Subsample ratio of the training instances. Setting it to 0.5 means that model would randomly sample half of the trianing data prior to growing trees. range: (0, 1]
- colsample_bytree (default=1)
  - This is similar to random forest, where each tree is made with the subset of randomly choosen features.
- lambda (default=1)
  - Also called reg_lambda. L2 regularization term on weights. Increasing this value will make model more conservative.
- alpha (default=0)
  - Also called reg_alpha. L1 regularization term on weights. Increasing this value will make model more conservative.

# 6.9 Selecting the best model
We select the final model from *decision tree, random forest, or xgboost* based on the best auc scores. After that we prepare the `df_full_train` and `df_test` to train and evaluate the final model. If there is not much difference between model auc scores on the train as well as test data then the model has generalized the patterns well enough.

Generally, XGBoost models perform better on tabular data than other machine learning models but the downside is that these model are easy to overfit cause of the high number of hyperparameter. Therefore, XGBoost models require a lot more attention for parameters tuning to optimize them.