# Hands-On Machine Learning with Scikit-Learn and TensorFlow

## Chapter 6 - Decision Trees

#### 1. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with 1 million instances?

> A Decision Tree without restrictions will grow until each leaf node has a single sample. 
Let N be the total number of samples, the approximate depth will be the log2(N). Thus we obtain ~20 levels

In [7]:
import numpy as np
np.log2(1000000)

19.931568569324174

#### 2. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?

> Due to the nature of the CART algorithm, the split will look for minimizing the total Gini impurity. This can be achieved by either both branches having a lower impurity or one of them outsetting the other by having a way lower impurity. Thus generally the nodes have a lower impurity compared to its parent's, however each node individually may have a greater impurity

#### 3. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

> Yes, since by reducing the max_depth, you restrict how many decisions nodes can be created, thus limiting the learning capacity of the model and avoiding overfitting

#### 4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

> No, since the CART algorithm doesn't employ the nominal value of each feature and only its ranking capacity, scaling the features won't have any effect. However this only applies to monotonic transformations.

#### 5. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

> The complexity of a training a Decision Tree is O(n * m(log(m))), therefore by dividing the complexity of each example we get: 10M * log(10M) / 1M * log(1M) ~ 11.7. So by multiplying the number of samples by 10, we multiply the training time by 11.7, therefore it would take roughly 11.7 hours

In [12]:
10E6 * np.log2(10E6) / (1E6 * np.log2(1E6)) 

11.666666666666666

#### 6. If your training set contains 100,000 instances, will setting presort=True speed up training?

> Probably not, the presort may bring benefits in terms of finding the best decision nodes faster, however the sorting action necessary for this much data may actually slow down the training

#### 7. Train and fine-tune a Decision Tree for the moons dataset.
- a. Generate a moons dataset using make_moons(n_samples=10000, noise=0.4)
- b. Split it into a training set and a test set using train_test_split()
- c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. Hint: try various values for max_leaf_nodes.
- d. Train it on the full training set using these hyperparameters, and measure your model’s performance on the test set. You should get roughly 85% to 87% accuracy.

In [24]:
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [25]:
dataset = make_moons(n_samples=10000, noise=0.4, random_state=42)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(dataset[0], dataset[1], test_size = 0.33, random_state = 42)

In [35]:
clf = DecisionTreeClassifier()

In [46]:
params = {
    'max_depth': [1, 2, 4, 6, 8, 10, 12, 14, 16, 18],
    'min_samples_split': [1, 2, 4, 6, 8, 10, 12, 14, 16, 18], 
    'max_features': ["auto", "sqrt", "log2"],
    'max_leaf_nodes': [1, 2, 4, 6, 8, 10, 12, 14, 16, 18]
}

In [47]:
grid_search = GridSearchCV(clf, params, cv= 3)

In [48]:
grid_search.fit(X_train, y_train)

1710 fits failed out of a total of 9000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
900 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\vgora\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\vgora\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "C:\Users\vgora\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

-------------------------------------------------------------------------

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [1, 2, 4, 6, 8, 10, 12, 14, 16, 18],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'max_leaf_nodes': [1, 2, 4, 6, 8, 10, 12, 14, 16, 18],
                         'min_samples_split': [1, 2, 4, 6, 8, 10, 12, 14, 16,
                                               18]})

In [49]:
best_model = grid_search.best_estimator_

In [50]:
y_pred = best_model.predict(X_test)

In [51]:
accuracy_score(y_pred, y_test)

0.8442424242424242

#### 8. Grow a forest.
- a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use ScikitLearn’s ShuffleSplit class for this.
- b. Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy.
- c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy’s mode() function for this). This gives you majority-vote predictions over the test set.
- d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a Random Forest classifier!

In [2]:
from sklearn.model_selection import ShuffleSplit

In [11]:
rs = ShuffleSplit(n_splits = 1000, test_size = 0.33, random_state = 42)

In [27]:
decision_tree_list = []
acc_scores = []
predictions = []

for index, (train_index, test_index) in enumerate(rs.split(X_train)):
    weak_learner = DecisionTreeClassifier(
        max_depth =  2,
        min_samples_split = 12, 
        max_features = "sqrt",
        max_leaf_nodes = 8
    )
    weak_learner.fit(X_train[train_index], y_train[train_index])
    decision_tree_list.append(weak_learner)
    
    y_pred = weak_learner.predict(X_test)
    predictions.append(y_pred)
    
    acc_score = accuracy_score(y_pred, y_test)
    acc_scores.append(acc_score)

In [63]:
average_score = sum(acc_scores)/len(acc_scores)

In [64]:
print(f"The average accuracy score of individual weak learner is {average_score}")

The average accuracy score of individual weak learner is 0.7825415151515164


In [50]:
import numpy as np
from scipy.stats import mode
predictions_array= np.array(predictions)

In [60]:
len(mode(predictions_array)[0][0])

3300

In [61]:
accuracy_score_rf = accuracy_score(mode(predictions_array)[0][0], y_test)

In [62]:
print(f"The average accuracy score of joined weak learner is {accuracy_score_rf}")

The average accuracy score of joined weak learner is 0.8606060606060606


In [65]:
accuracy_score_rf/average_score - 1

0.09975770479017876