# what is the approximate depth of a Decision Tree trained(without restrictions)on a training set with one million instances?

<font size=3>&ensp;&ensp;**The depth of a well-balanced binary tree containing m leaves is equal to log2(m),rounded up. A binary Decision Tree (one that makes only binary decisions)will end up more or less well balanced at the end of training,with one leaf per training instance if it is trained without restrictions.Thus, if the training set contains one million instances, the Decision Tree will have a depth of log2(10*(6))=20(actually a bit more since the tree will generally not be perfectly well balanced)**</font>

# Is a node's Gine impurity generally lower or greater than its parent's?Is it generally low/greater,or always lower/greater?

<font size=3>&ensp;&ensp;**A node's Gini impurity is generally lower than its parents'.This is due to the CART training algorithm's cost function,which splits each node in a way that minimizes the wieghted sum of its children's Gini impurities.However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child's impurity.**</font>

# If a Decision Tree is overfitting the training set,is it a good idea to try decreasing max_depth?

<font size=3>&ensp;&ensp;**If a decision tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model,regularizing it**</font>

# If a Decision Tree is underfitting the training set,is it a good idea to try scaling the input features?

<font size=3>&ensp;&ensp;**Decision Trees don't care whether or not the training data is scaled or centered.**</font>

# If it takes one hour to train a Decision Tree on a training set containing 1 million instances,roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

<font size=3>&ensp;&ensp;**Roughly 11.7 hours**</font>

# If your taining set contains 100,000 instances,will setting presort=True speed up training?

<font size=3>&ensp;&ensp;**Presorting the training instances speeds up training only if the dataset is smaller than a few thousand instances.**</font>

# Exercies 7

In [1]:
import numpy
import sklearn
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 294 candidates, totalling 882 fits


GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                            13, 14, 15, 16, 17, 18, 19, 20, 21,
                                            22, 23, 24, 25, 26, 27, 28, 29, 30,
                                            31, ...],
                         'min_samples_split': [2, 3, 4]},
             verbose=1)

In [2]:
grid_search_cv.best_estimator_

DecisionTreeClassifier(max_leaf_nodes=28, random_state=42)

In [3]:
grid_search_cv.best_score_

0.8663756209018508

In [4]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.87

# Exercise 8

In [5]:
from sklearn.model_selection import ShuffleSplit

n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances, random_state=42)
for mini_train_index, mini_test_index in rs.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

In [12]:
from sklearn.base import clone

forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.7945225

In [13]:
Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

In [14]:
from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [15]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.872