Excerise:
Train and fine-tune a Decision Tree for the moons dataset


In [3]:
# Generate a moons dataset using make_moons(n_smaples=10000, noise =0.4)
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html

import sklearn

moon_data=sklearn.datasets.make_moons(n_samples=10000, noise =0.4)
moon_data

(array([[ 1.48348663,  0.50466074],
        [-0.79255326, -0.07392339],
        [-0.2424727 ,  0.16771531],
        ...,
        [ 1.40387372,  0.21485577],
        [ 1.83604438,  0.32314912],
        [-0.48030005,  1.30123248]]),
 array([0, 0, 0, ..., 1, 1, 0]))

In [11]:
# split it into a training set and a test set using train_test_split()
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X,y = moon_data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

In [12]:
# Use grid searh with cross-validation to find good hyperparameter values for a DecisionTreeClassifier
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV

from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier

# loop through max_leaf_nodes from 2 to 100 
params = {'max_leaf_nodes': list(range(2, 100))}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 98 candidates, totalling 294 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 294 out of 294 | elapsed:    2.4s finished


GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                            13, 14, 15, 16, 17, 18, 19, 20, 21,
                                            22, 23, 24, 25, 26, 27, 28, 29, 30,
                                            31, ...]},
             verbose=1)

In [13]:
# print out the best max_leaf_nodes from the list
grid_search_cv.best_estimator_

DecisionTreeClassifier(max_leaf_nodes=14, random_state=42)

In [14]:
# Train it on the full training set using the best max_leaf_nodes, target the accuracy should above 85%

best_moon_clf = DecisionTreeClassifier(max_leaf_nodes=14)
best_moon_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

y_pred = best_moon_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8575

In [15]:
# by default the grid_serach_cv will use the best estimator
y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8575

In [16]:
# improve the accurary by checking other parameters like 

params = {'max_leaf_nodes': list(range(2, 100)),'min_samples_split': list(range(2,10))}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)


Fitting 3 folds for each of 784 candidates, totalling 2352 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 296 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 2321 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done 2352 out of 2352 | elapsed:    8.3s finished


0.8575

From above setting we could see this doesn't help improve the accuracy.... hmmmm