# Using several classifiers and tuning parameters - Parameters grid

**Model selection** is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset. 
**Model Selection** is a critical step in your machine learning model building. Choosing the right model can greatly impact the performance of your machine learning model, and choosing the wrong model, can leave you with unacceptable results.

We're going to use the *model selection* features of scikit-learn, and comparison of several classification methods.



In [2]:
"""
http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
@author: scikit-learn.org and Claudio Sartori
"""
import warnings
warnings.filterwarnings('ignore') # uncomment this line to suppress warnings

# A few imports from the sklearn library
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

print(__doc__) # print information included in the triple quotes at the beginning


# Loading a standard dataset
# For this lab we are going to use the iris dataset

#dataset = datasets.load_digits()
#dataset = datasets.fetch_olivetti_faces()
#dataset = datasets.fetch_covtype()
dataset = datasets.load_iris()
#dataset = datasets.load_wine()
#dataset = datasets.load_breast_cancer()


http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
@author: scikit-learn.org and Claudio Sartori



Now, prepare the environment, specifying attributes, targets.

In [3]:
X = dataset.data
y = dataset.target
ts = 0.3 # Fraction of the test data (between 0.2 and 0.5); ts = test size
random_state = 42

Split the dataset into train and test, and print the shapes.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = ts, random_state = random_state)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(105, 4)
(45, 4)
(105,)
(45,)


In order to ease the remainder of the session.

In [11]:
model_lbls = [
              'dt', 
              'nb', 
              'lp', 
              'svc', 
             'knn',
             'adb',
             'rf',
            ]

# Set the parameters by cross-validation
tuned_param_dt = [{'max_depth': [*range(1,20)]}]
tuned_param_nb = [{'var_smoothing': [10, 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-07, 1e-8, 1e-9, 1e-10]}]
tuned_param_lp = [{'early_stopping': [True]}]
tuned_param_svc = [{'kernel': ['rbf'], 
                    'gamma': [1e-3, 1e-4],
                    'C': [1, 10, 100, 1000],
                    },
                    {'kernel': ['linear'],
                     'C': [1, 10, 100, 1000],                     
                    },
                   ]
tuned_param_knn =[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]
tuned_param_adb = [{'n_estimators':[20,30,40,50],
                   'learning_rate':[0.5,0.75,1,1.25,1.5]}]
tuned_param_rf = [{'max_depth': [*range(5,15)],
                   'n_estimators':[*range(10,100,10)]}]

models = {
    'dt': {'name': 'Decision Tree       ',
           'estimator': DecisionTreeClassifier(), 
           'param': tuned_param_dt,
          },
    'nb': {'name': 'Gaussian Naive Bayes',
           'estimator': GaussianNB(),
           'param': tuned_param_nb
          },
    'lp': {'name': 'Linear Perceptron   ',
           'estimator': Perceptron(),
           'param': tuned_param_lp,
          },
    'svc':{'name': 'Support Vector      ',
           'estimator': SVC(), 
           'param': tuned_param_svc
          },
    'knn':{'name': 'K Nearest Neighbor ',
           'estimator': KNeighborsClassifier(),
           'param': tuned_param_knn
       },
       'adb':{'name': 'AdaBoost           ',
           'estimator': AdaBoostClassifier(),
           'param': tuned_param_adb
          },
    'rf': {'name': 'Random forest       ',
           'estimator': RandomForestClassifier(),
           'param': tuned_param_rf
          }

}

scores = ['precision', 'recall']

Group the outputs with a simple function, having as parameter the fitted model and using the components of the fitted model to inspect the results of the search with the parameters grid.

The components are:<br>
`model.best_params_`<br>
`model.cv_results_['mean_test_score']`<br>`
model.cv_results_['std_test_score']`<br>
`model.cv_results_['params']`

The classification_report() is generated by the function imported above from sklearn.metrics, which takes as argument the true and the predicted test labels.

In [12]:
def print_results(model):
    print("Best parameters set found on train set:")
    print()
    # if best is linear there is no gamma parameter
    print(model.best_params_)
    print()
    print("Grid scores on train set:")
    print()
    means = model.cv_results_['mean_test_score']
    stds = model.cv_results_['std_test_score']
    params = model.cv_results_['params']
    for mean, std, params_tuple in zip(means, stds, params):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params_tuple))
    print()
    print("Detailed classification report for the best parameter set:")
    print()
    print("The model is trained on the full train set.")
    print("The scores are computed on the full test set.")
    print()
    y_true, y_pred = y_test, model.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

### Loop on scores and model labels

**Grid search** is a tuning technique that attempts to compute the optimum values of hyperparameters.

 It is an exhaustive search that is performed on a the specific parameter values of a model. 

The model is also known as an estimator. Grid search exercise can save us time, effort and resources.

In [21]:
results_short = {}

for score in scores:
    print('='*40)
    print("# Tuning hyper-parameters for %s" % score)
    print()
    
  #'%s_macro' % score ## is a string formatting expression
  # the parameter after % is substituted in the string placeholder %s
    for model in model_lbls:
          print('-'*40)
          print("Trying model {}".format(models[model]['name']))
          
          # Activate the grid search
          clf = GridSearchCV(models[model]['estimator'], models[model]['param'], cv=5,
                            scoring='%s_macro' % score, 
                            return_train_score = False,
                            n_jobs = 2, # this allows using multi-cores
                            )
          clf.fit(X_train, y_train)
          print_results(clf)
          results_short[model] = clf.best_score_
    print("Summary of results for {}".format(score))
    print("Estimator")
    for m in results_short.keys():
        print("{}\t - score: {:4.2}%".format(models[m]['name'], results_short[m]))
          


# Tuning hyper-parameters for precision

----------------------------------------
Trying model Decision Tree       
Best parameters set found on train set:

{'max_depth': 4}

Grid scores on train set:

0.491 (+/-0.009) for {'max_depth': 1}
0.924 (+/-0.070) for {'max_depth': 2}
0.941 (+/-0.072) for {'max_depth': 3}
0.943 (+/-0.042) for {'max_depth': 4}
0.933 (+/-0.048) for {'max_depth': 5}
0.943 (+/-0.042) for {'max_depth': 6}
0.943 (+/-0.042) for {'max_depth': 7}
0.943 (+/-0.042) for {'max_depth': 8}
0.943 (+/-0.042) for {'max_depth': 9}
0.943 (+/-0.042) for {'max_depth': 10}
0.943 (+/-0.042) for {'max_depth': 11}
0.943 (+/-0.042) for {'max_depth': 12}
0.943 (+/-0.042) for {'max_depth': 13}
0.943 (+/-0.042) for {'max_depth': 14}
0.943 (+/-0.042) for {'max_depth': 15}
0.943 (+/-0.042) for {'max_depth': 16}
0.943 (+/-0.042) for {'max_depth': 17}
0.943 (+/-0.042) for {'max_depth': 18}
0.943 (+/-0.042) for {'max_depth': 19}

Detailed classification report for the best parameter set:

The m