# Random search
A popular way of tuning the model's hyper-parameters is to conduct a random search. A random search is similar to grid search, with the difference that in a random search you do not cover all possible combinations in the hyper-parameter space. In a random random search only a smaller subset, picked at random, is covered.

In this notebook we will review the basic concepts you need to know in order to successfully tune your model using ```RandomizedSearchCV``` from **Sklearn**.

We will use a xgboost classifier to demonstrate how to conduct a grid search. However, the technique can be use with any other model. We will start by importing the classifier, the grid search function, and the function that creates the data (dummy data) we will use.

In [4]:
#Import the classifier
from xgboost import XGBClassifier

#Import grid search
from sklearn.model_selection import RandomizedSearchCV

#Import functions to create dummy data
from mlb_misc_functions import obtain_train_clf_data

# Data Creation
We will start by creating the data we will be using in this notebook. We will create two lists (**train_x** and **train_y**) representing the features and the targets already in the format needed for training. In the following block of code these lists are created using the function ```obtain_train_clf_data()```. After that, we print the first 3 elements of each list and the number of elements in each list.

In [16]:
#Create the lists
train_x, train_y = obtain_train_clf_data()

#Print the first 3 elements of each list, and the number of elements in the lists
print("the first 3 elements of train_x:")
print(train_x[:3])
print("the first 3 elements of train_y:")
print(train_y[:3])
print("number of elements in the lists")
print(len(train_x))

the first 3 elements of train_x:
[[ -9   2  -3   5   9]
 [  8   9   4 -10   6]
 [ -2   5   8   3   1]]
the first 3 elements of train_y:
[0 1 1]
number of elements in the lists
8000


Each list contains 8,000 elements. In the case of **train_x** each element is a list containing 5 numbers, these are the model’s features. On the other hand, the list **train_y** only contains numbers (0 or 1), these are model’s targets

# Random Search Implementation
The first step is to create the classifier, in this case Xgboost. After that, we define the hyper-parameter space we are going to explore. See the block of code below:

In [13]:
#Create classifier
xgb_clf = XGBClassifier()

#Define hyper-parameter space
parameters = {"n_estimators": [10, 20, 40, 60, 80, 100, 110, 120, 130, 140], 
              "max_depth": [1, 2, 3, 4, 5],
             }

As can be seen, the hyper-parameter space is defined using a dictionary. In our case there are 10 possible values for **n_estimators** and 5 possible values for **max_depth**, for a total of 50 possible combinations. However, we will not test all possible combinations.

The next step is to create our random search object. This is done by passing the classifier, the dictionary with the hyper parameter space, a method to evaluate the predictions on the test set, and the cross validation splitting strategy to the ```RandomizedSearchCV()``` function. After the object is created, we use the ```fit()``` method to launch our grid search.

Before we move on, it is important to mention that we will use the f1 score to evaluate the performance of the classifier. You can use other scoring methods. See [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [17]:
#Create the random search object
random_search = RandomizedSearchCV(xgb_clf, parameters, n_iter=10, scoring='f1', cv=5, verbose=0)

#launch the random search
random_search.fit(train_x, train_y)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'n_estimators': [10, 20, 40, 60, 80, 100, 110, 120, 130, 140], 'max_depth': [1, 2, 3, 4, 5]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='f1', verbose=0)

It is important to mention that we are using a cross validation of 5 folds (cv=5), and randomly choosing 10 point in our 10x50 hyper-parameter grid. This means that we are doing the following:

* Split the data in 5 sets (cv=5), lets call them A, B, C, D, E.
* Select at random 10 (out of 50) hyper-parameter combinations.
* Pick one combination of hyper-parameters.
* Train the classifier using 4 of the sets (let's say A, B, C, D) and obtain scores on the last set (in this case E).
* Repeat the last two steps for all possible combinations (9 hyper-parameters, and 5 cv groups).

# Results

Now we can obtain the best hyper-parameter combination, and the score using ```best_params_``` and ```best_score_```, see below:

In [18]:
print(random_search.best_params_)
print(random_search.best_score_)

{'n_estimators': 140, 'max_depth': 3}
0.8934610174906333


# Final words
We went over the basic concepts behind **random search**, by now you should be able to use **RandomizedSearchCV** to tune up the hyper-parameters of your model. It is now your time to start coding! try the following:
* Add other hyper-parameters to the dicctionary.
* Try a different value for your cross validation splitting strategy (cv = 3, or 6).
* Use a different classifier.
* Use a different scoring method to meassure the performance of your classifiers.
* Try setting verbose=2 when you create the grid search object.

More information about the arguments passed to ```RandomizedSearchCV()``` can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).