# Overview

Recently, the databricks team has contributed the SparkTrials object to the hypeorpt project. This enhancement allows hyperopt to distribute a tuning job across an Apache Spark cluster.

If you are unfamiliar with Apache Spark, review the [corresponding notebooks](../../Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md)

As we saw in the [Hyperopt Search Trials notebok](Hyperopt%20Search%20Trials.ipynb), philosophically, a **Trial** is sample take from the Search Space. Said another way, a Trial is an observation of one of the possible scenarios defined as part of the Search Space.

Under the hood, the Trials object is what controls how the Hyperopt framework iterates over the Search Space and selects the next set of hyperparameters and produces the next Trial object.

There are three types of Trial objects: Trials, MongoTrials, SparkTrials. In the [Hyperopt Search Trials notebok](Hyperopt%20Search%20Trials.ipynb) we looked at the "base" Trials object. 

In this notebook We will look at the SparkTrials object produced through the [Spark Integration](Hyperopt%20Spark%20Integration.ipynb). Long story short, we will see that the interface and analysis of this object looks the same as the base Trials object.

# Gotchas
## Hyperopt version 0.2.7+ required for Apache Spark 3.0+
It appears that the 0.2.5 release of hyperopt has a bug which makes it incompatible with spark 3.0. This is documented in an [open PR on github](https://github.com/hyperopt/hyperopt/issues/798). We must ensure we are running 0.2.7 to integration with Spark 3.0.

## Spark Limitations Affecting fmin() function
When using the SparkTrials object, the *fmin()* function is effectively running on a Spark Executor. As such there are a few limitations we need to keep in mind that are inherited from use of the Spark function.

### Object must be pickle-able
To send information and data to the Spark Worker the Spark framework uses pickle to serialize Python objects. As such, the objects we use must be serializable through the pickle framework. If not, we might see an error like:


```
TypeError: can't pickle _thread.RLock objects
```

### Spark Doesn't Support Nested Parallelism
As discussed in the [series of notebooks related to Spark](../../Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md) the Spark Executor cannot kick off trainin sessions for MLlib algorithms or other components of Spark. 

We may see an error similar to the one listed above when trying to use Spark functionality within the fmin() function.

# 1. How It Works

Recall from the [README.md](README.md) that hyperopt's hyperparameter tuning functionlity is invoked via the *fmin()* function. As we have seen, this function allows us to pass in a trials object to which hyperopt records information about the various training trials that are conducted while it is searching the search space.

Under the hood (looking at the [github code](https://github.com/hyperopt/hyperopt/blob/master/hyperopt/fmin.py#L540)) we can see that the trials object is what actually impliments the searching process. Thus the SparkTrials object's *fmin()* function is configured to run batches of training tasks in parallel, one on each Spark executor, allowing massive scale-out for tuning.

## 1.1. Considerations

### 1.1.1. The Parallelism Parameter
The databricks team has a [post](https://databricks.com/blog/2021/04/15/how-not-to-tune-your-model-with-hyperopt.html) providing insights about running hyperopt on spark. One important note is how one might use the **parallelism** parameter. this parameter dictates how many spark jobs to run in parallel. 

One of the major gotchas of this parameter comes from the fact that hyperopt's  [TPE search algorithm](Hyperopt%20Search%20Algorithms.ipynb) is iterative; information from the previous trials are used to determine where to look next. In the case where parallelism is set equal to the **max_evals** parameter then we are effectively doing random search as all the trials would be conducted in parallel.

Another gotcha has to do with spark cluster utilization. Setting the parallelism parameter too low wastes resources. If running on a cluster with 32 cores, then running just 2 trials in parallel leaves 30 cores idle. Setting parallelism too high can cause a subtler problem. With a 32-core cluster, it’s natural to choose parallelism=32 of course, to maximize usage of the cluster’s resources. Setting it higher than cluster parallelism is counterproductive, as each wave of trials will see some trials waiting to execute.

The article reccomends we **set parallelism to a small multiple of the number of hyperparameters**, and allocate cluster resources accordingly. For example, if searching over 4 hyperparameters, parallelism should not be much larger than 4. 8 or 16 may be fine, but 64 may not help a lot. 64 doesn’t hurt, it just may not help much.

### 1.1.2. ML Library Built-in Parralelism
Some ML libraries have the ability to take advantage of multithreading while training models. For example scikit-learn accepts the **n_jobs** parameter while xgboost accepts the **nthread** parameter. 

Although a single Spark task is assumed to use one core, nothing stops the task from using multiple cores. For example, with 16 cores available, one can run 16 single-threaded tasks, or 4 tasks that use 4 cores each. The latter is actually advantageous if the fitting process can efficiently use 4 cores and return the results in a timely manner. This is because Hyperopt is iterative, and returning fewer results faster improves its ability to learn from early results to schedule the next trials. That is, in this scenario, trials 5-8 could learn from the results of 1-4 if those first 4 tasks used 4 cores each to complete quickly and so on, whereas if all were run at once, none of the trials’ hyperparameter choices have the benefit of information from any of the others’ results.

This affects thinking about the setting of parallelism. If a Hyperopt fitting process can reasonably use parallelism = 8, then by default one would allocate a cluster with 8 cores to execute it. But if the individual tasks can each use 4 cores, then allocating a 4 * 8 = 32-core cluster would be advantageous.

One of the dangers of this approach is that Spark may schedule too many core-hungry tasks on one machine causing the cluster to be slow or unresponsive. This can be particularely troublesome if operating in a shared environment where other users/workflows are trying to share the cluster resources.

A workaround is to execute the spark job using the **spark.task.cpus** parameter to tell spark the number of cores to allocate to each task. The disadvantage is that this is a cluster-wide configuration, which will cause all Spark jobs executed in the session to assume 4 cores for any task.

### 1.1.3. Spark's Serialization Impacts To Objective Function Definitions
Recall that Spark was written as a master/slave (rebranded as driver/executor) architecture. The work on the master node is split into chunks and sent to the slaves for processing. The mechanism by which this information is sent is serialization; ie. objects are converted into a serial byte stream and sent over the network and then rebuilt at the destination.

This process has an obvious overhead of computation as well as network bandwidth.

In the use case of hyperparameter tuning, we will likely be using the same train/test data within our objective function which is grading the search space (we likely only be changing the hyperparameters or ML algorithm). While prototyping we may be inclined to pass this data directly to the objective function within each call. This is a bad idea. This means every time we run a task, we have to serialize the data from the driver to the executor. This is unnecessary computation.

We might instead decide to broadcast the data. Reading through thespark documentaiton see that Broadcast variables are read-only variables that are cached and available to tasks on all nodes in a cluster. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs. The problem with broadcast variables however is that they are limited to 2GB of data.

As each spark task is a separate java process, the only other option is to do some advanced programming to load the data into a large shared memory object in the heap that can be used between tasks. Or possibly setting up a caching server of some kind that can quickly serve a raw byte stream. But that is outside our scope.

Practically speaking, if the train/test set is larger that 2GB you might simply have to reach out to the datastore and load the data each time the function runs. The benefit to this approach is you free up resources on the executor which will likely allow your workflow to run more moothly.

### 1.1.4. Suggestions on setting max_evals
We have seen that the *fmin()* function takes a parameter called max_evals which dictates how many trials within the search space to conduct. For example, if we set max_evals=20 then the search algorithm would run 20 times on data points selected from the search space.

Databricks has made a [reccomendation](https://databricks.com/blog/2021/04/15/how-not-to-tune-your-model-with-hyperopt.html) on a method for choosing the value for max_evals:

<table>
  <tbody>
<tr>
<th>Parameter Expression</th>
<th>Optimal Results</th>
<th>Fastest Results</th>
</tr>
<tr>
<td>(ordinal parameters)<p></p>
<p>hp.uniform<br>
hp.quniform<br>
hp.loguniform<br>
hp.qloguniform</p></td>
<td>20 x # parameters</td>
<td>10 x # parameters</td>
</tr>
<tr>
<td>(categorical parameters)<p></p>
<p>hp.choice</p></td>
<td colspan="2">15 x total categorical breadth*</td>
</tr>
</tbody>  
</table>

**Note:** “total categorical breadth” is the total number of categorical choices in the space.  If you have hp.choice with two options “on, off”, and another with five options “a, b, c, d, e”, your total categorical breadth is 10.

# 2. Define Search Space And Conduct Search Trials

## 2.1. Setup Spark

In [2]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

/root/ml-training-jupyter-notebooks


In [3]:
# Load a helper module
import os
import importlib.util
module_name = "spark_helper"
module_dir = os.path.join(project_root_dir, "Utilities", "{0}.py".format(module_name))
if not os.path.exists(module_dir):
    print("The helper module does not exist")
print("Loading module: {0}".format(module_dir))
spec = importlib.util.spec_from_file_location(module_name, module_dir)
spark_helper = importlib.util.module_from_spec(spec)
spec.loader.exec_module(spark_helper)

Loading module: /root/ml-training-jupyter-notebooks/Utilities/spark_helper.py


In [4]:
spark_app_name = "spark-jupyter-mlib"
docker_image = "tschneider/pyspark:v6-beta"
k8_master_ip = "15.4.7.11"
spark_session = spark_helper.create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
/opt/spark

Running findspark.init() function
['/opt/spark/python', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages/IPython/extensions', '/root/.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.master', 'k8s://https://15.4.7.11:6443')
('spark.app.name', 'spark-jupyter-mlib')
('spark.submit.deploy.mode', 'cluster')
('spark.kubernetes.container.image', 'tschneider/pyspark:v6-beta')
('spark.kubernetes.namespace', 'spark')
('spark.kubernetes.pyspark.pythonVersion', '3')
('spark.kubernetes.authenticate.driver.serviceAccountNam

In [5]:
! kubectl -n spark get pod

NAME                                         READY     STATUS    RESTARTS   AGE
spark-jupyter-mlib-cffb937dd4846132-exec-1   1/1       Running   0          32s
spark-jupyter-mlib-cffb937dd4846132-exec-2   1/1       Running   0          31s
spark-jupyter-mlib-cffb937dd4846132-exec-3   1/1       Running   0          31s


## 2.2. Define Search Space

In [6]:
import hyperopt

# Define the search space
space = hyperopt.hp.choice('my_choice', [
    {
        'name': 'model a',
        'x': hyperopt.hp.choice('model_a_x', [1,2,4,6,8])
    },
    {
        'name': 'model b',
        'x': hyperopt.hp.choice('model_b_x', [0,1,2,3,4]),
        'y': hyperopt.hp.choice('model_b_y', [0,3,5,7,9])        
    }
])

We can have a look at a single sample to see what is passed to the objective function

In [7]:
print(hyperopt.pyll.stochastic.sample(space))

{'name': 'model b', 'x': 3, 'y': 3}


## 2.3. Define Objective Function

In [8]:
def objective(args):
    x = args['x']
    y = args['y'] if 'y' in args.keys() else 0
    return x + y

## 2.4. Create SparkTrials Object

In [12]:
trials = hyperopt.SparkTrials(parallelism=3, spark_session=spark_session)

## 2.5. Perform Search

We can define a loss function and search through our parameter space for the optimal value using the hyperopt framework:

In [13]:
import hyperopt
import numpy

# Optimize the search space and retrieve the index which points to the best points in the search space
optimal_args_index = hyperopt.fmin(objective, 
                                   space, 
                                   algo=hyperopt.tpe.suggest, 
                                   max_evals=10, 
                                   trials=trials, 
                                   rstate=numpy.random.default_rng(42))

# Retrieve the resulting hyperparameter set from the search space using the index
optimal_hyperparams = hyperopt.space_eval(space, optimal_args_index)

# Print the results
print("=========================")
print("Optimal args index:")
print(optimal_args_index)
print("Best hyperparameters:")
print(optimal_hyperparams)

100%|██████████| 10/10 [00:11<00:00,  1.12s/trial, best loss: 1.0]


Total Trials: 10: 10 succeeded, 0 failed, 0 cancelled.


Optimal args index:
{'model_a_x': 0, 'my_choice': 0}
Best hyperparameters:
{'name': 'model a', 'x': 1}


We can see that "model a" was selected as it yields the minimal results from the objective function.

**Note:** We see that the search algorithm has a lot of repetitions... This is because it the only boundary on the search is the max_evals parameter. We will look at optimizing the search algorithm later on. For example when we utilize the loss_threshold to allow for early termination.

Having a look at the trials opject we inspect it's type and the useful properties.

In [14]:
trials

<hyperopt.spark.SparkTrials at 0x7f6bb18376d8>

In [15]:
dir(trials)

['MAX_CONCURRENT_JOBS_ALLOWED',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_decide_parallelism',
 '_dynamic_trials',
 '_exp_key',
 '_fmin_cancelled',
 '_fmin_cancelled_reason',
 '_ids',
 '_insert_trial_docs',
 '_spark',
 '_spark_context',
 '_spark_pinned_threads_enabled',
 '_spark_supports_job_cancelling',
 '_trials',
 'aname',
 'argmin',
 'assert_valid_trial',
 'asynchronous',
 'attachments',
 'average_best_error',
 'best_trial',
 'count_by_state_synced',
 'count_by_state_unsynced',
 'count_cancelled_trials',
 'count_failed_trials',
 'count_successful_trials',
 'count_total_trials',
 'delete_all',
 'fmin',
 'idxs',
 'idxs

# 3. Common Analysis Tasks
In the next section we look at code examples of extracting valuable information from the Trials object.

## 3.1. Get Trial Metadata

In [16]:
trials.trials[0:2]

[{'state': 2,
  'tid': 0,
  'spec': None,
  'result': {'loss': 8.0, 'status': 'ok'},
  'misc': {'tid': 0,
   'cmd': ('domain_attachment', 'FMinIter_Domain'),
   'workdir': None,
   'idxs': {'model_a_x': [0],
    'model_b_x': [],
    'model_b_y': [],
    'my_choice': [0]},
   'vals': {'model_a_x': [4],
    'model_b_x': [],
    'model_b_y': [],
    'my_choice': [0]}},
  'exp_key': None,
  'owner': None,
  'version': 0,
  'book_time': datetime.datetime(2021, 12, 19, 21, 8, 8, 913000),
  'refresh_time': datetime.datetime(2021, 12, 19, 21, 8, 9, 980000)},
 {'state': 2,
  'tid': 1,
  'spec': None,
  'result': {'loss': 4.0, 'status': 'ok'},
  'misc': {'tid': 1,
   'cmd': ('domain_attachment', 'FMinIter_Domain'),
   'workdir': None,
   'idxs': {'model_a_x': [],
    'model_b_x': [1],
    'model_b_y': [1],
    'my_choice': [1]},
   'vals': {'model_a_x': [],
    'model_b_x': [1],
    'model_b_y': [1],
    'my_choice': [1]}},
  'exp_key': None,
  'owner': None,
  'version': 0,
  'book_time': datet

## 3.2. Get Best Trial Metadata
We can see the best trial:

In [17]:
trials.best_trial

{'state': 2,
 'tid': 2,
 'spec': None,
 'result': {'loss': 1.0, 'status': 'ok'},
 'misc': {'tid': 2,
  'cmd': ('domain_attachment', 'FMinIter_Domain'),
  'workdir': None,
  'idxs': {'model_a_x': [2],
   'model_b_x': [],
   'model_b_y': [],
   'my_choice': [2]},
  'vals': {'model_a_x': [0],
   'model_b_x': [],
   'model_b_y': [],
   'my_choice': [0]}},
 'exp_key': None,
 'owner': None,
 'version': 0,
 'book_time': datetime.datetime(2021, 12, 19, 21, 8, 10, 918000),
 'refresh_time': datetime.datetime(2021, 12, 19, 21, 8, 12, 7000)}

## 3.3. Get Trial ID

In [18]:
trials.trials[0]["tid"]

0

## 3.4. Get Hyperparameters For Best Trial

In [19]:
trials.argmin

{'model_a_x': 0, 'my_choice': 0}

In [20]:
hyperopt.space_eval(space, trials.argmin)

{'name': 'model a', 'x': 1}

## 3.5. Get Hyperparameters For Arbitrary Trial

In [21]:
trials.trials[2]["misc"]["vals"]

{'model_a_x': [0], 'model_b_x': [], 'model_b_y': [], 'my_choice': [0]}

We need to convert this value into another form. The [source code](https://github.com/hyperopt/hyperopt/blob/0b49cde7c0860542b17e8f6102dcf46af4739d23/hyperopt/base.py#L619) for doing this is in the *argmin* property on the Trials object

In [22]:
vals = trials.trials[2]["misc"]["vals"]

def correct_vals_object_format(vals):
    rval = {}
    for k, v in list(vals.items()):
        if v:
            rval[k] = v[0]
    return rval
        
corrected_vals = correct_vals_object_format(vals)
corrected_vals

{'model_a_x': 0, 'my_choice': 0}

Once converted, we can use the same *space_eval* function to derive the hyperparameters.

In [23]:
hyperopt.space_eval(space, corrected_vals)

{'name': 'model a', 'x': 1}

## 3.6. Get Scores For A Trial

In [24]:
trials.results[0]

{'loss': 8.0, 'status': 'ok'}