# Overview

Recently, the databricks team has contributed the SparkTrials object which allows hyperopt to distribute a tuning job across an Apache Spark cluster.

If you are unfamiliar with Apache Spark, consider reviewing the [corresponding notebooks](../../Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md)

# Gotchas
## Hyperopt version 0.2.7+ required for Apache Spark 3.0+
It appears that there is a gotcha here. The hyperopt 0.2.5. release on pypi has a bug which makes it incompatible with spark 3.0. This is documented in an [open PR on github](https://github.com/hyperopt/hyperopt/issues/798). The PR links to a [merged PR](https://github.com/hyperopt/hyperopt/pull/765) which instructs how to install the merged code until the next released pypi package.

One will need to uninstall hyperopt and then reinstall the patched version 0.2.7 using the following commands:

```
pip uninstall hyperopt
pip install git+https://github.com/hyperopt/hyperopt.git
```

## Limitation of parrelel executions
I ran into an issue where the *fmin()* function raised an error while trying to optimize the SARIMAX model:
```
TypeError: can't pickle _thread.RLock objects
```
I googled and found an [article](https://giters.com/hyperopt/hyperopt/issues/720) which stated that there is a limitation to the way hyperopt does things in parallel. Our options are:
- Run distributed Hyperopt with single-machine training algorithms (SparkTrials)
- Run single-machine Hyperopt with distributed training algorithms (base.Trials)

AFAIK the SARIMAX model from stat

## Pickle search space

If you cant pickle the search space youll have a problem. Recall search space samples are sent to spark executor nodes. 

```
TypeError: can't pickle _thread.RLock objects
```

# 1. How It Works

Recall from the [README.md](README.md) that hyperopt's hyperparameter tuning functionlity is invoked via the *fmin()* function. As we have seen, this function allows us to pass in a trials object to which hyperopt records information about the various training trials that are conducted while it is searching the search space.

Under the hood (looking at the [github code](https://github.com/hyperopt/hyperopt/blob/master/hyperopt/fmin.py#L540)) we can see that the trials object is what actually impliments the searching process. Thus the SparkTrials object's *fmin()* function is configured to run batches of training tasks in parallel, one on each Spark executor, allowing massive scale-out for tuning.

## 1.1. Considerations

### 1.1.1. The Parallelism Parameter
The databricks team has a [post](https://databricks.com/blog/2021/04/15/how-not-to-tune-your-model-with-hyperopt.html) providing insights about running hyperopt on spark. One important note is how one might use the **parallelism** parameter. this parameter dictates how many spark jobs to run in parallel. 

One of the major gotchas of this parameter comes from the fact that hyperopt's  [TPE search algorithm](Hyperopt%20Search%20Algorithms.ipynb) is iterative; information from the previous trials are used to determine where to look next. In the case where parallelism is set equal to the **max_evals** parameter then we are effectively doing random search as all the trials would be conducted in parallel.

Another gotcha has to do with spark cluster utilization. Setting the parallelism parameter too low wastes resources. If running on a cluster with 32 cores, then running just 2 trials in parallel leaves 30 cores idle. Setting parallelism too high can cause a subtler problem. With a 32-core cluster, it’s natural to choose parallelism=32 of course, to maximize usage of the cluster’s resources. Setting it higher than cluster parallelism is counterproductive, as each wave of trials will see some trials waiting to execute.

The article reccomends we **set parallelism to a small multiple of the number of hyperparameters**, and allocate cluster resources accordingly. For example, if searching over 4 hyperparameters, parallelism should not be much larger than 4. 8 or 16 may be fine, but 64 may not help a lot. 64 doesn’t hurt, it just may not help much.

### 1.1.2. ML Library Built-in Parralelism
Some ML libraries have the ability to take advantage of multithreading while training models. For example scikit-learn accepts the **n_jobs** parameter while xgboost accepts the **nthread** parameter. 

Although a single Spark task is assumed to use one core, nothing stops the task from using multiple cores. For example, with 16 cores available, one can run 16 single-threaded tasks, or 4 tasks that use 4 cores each. The latter is actually advantageous if the fitting process can efficiently use 4 cores and return the results in a timely manner. This is because Hyperopt is iterative, and returning fewer results faster improves its ability to learn from early results to schedule the next trials. That is, in this scenario, trials 5-8 could learn from the results of 1-4 if those first 4 tasks used 4 cores each to complete quickly and so on, whereas if all were run at once, none of the trials’ hyperparameter choices have the benefit of information from any of the others’ results.

This affects thinking about the setting of parallelism. If a Hyperopt fitting process can reasonably use parallelism = 8, then by default one would allocate a cluster with 8 cores to execute it. But if the individual tasks can each use 4 cores, then allocating a 4 * 8 = 32-core cluster would be advantageous.

One of the dangers of this approach is that Spark may schedule too many core-hungry tasks on one machine causing the cluster to be slow or unresponsive. This can be particularely troublesome if operating in a shared environment where other users/workflows are trying to share the cluster resources.

A workaround is to execute the spark job using the **spark.task.cpus** parameter to tell spark the number of cores to allocate to each task. The disadvantage is that this is a cluster-wide configuration, which will cause all Spark jobs executed in the session to assume 4 cores for any task.

### 1.1.3. Spark's Serialization Impacts To Objective Function Definitions
Recall that Spark was written as a master/slave (rebranded as driver/executor) architecture. The work on the master node is split into chunks and sent to the slaves for processing. The mechanism by which this information is sent is serialization; ie. objects are converted into a serial byte stream and sent over the network and then rebuilt at the destination.

This process has an obvious overhead of computation as well as network bandwidth.

In the use case of hyperparameter tuning, we will likely be using the same train/test data within our objective function which is grading the search space (we likely only be changing the hyperparameters or ML algorithm). While prototyping we may be inclined to pass this data directly to the objective function within each call. This is a bad idea. This means every time we run a task, we have to serialize the data from the driver to the executor. This is unnecessary computation.

We might instead decide to broadcast the data. Reading through thespark documentaiton see that Broadcast variables are read-only variables that are cached and available to tasks on all nodes in a cluster. Instead of sending this data along with every task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs. The problem with broadcast variables however is that they are limited to 2GB of data.

As each spark task is a separate java process, the only other option is to do some advanced programming to load the data into a large shared memory object in the heap that can be used between tasks. Or possibly setting up a caching server of some kind that can quickly serve a raw byte stream. But that is outside our scope.

Practically speaking, if the train/test set is larger that 2GB you might simply have to reach out to the datastore and load the data each time the function runs. The benefit to this approach is you free up resources on the executor which will likely allow your workflow to run more moothly.

# 2. Suggestions on setting max_evals
We have seen that the *fmin()* function takes a parameter called max_evals which dictates how many trials within the search space to conduct. For example, if we set max_evals=20 then the search algorithm would run 20 times on data points selected from the search space.

Databricks has made a [reccomendation](https://databricks.com/blog/2021/04/15/how-not-to-tune-your-model-with-hyperopt.html) on a method for choosing the value for max_evals:

<table>
  <tbody>
<tr>
<th>Parameter Expression</th>
<th>Optimal Results</th>
<th>Fastest Results</th>
</tr>
<tr>
<td>(ordinal parameters)<p></p>
<p>hp.uniform<br>
hp.quniform<br>
hp.loguniform<br>
hp.qloguniform</p></td>
<td>20 x # parameters</td>
<td>10 x # parameters</td>
</tr>
<tr>
<td>(categorical parameters)<p></p>
<p>hp.choice</p></td>
<td colspan="2">15 x total categorical breadth*</td>
</tr>
</tbody>  
</table>

**Note:** “total categorical breadth” is the total number of categorical choices in the space.  If you have hp.choice with two options “on, off”, and another with five options “a, b, c, d, e”, your total categorical breadth is 10.

# 3. Example
In a previous [set of notebooks](../../Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md) we looked at using apache spark. In one [notebook](../../Big%20Data%20And%20Big%20Compute/Apache%20Spark/Running%20Scikit-Learn%20Apache%20Spark.ipynb) we looked at running scikit learn models on apache spark. We will continue where that notebook left off. This time we will optimize the hyperparameter for the scikit-learn model using hyperopt.

## 3.1. Get Spark Setup

### 3.1.1. Confirm Kubernetes Is Online
Recall Apache Spark is running on a Kubernetes Cluster. Before checking spark, which is higher in the tech stack, we will check the prerequisite apache spark cluster is online.

In [1]:
! kubectl version --client

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"windows/amd64"}


In [2]:
! kubectl cluster-info

Kubernetes control plane is running at https://15.4.7.11:6443
CoreDNS is running at https://15.4.7.11:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [3]:
! kubectl get node

NAME                           STATUS   ROLES                  AGE    VERSION
os004k8-master001.foobar.com   Ready    control-plane,master   220d   v1.21.1
os004k8-worker001.foobar.com   Ready    <none>                 220d   v1.21.1
os004k8-worker002.foobar.com   Ready    <none>                 220d   v1.21.1
os004k8-worker003.foobar.com   Ready    <none>                 220d   v1.21.1


### 3.1.2. Create SparkContext

In [4]:
from spark_helper import create_spark_session
spark_app_name = "spark-jupyter-win"
docker_image = "tschneider/pyspark:v6-beta"
k8_master_ip = "15.4.7.11"
spark_session = create_spark_session(spark_app_name, docker_image, k8_master_ip)
sc = spark_session.sparkContext

Setting SPARK_HOME
c:\spark\spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python', 'c:\\spark\\spark-3.1.1-bin-hadoop2.7\\python\\lib\\py4j-0.10.9-src.zip', 'c:\\program files\\python36\\python36.zip', 'c:\\program files\\python36\\DLLs', 'c:\\program files\\python36\\lib', 'c:\\program files\\python36', '', 'c:\\program files\\python36\\lib\\site-packages', 'c:\\program files\\python36\\lib\\site-packages\\win32', 'c:\\program files\\python36\\lib\\site-packages\\win32\\lib', 'c:\\program files\\python36\\lib\\site-packages\\Pythonwin', 'c:\\program files\\python36\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\Administrator\\.ipython']

Setting PYSPARK_PYTHON
/usr/bin/python3

Determine IP Of Server
The ip was detected as: 192.168.1.14

Create SparkContext



### 3.1.3. Ensure Kubernetes Has Created Containers
We can look at kubernetes to see that out worker nodes were created. 

The first time we create the spark context with a given docker image, the image will need to be downloaded (which takes some time). As a result, we may see the pods with a status of "ContainerCreating". In this case, we will need to wait until the containers are in a "Running" state.

```
kubectl -n spark get pod
NAME                                        READY   STATUS              RESTARTS   AGE
spark-jupyter-win-3ed7f27984f7563a-exec-1   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-2   0/1     ContainerCreating   0          12m
spark-jupyter-win-3ed7f27984f7563a-exec-3   0/1     ContainerCreating   0          12m
```

Sometimes creating the container takes time as the docker image needs to be downloaded from the registry. If we want to check on the status of the container being created we can log into the pod and run the *docker pull* command. This command will attach to an inprogress pull command if one exists and will show the current pull status. An example would be as follows:

```
kubectl -n spark exec -ti docker pull tschneider/pyspark:v3 docker pull tschneider/pyspark:v4
v3: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
e99d962ac218: Pull complete
Digest: sha256:eb74701b4ae909c40046ff68b1044b09b11895e175c955dfd8afe9fe680309cf
Status: Downloaded newer image for tschneider/pyspark:v3
docker.io/tschneider/pyspark:v3
[root@os004k8-worker002 ~]# docker pull tschneider/pyspark:v4
v4: Pulling from tschneider/pyspark
2d473b07cdd5: Already exists
71d236fb1195: Already exists
2e22160d8cab: Already exists
c556a717fe5d: Downloading [=======================>                           ]  578.7MB/1.246GB
```

We check the status of our pod with:

In [5]:
! kubectl -n spark get pod

NAME                                        READY   STATUS    RESTARTS   AGE
spark-jupyter-win-127dfb7d808790bc-exec-1   1/1     Running   0          32s
spark-jupyter-win-127dfb7d808790bc-exec-2   1/1     Running   0          31s
spark-jupyter-win-127dfb7d808790bc-exec-3   1/1     Running   0          31s


### 3.1.4. Create web server to host data

Determine the current working directory. 

Note: There is a trick to doing this inside a jupyter notebook and so we will use a special library to get that information.

In [6]:
import pyprojroot
project_root_dir  = pyprojroot.here()
print(project_root_dir)

C:\Users\Administrator\git\ml-training-jupyter-notebooks


Load the module for the webserver from our utilities directory

In [7]:
# Import the module for the web server we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("PythonHttpFileServer", "../../../Utilities/PythonHttpFileServer.py")
PythonHttpFileServer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(PythonHttpFileServer)

Configure logging so that messages are collected and displayed asynchronously so that the server can run in the background without casuing a jupyter cell to block.

In [9]:
# Configure the logger and log level
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Remove all handlers
for handler in logger.handlers: 
    logger.removeHandler(handler)
for handler in logger.handlers: 
    logger.removeHandler(handler)
    
# Start the webserver in a thread so the cell is not stuck in a running state
import threading
web_server_port = 80
web_server_args = (web_server_port, project_root_dir)
web_server_thread = threading.Thread(target=PythonHttpFileServer.run_server, args=web_server_args)
web_server_thread.start()

INFO:root:Starting server on port 80
INFO:root:Web root specified as: C:\Users\Administrator\git\ml-training-jupyter-notebooks


 * Serving Flask app 'PythonHttpFileServer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://192.168.1.14:80/ (Press CTRL+C to quit)
INFO:root:Get C:\Users\Administrator\git\ml-training-jupyter-notebooks\Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:192.168.1.14 - - [03/Dec/2021 07:43:43] "GET /Example%20Data%20Sets/nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get C:\Users\Administrator\git\ml-training-jupyter-notebooks\Example Data Sets/nasdaq_2019.csv
INFO:werkzeug:15.4.11.1 - - [03/Dec/2021 07:43:53] "GET /Example%20Data%20Sets/nasdaq_2019.csv HTTP/1.1" 200 -
INFO:root:Get C:\Users\Administrator\git\ml-training-jupyter-notebooks\favicon.ico
ERROR:PythonHttpFileServer:Exception on /favicon.ico [GET]
Traceback (most recent call last):
  File "c:\program files\python36\lib\site-packages\flask\app.py", line 2051, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\program files\python36\lib\site-packages\flask\app.py", line 1501, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\program files\python36\lib

### 3.1.5. Load The Data

Instruct the spark cluster to download a file from the web server

In [12]:
from spark_helper import determine_ip_address
csv_file_name = "Example%20Data%20Sets/nasdaq_2019.csv"
ip_address = determine_ip_address()
ip_address = "15.4.11.1"
csv_file_url = "http://{0}:{1}/{2}".format(ip_address, web_server_port, csv_file_name)
print("Data file should be available at: '{0}'".format(csv_file_url))
sc.addFile(csv_file_url)

Data file should be available at: 'http://15.4.11.1:80/Example%20Data%20Sets/nasdaq_2019.csv'


Import the utility function to convert a date string to a datetime object from our utilities module

In [13]:
# Import the utilities module we wrote
import importlib.util
spec = importlib.util.spec_from_file_location("utilities", "../../../Utilities/utilities.py")
utilities = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utilities)

# Define a mapping to convert our data field to the correct type
converter_mapping = {
    "date": utilities.convert_date_string_to_date
}

Load our OHCLV data Into a koalas dataframe and pull out a single day in the say way we would in pandas

In [None]:
# Avoid a warning
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

# Load the data
from databricks import koalas
koalas_dataframe = koalas.read_csv(u"file:////nasdaq_2019.csv", converters=converter_mapping)

INFO:spark:Patching spark automatically. You can disable it by setting SPARK_KOALAS_AUTOPATCH=false in your environment


We should see the workers download the file in the logs. If we log into the nodes we can see the file is located on the filesystem root.

With the data loaded into a koalas dataframe we can access the data in the same way we would from a pandas dataframe

In [None]:
koalas_dataframe.head()

### 3.1.6. We need to prepare our worker nodes

Note: We need to install relevant python libraries on the worker nodes. If you do not, you might see an error as follows:
```
PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 421, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/local/lib/python3.6/site-packages/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/local/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'pandas'
```

In our case we needed to install pandas, numpy, koalas, scikit-learn, sklearn. If you are unsure of what is installed on your workers, we can log into the kubernetes pods and execute shell commands.

Note: We must do this on all workers.

In [None]:
! kubectl -n spark get pod

In [None]:
! kubectl -n spark exec -ti spark-jupyter-win-7716697d71777b7b-exec-1 -- pip3 list

## 3.2. Submit Python Code To Spark Cluster

In this section of the notebook we are going to apply the kmeans algorithm from sklearn to each date in our koalas_dataframe object.
To do this, we are going to write a function that applies the algorithm to a dataframe; the assumption being the dataframe only contains data related to the same date.
Note: Most of this is a review and reworking of the content contained in 
<a href="../Cluster%20Analysis/K-Means.ipynb">Cluster Analysis/K-Means.ipynb</a>.

### 3.2.1. Process Data
To kick things off we need a dataframe representing a given stock. We will use the AABA stock.

In [None]:
# Sort based on the date column
koalas_dataframe = koalas_dataframe.sort_values("date")
aaba_df = koalas_dataframe.loc[koalas_dataframe["ticker"] == 'AABA'].copy()
aaba_df.head()

### 3.2.2. Define Search Space
We will be using several models to make predictions about future data points. Here is where we define these models and their hyperparameters.

In [None]:
# Import required libraries
import hyperopt
import numpy
import pandas
import sklearn
import statsmodels.api

# Define the search space
space = hyperopt.hp.choice('model_scenarios', [
    {
        'name': 'sarimax',
        'model_type': statsmodels.api.tsa.statespace.SARIMAX,
        'model_args': [],
        'model_kwargs': {},
        'feature_names': ["open"]
    }
])

### 3.2.3. Define Utility Functions

Our ML workflow requires a few functions to work properly. The basic process for selecting hyper parameters will be as follows:
1. Perform train/test split
2. Select training scenario (select a point from the hyperparameter search space)
3. Train model (calibrate model to fit the data)
4. Test model (make forcasts and calculate accuracy scores)
We will define these as separate functions

We will define some of these functions as standalone functions while others will be defined in an OOP way (ie. we will use classes, polymorphism, and interface programming).

#### 3.2.2.1. Perform the train test split
We will split the data using the so that there are three days to look ahead and predict for.

In [None]:
def train_test_split(df):
    return df.iloc[0:-3], df.iloc[-3:]

aaba_train_df, aaba_test_df = train_test_split(aaba_df)

#### 3.2.2.2. Select a training scenario
The scenario will be selected by the *fmin()* function provided by hyperopt so we do not need to define anything.

But We can get a glimps of what will be passed to the objective function during a trial using the following function:

In [None]:
scenario = hyperopt.pyll.stochastic.sample(space)
import pprint
pprint.pprint(scenario)

Check we can serialize the scenario (per gotchas)

In [None]:
import pickle
d = pickle.dumps(scenario)

#### 3.2.2.3. Train a model
We need to write a generic function that can train our algo

In [None]:
class hyperopt_model():
    def __init__(self, scenario):
        
        # Set some params for later
        self.model_type = None
        self.model = None
        self.trained_model = None
        self.is_built = False
        self.is_trained = False
        self.is_tested = False
        self.scores = {}
        
        # Retrieve override parameters from the scenario
        self.__dict__.update(scenario)
        
    def train(self, train_df):
        raise NotImplementedError()
    
    def test(self, train_df):
        raise NotImplementedError()

In [None]:
class sarimax(hyperopt_model):
    def __init__(self, scenario):
        super(sarimax, self).__init__(scenario)

    def train(self, train_df):
        try:
            # Convert data from pandas to numpy
            # Annoyingly, some algorithms cannot handle the koalas datatype
            model_parameters = train_df[[*self.feature_names]].to_numpy()

            args = list(self.model_args)
            args.insert(0, model_parameters)

            # Create an instance fit to the data
            self.model = self.model_type(*args, **self.model_kwargs)
            self.trained_model = self.model.fit()

            # Set some flags
            self.is_built = True
            self.is_trained = True
        except Exception as e:
            raise Exception("Unable to train the model") from e
    
    def test(self, test_df):
        
        try:
            # Overcome issue with koalas api
            from databricks.koalas.config import set_option, reset_option
            set_option("compute.ops_on_diff_frames", True)

            # Predict next data points
            # Note: We need to do some index magic
            #       https://stackoverflow.com/questions/67662652/adding-a-new-column-to-an-existing-koalas-dataframe-results-in-nans
            #
            actuals = test_df[self.feature_names[0]]
            predictions = self.trained_model.forecast(steps=test_df.shape[0])
            test_df["predictions"] = koalas.Series(predictions, test_df.index.tolist())

            # Calculate the error for each prediction
            test_df["error"] = actuals - test_df["predictions"]
            
            # Calculate aggregate error metrics
            #     mae - mean absolute error
            #     rmse - root mean squared error
            #     mape - mean absolute percentage error
            
            self.scores = {
                "mae": test_df["error"].abs().sum(),
                "rmse": numpy.sqrt((test_df["error"]**2).mean()),
                "mape": 100 * test_df["error"].abs().sum() / actuals.sum()
            }
            self.is_tested = True
            
            return self.scores, test_df

        except Exception as e:
            raise Exception("Unable to train the model") from e            
        finally:
            # Overcome issue with koalas api
            reset_option("compute.ops_on_diff_frames")

In [None]:
def hyperopt_model_factory(scenario):
    name = scenario["name"]
    constructor = globals()[name]
    return constructor(scenario)

In [None]:
my_hyperopt_model = hyperopt_model_factory(scenario)
my_hyperopt_model.train(aaba_train_df)
my_hyperopt_model.trained_model

#### 3.2.2.4. Test the model
We will make predictions for the test data set and calculate an accuracy score

In [None]:
scores, error_df = my_hyperopt_model.test(aaba_test_df)

In [None]:
error_df

In [None]:
scores

In [None]:
scores["mape"]

### 3.2.4. Define The Objective Function

In [None]:
def objective_function(scenario):
    
    my_hyperopt_model = hyperopt_model_factory(scenario)
    my_hyperopt_model.train(aaba_train_df)
    scores, error_df = my_hyperopt_model.test(aaba_test_df)
    
    return scores["mape"]

In [None]:
objective_function(scenario)

### 3.2.6. Confirm Everything Can Pickle
As we mentioned in the gotcha, some objects cannot pickle and thus cannot ...

Our call to the objective function will look like this:

```
best_hyperparameters = hyperopt.fmin(
  fn = objective_function,
  space = space,
  algo = hyperopt.tpe.suggest,
  max_evals = 200,
  trials = spark_trials,
  loss_threshold = 0.05,
  rstate = numpy.random.default_rng(42))
```

So we need to test if we can pickle all the parameters being passed in:

### 3.2.5. Run Hyperopt Against Spark

We first create an instance of the SparkTrials object to hold the trials from our hyperopt search. This can accept a *spark_session* parameter however it can also auto-detect the active spark session as we see below.

In [None]:
spark_trials = hyperopt.SparkTrials(parallelism=3, spark_session = spark_session)

Once this object is create, we run the *fmin()* function as we have done in the past.

In [None]:
seed = numpy.random.RandomState(42)

In [None]:
dir(seed)
seed.rand()

In [None]:
def foobar(scenario):
    return 1

In [None]:
best_hyperparameters = hyperopt.fmin(
  fn = foobar,
  space = space,
  algo = hyperopt.tpe.suggest,
  max_evals = 200,
  trials = spark_trials,
  loss_threshold = 0.05,
  rstate = numpy.random.default_rng(42))

In [None]:
best_hyperparameters = hyperopt.fmin(
  fn = objective_function,
  space = space,
  algo = hyperopt.tpe.suggest,
  max_evals = 200,
  trials = spark_trials,
  loss_threshold = 0.05,
  rstate= numpy.random.RandomState(42))

# 6. Cleanup Spark Cluster On Kubernetes

In [None]:
sc.stop()

In [None]:
! kubectl -n spark get pod