## Step 1: Initialize Combiner.

```python
from spark.combiner.combiner import Combiner
combiner = Combiner(4, 26544)
```

- **4** is the **number of cores** per worker node.
- **26544** is the **total memory** per worker node in MB. **90%** of the total memory are allocated for executors. If total memory allocated for executor is pre-decided let's say X MB, then set that value to X/0.9 MB ( as 90% of total memory is allocated to executor).

In [1]:
from spark.combiner.combiner import Combiner
combiner = Combiner(4, 26544)



## Step 2: Add Training Data
```python
training_data_1 = {
        "spark.executor.memory": 11945,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 2,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 10,
        "spark.sql.statistics.fallBackToHdfs": 0
    }
runtime_in_sec = 248
combiner.add_training_data(training_data_1, runtime_in_sec)
```
Add the training data as a dictionary of config name and it's corresponding value.
</br>
Note for Boolean config, instead of *True* and *False* 0 or 1 needs to be used.
</br>
These are the only configs currently being modeled out of box:
- spark.executor.memory
- spark.sql.shuffle.partitions
- spark.executor.cores
- spark.driver.memory
- spark.sql.autoBroadcastJoinThreshold
- spark.sql.statistics.fallBackToHdfs
</br>
For adding more configs, you can add them to ConfigSet manually, which is out of scope of this tutorial.

In [2]:
training_data_1 = {
        "spark.executor.memory": 11945,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 2,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 10,
        "spark.sql.statistics.fallBackToHdfs": 0
    }
runtime_in_sec = 248
combiner.add_training_data(training_data_1, runtime_in_sec)

## Step 3: Compute best config for next run
```python
best_config = combiner.get_best_config()
print best_config
```
Compute the best config for next run. Then run the job with the suggested configuration.

In [3]:
best_config = combiner.get_best_config()
print best_config

{'spark.executor.cores': 4, 'spark.driver.memory': 2048, 'spark.sql.statistics.fallBackToHdfs': 1, 'spark.sql.autoBroadcastJoinThreshold': 100, 'spark.executor.memory': 23889, 'spark.sql.shuffle.partitions': 200}


## Step 4: Add new training data from Step 3 and repeat the process `n` times

* Run with suggested configs in Step 3 and based upon this run add new training data to model.
* Get the new best config to run the job again.
* Repeat this process for predefined `n` number of times.

In [4]:
training_data_2 = {
        "spark.executor.memory": 23889,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 4,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 100,
        "spark.sql.statistics.fallBackToHdfs": 1
    }
runtime_in_sec = 92
combiner.add_training_data(training_data_2, runtime_in_sec)
best_config = combiner.get_best_config()
print best_config

OrderedDict([('spark.sql.shuffle.partitions', 1730.0), ('spark.executor.memory', 23876.0), ('spark.driver.memory', 5632.0), ('spark.executor.cores', 4.0), ('spark.sql.autoBroadcastJoinThreshold', 85.0), ('spark.sql.statistics.fallBackToHdfs', 1.0)])


In [5]:
training_data_3 = {
        "spark.executor.memory": 11940,
        "spark.sql.shuffle.partitions": 460,
        "spark.executor.cores": 2,
        "spark.driver.memory": 2304,
        "spark.sql.autoBroadcastJoinThreshold": 30,
        "spark.sql.statistics.fallBackToHdfs": 1
    }
runtime_in_sec = 105
combiner.add_training_data(training_data_3, runtime_in_sec)
best_config = combiner.get_best_config()
print best_config

OrderedDict([('spark.sql.shuffle.partitions', 1930.0), ('spark.executor.memory', 5972.0), ('spark.driver.memory', 4864.0), ('spark.executor.cores', 1.0), ('spark.sql.autoBroadcastJoinThreshold', 95.0), ('spark.sql.statistics.fallBackToHdfs', 1.0)])


In [6]:
training_data_4 = {
        "spark.executor.memory": 5972,
        "spark.sql.shuffle.partitions": 1930,
        "spark.executor.cores": 1,
        "spark.driver.memory": 4864,
        "spark.sql.autoBroadcastJoinThreshold": 95,
        "spark.sql.statistics.fallBackToHdfs": 1
    }
runtime_in_sec = 121
combiner.add_training_data(training_data_4, runtime_in_sec)
best_config = combiner.get_best_config()
print best_config

OrderedDict([('spark.sql.shuffle.partitions', 1750.0), ('spark.executor.memory', 23503.0), ('spark.driver.memory', 5888.0), ('spark.executor.cores', 4.0), ('spark.sql.autoBroadcastJoinThreshold', 30.0), ('spark.sql.statistics.fallBackToHdfs', 0.0)])


## Step 5:  Choose the Config with best runtimes.
In our example best config is:
```python
training_data_2 = {
        "spark.executor.memory": 23889,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 4,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 100,
        "spark.sql.statistics.fallBackToHdfs": 1
    }
```
# Best Config gave 2.7X improvement over intial config.
