## Step 1: Initialize Combiner.

```python
from spark.combiner.combiner import Combiner
combiner = Combiner(4, 26544)
```

- **4** is the **number of cores** per worker node.
- **26544** is the **total memory** per worker node in MB. **90%** of the total memory are allocated for executors. If total memory allocated for executor is pre-decided let's say X MB, then set that value to X/0.9 MB ( as 90% of total memory is allocated to executor).

In [14]:
from spark.combiner.combiner import Combiner
combiner = Combiner(4, 26544)



## Step 2: Add Training Data
```python
training_data_1 = {
        "spark.executor.memory": 11945,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 2,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 10,
        "spark.sql.statistics.fallBackToHdfs": 0
    }
runtime_in_sec = 71
combiner.add_training_data(training_data_1, runtime_in_sec)
```
Add the training data as a dictionary of config name and it's corresponding value.
</br>
Note for Boolean config, instead of *True* and *False* 0 or 1 needs to be used.
</br>
These are the only configs currently being modeled out of box:
- spark.executor.memory
- spark.sql.shuffle.partitions
- spark.executor.cores
- spark.driver.memory
- spark.sql.autoBroadcastJoinThreshold
- spark.sql.statistics.fallBackToHdfs
For adding more configs, you can add them to ConfigSet manually, which is out of scope of this tutorial.

### Example with TPC-DS Q17


|conf|value|
|:-|:-|
|spark.driver.memory|2g|
|spark.executor.cores|3|
|spark.executor.memory|2g|
|spark.sql.shuffle.partitions|400|


<h5 align="center">Timing for the query: q17 - 806750 </h5>

================================================================================================================================================
<br>


|conf|value|
|:-|:-|
|spark.driver.memory|4g|
|spark.executor.cores|8|
|spark.executor.memory|5g|
|spark.sql.shuffle.partitions|600|


<h4 align="center">Timing for the query: q17 - 1191319 </h4> 

================================================================================================================================================
<br>


|conf|value|
|:-|:-|
|spark.driver.memory|1g|
|spark.executor.cores|2|
|spark.executor.memory|2g|
|spark.sql.shuffle.partitions|100|


<h4 align="center">Timing for the query: q17 - 1138390 </h4> 


In [15]:
training_data_1 = {
        "spark.executor.memory": 11945,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 2,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 10,
        "spark.sql.statistics.fallBackToHdfs": 0
    }
runtime_in_sec = 248
combiner.add_training_data(training_data_1, runtime_in_sec)

In [16]:
best_config = combiner.get_best_config()
print best_config

{'spark.executor.cores': 4, 'spark.driver.memory': 2048, 'spark.sql.statistics.fallBackToHdfs': 1, 'spark.sql.autoBroadcastJoinThreshold': 100, 'spark.executor.memory': 23889, 'spark.sql.shuffle.partitions': 200}


In [17]:
training_data_2 = {
        "spark.executor.memory": 23889,
        "spark.sql.shuffle.partitions": 200,
        "spark.executor.cores": 4,
        "spark.driver.memory": 1024 * 2,
        "spark.sql.autoBroadcastJoinThreshold": 100,
        "spark.sql.statistics.fallBackToHdfs": 1
    }
runtime_in_sec = 92
combiner.add_training_data(training_data_2, runtime_in_sec)
best_config = combiner.get_best_config()
print best_config

OrderedDict([('spark.sql.shuffle.partitions', 460.0), ('spark.executor.memory', 11940.0), ('spark.driver.memory', 2304.0), ('spark.executor.cores', 2.0), ('spark.sql.autoBroadcastJoinThreshold', 30.0), ('spark.sql.statistics.fallBackToHdfs', 1.0)])


In [18]:
training_data_3 = {
        "spark.executor.memory": 11940,
        "spark.sql.shuffle.partitions": 460,
        "spark.executor.cores": 2,
        "spark.driver.memory": 2304,
        "spark.sql.autoBroadcastJoinThreshold": 30,
        "spark.sql.statistics.fallBackToHdfs": 1
    }
runtime_in_sec = 105
combiner.add_training_data(training_data_3, runtime_in_sec)
best_config = combiner.get_best_config()
print best_config

OrderedDict([('spark.sql.shuffle.partitions', 460.0), ('spark.executor.memory', 11940.0), ('spark.driver.memory', 2304.0), ('spark.executor.cores', 2.0), ('spark.sql.autoBroadcastJoinThreshold', 30.0), ('spark.sql.statistics.fallBackToHdfs', 1.0)])


In [12]:
print best

{'theta': 7.230629701585487, 'beta': 412696.6304998827, 'alpha1': 998.0269377497508, 'gamma': 0.5985143784548778}
