# Performance Tuning - Understanding Parameters Impact 

A modern IT system comes with hundreds of tunable parameters. Selecting the proper configuration is crucial to extract all the available performance and save costs.
When exploring huge search spaces, random search is surprisingly effective at finding optima.
We can thus replicate the environment and run a series of performance tests, collecting data and finding good configurations.
After the search is completed, we need to transfer the best configuration found in the production deployment.
However, modifying a parameter is a risky business, hence we need to minimise the number of parameters to modify, while maintaining good performance.

## Data loading
Here we have an example dataset, collected on the MongoDB DBMS while tuning some MongoDB and Linux parameters while monitoring the MongoDB throughput.

- Each line represents an experiment (the first is the baseline, or vendor default configuration).
- The first column is the experiment_id.
- The second column is the target performance metric (to be maximized in this case).
- Other columns contain the applied parameters (some are categoricals).

Let's open the dataset and have a look:

In [1]:
import pandas as pd
filename = 'polimi.csv'
df = pd.read_csv(filename)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 38 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   experiment_id                                               90 non-null     int64  
 1   throughput (ops/sec)                                        90 non-null     float64
 2   parameter-MongoDB.mongodb_cache_size (megabytes)            90 non-null     int64  
 3   parameter-MongoDB.mongodb_eviction_dirty_target (percent)   90 non-null     int64  
 4   parameter-MongoDB.mongodb_eviction_dirty_trigger (percent)  90 non-null     int64  
 5   parameter-MongoDB.mongodb_syncdelay (seconds)               90 non-null     int64  
 6   parameter-OS.os_CPUSchedAutogroupEnabled                    90 non-null     int64  
 7   parameter-OS.os_CPUSchedChildRunsFirst                      90 non-null     int64  
 8   pa

In [2]:
# first column is the id (useless), then we have the metric and all the
# parameters
y = df.iloc[:, 1]
X = df.iloc[:, 2:]

### Normalization
Our idea is to use a Random Forest (RF) as a regression model to make predictions about the performance of candidate configurations.
To build a better model, we normalise the target variable (throughput) with a common function in performance tuning: the Normalized Performance Improvement (NPI).
NPI is defined as the achieved performance improvement over the available performance improvement. The baseline configuration thus has a NPI of 0 and the best configuration a NPI of 1.

In [3]:
# npi = (y - baseline) / (best - baseline)
# the baseline is the first line
y = (y - y[0]) / (max(y) - y[0])

%matplotlib notebook
import matplotlib.pyplot as plt
plt.plot(y)
plt.show()
# Clearly, we did not use a random search

<IPython.core.display.Javascript object>

### Feature prepearation and model creation
Here is some boilerplate code for common preliminary steps:
- we split the dataset in train and validation subsets
- we transform categorical features using a one-hot encoder.

At the end, we fit a Random Forest Regressor:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Other examples: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_categorical.html
categorical_columns = X.dtypes == 'object'
numerical_columns = X.dtypes != 'object'
categorical_encoder = OneHotEncoder()
preprocessing = ColumnTransformer([('cat', categorical_encoder, categorical_columns)],
                                  remainder='passthrough')
# now we can fit the RF
rf = Pipeline([
    ('preprocess', preprocessing),
    ('regressor', RandomForestRegressor(random_state=42))
])
rf.fit(X_train, y_train)
print("RF train accuracy: %0.3f" % rf.score(X_train, y_train))
print("RF test accuracy: %0.3f" % rf.score(X_test, y_test))

## Solution 1: using RF feature importance

In [None]:
import numpy as np
# this is some boilerplate code to print the feature names, nothing relevant
ohe = (rf.named_steps['preprocess']
         .named_transformers_['cat'])
feature_names = ohe.get_feature_names(input_features=X.columns[categorical_columns])
feature_names = np.r_[feature_names, X.columns[numerical_columns]]

tree_feature_importances = (
    rf.named_steps['regressor'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()

y_ticks = np.arange(0, len(feature_names))
fig, ax = plt.subplots()
ax.barh(y_ticks, tree_feature_importances[sorted_idx])
ax.set_yticklabels(feature_names[sorted_idx])
ax.set_yticks(y_ticks)
ax.set_title("Random Forest Feature Importances (MDI)")
fig.tight_layout()
plt.show()

## Solution 2: permutation importance

In [None]:
#https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html
from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10,
                                random_state=42, n_jobs=-1)
sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

## Solution 3: ablation analysis
With solutions 1 and 2 we might select parameters which are important to *degrade* performace!
What we are really interested in are parameters which *increase* performance.
We can measure this by changing one parameter at a time, starting from the baseline configuration, and predicting the score

In [None]:
best_idx = np.argmax(y)
best_x = X.iloc[best_idx, :]
bsl_x = X.iloc[0, :]
xs = []
for param in X.columns:
    x = bsl_x.copy()
    x[param] = best_x[param]
    xs.append(x)

# The baseline has an NPI of 0 by definition, so the prediction is the gain
gains = rf.predict(xs)
sorted_idx = gains.argsort()
y_ticks = np.arange(0, len(X.columns))
fig, ax = plt.subplots()
ax.barh(y_ticks, gains[sorted_idx])
ax.set_yticklabels(X.columns[sorted_idx])
ax.set_yticks(y_ticks)
ax.set_title("Gain from baseline to best")
fig.tight_layout()
plt.show()

## Solution 4: combined ablation
The previous solution does not consider the *combined* effect of multiple parameters.
To do that, we can change one parameter at a time, without restarting from the baseline.
We can start from the importance we just computed, and see whether he
combined effect gives a new ordering, and iterate.
We have no assurance of convergence to any ordering, let alone the optimal one.

In [None]:
new_ordering = True
sorted_cols = X.columns
iters = 0
while new_ordering and iters < 100:
    new_ordering = False
    iters += 1
    # we have a new param ordering, build the points
    xs = []
    x = bsl_x.copy()
    sorted_cols = sorted_cols[sorted_idx[::-1]]
    for param in sorted_cols:
        x = x.copy()
        x[param] = best_x[param]
        xs.append(x)

    # This time, to compute the gain we need to compute the difference for the
    # single parameter
    predictions = rf.predict(xs)
    gains = np.ediff1d(predictions, to_begin=predictions[0])
    old_idx = sorted_idx
    sorted_idx = gains.argsort()
    if any(old_idx != sorted_idx):
        new_ordering = True

# Use the final ordering, do not recompute the gains
xs = []
x = bsl_x.copy()
sorted_cols = sorted_cols[sorted_idx[::-1]]
for param in sorted_cols:
    x = x.copy()
    x[param] = best_x[param]
    xs.append(x)
predictions = rf.predict(xs)

y_ticks = np.arange(0, len(X.columns))
fig, ax = plt.subplots()
ax.barh(y_ticks, gains[sorted_idx], color='blue', height=0.4)
ax.barh(y_ticks+0.4, predictions[::-1], color='red', height=0.4)
ax.set_yticklabels(sorted_cols[::-1])
ax.set_yticks(y_ticks)
ax.set_title("Combined gain from baseline to best. Single parameter contribution in blue, overall in red")
fig.tight_layout()
plt.show()