# grid_search_log_explorer

To quickly iterate on grid search parameter ranges, this notebook provides an interactive way to explore the log files generated by the `grid_search.ipynb` notebook. Various different settings can be made, which will be explained in the following sections. The structure of this log evaluation approach is also used in the `grid_search_thesis_plots.ipynb` notebook, which is used to generate the plots in chapter 4 of the thesis.

---
## Selection of algorithm, target metric, and log file
To start out, one of the four different algorithms can be selected. Also required is to provide the desired target metric and wheather a high or low value is desired. Any of the metrics that are returned by the pipeline functions can be used as a target for the evaluation. Referr to the individual notebooks for an exhastive list. With a given log file path, the data can be loaded and the exploration can start.


In [None]:
import pandas as pd

# select pipeline: uncomment one of the following blocks

algorithm = 'coastdown'
target_metric = 'loss_low_suspension' #"loss_very_low_suspension", "loss_low_suspension", "loss_medium_suspension", "loss_high_suspension", "loss_CoC_values", "loss_EPA_values", 'elapsed_time'
higher_is_better = False

# algorithm = 'constspeed'
# target_metric = 'loss_low_suspension' #"loss_very_low_suspension", "loss_low_suspension", "loss_medium_suspension", "loss_high_suspension", "loss_CoC_values", "loss_EPA_values"
# higher_is_better = False

# algorithm = 'efficiencymap'
# target_metric = 'rmse' # 'elapsed_time', 'num_total_data_points', 'num_valid_data_points', 'num_removed', 'num_removed_neg_torque', 'num_removed_soc', 'rmse_diff', 'mean_abs_diff'
# higher_is_better = False
# select_gear = '1' # None, '1', '2', 'vw'

# algorithm = 'gearstrategy'
# target_metric = 'prmse_global' # hd_global, dtw_global, cd_global, fd_global, prmse_global, fd_normal, fd_sport, etc., fd_rmt_global
# higher_is_better = False


# load log
if algorithm == 'coastdown':
    log_path = 'data/logs/coastdown_log.csv'
    select_gear = None
elif algorithm == 'constspeed':
    log_path = 'data/logs/constspeed_log.csv'
    select_gear = None
elif algorithm == 'efficiencymap':
    log_path = 'data/logs/efficiencymap_log.csv'
elif algorithm == 'gearstrategy':
    log_path = 'data/logs/gear_strategy_log.csv'
    select_gear = None

results_df = pd.read_csv(log_path)

# fix windows file paths
if 'files' in results_df.columns:
    results_df['files'] = results_df['files'].str.replace('\\\\', '/')
    results_df['files'] = results_df['files'].str.replace('\\', '/')
if 'gear' in results_df.columns:
    results_df['gear'] = results_df['gear'].astype(str)

print(f'Total runs: {len(results_df)}')
results_df.value_counts('comment')

---
## Results space filtering

Currently, all grid search runs that are present are loaded. Now a specific run can be selected by inputing the desired comment into `specific_comment`. If set to None, the last run will be selected automatically. If a specific number of recent runs should be selected, the `number_of_recents` parameter can be set to something other than None. Lastly, if all runs from the log file should be selected, the `select_all` parameter can be set to True.

Additionally, a `limit` can be supplied to filter out runs that went extremely bad to avoid skewing the subsequent analysis in undesired ways.  The limit should be chosen at least 10 times the expected value of the target metric, to avoid filtering out good runs.

In [None]:
# Filter options - set to None to automatically select the most recent run
specific_comment = None
number_of_recents = None

select_all = False
limit = 200


# Filtering logic
if limit is not None:
    if higher_is_better:
        results_df = results_df[results_df[target_metric] >= limit]
    else:
        results_df = results_df[results_df[target_metric] <= limit]
if select_gear is not None:
    results_df = results_df[results_df['gear'] == select_gear]
if specific_comment is not None:
    results_df = results_df[results_df['comment'] == specific_comment]
elif number_of_recents is not None:
    results_df = results_df.iloc[-number_of_recents:]
elif select_all:
    pass
else:
    results_df = results_df[results_df['comment'] == results_df['comment'].iloc[-1]]

print(f'Selected {len(results_df)} runs with comment: {results_df["comment"].iloc[0]}')

---
## First look at the data

This function scans the results space to automatically detect which parameters were varied in the grid search. It also extracts the run with the lowest target metric value, which can later be used to run the pipeline with these parameters. Also a histogram of the target metric values is shown to get a first impression of the results space.

The `varied_parameters` list contains all the design parameters that will be used in the subsequent analysis. If any parameters should be excluded, they can be removed from this list here.

In [None]:
from modules.parametric_pipelines import evaluate_results

varied_parameters, best_combination = evaluate_results(results_df, target_metric, algorithm, higher_is_better)

# remove non-parameters
varied_parameters = [param for param in varied_parameters if param != 'gear' and param != 'files' and param != 'columns_to_smooth']

---
## Statisical significance testing

To test if the design parameters (IVs) have a significant impact on the target metric (DV), a statistical significance test is performed here.

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Distinguish categorical vs numeric parameters
categorical_params = [p for p in varied_parameters if results_df[p].dtype == object or results_df[p].dtype == bool]
numeric_params = [p for p in varied_parameters if p not in categorical_params]

# Construct the formula for the linear model
# C() indicates categorical variables. Numeric variables are included as-is.
formula_terms = []
for p in numeric_params:
    formula_terms.append(f"{p}")  # numeric variables included directly
for p in categorical_params:
    formula_terms.append(f"C({p})")  # categorical variables indicated with C()
formula = f"{target_metric} ~ " + " + ".join(formula_terms)

print("Fitting model with formula:")
print(formula)

# Fit the model using OLS
model = smf.ols(formula=formula, data=results_df).fit()

# Perform ANOVA on the fitted model
anova_results = sm.stats.anova_lm(model, typ=2)  # Type-II ANOVA

print("ANOVA results:")
print(anova_results)
from modules.plotter import plot_anova_bar
plot_anova_bar(anova_results, color='#0065bd', name='testing', figsize=(10,3))

---
## Pairplot investigation

To identify parameter interactions, a pairplot is generated. The list `params_to_plot` can be adjusted easily here. Default behaviour is to use `varied_parameters` from the first look at the data. Depending on the length of the list, different approprate plots are generated. If the list is longer than two, another sensitivity analysis plot is generated. This sensitivity analysis plot makes use of the random forests ability to provide feature importances. After training a random forest  to regress the target metric, the feature importances are used to plot the sensitivity of the target metric to the different parameters.

In [None]:
from modules.plotter import plot_distribution_against_hyperparameter, threeDplot, plot_hyperparameter_heatmaps, plot_parameter_importance_analysis

# Plot the results
params_to_plot = [param for param in varied_parameters if param != 'files']
#params_to_plot = ['soc_limit_lower', 'soc_limit_upper']

if len(params_to_plot) == 1:
    plot_distribution_against_hyperparameter(results_df, params_to_plot, target_metric, figsize=(12, 8))
elif len(params_to_plot) == 2:
    threeDplot(results_df, params_to_plot, target_metric, figsize=(800,800), higher_is_better=higher_is_better)
else:
    plot_hyperparameter_heatmaps(results_df, params_to_plot, target_metric, figsize=(15,15), higher_is_better=higher_is_better, use_identical_scale=False, aggregation='mean')
    plot_parameter_importance_analysis(results_df, params_to_plot, target_metric, figsize=(12, 6))

---
## Individual parameter insights

This widget allows to scroll through the list of parameters to output the aggregated results that contain the specific value of the selected hyperparameter as a box plot. Additional information is provided, by printing which distribution has the lowest median target metric value, the lowest variance, and the lowest interquartile range.

In [None]:
import ipywidgets as widgets
from IPython.display import display

# optionally enforce axis limits
axis_limits = {
    'y': [0, 50]
} 

from modules.plotter import plot_distribution_against_hyperparameter
# Define the interactive widget function
def interactive_plot(param_index):
    param = varied_parameters[param_index]
    param = [param]
    if axis_limits is not None:
        plot_distribution_against_hyperparameter(results_df, param, target_metric, figsize=(8, 6), axis_limits=axis_limits)
    else:
        plot_distribution_against_hyperparameter(results_df, param, target_metric, figsize=(8, 6))

# Create the slider
param_slider = widgets.IntSlider(value=0, min=0, max=len(varied_parameters)-1, step=1, description='Parameter')
widgets.interactive(interactive_plot, param_index=param_slider)

---
## Run pipeline with "best" parameters

Here the pipeline is run with the parameters which showed the "best" target metric. This has to be used with caution, as the lowest target metric value might not be the best choice. This is mostly for getting a quick overview if the pipeline is prone to overfitting.

In [None]:
from modules.parametric_pipelines import run_pipeline_with_best_parameters
run_pipeline_with_best_parameters(best_combination, algorithm, generate_plots=True, verbose=False)