# Tutorial: Evaluating Model Using Steric Descriptors

Author: Hideya Tanaka

Reviewer: Tomoyuki Miyao

---

This tutorial demonstrates the evaluation of the model constructed using the steric descriptors generated in `Tutorial 1`, employing the following module:

- **univariate_linear_regression_loocv.py**

### Table of Contents

1. **Analysis of the Dataset Containing Primary and Secondary Alcohols**
2. **Analysis of the Dataset Containing Only Secondary Alcohols**

---

## 1. Analysis of the Dataset Containing Primary and Secondary Alcohols

We used the final CSV file obtained in `Tutorial 1`, `steric-descriptors-alcohol/data/output_data/morfeus_qo_centerC_zaxisO_xzplaneH(C).csv`, as the dataset for the present analysis. Using this dataset and the module, we constructed univariate linear regression models and evaluated the performance of each steric descriptor for the target reaction based on three evaluation metrics: q², MAE, and RMSE. The target variables are the conversion values after 10 minutes under conditions without water addition (`conv_woH2O_10min`) and with water addition (`conv_wH2O_10min`) in the oxidation reactions of alcohols. By using the `run_univariate_linear_regression_loocv_evaluation` function from the module `univariate_linear_regression_loocv.py`, all metric values (q², MAE, RMSE) are saved in a CSV file. However, only the metric specified in the `eval_metric` argument is used to generate PNG images. Therefore, by specifying each of q², MAE, and RMSE sequentially in `eval_metric`, PNG images corresponding to each evaluation metric can be obtained. In addition, by specifying groups in the `search_keywords_list` argument, the best-performing descriptor within each group can be identified, and corresponding PNG images are generated for each group.  
For usage details, please refer to the module.

1. **Target Variable: Conversion without Water (AllDataset)**

In [None]:
import os
import sys

fd = os.getcwd()
parent_dir = os.path.dirname(fd)
sys.path.append(parent_dir)

from src.univariate_linear_regression_loocv import run_univariate_linear_regression_loocv_evaluation

dataset_file_path_input = f'{parent_dir}/data/output_data/morfeus_qo_centerC_zaxisO_xzplaneH(C).csv'
target_column = 'conv_woH2O_10min'
exclude_columns = [target_column, 'smiles', 'confid', 'total_energy_xTB', 'Zero_point_correction', 'Thermal_correction_to_Energy',
                    'Thermal_correction_to_Enthalpy', 'Thermal_correction_to_Gibbs_Free_Energy',
                    'Sum_of_electronic_and_zero_point_Energies', 'Sum_of_electronic_and_thermal_Energies',
                    'Sum_of_electronic_and_thermal_Enthalpies', 'Sum_of_electronic_and_thermal_Free_Energies',
                    'HOMO', 'LUMO', 'filepath', 'HOMO_eV', 'LUMO_eV', 'chemical_potential_eV', 'chemical_hardness_eV',
                    'GEI_eV', 'conv_wH2O_10min']
n_jobs = -1
max_features_per_plot = 53
fig_width = 9

In [None]:
keyword = 'results_LR_LOOCV_woH2O_q2_AllDataset'
eval_metric = 'q2'
sort_flag = False
search_keywords_list = []

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_woH2O_q2_AllDataset'
eval_metric = 'q2'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_woH2O_MAE_AllDataset'
eval_metric = 'MAE'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

2. **Target Variable: Conversion with Water (AllDataset)**

In [None]:
target_column = 'conv_wH2O_10min'
exclude_columns = [target_column, 'smiles', 'confid', 'total_energy_xTB', 'Zero_point_correction', 'Thermal_correction_to_Energy',
                    'Thermal_correction_to_Enthalpy', 'Thermal_correction_to_Gibbs_Free_Energy',
                    'Sum_of_electronic_and_zero_point_Energies', 'Sum_of_electronic_and_thermal_Energies',
                    'Sum_of_electronic_and_thermal_Enthalpies', 'Sum_of_electronic_and_thermal_Free_Energies',
                    'HOMO', 'LUMO', 'filepath', 'HOMO_eV', 'LUMO_eV', 'chemical_potential_eV', 'chemical_hardness_eV',
                    'GEI_eV', 'conv_woH2O_10min']

In [None]:
keyword = 'results_LR_LOOCV_wH2O_q2_AllDataset'
eval_metric = 'q2'
sort_flag = False
search_keywords_list = []

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_wH2O_q2_AllDataset'
eval_metric = 'q2'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_wH2O_MAE_AllDataset'
eval_metric = 'MAE'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

## 2. Analysis of the Dataset Containing Only Secondary Alcohols

Secondary alcohols generally undergo oxidation more slowly than primary alcohols due to greater steric hindrance. Therefore, we conducted a similar analysis using the dataset consisting exclusively of secondary alcohols, in which the effect of steric hindrance on reaction rate is more pronounced,
to investigate the performance of the descriptors in greater detail.

In [None]:
import pandas as pd
df = pd.read_csv(f'{parent_dir}/data/output_data/morfeus_qo_centerC_zaxisO_xzplaneH(C).csv', index_col=0)
filtered_df = df.drop([0, 1, 2, 9])

print(f'df.shape: {df.shape}')
print(f'filtered_df.shape: {filtered_df.shape}')

filtered_df.to_csv(f'{fd}/morfeus_qo_centerC_zaxisO_xzplaneH(C)_SecondaryDataset.csv')

dataset_file_path_input = f'{fd}/morfeus_qo_centerC_zaxisO_xzplaneH(C)_SecondaryDataset.csv'

1. **Target Variable: Conversion without Water (SecondaryDataset)**

In [None]:
target_column = 'conv_woH2O_10min'
exclude_columns = [target_column, 'smiles', 'confid', 'total_energy_xTB', 'Zero_point_correction', 'Thermal_correction_to_Energy',
                    'Thermal_correction_to_Enthalpy', 'Thermal_correction_to_Gibbs_Free_Energy',
                    'Sum_of_electronic_and_zero_point_Energies', 'Sum_of_electronic_and_thermal_Energies',
                    'Sum_of_electronic_and_thermal_Enthalpies', 'Sum_of_electronic_and_thermal_Free_Energies',
                    'HOMO', 'LUMO', 'filepath', 'HOMO_eV', 'LUMO_eV', 'chemical_potential_eV', 'chemical_hardness_eV',
                    'GEI_eV', 'conv_wH2O_10min']

In [None]:
keyword = 'results_LR_LOOCV_woH2O_q2_SecondaryDataset'
eval_metric = 'q2'
sort_flag = False
search_keywords_list = []

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_woH2O_q2_SecondaryDataset'
eval_metric = 'q2'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_woH2O_MAE_SecondaryDataset'
eval_metric = 'MAE'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

2. **Target Variable: Conversion with Water (SecondaryDataset)**

In [None]:
target_column = 'conv_wH2O_10min'
exclude_columns = [target_column, 'smiles', 'confid', 'total_energy_xTB', 'Zero_point_correction', 'Thermal_correction_to_Energy',
                    'Thermal_correction_to_Enthalpy', 'Thermal_correction_to_Gibbs_Free_Energy',
                    'Sum_of_electronic_and_zero_point_Energies', 'Sum_of_electronic_and_thermal_Energies',
                    'Sum_of_electronic_and_thermal_Enthalpies', 'Sum_of_electronic_and_thermal_Free_Energies',
                    'HOMO', 'LUMO', 'filepath', 'HOMO_eV', 'LUMO_eV', 'chemical_potential_eV', 'chemical_hardness_eV',
                    'GEI_eV', 'conv_woH2O_10min']

In [None]:
keyword = 'results_LR_LOOCV_wH2O_q2_SecondaryDataset'
eval_metric = 'q2'
sort_flag = False
search_keywords_list = []

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_wH2O_q2_SecondaryDataset'
eval_metric = 'q2'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')

In [None]:
keyword = 'LinearRegression_LOOCV_wH2O_MAE_SecondaryDataset'
eval_metric = 'MAE'
sort_flag = True
search_keywords_list = ['center_C', 'center_O', 'center_H(O)', 'center_H(C)', 'quadrant', 'octant']

run_univariate_linear_regression_loocv_evaluation(fd, dataset_file_path_input, keyword, target_column, exclude_columns, eval_metric, sort_flag, n_jobs, max_features_per_plot, fig_width, search_keywords_list)
print('Finish')