# Preprocessing

This notebook provides code implementations to quantify the categorical data. First the data is mapped to numerical data. In the following the reproducibility scores are calcuated for classes of variables and reproducibility dimensions accoding to Fig. 3 in the paper. The Dimensions are:

- **D1: Data**
- **D2: Methodology**
- **D3: Experiment**

Function $v_j(s) $ is defined as:

$
v_j(s) =
\begin{cases}
1, & \text{if study } s \text{ reports variable } j \\
0, & \text{otherwise or ``No Information"} \\
\end{cases}
$

The dimensions $D_{1}$ (Data), $D_{2}$ (Method), and $D_{3}$ (Experiment) are quantified as follows: for a given dimension $i$ where $i \in \{1,2,3\}$

$D_i(s) = \frac{\sum_{j=1}^{D_{i}} v_j(s)}{|D_{i}|}$

Then the overall degree for the givent study is computed as follows:

$degree(s) = \frac{\sum_{j=1}^{V} v_{j}(s)}{|V|}$

where $V = D_{1}\ \cup D_{2}\ \cup D_{3}$.

## Setup

In [1]:
# Imports
import pandas as pd

In [2]:
# Parameters
data_path = '../data/'
save_data = True


In [3]:
# Load Dataset
data_categoric = pd.read_csv(f'{data_path}papers_reviewed_reprod_variables_categoric.csv', index_col=0)
data_categoric.head(5)

Unnamed: 0_level_0,DOI,DOI_short,Methodology,Publisher,Year,data_listed,data_metadata,data_stats,data_type,data_access,...,eval_metrics,eval_sig_test,code_link,code_empty,code_preproc,code_features_gen,code_eval,code_params_opt,code_info,code_runable
Paper ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,https://doi.org/10.1109/CCDC.2014.6852414,10.1109/CCDC.2014.6852414,MB+ML,IEEE,2014,n,y,n,real-world,No Information,...,"Charts, Error Est.",n,n,na,na,na,na,na,na,na
2,https://doi.org/10.1109/ICTAI.2018.00136,10.1109/ICTAI.2018.00136,ML,IEEE,2018,n,y,n,real-world,proprietary,...,"Precision, Recall",y,n,na,na,na,na,na,na,na
3,https://doi.org/10.1109/ICVR57957.2023.10169760,10.1109/ICVR57957.2023.10169760,MB+ML+KB,IEEE,2023,n,n,n,No Information,No Information,...,Operational KPIs,n,n,na,na,na,na,na,na,na
4,https://doi.org/10.1109/AIKIIE60097.2023.10390401,10.1109/AIKIIE60097.2023.10390401,MB+ML,IEEE,2023,n,n,n,simulation,No Information,...,"Accuracy, F1 score, FPR, TPR",n,n,na,na,na,na,na,na,na
5,https://doi.org/10.1109/ICICT55905.2022.00043,10.1109/ICICT55905.2022.00043,MB+ML+KB,IEEE,2022,n,y,n,"real-world, simulation",No Information,...,"Accuracy, Loss",n,n,na,na,na,na,na,na,na


## Convert to numerical data

For each possible entry a mapping to numerical values is conducted.

In [4]:
numeric_mapping = {
    "y": 1,
    "n": 0,
    "na": 0,
    "not mentioned": 0,
    "real-world": 1,
    "simulation": 1,
    "experiment": 1,
    "real-world, simulation": 1,
    "proprietary": 1,
    "public": 1, 
    "purchasable": 1,
    "Single-Split": 1,
    "Train/Validation/Test": 1,
    "Cross Validation": 1,
    "Out of Sample": 1,
    "No Information": 0,
}

In [5]:
# Convert categorical data to numeric
data_numeric = data_categoric.copy()
for col in data_numeric.columns:
    data_numeric[col] = data_numeric[col].map(numeric_mapping).fillna(data_numeric[col])
    if col == "eval_metrics":
        # Special case for eval_metrics to handle multiple values
        data_numeric[col] = data_numeric[col].apply(lambda x: 0 if x == "no metrics" else 1)

if save_data:
    data_numeric.to_csv(f'{data_path}papers_reviewed_reprod_variables_numeric.csv')
    
data_numeric.head(5)

Unnamed: 0_level_0,DOI,DOI_short,Methodology,Publisher,Year,data_listed,data_metadata,data_stats,data_type,data_access,...,eval_metrics,eval_sig_test,code_link,code_empty,code_preproc,code_features_gen,code_eval,code_params_opt,code_info,code_runable
Paper ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,https://doi.org/10.1109/CCDC.2014.6852414,10.1109/CCDC.2014.6852414,MB+ML,IEEE,2014.0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
2,https://doi.org/10.1109/ICTAI.2018.00136,10.1109/ICTAI.2018.00136,ML,IEEE,2018.0,0,1,0,1,1,...,1,1,0,0,0,0,0,0,0,0
3,https://doi.org/10.1109/ICVR57957.2023.10169760,10.1109/ICVR57957.2023.10169760,MB+ML+KB,IEEE,2023.0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,https://doi.org/10.1109/AIKIIE60097.2023.10390401,10.1109/AIKIIE60097.2023.10390401,MB+ML,IEEE,2023.0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
5,https://doi.org/10.1109/ICICT55905.2022.00043,10.1109/ICICT55905.2022.00043,MB+ML+KB,IEEE,2022.0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0


## Calculate Reproducability Scores
1. For the different classes of the reproducability variables (see Table 1) the average score is calculated (`score_class`). 
2. For the differnet reproducability dimensions (see Fig. 3, Eq. 2 and top of the notebook) the average score is calculated (`score_dimension`). 

In [6]:
# Calculate scores for each sub-category of the variables
def calculate_scores_classes(df):
    scores = pd.DataFrame(index=df.index)
    scores['Data'] = df[['data_listed', 'data_metadata', 'data_stats', 'data_type', 'data_access']].mean(axis=1)
    scores['Preprocessing'] = df[['preproc_data', 'preproc_features', 'multiple data']].mean(axis=1)
    scores['Model'] = df[['opt_mentioned', 'opt_baseline', 'opt_procedure', 'params_models', 'params_baselines', 'params_best_model', 'params_best_baseline']].mean(axis=1)
    scores['Experiment'] = df[['eval_splitting', 'eval_metrics', 'eval_sig_test']].mean(axis=1)
    scores['Code'] = df[['code_link', 'code_empty', 'code_preproc', 'code_features_gen', 'code_eval', 'code_params_opt', 'code_info', 'code_runable']].mean(axis=1)
    scores['Total'] = scores[['Data', 'Preprocessing', 'Model', 'Experiment', 'Code']].mean(axis=1)
    return scores

score_class = calculate_scores_classes(data_numeric)

if save_data:
    score_class.to_csv(f'{data_path}score_class.csv', index=True)


In [7]:
# Calculate scores for the dimensions and the overall degree
def calculate_scores_dimensions(df):
    list_D1 = ['data_access', 'data_listed', 'data_type', 'data_stats']
    list_D2 = ['data_access', 'data_listed', 'preproc_features', 'preproc_data', 'code_link', 'code_features_gen', 'code_preproc', 'opt_procedure', 'opt_mentioned', 'params_models']
    list_D3 = ['data_access', 'data_listed', 'preproc_features', 'preproc_data', 'code_link', 'code_features_gen', 'code_preproc', 'code_eval', 'eval_metrics', 'params_best_model', 'params_best_baseline', 'eval_splitting']
    combined = list_D1 + list_D2 + list_D3
    combined_unique = list(set(combined))

    dimensions = pd.DataFrame(index=df.index)
    dimensions['D1'] = df[list_D1].mean(axis=1)
    dimensions['D2'] = df[list_D2].mean(axis=1)
    dimensions['D3'] = df[list_D3].mean(axis=1)

    dimensions['degree'] = df[combined_unique].mean(axis=1)
    return dimensions

score_dimension = calculate_scores_dimensions(data_numeric)

if save_data:
    score_dimension.to_csv(f'{data_path}score_dimension.csv', index=True)
