⚔️ Side Quest Notebook: Imputation Optimization ⚔️
==============================================================

**Author:** Xavier R Nogueira

**Overview:** In my first competition notebook, `NB1_PreProcessing_Data.ipynb`, missing values in the Protein and Peptide training datasets were imputed using both Iterative and KNN imputation. That notebook will remain the first notebook in my workflow, however, in this notebook we will explore whether our imputation accuracy can be improved for each method via altering parameters. 

Additionally, it is important to have some idea about how realistic imputed values are. For example, if we use imputation to prepare our training set but the imputation has a very low R2 (that we were unaware of), the values predicted are relatively meaningless and contribute to noise in our model. Basing a model around inconsistentant imputed values also introduces a higher risk of overfitting to the specific imputed values in our training set. 

We will test both `sklearn.impute.KNNImputer` and `sklearn.impute.IterativeImputer` across all combinations of parameters defined in a input parameter space. Notably, prediction results for all methods-parameters-column combinations will be saved in a single `.parquet` file. The motivation behind this is to allow myself (and others) to choose an imputation method/parameters based on prediction accuracy for any subset of Protein/Peptide columns one decides to focus their model on. Since at this point I do not know which columns will be relevant, it is in my interest to keep track of performance across all of them independently. Alternatively, one could also choose to base their model around only Protein/Peptide columns where we can impute values with some degree of certainty.

**Methodology:**
1. Pull in the columnar formatted `protein_data_raw.parquet` and `peptide_data_raw.parquet` produced in my [other notebook](https://www.kaggle.com/code/xaviernogueira/pre-processing-making-labels-and-imputation) and saved as a [Kaggle dataset](https://www.kaggle.com/datasets/xaviernogueira/protein-and-peptide-preproccessed). Combine them into one table.
2. Do the same for the missing data boolean mask tables (also part of the output dataset I uploaded). Make a dictionary that returns indices where there IS data for a given column.
3. Set up a version of K-Fold CV where a different subset of cells are coverted to `np.nan` in each fold such that no cell is converted twice. Evaluate imputation accuracy.
4. Create a parameter grid search for both `sklearn.impute.KNNImputer` and `sklearn.impute.IterativeImputer`, run K-Fold for all possible parameter combinations.
5. Record all results in a `pd.DataFrame` such that if we eliminate features later, we can focus on the imputation method/parameters that provided the best performance for our subselection of features.

In [1]:
# core imports
import random
import itertools
import datetime
import pandas as pd
import numpy as np
import sklearn.metrics
from typing import (
    List,
    Dict,
    Tuple,
    Any,
    Optional,
)

# enable experimental imputer
from sklearn.experimental import enable_iterative_imputer

# import our imputation algos
from sklearn.impute import (
    IterativeImputer,
    KNNImputer,
)

# Pull in data

## Combine raw data tables

In [2]:
# load in data from parquet
proteins_df = pd.read_parquet(
    'prepped_inputs/protein_data_raw.parquet',
    engine='pyarrow',
)
peptide_df = pd.read_parquet(
    'prepped_inputs/peptide_data_raw.parquet',
    engine='pyarrow',
)

In [3]:
# keep track of our protein / peptide columns
protein_cols = proteins_df.columns
peptide_cols = peptide_df.columns

# join the protein / peptide data
prot_and_peps_df = pd.concat(
    [proteins_df, peptide_df],
    axis=1,
)

In [4]:
prot_and_peps_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,9104.27,402321.0,,,7150.57,2497.84,83002.9,15113.6,167327.0,129048.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,10464.2,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,13235.7,507386.0,7126.96,24525.7,,2372.71,126506.0,16289.6,168107.0,163776.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,12600.2,494581.0,9165.06,27193.5,22506.1,6015.9,156313.0,54546.4,204013.0,56725.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.2,522138.0,4498.51,17189.8,29112.4,2665.15,151169.0,52338.1,240892.0,85767.1,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


## Combine missing value matrices

In [5]:
# load in data from parquet
proteins_mask_df = pd.read_parquet(
    'prepped_inputs/protein_data_missing_values_mask.parquet',
    engine='pyarrow',
)
peptide_mask_df = pd.read_parquet(
    'prepped_inputs/peptide_data_missing_values_mask.parquet',
    engine='pyarrow',
)

In [6]:
# join the protein / peptide data
bool_mask_df = pd.concat(
    [proteins_mask_df, peptide_mask_df],
    axis=1,
)
bool_mask_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,False,False,True,True,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,True,False
10053_12,False,False,True,True,True,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
10053_18,False,False,False,False,True,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
10138_12,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10138_24,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


## Make a dictionary containing column headers as keys, and non-empty cell indices as values

In [7]:
def get_valid_values_indices(column: str) -> Tuple[str, List[str]]:
    out_list = list(
        bool_mask_df.loc[
            (bool_mask_df[column] == False)
        ].index
    )
    return (column, out_list)

# make function to remake values_exist_dict (seems to get altered)
def get_value_exist_dict() -> Dict[str, List[str]]:
    values_exist_dict = {}
    results = map(get_valid_values_indices, bool_mask_df.columns)
    for result in results:
        values_exist_dict[result[0]] = result[1]
    assert len(values_exist_dict) == len(bool_mask_df.columns)
    return values_exist_dict

In [8]:
%%time
values_exist_dict = get_value_exist_dict()

CPU times: total: 1.55 s
Wall time: 2.1 s


# Define our K-Fold evaluation functions!

**Method:**
1. Find the column with the least missing non-NaN values, divide the # of those values by the number of K-folds desired. Let's call this number N.
2. In the first K-fold we randomly convert N non-NaN values to NaN. We keep track of which values were converted.
3. Next we use our imputation class and parameter to impute all missing values.
4. We then keep track of the N real vs predicted values.
5. Next we repeat steps 2-4 but without re-converting any cell that was converted to NaN in a previous trial. At the end of all K-folds, N * # of Folds values in each column will have been imputed where real data existed.
6. We then keep track of a evaluation metric (i.e., R-squared) for each column, as well as the mean/std across all columns. This is appended to a DataFrame.

The idea here is that we can run this over a set of imputation class parameter, and later when we know which columns/features matter, we can go back and find the parameter set that does the best job imputing them.

## Define a function to randomly convert existing values to NaN

This function will need to randomly convert some proportion (1/K) of real cell values to NaN for each trial, without repeating the same cell twice.

In [9]:
def get_fold_matrices(
    kfold_df: pd.DataFrame,
    choose_from_dict: Dict[str, List[str]],
    num_values_to_convert: int,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Get a dataframe to impute NaNs, and a mask of removed values.
    
    Returns:
        (kfold_df, mask_df). Note that 1 in mask_df corresponds to values
        that existed but were randomly removed.
    """
    # store imputed index values in a dict
    mask_df = pd.DataFrame(
        np.zeros_like(kfold_df.to_numpy()),
        columns=kfold_df.columns,
        index=kfold_df.index, 
        dtype=np.int8,
    )
    assert mask_df.shape == kfold_df.shape

    # replace values randomly
    for col in kfold_df.columns:
        while len(choose_from_dict[col]) < num_values_to_convert:
            num_values_to_convert -= 1

        # randomly select existing values to impute
        random_idxs = random.sample(range(len(choose_from_dict[col])), num_values_to_convert)
        random_visit_ids = [choose_from_dict[col].pop(index) for index in sorted(random_idxs, reverse=True)]

        # replace randomly selected values with NaN
        kfold_df.loc[random_visit_ids, col] = np.nan

        # keep track of where they were replaced
        mask_df.loc[random_visit_ids, col] = 1
        assert mask_df[col].sum() == num_values_to_convert
    return (kfold_df, mask_df)

## Define the function that performs our K-Fold analysis, and saves results to a DataFrame

Yes I know this is a way to long function, but bear with me!

In [10]:
def k_fold_cv(
    data_df: pd.DataFrame,
    values_exist_dict: Dict[str, List[str]],
    imputation_class: object,
    imputation_kwargs: Dict[str, Any],
    results_df: Optional[pd.DataFrame] = None,
    kfolds: Optional[int] = 5,
    eval_metric: Optional[callable] = sklearn.metrics.r2_score,
    verbose: Optional[bool] = True,
) -> Tuple[float, pd.DataFrame]:
    """Run K-Fold analysis and save results to a dataframe.
    
    Arguments:
        data_df: full data table.
        values_exist_dict: a dictionary with data_df.columns as keys, and 
            a list of data_df.index values where data is not NaN.
            NOTE: I recomend setting this to get_value_exist_dict() in order 
            to prevent odd behavior when re-running cells.
        imputation_class: an sklearn imputation class (i.e., KNNImputer).
        imputation_kwargs: a dictionary of kwargs for the imputation_class.__init__.
        results_df: if provided, imputation results are appended to the end of it.
        kfolds: the number of k-fold to perform.
        eval_metric: an sklearn evaluation metric (default is R2).
        verbose: whether to print out each time a K-Fold is ran.
    Returns:
        (The mean score across all columns, the results_df).
    """

    # find the number of values to include in our folds for each column
    num_values_to_convert = data_df.apply(lambda x: len(x.dropna()) // 5).min()

    # make a copy of our values_exist_dict to choose from
    choose_from_dict = values_exist_dict.copy()

    # set up results_df (or check if the input one is as expected)
    results_cols = (
        [
            'imputation_method', 
            'params_dict', 
            f'mean_{eval_metric.__name__}', 
            f'std_{eval_metric.__name__}',
        ] + list(data_df.columns)
    )
    if results_df is None:
        results_df = pd.DataFrame(
            columns=results_cols,
        )
    else:
        assert list(results_df.columns) == results_cols

    imputed_dfs = []
    for fold in range(kfolds):
        if verbose:
            print(f'K-Fold {fold +1} | datetime={datetime.datetime.now()}')

        # get matrix to impute, and mask
        kfold_df, mask_df = get_fold_matrices(
            data_df.copy(),
            choose_from_dict,
            num_values_to_convert,
        )

        # init sklearn imputation class
        imputer = imputation_class(**imputation_kwargs)

        # apply imputation and convert to dataframe, add to list
        imputed_kfold_array = imputer.fit_transform(kfold_df)
        assert imputed_kfold_array.shape == mask_df.values.shape

        imputed_df = pd.DataFrame(
            data=imputed_kfold_array,
            columns=kfold_df.columns,
            index=kfold_df.index,
        )
        del imputed_kfold_array

        # check that things look as expected
        assert list(imputed_df.columns) == list(mask_df.columns)
        assert list(imputed_df.index) == list(mask_df.index)

        # replace all non-imputed values with NaN
        imputed_df = imputed_df.where(mask_df.values == 1, np.nan)
        imputed_dfs.append(imputed_df)

    # stack our imputed dataframes
    all_imputed_df = imputed_dfs[0]
    for df in imputed_dfs[1:]:
        all_imputed_df = all_imputed_df.combine_first(df)
    del imputed_dfs

    # for each column get all values that where real in data_df but nan in kfold_df
    metric_scores = []
    for col in kfold_df.columns:
        real_vals = data_df.loc[all_imputed_df[col].notna(), col].values
        imputed_vals = all_imputed_df.loc[all_imputed_df[col].notna(), col].values

        # calculate metric score
        try:
            metric_scores.append(
                eval_metric(
                    y_true=real_vals,
                    y_pred=imputed_vals,
                )
            )
        except ValueError:
            print(f'An issue was hit for column={col}, metric_score = NaN')
            metric_scores.append(np.nan)
    metric_scores = np.array(metric_scores, dtype='object')

    # for each column compare imputed vs real for out fold score
    mean_metric_score = np.nanmean(metric_scores)
    std_metric_score = np.nanstd(metric_scores)

    # insert the necessary data for our results df at the start
    metric_scores = np.insert(
        metric_scores, 
        0, 
        [
            imputation_class.__name__, 
            str(imputation_kwargs), 
            mean_metric_score, 
            std_metric_score,
        ]
    )

    # append to our results df
    metric_scores_df = pd.DataFrame(
        data=metric_scores.reshape(1, len(metric_scores)),
        columns=results_df.columns,
        dtype='object',
    )
    out_results_df = pd.concat(
        [results_df, metric_scores_df],
        axis=0,
        ignore_index=True,
    )

    # return mean score as the fold score
    print(f'Mean {eval_metric.__name__} score: {mean_metric_score} | {datetime.datetime.now()}')
    return (mean_metric_score, out_results_df)


# Use a custom grid search to test a variety of imputation parameter combinations

All results are saved to the same `results_df`, which will then be saved to `.parquet`. The idea here is we don't know yet which Protein/Peptides will be valuable for predictions, therefore we want to create our own dataset such that we can go back and find the best imputation parameters for the subsect of features we find important.

In [11]:
# init results_df, can read from parquet if desired
RESULTS_PARQUET = None
if not RESULTS_PARQUET:
    results_df = None
else:
    results_df = pd.read_parquet(
        RESULTS_PARQUET,
        engine='pyarrow',
    )

In [12]:
# define the number of K-folds to evaluate over
K_FOLDS: int = 5

## Search for `KNNImputer`

In [13]:
# define our parameter grid
knn_param_grid = {
    'n_neighbors': list(range(2, 21, 3)),
    'weights': ['uniform', 'distance'],
}

# calculate all param combinations
knn_param_combinations = list(itertools.product(*knn_param_grid.values()))

In [14]:
%%time
print(f'Testing {len(knn_param_combinations)} KNNImputer params...')
for param_combo in knn_param_combinations:
    params_dict = dict(zip(knn_param_grid.keys(), param_combo))
    print(f'\nTesting: {params_dict}')
    score, results_df = k_fold_cv(
        prot_and_peps_df,
        get_value_exist_dict(),
        imputation_class=KNNImputer,
        imputation_kwargs=params_dict,
        results_df=results_df,
        kfolds=K_FOLDS,
        verbose=False,
    )
print('Done!')

Testing 14 KNNImputer params...

Testing: {'n_neighbors': 2, 'weights': 'uniform'}
Mean r2_score score: 0.18236277445513477 | 2023-04-16 02:57:35.371549

Testing: {'n_neighbors': 2, 'weights': 'distance'}
Mean r2_score score: 0.18713362602099826 | 2023-04-16 02:58:14.653948

Testing: {'n_neighbors': 5, 'weights': 'uniform'}
Mean r2_score score: 0.28007697252597724 | 2023-04-16 02:58:56.598814

Testing: {'n_neighbors': 5, 'weights': 'distance'}
Mean r2_score score: 0.2912689042984318 | 2023-04-16 02:59:39.460724

Testing: {'n_neighbors': 8, 'weights': 'uniform'}
Mean r2_score score: 0.29287821867435054 | 2023-04-16 03:00:21.927977

Testing: {'n_neighbors': 8, 'weights': 'distance'}
Mean r2_score score: 0.30184001109382985 | 2023-04-16 03:01:09.986926

Testing: {'n_neighbors': 11, 'weights': 'uniform'}
Mean r2_score score: 0.2931754670248502 | 2023-04-16 03:01:51.899509

Testing: {'n_neighbors': 11, 'weights': 'distance'}
Mean r2_score score: 0.3017973423288525 | 2023-04-16 03:02:35.9434

In [15]:
results_df.head(n=5)

Unnamed: 0,imputation_method,params_dict,mean_r2_score,std_r2_score,O00391,O00533,O00584,O14498,O14773,O14791,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,KNNImputer,"{'n_neighbors': 2, 'weights': 'uniform'}",0.182363,0.216865,0.121767,0.615388,0.290944,0.229909,-0.077902,-0.012051,...,-0.095072,0.323095,-0.039521,0.224295,0.256143,0.245224,0.028727,-0.029504,0.341731,0.122736
1,KNNImputer,"{'n_neighbors': 2, 'weights': 'distance'}",0.187134,0.213192,0.140218,0.539986,0.278423,0.2902,0.160262,0.082461,...,-0.091191,0.382571,0.0512,0.094238,0.221503,0.188845,0.014452,-0.000677,0.229565,0.120847
2,KNNImputer,"{'n_neighbors': 5, 'weights': 'uniform'}",0.280077,0.181587,0.204337,0.634316,0.303693,0.362852,0.253829,0.155422,...,0.079782,0.498926,0.05101,0.120241,0.205945,0.332093,0.170402,0.179671,0.387852,0.156667
3,KNNImputer,"{'n_neighbors': 5, 'weights': 'distance'}",0.291269,0.18215,0.28336,0.642646,0.334484,0.434463,0.262832,0.159615,...,0.019384,0.466665,0.092569,0.170675,0.172758,0.326487,0.053301,0.171498,0.379182,0.181235
4,KNNImputer,"{'n_neighbors': 8, 'weights': 'uniform'}",0.292878,0.173131,0.258248,0.588604,0.371501,0.393089,0.243892,0.170903,...,0.071173,0.446771,0.128921,0.096146,0.157134,0.353228,0.159601,0.196671,0.404361,0.232213


## Search for `IterativeImputer`

In [16]:
from sklearn.linear_model import (
    ARDRegression,
    BayesianRidge,
)

In [17]:
# define our parameter grid
iter_param_grid = {
    'estimator': [ARDRegression(), BayesianRidge(),],
    'max_iter': [5],
    'n_nearest_features': list(range(50, 111, 20)),
    'sample_posterior': [True],
}

# calculate all param combinations
iter_param_combinations = list(itertools.product(*iter_param_grid.values()))

In [18]:
%%time
print(f'Testing {len(iter_param_combinations)} IterativeImputer params...')
for param_combo in iter_param_combinations:
    params_dict = dict(zip(iter_param_grid.keys(), param_combo))
    print(f'\nTesting: {params_dict}')
    score, results_df = k_fold_cv(
        prot_and_peps_df,
        get_value_exist_dict(),
        imputation_class=IterativeImputer,
        imputation_kwargs=params_dict,
        results_df=results_df,
        kfolds=K_FOLDS,
        verbose=False,
    )
print('Done!')

Testing 8 IterativeImputer params...

Testing: {'estimator': ARDRegression(), 'max_iter': 5, 'n_nearest_features': 50, 'sample_posterior': True}
Mean r2_score score: 0.21335630242369547 | 2023-04-16 03:16:44.098890

Testing: {'estimator': ARDRegression(), 'max_iter': 5, 'n_nearest_features': 70, 'sample_posterior': True}
Mean r2_score score: 0.2955188583161393 | 2023-04-16 03:31:52.414309

Testing: {'estimator': ARDRegression(), 'max_iter': 5, 'n_nearest_features': 90, 'sample_posterior': True}
Mean r2_score score: 0.3597963457708354 | 2023-04-16 03:51:59.793753

Testing: {'estimator': ARDRegression(), 'max_iter': 5, 'n_nearest_features': 110, 'sample_posterior': True}
Mean r2_score score: 0.4058245635469679 | 2023-04-16 04:18:08.601868

Testing: {'estimator': BayesianRidge(), 'max_iter': 5, 'n_nearest_features': 50, 'sample_posterior': True}
Mean r2_score score: 0.2004183954847902 | 2023-04-16 04:20:53.261207

Testing: {'estimator': BayesianRidge(), 'max_iter': 5, 'n_nearest_features'

In [19]:
results_df.tail()

Unnamed: 0,imputation_method,params_dict,mean_r2_score,std_r2_score,O00391,O00533,O00584,O14498,O14773,O14791,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
17,IterativeImputer,"{'estimator': ARDRegression(), 'max_iter': 5, ...",0.405825,0.290665,0.215991,0.873559,0.148967,0.550286,0.10842,0.61039,...,-0.005878,0.292892,0.702772,0.439464,0.932664,0.261024,0.153584,0.13832,0.615008,0.265595
18,IterativeImputer,"{'estimator': BayesianRidge(), 'max_iter': 5, ...",0.200418,0.318966,0.118181,0.705925,-0.029124,0.466217,-0.013219,0.014038,...,-0.160374,0.414211,-0.184243,0.217544,0.845128,-0.03181,-0.043605,-0.151174,0.543387,0.115601
19,IterativeImputer,"{'estimator': BayesianRidge(), 'max_iter': 5, ...",0.267385,0.327774,-0.060826,0.816276,0.121168,0.292292,0.058093,0.421451,...,-0.25054,0.356562,0.284778,0.259837,0.891777,0.032449,-0.071722,-0.207444,0.504375,0.120079
20,IterativeImputer,"{'estimator': BayesianRidge(), 'max_iter': 5, ...",0.315123,0.3124,-0.015119,0.768125,0.236775,0.527927,0.328193,0.353725,...,-0.142057,0.523655,0.249935,0.226626,0.909515,0.084076,-0.031858,0.218327,0.564412,0.083414
21,IterativeImputer,"{'estimator': BayesianRidge(), 'max_iter': 5, ...",0.357893,0.310639,0.410341,0.828581,0.217252,0.62167,-0.020368,0.34261,...,0.192516,0.406141,0.529886,0.295797,0.910809,0.154364,0.186039,0.256965,0.566092,0.07754


## Save results to parquet

In [20]:
results_df.to_parquet(
    'imputation_param_search.parquet',
    engine='pyarrow',
)