⚔️ Side Quest Notebook: Imputation Optimization ⚔️
==============================================================

**Author:** Xavier R Nogueira

**Overview:** In my first competition notebook, `NB1_PreProcessing_Data.ipynb`, missing values in the Protein and Peptide training datasets were imputed using both Iterative and KNN imputation. That notebook will remain the first notebook in my workflow, however, in this notebook we will explore whether our imputation accuracy can be improved for each method via altering parameters. In later notebooks we will make predictions using training data filled with both methods, and evaluate results at the prediction task level.

**Methodology:**
1. Pull in the columnar formatted `protein_data_raw.parquet` and `peptide_data_raw.parquet` training data files into `pd.DataFrame`s. Combine them into one table.
2. Combine the Protein/Peptide boolean missing data masks. Make a dictionary that returns indices where there IS data for a given column.
3. Set up a version of K-Fold CV where a different subset of cells are coverted to `np.nan` in each fold such that all non-empty cells get converted just once. Evaluate imputation accuracy.
4. Run `Optuna` evaluation for both imputation methods across their parameter space.
5. Record all results in a `pd.DataFrame` such that if we eliminate features later, we can focus on the imputation method that provides the best performance for our subselection of columns.

In [197]:
# core imports
import random
import datetime
import pandas as pd
import numpy as np
import hvplot.pandas
import sklearn.metrics
from typing import (
    List,
    Dict,
    Tuple,
    Any,
    Optional,
)

# enable experimental imputer
from sklearn.experimental import enable_iterative_imputer

# import our imputation algos
from sklearn.impute import (
    IterativeImputer,
    KNNImputer,
)

# Pull in data

## Combine raw data tables

In [2]:
# load in data from parquet
proteins_df = pd.read_parquet(
    'prepped_inputs/protein_data_raw.parquet',
    engine='pyarrow',
)
peptide_df = pd.read_parquet(
    'prepped_inputs/peptide_data_raw.parquet',
    engine='pyarrow',
)

In [3]:
# keep track of our protein / peptide columns
protein_cols = proteins_df.columns
peptide_cols = peptide_df.columns

# join the protein / peptide data
prot_and_peps_df = pd.concat(
    [proteins_df, peptide_df],
    axis=1,
)

In [4]:
prot_and_peps_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,9104.27,402321.0,,,7150.57,2497.84,83002.9,15113.6,167327.0,129048.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.3
10053_12,10464.2,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
10053_18,13235.7,507386.0,7126.96,24525.7,,2372.71,126506.0,16289.6,168107.0,163776.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
10138_12,12600.2,494581.0,9165.06,27193.5,22506.1,6015.9,156313.0,54546.4,204013.0,56725.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.2,522138.0,4498.51,17189.8,29112.4,2665.15,151169.0,52338.1,240892.0,85767.1,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09


## Combine missing value matrices

In [5]:
# load in data from parquet
proteins_mask_df = pd.read_parquet(
    'prepped_inputs/protein_data_missing_values_mask.parquet',
    engine='pyarrow',
)
peptide_mask_df = pd.read_parquet(
    'prepped_inputs/peptide_data_missing_values_mask.parquet',
    engine='pyarrow',
)

In [6]:
# join the protein / peptide data
bool_mask_df = pd.concat(
    [proteins_mask_df, peptide_mask_df],
    axis=1,
)
bool_mask_df.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,False,False,True,True,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,True,False
10053_12,False,False,True,True,True,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
10053_18,False,False,False,False,True,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
10138_12,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10138_24,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


## Make a dictionary containing column headers as keys, and non-empty cell indices as values

In [210]:
def get_valid_values_indices(column: str) -> List[str]:
    l = list(
        bool_mask_df.loc[
            (bool_mask_df[column] == False)
        ].index
    )
    
    return (column, l)

In [243]:
%%time
values_exist_dict = {}
results = map(get_valid_values_indices, bool_mask_df.columns)
for result in results:
    values_exist_dict[result[0]] = result[1]
assert len(values_exist_dict) == len(bool_mask_df.columns)

CPU times: total: 1.53 s
Wall time: 2.12 s


# Define functions for evaluation

**Note:** The following is a modified version of our main workflow defined in `Tabular_MachineLearning_Projects/ml_models`.

## Define function to run full matrix K-Fold

This function will need to randomly convert some proportion (1/K) of real cell values to NaN for each trial, without repeating the same cell twice.

In [244]:
# define the number of K-folds to evaluate over
K_FOLDS: int = 5

In [245]:
def get_fold_matrix(
    kfold_df: pd.DataFrame,
    choose_from_dict: Dict[str, List[str]],
    num_values_to_convert: int,
) -> pd.DataFrame:
    # store imputed index values in a dict
    mask_df = kfold_df.copy()
    
    for col in kfold_df.columns:
        while len(choose_from_dict[col]) < num_values_to_convert:
            num_values_to_convert -= 1

        # randomly select existing values to impute
        random_idxs = random.sample(range(len(choose_from_dict[col])), num_values_to_convert)
        random_visit_ids = [choose_from_dict[col].pop(index) for index in sorted(random_idxs, reverse=True)]

        # replace randomly selected values with NaN
        kfold_df.loc[random_visit_ids, col] = np.nan
        
    return kfold_df

In [246]:
def k_fold_cv(
    data_df: pd.DataFrame,
    values_exist_dict: Dict[str, List[str]],
    imputation_class: object,
    imputation_kwargs: Dict[str, Any],
    results_df: Optional[pd.DataFrame] = None,
    kfolds: Optional[int] = 5,
    eval_metric: Optional[callable] = sklearn.metrics.r2_score,
) -> Tuple[float, pd.DataFrame]:
    
    # find the number of values to include in our folds for each column
    num_values_to_convert = data_df.apply(lambda x: len(x.dropna()) // 5).min()

    # make a copy of our values_exist_dict to choose from
    choose_from_dict = values_exist_dict.copy()

    # set up results_df (or check if the input one is as expected)
    results_cols = (
        ['imputation_method', 'params_dict', f'mean_r2', f'std_r2']  +
        list(data_df.columns)
    )
    if not results_df:
        results_df = pd.DataFrame(
            columns=results_cols,
        )
    else:
        assert list(results_df.columns) == results_cols
    
    imputed_dfs = []
    for fold in range(kfolds):
        print(f'K-Fold {fold +1} | datetime={datetime.datetime.now()}')
        kfold_df = get_fold_matrix(
            data_df.copy(),
            choose_from_dict,
            num_values_to_convert,
        )
        
        # init sklearn imputation class
        imputer = imputation_class(**imputation_kwargs)
        
        # apply imputation and convert to dataframe, add to list
        imputed_kfold_array = imputer.fit_transform(kfold_df)
        imputed_dfs.append(
            pd.DataFrame(
            data=imputed_kfold_array,
            columns=kfold_df.columns,
            index=kfold_df.index,
            ),
        )
        del imputed_kfold_array
    
    # stack our imputed dataframes
    all_imputed_df = imputed_dfs[0]
    for df in imputed_dfs[1:]:
        all_imputed_df = all_imputed_df.combine_first(df)
    del imputed_dfs
    print(all_imputed_df.shape)
    
    # for each column get all values that where real in data_df but nan in kfold_df
    metric_scores = []
    for col in kfold_df.columns:
        real_vals = data_df.loc[values_exist_dict[col], col].values
        imputed_vals = all_imputed_df.loc[values_exist_dict[col], col].values
        
        # calculate metric score
        try:
            metric_scores.append(
                eval_metric(
                    y_true=real_vals,
                    y_pred=imputed_vals,
                )
            )
        except ValueError:
            print(col)
            print(real_vals)
            print(imputed_vals)
            metric_scores.append(np.nan)
    metric_scores = np.array(metric_scores, dtype='object')
    
    # for each column compare imputed vs real for out fold score
    mean_metric_score = np.nanmean(metric_scores)
    std_metric_score = np.nanstd(metric_scores)
    
    # insert the necessary data for our results df at the start
    metric_scores = np.insert(
        metric_scores, 
        0, 
        [
            imputation_class.__name__, 
            imputation_kwargs, 
            mean_metric_score, 
            std_metric_score,
        ]
    )
    
    # append to our results df
    metric_scores_series = pd.Series(
        data=metric_scores,
        index=results_df.columns,
        dtype='object',
    )
    results_df = results_df.append(metric_scores_series, ignore_index=True)
    
    # return mean score as the fold score
    return (mean_metric_score, results_df)


In [255]:
kfold_df = get_fold_matrix(
    df.copy(),
    values_exist_dict,
    96,
)

In [257]:
df

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,9104.27,402321.0,,,7150.57,2497.84,83002.9,15113.6,167327.0,129048.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.30
10053_12,10464.20,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,13235.70,507386.0,7126.96,24525.7,,2372.71,126506.0,16289.6,168107.0,163776.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,12600.20,494581.0,9165.06,27193.5,22506.10,6015.90,156313.0,54546.4,204013.0,56725.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.20,522138.0,4498.51,17189.8,29112.40,2665.15,151169.0,52338.1,240892.0,85767.1,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,9983.00,400290.0,24240.10,,16943.50,6303.17,77493.6,46435.3,254247.0,138910.0,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,6757.32,360858.0,18367.60,14760.7,18603.40,1722.77,86847.4,37741.3,212132.0,100519.0,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,
942_24,,352722.0,22834.90,23393.1,16693.50,1487.91,114772.0,36095.7,185836.0,99183.5,...,185428.0,5554.53,,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,,251820.0,22046.50,26360.5,22440.20,2117.43,82241.9,30146.6,167633.0,84875.1,...,137611.0,6310.09,,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [256]:
kfold_df

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,9104.27,402321.0,,,,2497.84,83002.9,15113.6,167327.0,129048.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.30
10053_12,10464.20,435586.0,,,,,197117.0,15099.1,164268.0,108114.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,13235.70,,7126.96,24525.7,,2372.71,,16289.6,168107.0,163776.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,12600.20,494581.0,9165.06,27193.5,22506.10,6015.90,156313.0,54546.4,204013.0,56725.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,12003.20,522138.0,4498.51,17189.8,29112.40,2665.15,,52338.1,240892.0,,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,9983.00,400290.0,24240.10,,16943.50,6303.17,77493.6,46435.3,254247.0,138910.0,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,,571879.0,80268.3,54889.70
942_12,,360858.0,18367.60,14760.7,18603.40,1722.77,86847.4,37741.3,212132.0,100519.0,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,
942_24,,352722.0,22834.90,23393.1,16693.50,1487.91,,36095.7,185836.0,99183.5,...,185428.0,5554.53,,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,,251820.0,22046.50,26360.5,22440.20,2117.43,82241.9,30146.6,,84875.1,...,137611.0,6310.09,,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [247]:
score, results_df = k_fold_cv(
    prot_and_peps_df,
    values_exist_dict,
    imputation_class=KNNImputer,
    imputation_kwargs={'weights':'uniform', 'n_neighbors':5},
    kfolds=K_FOLDS,
)

K-Fold 1 | datetime=2023-04-15 01:51:38.360063
K-Fold 2 | datetime=2023-04-15 01:51:44.427870
K-Fold 3 | datetime=2023-04-15 01:51:50.809150
K-Fold 4 | datetime=2023-04-15 01:51:56.627066
K-Fold 5 | datetime=2023-04-15 01:52:02.428454
(1113, 1195)


  results_df = results_df.append(metric_scores_series, ignore_index=True)


# TODO: Fix R2=1

This is because the combine_first part is taking real values, we should update get_fold_matrix() by making a mask df that we can use to convert all non-imputed values to nan before adding the kfold_df to the list!

I got sleepy lol.

In [253]:
results_df.loc[0][4:].min()

1.0